Subtracting Consecutive Records in Large Dataset (1 Viewer)

GDTRFB

Registered User.
Local time
Today, 11:59
Joined
Oct 31, 2018
Messages
10
Hi all,
I have a dataset which lists consecutive snapshot readings grouped by different sources.


Example dataset


PK Location Source Timestamp ReadingValue_1 Reading_Value_2
1 HouseA GasMeter 12/1/18 08:00 123.12 146.12
2 HouseA GasMeter 12/1/18 08:15 124.15 154.23
3 HouseB GasMeter 11/15/18 09:00 46.35 100.98
4 HouseB GasMeter 11/15/18 09:015 48.34 102.45
5 HouseB WaterMeter 1/15/18 19:00 46.35 100.98
6 HouseB WaterMeter 1/15/18 19:15 58.34 132.45


The PK is an autonumber.


Goal: I would like to subtract consecutive readings from each other.


For example for HouseA:
ReadingsValue_1_Diff = 124.15 – 123.12
ReadingsValue_2_Diff = 154.23 – 146.12


This is a large dataset. There are roughly 35,000 readings for each Location/Source combination.


I’ve tried to do this before with subqueries and found it to be very slow for large datasets, so I’m attempting to do it using a VBA function which I call from a query, but having some trouble.


My approach is to:

  1. Create a new query to organize the data and call the VBA function
  2. Create a function with a single argument to capture the PK
  3. Create a recordset based on the datasource table
  4. The SQL statement orders the data by location/source and timestamp
  5. Sync-up the recordset to the query using rs.Move
  6. Check to see if the Location/Source combination
  7. Find the next record values using rs.MoveNext
  8. Perform the subtraction calculation
Code:
  Public Function MeterReadings(PKID As Integer) As Variant
   
  Dim cnn As New ADODB.Connection
  Dim rs As New ADODB.Recordset
   
   
  Dim sSql As String
  Dim CurrentLocation As String
  Dim CurrentSource As String
  Dim NextLocation As String
  Dim NextSource As String
  Dim CurrentReading As String
  Dim NextReading As String
   
  sSql = "SELECT * FROM tbl_Readings PI ORDER BY Location, Source, TimeStamp"
   
  Set cnn = CurrentProject.Connection
  rs.Open sSql, cnn
  rs.Move PKID - 1
   
   
  If rs.EOF <> True Then
   
      CurrentSource = rs!Location
      CurrentLocation = rs!Source
   
      rs.MoveNext
      NextLocation = rs!Location
      NextSource = rs!Source
   
      If (CurrentLocation = NextLocation) Then
          If (CurrentSource = NextSource) Then
              MeterReadings = (NextReading - CurrentReading)
          End If
      End If
  End If
   
  rs.Close
  cnn.Close
   
   
  End Function


When I call this code from the query it freezes up MS Access and I must force an exit.

Can someone help me with what I’m doing wrong.

Many Thanks
 

Dreamweaver

Well-known member
Local time
Today, 18:59
Joined
Nov 28, 2005
Messages
2,466
Your data needs to be normalised into at least 3 tables


I don't really use ado but is this wright
rs.Move PKID - 1
wouldn't it move the pointer to far
 

theDBguy

I’m here to help
Staff member
Local time
Today, 11:59
Joined
Oct 29, 2018
Messages
21,358
Hi,

Not sure if this any where close to what you’re doing but take a look at my DDiff() function.
 

Eugene-LS

Registered User.
Local time
Today, 21:59
Joined
Dec 7, 2018
Messages
481
try
Code:
sSql = "SELECT * FROM tbl_Readings ORDER BY Location, Source, TimeStamp"
... no "PI" in SQL string

or for test
Code:
sSql = "SELECT TOP 100 * FROM tbl_Readings ORDER BY Location, Source, TimeStamp"
 

MajP

You've got your good things, and you've got mine.
Local time
Today, 14:59
Joined
May 21, 2018
Messages
8,463
These can be very resource expensive queries. I think if it was me I would create a temp table. If worried about bloat do it external. Then I would make a table that simply puts the proper consecutive IDs together. So from your example i simply write

PK1 PK2
1 2
3 4
5 6

Now once you create that table (even with 35k should not take long). You can do two simple joins, joining PK1 to your table as FirstRecord and PK2 to your table as SecondRecord. Now you can directly subtract SecondRecod.ReadingValue_1 - FirstRecord.ReadingValue_2, and same for second readin
 

isladogs

MVP / VIP
Local time
Today, 18:59
Joined
Jan 14, 2017
Messages
18,186
Thought I'd do a quick demo of what I was suggesting

It has a query containing 2 copies of the table in which paired records are linked


The SQL is
Code:
SELECT tblMeterReadings.Location, tblMeterReadings.Source, [tblMeterReadings].[ReadingValue1]-[tblMeterReadings_1].[ReadingValue1] AS RV1Diff, [tblMeterReadings].[ReadingValue2]-[tblMeterReadings_1].[ReadingValue2] AS RV2Diff
FROM tblMeterReadings INNER JOIN tblMeterReadings AS tblMeterReadings_1 ON (tblMeterReadings.Source = tblMeterReadings_1.Source) AND (tblMeterReadings.Location = tblMeterReadings_1.Location)
WHERE (((tblMeterReadings.PK)=[tblMeterReadings_1].[PK]+1));

Results:


I've included the Serialize function that I mentioned but its not actually used here. It would however be needed if the paired readings are not consecutive records. In that case, it would be used to create a 'rank' order sorting by location, source & timestamp (see similar example in link from my first reply)

Although I hear what MajP is saying about resource intensive queries, this is so simple that I don't think it will be a problem using 35000 records. Suggest you try it and see....:)
 

Attachments

  • queryDesign.PNG
    queryDesign.PNG
    16.7 KB · Views: 248
  • results.PNG
    results.PNG
    7.8 KB · Views: 242
  • GDTRFB.zip
    23.8 KB · Views: 71
Last edited:

MajP

You've got your good things, and you've got mine.
Local time
Today, 14:59
Joined
May 21, 2018
Messages
8,463
Although I hear what MajP is saying about resource intensive queries, this is so simple that I don't think it will be a problem using 35000 records
I agree with you, but I am pessimistic and assumed you would never get that lucky to have all the 35k data in consecutive order by PK. But if this is some automated reading, it might.
 

isladogs

MVP / VIP
Local time
Today, 18:59
Joined
Jan 14, 2017
Messages
18,186
I agree with you, but I am pessimistic and assumed you would never get that lucky to have all the 35k data in consecutive order by PK. But if this is some automated reading, it might.

That's why i referred to the Serialize function in my previous reply. If the records aren't consecutive, that is used to sort them into the desired 'rank' order so they become consecutive. Very quick in practice and minimal CPU overhead even for large datasets.
 

MajP

You've got your good things, and you've got mine.
Local time
Today, 14:59
Joined
May 21, 2018
Messages
8,463
If you want to try it in code this function is orders of magnitude more efficient than yours. For every record you only need to return two records. The current time stamp for that location and source and the previous time stamp if it exists. You are returning 35k records and every field 35k times. That is 1,225,000,000 records instead of 75,000. Then you have to search for the correct PK. In my solution if two records are returned it is the current record and the previous record. You also only need to return a single field.
Code:
Public Function MeterReadings(Location As Variant, Source As Variant, TimeStamp As Variant) As Variant
  'Do not use ADO when DAO is faster in a local db
  Dim RS As DAO.Recordset
  Dim sSql As String
  Dim CurrentReading As Double
  If IsNull(Source) Or IsNull(Location) Or IsNull(TimeStamp) Then Exit Function
  'DO Not return 35,000 records. Return 2.
  'Do not return Every field
  
  sSql = "SELECT top 2 ReadingValue1 FROM tblMeterReadings where Location = '" & Location & "' AND Source = '" & Source & "'"
  sSql = sSql & " AND TimeStamp <= #" & TimeStamp & "# ORDER BY Location, Source, TimeStamp"
  Debug.Print sSql
  Set RS = CurrentDb.OpenRecordset(sSql)
     
  If Not RS.EOF And Not RS.BOF Then
    RS.MoveLast
    RS.MoveFirst
    If RS.RecordCount = 2 Then
      CurrentReading = RS!ReadingValue1
      RS.MoveNext
      MeterReadings = RS!ReadingValue1 - CurrentReading
    End If
  End If
End Function

Code:
SELECT tblmeterreadings.pk, 
       tblmeterreadings.location, 
       tblmeterreadings.source, 
       tblmeterreadings.TIMESTAMP, 
       tblmeterreadings.readingvalue1, 
       Meterreadings([location], [source], [timestamp]) AS Diff 
FROM   tblmeterreadings 
ORDER  BY tblmeterreadings.location, 
          tblmeterreadings.source, 
          tblmeterreadings.TIMESTAMP;
If European time settings you may need to wrap the time stamp in a format MM/DD/YYYY hh:mm:ss

If you want to try a subquery
Code:
SELECT Latter.pk, Latter.location, Latter.source, Latter.TIMESTAMP, Prior.TIMESTAMP, Latter.readingvalue1, Prior.readingvalue1, Latter.readingvalue2, Prior.readingvalue2, [latter].[readingvalue1] - [prior].[readingvalue1] AS DiffReading1, [latter].[readingvalue2] - [prior].[readingvalue2] AS DiffReading2
FROM tblmeterreadings AS Latter INNER JOIN tblmeterreadings AS [Prior] ON ( Latter.source = Prior.source ) AND ( Latter.location = Prior.location )
WHERE (( ( Prior.TIMESTAMP ) = (SELECT TOP 1 [tblmeterreadings].[timestamp] 
                                 FROM   [tblmeterreadings] 
                                 WHERE  [latter].[timestamp] > 
                                        [tblmeterreadings].[timestamp] 
                                        AND [tblmeterreadings].[location] = 
                                            [latter].location 
                                        AND [latter.source] = 
                                            [tblmeterreadings].[source] 
                                 ORDER  BY [tblmeterreadings].TIMESTAMP) ))
ORDER BY Latter.location, Latter.source, Latter.TIMESTAMP;
 

isladogs

MVP / VIP
Local time
Today, 18:59
Joined
Jan 14, 2017
Messages
18,186
@GDTRFB
If & when you reply to this thread, it would be interesting to compare the speed of the solutions that MajP & I are suggesting on your full dataset of 35000 records.

If the data isn't confidential, could you zip & upload your dataset for comparison

@MajP
75000? Typo?
 

MajP

You've got your good things, and you've got mine.
Local time
Today, 14:59
Joined
May 21, 2018
Messages
8,463
Sorry 70k. Since this is a function it returns a recordset for each record. There are 35k records, and 35k calls to the function. The OP returns 35k records for each of the 35k calls. The serialize function returns 35k records each time it is called. I return 2 records for each call to the function. I believe that is 2*35k = 7Ok
 

MajP

You've got your good things, and you've got mine.
Local time
Today, 14:59
Joined
May 21, 2018
Messages
8,463
If you do the speed test on a large set, I would be curious of the subquery vs the functions that create a recordset. Both should be slow, but which is slower.
 

GDTRFB

Registered User.
Local time
Today, 11:59
Joined
Oct 31, 2018
Messages
10
Hi everyone,


Thanks for all the recommendations. A little more background on my situation. The dataset comes from a data warehouse in a denormalized format ordered chronologically by timestamp.



Each source/location and value combination includes roughly 35k records. In my initial dataset this corresponds to >246k records total (so far).



Then, I import the data into MS Access and assign an autonumber PK.



Isladogs, thank you for your query suggestion. I tried the self-join query option and it seems to be working pretty well. What a straight forward solution that keeps reminding me the power of SQL. I have the ability to "setup" the PK values so they are sequential so this was an easy solution to implement.

Execution time for 246k records = 10 sec





I have not tried the "serial ranking" option yet because i was able to control the PK via my import process, but in the future i might not always have this ability so i'm going to try the "serial ranking" option this weekend. Isladogs, unfortunately, i cannot share the dataset due to confidentially but i will implement the solutions and report back execution time.



MajP, thank you for your VBA suggestion. I want to figure out how to solve this problem using VBA to explore re-working some older solutions i came up with using subqueries, which as you mentioned can become painfully slow. i will try this out over the weekend and report back.


Thanks again everyone.
 

MajP

You've got your good things, and you've got mine.
Local time
Today, 14:59
Joined
May 21, 2018
Messages
8,463
Understand the serialize is a generic function, but as written for this problem it is very inefficient. You could tailor it to be far more efficient for a specific large set. To find the absolute position of the 10th record in sequence you only need to pull the timestamps less than the current timetamp for that source and that location. So you only need to return 10 records and count them vice pulling 35k and searching for that PK.
To find the abs of the first ordered record you need to return 1 record
second ordered record you need to return 2 records
..... 34999 34999 records.
 

GDTRFB

Registered User.
Local time
Today, 11:59
Joined
Oct 31, 2018
Messages
10
MajP, absolutely, i really liked your idea of limiting the recordset to only return what i need to do the math.
 

isladogs

MVP / VIP
Local time
Today, 18:59
Joined
Jan 14, 2017
Messages
18,186
@GDTFRB
I strongly recommend you index all fields that are going to be used in your query.
Whichever method you use, this should make your task much faster

Personally I think all the solutions have merit.
Looping through recordsets & subqueries are generally slow compared to queries.

If your records are consecutive, I would still expect a 'simple' query to be faster.
If you need to rank first using serialize, that may well no longer be the case.
Obviously getting the rank order using serialize could be slow but if the fields used to determine the rank are indexed it can be surprisingly fast.
I just tested a linked SQL table containing 42,000 records using a composite key of 2 indexed text fields for ranking. It took approx 2.5 seconds to rank these.

UPDATE:
I've attached an updated version including the use of the serialize function.
This has 2 queries:

qryRankOrder:
SELECT Serialize("qryRankOrder","MeterID",[MeterID]) AS Rank, tblMeterReadings.Location, tblMeterReadings.Source, tblMeterReadings.Timestamp, CDbl([Timestamp]) & " _" & [Location] & "_" & [Source] AS MeterID, tblMeterReadings.ReadingValue1, tblMeterReadings.ReadingValue2
FROM tblMeterReadings
ORDER BY tblMeterReadings.Location, tblMeterReadings.Source, tblMeterReadings.Timestamp;

The composite field in BOLD is used for the rank order. As it doesn't reference the PK field at all, you can try it with your existing dataset if you wish

The second query uses two copies of qryRankOrder in a similar way to the original solution I supplied

qryMeterReadings_v2:
SELECT qryRankOrder.Location, qryRankOrder.Source, [qryRankOrder].[ReadingValue1]-[qryRankOrder_1].[ReadingValue1] AS RV1Diff, [qryRankOrder].[ReadingValue2]-[qryRankOrder_1].[ReadingValue2] AS RV2Diff
FROM qryRankOrder INNER JOIN qryRankOrder AS qryRankOrder_1 ON (qryRankOrder.Source = qryRankOrder_1.Source) AND (qryRankOrder.Location = qryRankOrder_1.Location)
WHERE (((qryRankOrder.Rank)=[qryRankOrder_1].[Rank]+1));

I really don't know what the comparative performance will be for the different methods so look forward to your feedback when you've had time to test
Don't forget to index!
 

Attachments

  • GDTRFB_v2.zip
    24.7 KB · Views: 67
Last edited:

Users who are viewing this thread

Top Bottom