Running SQL statement from within a function called from a query (1 Viewer)

GPGeorge

Grover Park George
Local time
Today, 04:21
Joined
Nov 25, 2004
Messages
1,775
Here's a thought... I see you are from the UK. Your postal system has certain "locale" codes - you might call them something different, of course. We in the USA call the postal area codes "ZIP codes". If you pre-sort everything and compute distances only for things within the same postal codes, that will for a pre-filter that should help reduce the size of your data set to something manageable.

Further, if you can use your postal data to determine which postal codes are adjacent to each other, that should take care of reducing the address set considerably. With 3000 client locations, they are SURELY spread out over several postal areas, so that should give you a tremendous yet practical filter. You would probably only want to work on addresses for postal codes that are the same or that are geographically adjacent.

After all, (to give you a USA example), our postal code for a part of the city of Los Angeles is 90210. New Orleans is 70121. I live in Metairie, which is 70006. If we look up distances, 90210 is a couple of thousand miles from 70121 and 70006 - but 70121 and 70006 are less than 10 miles from each other. So when computing distances, you could skip looking at 90210 and 70121 - but would continue to work on 70006 vs 70121 because they are much closer. If you decide that the postal area centers are more than, say, 50 miles apart, you could probably skip the computation and not bother with storing those distances.

Does this approach make sense?
Sometimes reading these discussions, wild ideas pop into my head. Would it be possible to integrate somehow with Google maps for some part of this task?
 

Blueskies

Registered User.
Local time
Today, 11:21
Joined
Oct 23, 2009
Messages
69
Here's a thought... I see you are from the UK. Your postal system has certain "locale" codes - you might call them something different, of course. We in the USA call the postal area codes "ZIP codes". If you pre-sort everything and compute distances only for things within the same postal codes, that will for a pre-filter that should help reduce the size of your data set to something manageable.

Further, if you can use your postal data to determine which postal codes are adjacent to each other, that should take care of reducing the address set considerably. With 3000 client locations, they are SURELY spread out over several postal areas, so that should give you a tremendous yet practical filter. You would probably only want to work on addresses for postal codes that are the same or that are geographically adjacent.

After all, (to give you a USA example), our postal code for a part of the city of Los Angeles is 90210. New Orleans is 70121. I live in Metairie, which is 70006. If we look up distances, 90210 is a couple of thousand miles from 70121 and 70006 - but 70121 and 70006 are less than 10 miles from each other. So when computing distances, you could skip looking at 90210 and 70121 - but would continue to work on 70006 vs 70121 because they are much closer. If you decide that the postal area centers are more than, say, 50 miles apart, you could probably skip the computation and not bother with storing those distances.

Does this approach make sense?
Hi

There's a lot of sense there - I guess I have been using a bit of a 'dumb' way of going about this, and if I could treate adjacent postcodes (as we call them in the UK!) differently to ones that are further out, I could indeed cut down on the number of calculations needed.

There's no point me looping though thousands of codes that are nowhere near the address I want, if the ideas is to get the closest ones.

You've given me some food for thought there.... definitely worth pursuing.
 

Blueskies

Registered User.
Local time
Today, 11:21
Joined
Oct 23, 2009
Messages
69
Sometimes reading these discussions, wild ideas pop into my head. Would it be possible to integrate somehow with Google maps for some part of this task?

I think the calculation I have which compares longitude/latitude is pretty fast for Access to run, so I don't think we could speed it up by using Google Maps.

If I needed a 'by road' distance instead of a 'as the crow flies' distance then I'd need some sort of mapping calculation, but fortunately I haven't been asked for that.

If I needed the logitude/latitude I might have to use google maps as well, but luckily I have those.

What I might use it for is seeing if I can do something with your idea of the proximity of postcode areas (as a way of eliminating the need to go through every postcode in the country). I know for example that TR is Cornwall (South West of UK), so it's no use whatsoever doing distance calculations among the ZE postcodes (far North).
 

The_Doc_Man

Immoderate Moderator
Staff member
Local time
Today, 06:21
Joined
Feb 28, 2001
Messages
26,996
I know for example that TR is Cornwall (South West of UK), so it's no use whatsoever doing distance calculations among the ZE postcodes (far North).

Exactly the point of the post codes suggestion.
 

Pat Hartman

Super Moderator
Staff member
Local time
Today, 07:21
Joined
Feb 19, 2002
Messages
42,970
I don't understand why you are running an update query inside an update query. Be aware that a function running inside an update query that accesses data, is accessing data in the table, NOT in the workspace of a query. So if the update query is updating data the function is retrieving, the function will retrieve the original value, not the updated value.
 

Blueskies

Registered User.
Local time
Today, 11:21
Joined
Oct 23, 2009
Messages
69
I don't understand why you are running an update query inside an update query. Be aware that a function running inside an update query that accesses data, is accessing data in the table, NOT in the workspace of a query. So if the update query is updating data the function is retrieving, the function will retrieve the original value, not the updated value.
Hi Pat

That's what I was asking about originally - I didn't understand why it wouldn't work. I'm now back to using recordsets as a couple of people have suggested.
 

Blueskies

Registered User.
Local time
Today, 11:21
Joined
Oct 23, 2009
Messages
69
Well, I just implemented Auntiejack56's suggestion of aborting the loop after the restaurant latitude (which is in order in the restaurant array) gets too far away from the address latitude.

I found that a difference of 1 in the latitude is 69 miles, which is more than enough for a restaurant to be considered local!

This has taken me down to a time of 23 minutes for 1 million records, so a vast improvement!

This is the same sort of logic as Doc_Man's idea of postcode near each other - I was thinking of putting a table together listing each postcode area - CB, NG, PO etc - alongside it's nearest neighbours. That way the code would only have to check the nearest ones, but just exiting the loop when the latitude gets too far away seems a much simpler approach and it's clearly very effective.
 

MajP

You've got your good things, and you've got mine.
Local time
Today, 07:21
Joined
May 21, 2018
Messages
8,463
@Blueskies
I do not know if you are still doing this, but if interested I coded the nearest neighbor algorithm using a K-d tree. FYI, this was a huge pain there are so many places a little mistake can change the results, and very hard to debug and see your problems. However, once the tree is loaded it only takes less than a second to find the closest neighbor for a given address. (with 25k addresses)

I caution you on some of the strategies provided by @The_Doc_Man and @Auntiejack56 . Searching only in a grid will fail often on boundary conditions.
This can be seen below. If red is the target address there are 3 much closer restaurants in postcode 2,3,4 then in its own postcode 1.
postcode.jpg


The files you sent only had 25K unique addresses. I think the precision on the lat long was reduced. If you want I can test the whole thing and see what this really takes. Once I load the KD-tree (I do not know how long that takes, maybe twice as long as loading the array). After that to find the closes for each 3k restaurant should take a couple of minutes I would assume.

For me this would be real world. If I searched only in the large very narrow and wide county I live in the closest restaurant would be much further then the closest in the next county.
 

Blueskies

Registered User.
Local time
Today, 11:21
Joined
Oct 23, 2009
Messages
69
Hi MajP

I'm actually waiting on the full data from a colleague before I do any more with this - as you've said, there was some duplication in the data, so I knew I could only use it for testing the code for speed, which has been a lot better than I was getting originally thanks to those who contributed to this post.

I was going to go back to the project when I got the real data, so if it's OK with you I'll upload the it for you to try out when I get it and we can compare results.

Thanks for the effort you've obviously put into this - it's great to find people on the forum who get interested in these things and are prepared to spend time investigating them.

I'll post as soon as I get the data - you've reminded me to chase it up!
 

Mike Krailo

Well-known member
Local time
Today, 07:21
Joined
Mar 28, 2020
Messages
1,030
Fascinating thread, can't wait to see the final iteration of this code with all the address data.
 

MajP

You've got your good things, and you've got mine.
Local time
Today, 07:21
Joined
May 21, 2018
Messages
8,463
@Blueskies,
If you can explain your process of logging the results, that would help. What do you start with, what queries or recordsets you run, and what is logged? If you are calculating the distance, finding the nearest, and logging results in 23 minutes, that is probably a very good approach.

I coded the solution from post 18, which in theory should be pretty fast, but in application it is not. Using an array as @Auntiejack56 posted you loop each restaurant for each of the 5M addresses. So loop 3K times per address. With the K-D algorithm you loop about 30 times for each record so it should be about 100 times faster. With the straight array you have to calculate every distance, with the KD-Tree you calculate only a portion of the records reducing time further. So I can post the database, but not much benefit unless I can figure out what is slowing things down. So I think the algorithm is good, just need a better implementation. I will keep looking at it.

As for your addresses they are mostly duplicates. Example. My guess is you are rounding the lat long to too few places to make them unique.
53.60,-2.40,0,0
52.75,-1.20,0,0
52.75,-1.20,0,0
52.75,-1.20,0,0
52.75,-1.20,0,0
52.75,-1.20,0,0
52.75,-1.20,0,0
52.75,-1.20,0,0
52.75,-1.20,0,0
53.60,-2.40,0,0
52.75,-1.20,0,0
53.60,-2.40,0,0
52.75,-1.21,0,0
52.75,-1.20,0,0
52.75,-1.20,0,0
53.60,-2.40,0,0
52.75,-1.21,0,0
52.75,-1.21,0,0
53.60,-2.40,0,0
53.60,-2.40,0,0
53.60,-2.40,0,0
52.75,-1.21,0,0
52.75,-1.21,0,0
 

MajP

You've got your good things, and you've got mine.
Local time
Today, 07:21
Joined
May 21, 2018
Messages
8,463
Here is my demo, but the problem is it is not fast. I search my code to see if I am doing something dumb like a debug.print in the loop, but I cannot see anything. I demoed 2 things.
1. Like @Auntiejack56 suggested. I read all 3000 restaurants into an array. Then I can loop through the 5M records (I only used 25K for the demo) and loop the array 3K times and store the smallest distance. So this requires 3000 x 5M steps.
2. As per the link in 18 I created a KD-Tree. This is a binary tree with the difference is that you swap every level of the tree on what value you compare to determine if you add the node to the left or right. You first compare by lat, then long, then lat, then long.... Like in step 1 you loop the 5M records but with the KD tree instead of looping 3000 restaurants you only compare about 30. (approximately O log N)

The demo allows you to try different numbers of addresses so you can look at times to complete. I would try about 100 to start. It gets lengthy after that. It took 6:52 for the array and 21 seconds for the KDTree to process 2500 addresses. This performance difference is what I would have expected. I actually expected it to be about 100 times faster. However, both methods IMO should be much faster, and I think there is something very inefficient in my application, I just cannot see it. I think there is something slow in the distance calculation, because I did some timing and it does not seem to have do with writing to the table or even looping the records. 21 seconds for 2,500 records seems OK until you do the math and that is like 13hours to run 5M

So I am definitely curious to see the implementation where you can do a million records in 23 minutes.
So I doubt I provided any workable solution, but something to think about.
NN.jpg
 

Attachments

  • NearestNeighbor V2.accdb
    1.8 MB · Views: 115

Blueskies

Registered User.
Local time
Today, 11:21
Joined
Oct 23, 2009
Messages
69
Hi MajP

I've got the revised data now, without duplicates - see attached. I added the duplicates to build up the numbers from the smaller data sets I was given. It was important to see how fast the process was and if it could cope with the larger numbers.

I've got 1928 unique restaurants and 20 million unique addresses here to play with.

I haven't gone over your accdb yet - thanks for posting that, but I'll do some tidying up on what I've been using and post it shortly.

Cheers
 

Attachments

  • New Data.zip
    5.4 MB · Views: 96

Blueskies

Registered User.
Local time
Today, 11:21
Joined
Oct 23, 2009
Messages
69
OK, here's the module I've been using (I haven't got to the stage of setting up as a form yet!).

I've got this in a 'front end' database with the Restaurant and Address Lists linked from a 'back end'. I've been running 5 million records at a time through this.

Address table is named DATA1 and the Restaurants is 'Restaurants with dups' (included duplicates to get the numbers up as discussed). The 'goFindClosestNosh' function (love that name!) is courtesy of Auntiejack56 with a few small tweaks. I'm afraid I forget where the 'get_distance' function came from - possibly even this forum!

It reports progress to the debug window.

Code:
Option Compare Database

Sub START_HERE()

    Dim start_time As Date
    Dim end_time As Date
    Dim mins As Integer

    start_time = Now
    
    Make_new_summary_table
    
    Debug.Print ""
    
    Debug.Print "Start at " & Format(Now, "HH:mm:ss")
          
    goFindClosestNosh "DATA 1", "Restaurants with dups", start_time, 1000000
  
        end_time = Now
    
    mins = DateDiff("n", start_time, end_time)
    
    Debug.Print ""
    Debug.Print "Complete. Time taken:" & vbNewLine & vbNewLine & Format((mins \ 60), "00") & " hours, " & Format(mins Mod 60, "00") & " minutes"
    
      

End Sub



Function goFindClosestNosh(add_tab As String, rest_tab As String, start_time As Date, record_limit)


    DAO.DBEngine.SetOption dbMaxLocksPerFile, 50000

    Dim db As DAO.Database, aa As DAO.recordset, rs As DAO.recordset, strSQL As String
    Dim n As Long, minDist As Double, calcDist As Double, saveID As Long
    Dim arrRests
    Dim hourly_rate As Long
    
    Dim add_lon As Double
    Dim add_lat As Double
    Dim rest_lon As Double
    Dim rest_lat As Double
    
    Dim Progress_Amount As Long
    
    Set db = CurrentDb
    
    strSQL = "SELECT r.ID, r.Lat, r.Lon FROM [" & rest_tab & "] as r ORDER BY r.Lat;"
    Set aa = db.OpenRecordset(strSQL, dbOpenDynaset)
    arrRests = aa.GetRows(9999)
    
    strSQL = "SELECT t.ID, t.Lat, t.Lon, t.RestaurantID, t.RestaurantDistance from [" & add_tab & "] as t;"
    Set rs = db.OpenRecordset(strSQL, dbOpenDynaset)
    
    rs.MoveLast
    
    rec_count = rs.RecordCount
    
    rs.MoveFirst
      
    If rec_count < 10000 Then
        progress_report_interval = Int(rec_count / 100)
    Else
        progress_report_interval = 1000
    End If
    
    Progress_Amount = 1
    
    
    Do Until rs.EOF
        minDist = 10000
        
        add_lat = rs("Lat")
        add_lon = rs("Lon")
            
        For n = 0 To UBound(arrRests, 2)
                    
            rest_lat = arrRests(1, n)
            rest_lon = arrRests(2, n)
            
            If rest_lat - add_lat > 1 Then
                Exit For
            End If
                                  
            calcDist = distance(add_lat, add_lon, rest_lat, rest_lon)
            If calcDist < minDist Then
                minDist = calcDist
                saveID = arrRests(0, n)
            End If
        Next
        
        rs.Edit
        rs!RestaurantID = saveID
        rs!RestaurantDistance = minDist
        rs.Update
        
        
        If Progress_Amount Mod progress_report_interval = 0 Then
                      
            hourly_rate = get_hourly_rate(start_time, Now, Progress_Amount)
            
            Debug.Print "Progress: " & Progress_Amount & " at " & Format(Now, "HH:mm:ss") & " Rate: " & hourly_rate & " records per hour"
            DoEvents
            
        End If
        
        Progress_Amount = Progress_Amount + 1
        
        If Progress_Amount > record_limit Then Exit Do
        
        rs.MoveNext
      
    
    Loop
    
    rs.Close
    db.Close
    
End Function

Function get_distance(lat1 As Double, lon1 As Double, lat2 As Double, lon2 As Double) As Double

    Dim pi As Double
    
    pi = 3.14159265358979
    
    get_distance = (Sin((lat2 * pi) / 180)) * (Sin((lat1 * _
    pi) / 180)) + (Cos((lat2 * pi) / 180)) * _
    ((Cos((lat1 * pi) / 180))) * _
    (Cos((lon1 - lon2) * (pi / 180)))
    
    On Error GoTo errhandler
    
    get_distance = 6371 * (Atn(-distance / Sqr(-distance * distance + 1)) + 2 * Atn(1)) ' this is km
    
    get_distance = distance / 1.609344 ' this is miles
    
    get_distance = Abs(distance)
    
    Exit Function
    
errhandler:
    
    get_distance = 0


End Function
 

MajP

You've got your good things, and you've got mine.
Local time
Today, 07:21
Joined
May 21, 2018
Messages
8,463
Your restaurant table is fine and not really an issue impacting the solution. The address table however is still mostly dupes. For test purposes that does not matter either. Once the file open this is the top rows.
Code:
57.14,-2.09,0,0
57.14,-2.09,0,0
57.14,-2.09,0,0
57.14,-2.09,0,0
57.14,-2.09,0,0
57.14,-2.09,0,0
57.14,-2.09,0,0
57.15,-2.09,0,0
57.14,-2.09,0,0
57.15,-2.09,0,0
57.15,-2.09,0,0
57.15,-2.09,0,0
57.15,-2.09,0,0
57.15,-2.09,0,0
57.15,-2.09,0,0
57.15,-2.09,0,0
57.14,-2.09,0,0
57.14,-2.09,0,0
57.14,-2.09,0,0
57.15,-2.09,0,0
57.15,-2.09,0,0
57.15,-2.09,0,0
57.15,-2.09,0,0
57.15,-2.09,0,0
57.15,-2.09,0,0
57.15,-2.09,0,0
57.15,-2.09,0,0
57.15,-2.09,0,0
57.15,-2.09,0,0
57.15,-2.09,0,0
57.15,-2.09,0,0
57.15,-2.09,0,0
57.14,-2.09,0,0
57.14,-2.09,0,0
57.14,-2.09,0,0
57.14,-2.09,0,0
57.14,-2.09,0,0
57.14,-2.09,0,0

The get distance function for lat, long distance on a globe (accounts for curvature) is called the Haversine formula. It will work. You can simplify this to a flat projection. The reason is for close points the curvature of the earth will have little to no impact and all far points are rejected anyways.
Code:
   x = (radLongEnd - radLongStart) * Cos((radLatStart + radLatEnd) / 2)
   y = radLatEnd - radLatStart
   d = Sqr(x ^ 2 + y ^ 2) * radiusOfEarth

I do not believe as written your distance function works anyways. I do not see any conversion to radians which is required.

What you have coded is basically what I did in my demo part 1. However, I would have thought my implementation was more efficient since I am doing an update query instead of leaving that large rs open.
Are you really saying you can process 1 million records in 23 minutes? Like I said, I am doing basically the exact same thing and not getting anywhere remotely close to times like that. If that is really possible, then I should be able to do that about 30 times faster with the KD tree. Even though my code using the array is basically the same as you are doing, I think something is causing it to be slow possibly in my distance calculations. So if I can figure it out then I should be able to implement the KD tree faster since it calls the distance calculations fewer times.

I am wondering if this can be done faster with just an update query called repeated times. Imagine there was only 10 addresses and 3000 restaurants. In pure SQL to include the distance formula you could write a query to find the nearest restaurant and distance, and use that as an update. Then add a datetime stamp to your address table. Then figure out how big you can scale up that query. My assumption is that it will quickly choke, but say it works well for 5k records. If it works fine for 5k addresses then I wonder if repeatedly calling that query selecting the next 5k addresses that have not been processed yet would be quicker.

Is your backend SQL server? If so I would try to put it on there and see what is possible. There are a lot more options.
 

Blueskies

Registered User.
Local time
Today, 11:21
Joined
Oct 23, 2009
Messages
69
Hi

Wow, you are clearly more knowledgeable about this than I could ever hope to be!

To answer your query re the duplication - 2 issues here:

1. It looks like some of the detail was lost in the export from Access. I originally got the file as a csv and imported it with these fields as long integers, so the top few rows look like the below. When I exported to text, it lost a few of the decimal places, resulting in the file you've seen.



PAF_URNLatitudeLongitude
49​
57.1493​
-2.097807​
130​
57.14993​
-2.09761​
132​
57.14993​
-2.09761​
133​
57.14993​
-2.09761​
134​
57.14993​
-2.09761​
135​
57.14993​
-2.09761​
136​
57.14993​
-2.09761​
140​
57.15003​
-2.097346​
141​
57.14993​
-2.09761​
143​
57.15003​
-2.097346​
144​
57.15003​
-2.097346​
145​
57.15003​
-2.097346​
146​
57.15003​
-2.097346​
147​
57.15003​
-2.097346​
149​
57.15003​
-2.097346​
150​
57.15003​
-2.097346​
151​
57.14993​
-2.09761​
152​
57.14993​
-2.09761​
153​
57.14993​
-2.09761​
155​
57.15052​
-2.096835​

2. The other reason for the duplication, even with the extra detail to the lon/lats shown above, is probably because some of these addresses will be flats in a larger builing or something like that. I didn't give the PAF_URN number in the file I uploaded, thinking it wouldn't be needed, but my colleague did provide it and you can see this changes and shows the addresses are different and 'legit', even if the co-ords are the same.

This of course raises the possibility of just doing one distance calculation on addresses with the same lat/lon, so a time saving to be had there.

I'll try to export the data properly ASAP and post it here.

Re the backend: I'm just using another Access database. I've been meaning to get around to setting up a MySQL server for a few weeks now (I haven't had a need to use any form of SQL server for the last 7 years or so), so probably a good time to get on with it.

As for the time running the code takes, I'll set another run going shortly and make sure. Hopefully I haven't made that time up somehow!
 

Blueskies

Registered User.
Local time
Today, 11:21
Joined
Oct 23, 2009
Messages
69
OK, I've opened the 20 million record text file and can see the details is there.

I've split it into files of 5 million lines each to make it easier to import into Access (due to the header being counted as a line, the final file is 1 record!)

It looks like the file is too big now to upload in the forum though, so I've put it here:

wetransfer
 

Blueskies

Registered User.
Local time
Today, 11:21
Joined
Oct 23, 2009
Messages
69
Hi

I just ran another test an it did 1,000,000 records in 25 minutes. I've picked a random result to test:


LatLonRestaurantIdRestaurantDistance
51.59114​
-2.925567​
1146​
2.43659047751239​

According to Google maps, this address is Playford Cres, Newport NP19 9HY.

The restaurant Id 1146 looks like it's actually the ID of the table, not the 'Restaurant Id' field, so I'll have to correct that! - however it has these co-ords:


LatLon
51.6192238​
-2.9599093​

The restaurant address is Ponthir Rd, Caerleon, Newport NP18 3XL

I've got a route distance between the 2 points of 3.2 miles from Google maps, so if you take into account the fact that the lat/lon function is giving the 2.4 mile distance as the crow flies, that's probably checks out OK.
 

Users who are viewing this thread

Top Bottom