"Fat Fingering" Query

DQuick · May 10, 2013

Hey guys,
I have been trying to find a solution to this for about an hour and haven't been able to come up with much. :banghead:

Task:
Compare two fields that should have the same number in each.
Return results that are off by a digit or two.

Example:
Tracking # - Tracking # 2
123459789 - 123456789

I already have queries that compare both tables and returns results that do not match. This one needs to be ones that almost match.

Thanks for the help!

Galaxiom · May 10, 2013

You could do a loop through the numbers character by character but that would be relatively slow because it is running on top of a Cartesian product.

How about subtracting one number from the other and counting the occurrence of zeros in the result? Include the leading zeros by formatting back to the original number of digits. It is not a perfect idea because the "borrowing" would potentially mess up the more significant digits but I think it is worth further thought.

Maybe try subtraction in both directions and pick the one with the most zeros.

DQuick · May 10, 2013

That was a bit over my head...

lol

Still researching....

Galaxiom · May 10, 2013

If you subtract one number for the other the result will be zeros in every place except for those where the digits are different. Format the result back to the original length

123459789 - 123456786 = 000003003

Results with a lot of zeros compared to the length indicate a small number of differences and as such are likely targets for being fat-fingered.

It would be best done in a function. I actually have a use for just such a function at work so I will write it next week and post back.

Galaxiom · May 13, 2013

Did a bit of research.

Try this:
http://bytes.com/topic/access/insights/909002-levenshtein-approximate-string-matching

DQuick · May 13, 2013

Excellent!
Ill see if I can somehow adapt that to work the way I need it to.

Would I take that code and create a module? (sorry I am still pretty new to VBA)

Galaxiom · May 14, 2013

Put the code in a Standard Module. It will then be available as a function that will return a score. The lower the score the more similar the two strings it has compared.

Use it in a query that compares the two numbers as a derived field.

In the query designer the derived field will be:

MatchRating: Levenshtein([field1], [field2])

Galaxiom · May 14, 2013

There are some tyopographical errors in the code provided on that page. The syntax is for VBScrpt too

Here is the corrected code for VBA.

Code:

Function Levenshtein(ByVal str1 As String, ByVal str2 As String) As Integer
 
Dim arrLev() As Integer
Dim intLen1 As Integer
Dim intLen2 As Integer
Dim i As Integer
Dim j As Integer
Dim arrStr1() As Integer
Dim arrStr2() As Integer
Dim intMini As Integer
 
    intLen1 = Len(str1)
    ReDim arrStr1(intLen1 + 1)
    intLen2 = Len(str2)
    ReDim arrStr2(intLen2 + 1)
    ReDim arrLev(intLen1 + 1, intLen2 + 1)
 
    arrLev(0, 0) = 0
    For i = 1 To intLen1
        arrLev(i, 0) = i
        arrStr1(i) = Mid(str1, i, 1)
    Next
 
    For j = 1 To intLen2
        arrLev(0, j) = j
        arrStr2(j) = Mid(str2, j, 1)
    Next
 
    For j = 1 To intLen2
        For i = 1 To intLen1
            If arrStr1(i) = arrStr2(j) Then
                arrLev(i, j) = arrLev(i - 1, j - 1)
            Else
                intMini = arrLev(i - 1, j) 'deletion
                If intMini > arrLev(i, j - 1) Then intMini = arrLev(i, j - 1) 'insertion
                If intMini > arrLev(i - 1, j - 1) Then intMini = arrLev(i - 1, j - 1) 'deletion
 
                arrLev(i, j) = intMini + 1
            End If
        Next
    Next
 
    Levenshtein = arrLev(intLen1, intLen2)
End Function

Galaxiom · May 14, 2013

I have been trying to get the Damerau-Levenshtein function (further down the linked page) working. It has the advantage of heavier weighting on transposed characters by counting them as one change instead of two.

Unfortunately it has bigger problems. I can't work out what intSize argument is suppoed to indicate and I get subscript out of range errors.

Once I get my head around how it works I will post the corrected code and add some comments to explain how it works.

These are very useful functions.

Galaxiom · May 14, 2013

Found another Damerau-Levenshtein function. This one is more sophisticated because it can weight the scores on different types of mismatches.

http://stackoverflow.com/questions/13693149/weighted-damerau-levenshtein-in-vba

This original used some Excel only functions so needed some modification for Access. I have added the MinOf and MaxOf functions. (Names changed to avoid confusion with the Min and Max function in SQL.)

I also changed the weigthting variables to constants and moved the Dim out of the loops where they would otherwise just waste time.

I have not looked at the code logic but it seems to work very well.

Code:

Public Function WeightedDL(ByVal source As String, ByVal target As String) As Double
 
Const deleteCost = 1
Const insertCost = 1.1
Const replaceCost = 1.1
Const swapCost = 1.2
 
Dim i As Integer
Dim j As Integer
Dim k As Integer
 
Dim deleteDistance As Double
Dim insertDistance As Double
Dim matchDistance As Double
Dim maxSourceLetterMatchIndex As Integer
Dim table() As Double
Dim sourceIndexByCharacter() As Variant
Dim candidateSwapIndex As Integer
Dim jSwap As Integer
Dim swapDistance As Double
Dim iSwap As Integer
Dim preSwapCost As Double
 
    If Len(source) = 0 Then
        WeightedDL = Len(target) * insertCost
        Exit Function
    End If
 
    If Len(target) = 0 Then
        WeightedDL = Len(source) * deleteCost
        Exit Function
    End If
 
    ReDim table(Len(source), Len(target))
    ReDim sourceIndexByCharacter(0 To 1, 0 To Len(source) - 1)
 
    If Left(source, 1) <> Left(target, 1) Then
        table(0, 0) = MinOf(replaceCost, (deleteCost + insertCost))
    End If
 
    sourceIndexByCharacter(0, 0) = Left(source, 1)
    sourceIndexByCharacter(1, 0) = 0
 
    For i = 1 To Len(source) - 1
 
        deleteDistance = table(i - 1, 0) + deleteCost
        insertDistance = ((i + 1) * deleteCost) + insertCost
 
        If Mid(source, i + 1, 1) = Left(target, 1) Then
            matchDistance = (i * deleteCost) + 0
        Else
            matchDistance = (i * deleteCost) + replaceCost
        End If
 
        table(i, 0) = MinOf(MinOf(deleteDistance, insertDistance), matchDistance)
    Next
 
    For j = 1 To Len(target) - 1
 
        deleteDistance = table(0, j - 1) + insertCost
        insertDistance = ((j + 1) * insertCost) + deleteCost
 
        If Left(source, 1) = Mid(target, j + 1, 1) Then
            matchDistance = (j * insertCost) + 0
        Else
            matchDistance = (j * insertCost) + replaceCost
        End If
 
        table(0, j) = MinOf(MinOf(deleteDistance, insertDistance), matchDistance)
    Next
 
    For i = 1 To Len(source) - 1
 
        If Mid(source, i + 1, 1) = Left(target, 1) Then
            maxSourceLetterMatchIndex = 0
        Else
            maxSourceLetterMatchIndex = -1
        End If
 
        For j = 1 To Len(target) - 1
            candidateSwapIndex = -1
 
            For k = 0 To UBound(sourceIndexByCharacter, 2)
 
                If sourceIndexByCharacter(0, k) = Mid(target, j + 1, 1) Then candidateSwapIndex = sourceIndexByCharacter(1, k)
            Next
 
            jSwap = maxSourceLetterMatchIndex
            deleteDistance = table(i - 1, j) + deleteCost
            insertDistance = table(i, j - 1) + insertCost
            matchDistance = table(i - 1, j - 1)
 
            If Mid(source, i + 1, 1) <> Mid(target, j + 1, 1) Then
                matchDistance = matchDistance + replaceCost
            Else
                maxSourceLetterMatchIndex = j
            End If
 
            If candidateSwapIndex <> -1 And jSwap <> -1 Then
                iSwap = candidateSwapIndex
 
                If iSwap = 0 And jSwap = 0 Then
                    preSwapCost = 0
                Else
                    preSwapCost = table(MaxOf(0, iSwap - 1), MaxOf(0, jSwap - 1))
                End If
 
                swapDistance = preSwapCost + ((i - iSwap - 1) * deleteCost) + ((j - jSwap - 1) * insertCost) + swapCost
 
            Else
                swapDistance = 500
            End If
 
            table(i, j) = MinOf(MinOf(MinOf(deleteDistance, insertDistance), matchDistance), swapDistance)
 
        Next
 
        sourceIndexByCharacter(0, i) = Mid(source, i + 1, 1)
        sourceIndexByCharacter(1, i) = i
    Next
 
    WeightedDL = table(Len(source) - 1, Len(target) - 1)
 
End Function
 
Public Function MinOf(ByVal Value1 As Double, ByVal Value2 As Double) As Double
 
    If Value1 > Value2 Then
        MinOf = Value2
    Else
        MinOf = Value1
    End If
 
End Function
 
Public Function MaxOf(ByVal Value1 As Double, ByVal Value2 As Double) As Double
 
    If Value1 < Value2 Then
        MaxOf = Value2
    Else
        MaxOf = Value1
    End If
 
End Function

Beetle · May 14, 2013

Thanks Galaxiom. This might be a good one for the Code Repository.

Galaxiom · May 14, 2013

Beetle said:
Thanks Galaxiom. This might be a good one for the Code Repository.

I would take a closer look at it first to see if can be optimised. I have only changed the obvious.

The poster converted the concept from some java code they found and said they were not well experienced in coding. None-the-less they have done well to convert it.

I would also like to add some documentation comments.

Would be good to get a bit of operational experience too. I am about to start using it on our archivng database to try and pick up incorrectly indexed documents.

I also thought of weighting in favour of keys that are adjacent on the keyboard. However the potential increase in processing time might outweigh the advantages.

Galaxiom · May 15, 2013

Now that I have started using the function I can report it is fantastically useful.

Our archive database records name (as Surname Initials) and associated member number to be entered independently. We get a lot of name variants depending on the number of initials, maiden name, documents in joint names etc. The variants are easily resolved looking back from the number but problems when neither the number nor name match due to variants in the name and typographical errors in the number.

I had previously done queries that matched the whole name field and compared the associated number. I would look through the results and I could spot the likely ones. The algorithm didn't find any more.

However I had also written a query that matched on surname alone and compared the number. This produced 360K results making it quite impractical to resolve manually.

Now I can rank the difference in the numbers with the algorithm I have turned up many genuine errors.

Once I work through this I will try a join on the low difference partial match of the numbers and measure the difference between the name fields.

I have long intended to come up with this kind of solution. Thanks to the push from DQuick's question. Actually I am glad I didn't start sooner since the WeightedDL VBA function was only posted last December.

Every DB designer should have this function in their toolbox.

Lightwave · Nov 17, 2016

Ace G - will try and get my head around this

Cowboy_BeBa · May 22, 2017

can anyone explain how i can put the function (from post #10) to use?

have two tables (tblMasterList and tblDatalists), all have exactly identical fields (did this on purpose when i was compiling tblDataLists) to make sure it would be easier to merge the two lists when the time came (tblMasterList has approx 30,000 records and needs to be updated with data from tblDataLists)

now some of my records in tblDataLists have an ID number (schID) that will link with the data in tblMasterList, so that part is easy, however most of the records in tblDataLists do not have an ID and there are very few names (one of the variables in both tables "fullNames") that match exactly (many of the records were mistyped when the data was collected).

Am hoping i can use this function to find similar names with which to match the records.

I consider myself pretty good at VB coding but cannot seem to wrap my head around this code

Galaxiom · May 22, 2017

Basically you do a cartisan product (AKA Cross Join). That is the two tables with no join. Then feed the function with fields for each table that are to be compared. Then order by the function result to get the closest matches.

Unfortunately this means it will be processing a total of records equal to the product of the number of records in the two tables. 30,000 * ? . Anything you can do to pre filter and reduce the number of records for comparison will help.

"Fat Fingering" Query

DQuick

Registered User.

Galaxiom

Super Moderator

DQuick

Registered User.

Galaxiom

Super Moderator

Galaxiom

Super Moderator

DQuick

Registered User.

Galaxiom

Super Moderator

Galaxiom

Super Moderator

Galaxiom

Super Moderator

Galaxiom

Super Moderator

Beetle

Duly Registered Boozer

Galaxiom

Super Moderator

Galaxiom

Super Moderator

Lightwave

Ad astra

Cowboy_BeBa

Registered User.

Galaxiom

Super Moderator

Similar threads

Users who are viewing this thread