use Contains rather than Like

megatronixs

Registered User.
Local time
Today, 17:38
Joined
Aug 17, 2012
Messages
719
Hi all,

I needed to create a query that will check in 2 different tables for the first name and surname if there is a match. The trouble is that there could be a typo and a first name or surname would be missing or having to much letters. for example McCanna vs McCana or with the first name John vs Joh.
I can imagine that if I would compare John McCanna in a table where Joh McCanna or John McCanna, I would have troubles getting results.
The below query, does not get me the data, only if they really match.
any way to make it get fixed?
Code:
SELECT DISTINCT [Fund Data].TITLE, [Fund Data].FORENAME, [Fund Data].SURNAME, [Fund Data].PRIMARY_DOB, tbl_exit_notification.Title, tbl_exit_notification.[First Name], tbl_exit_notification.[Surname_Business Name], tbl_exit_notification.[Date of Birth], tbl_exit_notification.[Client ID]
FROM [Fund Data] INNER JOIN tbl_exit_notification ON ([Fund Data].SURNAME = tbl_exit_notification.[Surname_Business Name]) AND ([Fund Data].PRIMARY_DOB = tbl_exit_notification.[Date of Birth])
WHERE ((([Fund Data].SURNAME) Like "*" & [tbl_exit_notification].[Surname_Business Name] & "*") OR ((tbl_exit_notification.[Surname_Business Name]) Like "*" & [Fund Data].[SURNAME] & "*"));

Greetings.
 
Not using straight SQL. That kind of analysis requires a complex program that deals with regular expressions more extensive that I believe is reasonable for standard SQL.

The only way I've seen this done at a low level is to create an "allowable match" table and create a query to match the names in your "real" database tables with the patterns you place in the "acceptable" table.

The overview would be that you would create a table for first names and another for last names. So take the simple example of John. Your "allowed" table might have tuples:

Code:
Main, Alternate
<John>,<John>  - you allow an exact match
<John>,<Jon>
<John>,<Joh>

Then you would write a query that did a

"...WHERE FirstName LIKE *" & Alternate & "* ... "

and would give you the Main name field This is sort of like a reverse JOIN but using WHERE clauses. Once could imagine you would have to compile a list of "allowed" matches for every name.

This would not be trivial but it has a shot.
 
@OP,

Why do you have the person's name in more than one table? And is this for a "One off" conversion or is this going to be used more often?
 
Hi all,

@Mark
we get every month from 2 different systems excel files. We need to compare them both to see if we have in one excel file the same person listed as in the other excel.
of course, there could be a typo in one of the files and we would not be able to catch this.
When I use on the table straight the filter and use CONTAINS, it will get me the same surname if it would be a typo (that is if I would type the surname inside the filter field)

I was reading somewhere about the fact that Like * McManna * and Like *McManna* would give different results (being the last the one that will not fail).
The thing is that I use the expression to feed the query:
Code:
((([Fund Data].SURNAME) Like "*" & [tbl_exit_notification].[Surname_Business Name] & "*")
Maybe this can be adjusted?
Greetings.
 
The best way to test for typographical errors is the Damereau-Levensthein Distance.

See this thread for a function and an example of how to use it.
 
I've used the double metaphone method (similar to Soundex). Basically used for looking for potential duplicates in large datasets, identifying companies in the same group (one case I found 21 different ways of spelling McDonalds out of 50 of different addresses) and for conference checkin where delegates state their name, sometimes not very clearly,the spelling isn't obvious or the delegate is foreign
 
we get every month from 2 different systems excel files. We need to compare them both to see if we have in one excel file the same person listed as in the other excel.

As you are comparing data sets coming from different sources, may I suggest doing a flip on your criteria? I would start by the value LEAST likely to be incorrect (hopefully), their date of birth. It is in both tables. For each date, I would show a list side by side to see if you can spot duplicates with different spelling of surname/first name.

As this is done monthly I am hoping this ins't tens of thousands of records. Still, checking by name can easily miss when one person goes by different names (First name / Last with one company, First Initial MIDDLE NAME, Last at the other) with your different list generators.

I would ALSO do the same type of checking the other's have recommended, but I would by preference do this AFTER catching as many as I can by date of birth.
 
Here's another technique you can try, which just counts the letters in the two strings to be compared, returns 100% if the two counts are the same, and returns values diminishing to zero as the characters and counts diverge.
Code:
[SIZE="1"]Private Sub TestIt123()
    Debug.Print "1) " & CompareAlphaVectorStrings("miles goodwin", "myles godwine"), "Not bad"
    Debug.Print "2) " & CompareAlphaVectorStrings("GeorgeBush", "HeBugsGore"), "Fails for anagrams--false positive."
    Debug.Print "3) " & CompareAlphaVectorStrings("MacDonald", "Donaldson"), "Not bad"
    Debug.Print "4) " & CompareAlphaVectorStrings("Jones", "Smith"), "expected to be a poor match"
End Sub

Function GetCharacterArray(Text As String) As String()
[COLOR="Green"]'   This function returns a string array, one character per array element,
'   of each character in the parameter Text.[/COLOR]
    Dim i As Integer
    ReDim tmp(Len(Text) - 1) As String
    
    For i = 0 To Len(Text) - 1
        tmp(i) = Mid$(Text, i + 1, 1)
    Next
    GetCharacterArray = tmp
End Function

Function GetAlphaVector(Text As String) As Integer()
[COLOR="green"]'   This function returns a 26 element integer array, one element for each
'   letter in the alphabet.  Each indexed location stores a count of the letters
'   in the parameter Text, so if we passed in the word "Aardvark", then
'   GetAlphaVector(0) would equal 3, representing the three letters "a".
'   This function is not case sensitive.[/COLOR]
    Dim tmp(25) As Integer
    Dim var
    Dim i As Integer
    
    For Each var In GetCharacterArray(Text)
        i = Asc(UCase(var)) - Asc("A")
        If i >= 0 And i <= 25 Then tmp(i) = tmp(i) + 1
    Next
    GetAlphaVector = tmp
End Function

Function CompareAlphaVectorStrings(s1 As String, s2 As String) As Single
[COLOR="green"]'   This function compares two AlphaVectors and returns a number between
'   zero and one representing how well the two sets of characters match.
'   This algorith matches anagrams 100%, so "GeorgeBush" and "HeBugsGore"
'   will match because they contain exactly the same letters.[/COLOR]
    Dim i As Integer
    Dim sum As Single
    Dim avg As Single
    Dim v1, v2
    
    v1 = GetAlphaVector(s1)
    v2 = GetAlphaVector(s2)
    
    For i = 0 To 25
[COLOR="green"]        ' see if both vectors contain this char[/COLOR]
        If v1(i) > 0 And v2(i) > 0 Then
[COLOR="green"]            'if so, we try to deal with how many, using an Avg()
            'doing it this way, the word "Test", compared to itself, will score 4
            'rather than 3 (ignoring the duplicate "T")[/COLOR]
            sum = sum + (v1(i) + v2(i)) / 2
        End If
    Next
[COLOR="green"]    'now we divide the matching chars by the average length of the two,
    'for an estimate of percent match[/COLOR]
    avg = (Len(s1) + Len(s2)) / 2
    If avg > 0 Then CompareAlphaVectorStrings = Round(sum / avg, 2)
End Function[/SIZE]
I developed this out some kind of artificial intelligence text I was reading at the time, but I never really used it for string matching, but for some data sets it might yield superior results. Because it doesn't consider the position of characters in the string, longer strings will tend to share more characters, and you'll get more false positives. Anyway, thought it was worth posting...
Mark
 
Hi all,

I made some progress. The only thing that now is making it not work for 100% (only 90%) is one tiny part. It has to do with a Chinese name. on one table it is as "Meiguang" and on the other table as "Mei". both have the same date of birth.
The below SQL finds all that I need, but not this one Mei or Meiguang.

Any ideas? I guess I need to check on the 3 first characters too, correct?
Code:
SELECT DISTINCT [Fund Data].TITLE, [Fund Data].FORENAME, [Fund Data].SURNAME, [Fund Data].PRIMARY_DOB, tbl_exit_notification.Title, tbl_exit_notification.[First Name], tbl_exit_notification.[Surname_Business Name], tbl_exit_notification.[Date of Birth], tbl_exit_notification.[Client ID]
FROM tbl_exit_notification INNER JOIN [Fund Data] ON tbl_exit_notification.[Date of Birth] = [Fund Data].PRIMARY_DOB
WHERE ((([Fund Data].SURNAME) Like "*" & [tbl_exit_notification].[Surname_Business Name] & "*")) OR ((([Fund Data].FORENAME) Like "*" & [tbl_exit_notification].[First Name] & "*")) OR ((([Fund Data].FORENAME)=[tbl_exit_notification].[First Name])) OR (((tbl_exit_notification.[First Name]) Like "*" & [Fund Data].[FORENAME] & "*")) OR (((tbl_exit_notification.[First Name])=[Fund Data].[FORENAME]));

Greetings.
 
You do realize you will never be 100% with these types of data scrubs. If Meiguang goes by Mei in one location, any chance they go by a "local" name elsewhere?

One of my coworkers is named "Wai-Leung", but he goes by "Sam". The only way you would find him correctly would be by surname and date of birth. Likewise I used to work for a fellow who would go by either "David", "D. Keith", or "Keith".
 
If you know that the Fund Data is always going to have the same or less number of chars, you could do something like this in your WHERE clause. Otherwise, you could try your suggestion and hard code a certain number of chars.

This will get the number of chars in SurName and use that to limit the number of chars in SurNameBusName, so in the example it would say to get the left 3 chars of SurNameBusName and see if it matches SurName.

Code:
OR (((FundData.SurName)=Left([tblexitnotif].[SurNameBusName],Len([funddata].[SurName]))))
 
Following on from previous comments, add a unique ID field to your table and set to not allow duplicates
For example, this could be the first two letters of forename followed by first letter of surname.
For example Tom Jones = TOJ & Tina James = TIJ
Occasionally you many need to modify to prevent repetition e.g. Tim Johnson could be TMJ

Populate the field for all users THEN make it the primary key field
Also add it a a FK field to other tables based on the same people

Then you'll never need to do this again
 
megatronix,

What sort of volumes and frequency of such matching do you do/need?

Do you deal with an AccountNumber or CustomerNumber, or are you moving is such a direction?

I recall an application involving ~50000+ companies with information coming from several sources to be vetted.
The best we did was comparing fields such as

-Legal Name
-Operating Name
-Location Address
-Mailing Address
-Business Phone
-email

and used some weighting factors to get a number. It still required considerable manual "eyeballs" to vet the data.

As others have said - you're not going to get 100%.
 

Users who are viewing this thread

Back
Top Bottom