Query to find a match when words in a different order (1 Viewer)

AndrewS

Registered User.
Local time
Today, 19:00
Joined
Feb 21, 2017
Messages
30
I'm trying to implement duplicate checking for importing data from an external system: basically querying the imported data against existing data using Allen Browne's Soundex and (well, or) simple like, but there's one thing that avoids detection by either: if the words are there but in a different order.

E.g. London University and University of London
E.g. Ambridge Organics and Organics of Ambridge

Anyone got any ideas how to set criteria to match London University with University of London in a query?

The one thing that's occurred to me (as I started typing this post!) is to spit the data into separate words (ignoring ofs, thes, ands), and then build the query, so I'd end up with something like (pseudo code)
Code:
SELECT tbl1.field1, tbl1.field2, tbl2.field1, tbl2.field2
FROM tbl1, tbl2
WHERE soundex(tbl1.field2) = soundex(tble2.field2) 
OR tbl1.field2 LIKE tbl2.field2
OR (tbl1.field2 LIKE *word1*
      AND
      tbl1.field2 LIKE *word2*
      AND
      tbl1.field2 LIKE *word3*

I've not tried that yet, but I'll still post this now, in case anyone has a more elegant solution.

Thanks!
 

jdraw

Super Moderator
Staff member
Local time
Today, 14:00
Joined
Jan 23, 2006
Messages
13,918
Yes, your approach(while typing) is akin to general concept.
You will see articles involving keywords, synonyms, stems and noise words.

If words are in a different order, you can capture the "keywords and sort them".
Noise words are the "and, of, the, Ltd, in, for,..etc..". These are very common terms.

You may get more focused responses/links if you describe your situation with volumes, final usage(s) .
Good luck with your project.
 

Pat Hartman

Super Moderator
Staff member
Local time
Today, 14:00
Joined
Feb 19, 2002
Messages
33,682
I don't want to discourage you but Google and other search engines devote millions of dollars and programming hours to tweaking algorithms to do what you are asking for. How important is finding the "duplicates"? What is the downside (risk) of not finding them? I would try to find something already developed rather than spending a lot of time on something that will be weak at best.
 

AndrewS

Registered User.
Local time
Today, 19:00
Joined
Feb 21, 2017
Messages
30
Thanks both,

I'll progress down this route.

Pat, Ha! Yes, you're right of course. This is only a relatively small system though - essentially a basic CRM system, to allow tracking of interaction with businesses and their employees. This particular process I'm working on is for importing attendance lists on workshops from Eventbrite - typically no more than 20 per event, and will display possible matches for the user to identify as matches or not.
 

Users who are viewing this thread

Top Bottom