Using the Simil function for best match (1 Viewer)

here4real

Registered User.
Local time
Yesterday, 19:22
Joined
May 1, 2013
Messages
87
It is downloadable from the reference library and is a GREAT resource for doing a fuzzy match.

My question is how can I implement it take a field on a record and have it return ONLY 1 record from another table that is the best match? Simil will compare 2 strings. I want to compare one sting to a field in a table?

Thanks.
 

jdraw

Super Moderator
Staff member
Local time
Yesterday, 19:22
Joined
Jan 23, 2006
Messages
15,379
Did you try it?
 

here4real

Registered User.
Local time
Yesterday, 19:22
Joined
May 1, 2013
Messages
87
Yes.

Let me give a little background.

I have a table of good addresses. We can call that table Good.

I have another table that has addresses that for each one I want it to do a fuzzy search against Good and return the best match as long as that match has 95% accuracy. Simil in its simplest form doesn't scan a second table to find the best match. It just scores string1 against string2.
 

jdraw

Super Moderator
Staff member
Local time
Yesterday, 19:22
Joined
Jan 23, 2006
Messages
15,379
And that is how most functions Levenstein Distance etc work.

From experience I can tell you that address matching/validation can be a real PITA.
I honestly don't think you would be happy with Soundex or Levenstein.
Iin the end you will likely need someone's eyeballs to do final check.

I recall someone on another forum looking at fuzzy matching. You can see the dialog for reference.

Good luck with the project. I'd like to hear/see what you do in the end.
 

here4real

Registered User.
Local time
Yesterday, 19:22
Joined
May 1, 2013
Messages
87
Okay. I broke it down into several steps.

1) I do a join for each record against all addresses. In my case, there is a level above address (there is a limited set of addresses based on a different field) so the number of generated records isn't so onerous. I did this as a MakeTable query. Besides the Simil score, I also pull off the number of the address (whatever exists before the first " " or "-") and save that as well.

2) A query off of that table that groups based on person what the highest Simil score is.

3) A query that joins that table against the query in step 2 matching person, address and score. This is so I can bring in additional data located in the Addresses table. I also check that the number of the address is the same and that the Simil score is over .95. The reason for the number address is because 15 Main Street and 18 Main Street have very high Simil scores but obviously are different. By pulling off the number portion of the addresses I ensure that the high Simil score is in fact the same address.

2)
 

here4real

Registered User.
Local time
Yesterday, 19:22
Joined
May 1, 2013
Messages
87
Still tweaking it. Since I am matching on the address number portion, I decreased to Simil score to .65 which caught a LOT more.
 

jdraw

Super Moderator
Staff member
Local time
Yesterday, 19:22
Joined
Jan 23, 2006
Messages
15,379
Does it do what you need/want?
Do you still have to do a lot of manual intervention?
If you can get something to isolate groups or patterns, you might be able to focus soome code/logic to handle each or several patterns.

Did you look at the other post I mentioned earlier. There are links within it and there are people with much more math and linguistic talent than mine who do a lot of processing of genome data. It may be too complex for your needs, but some parts of it might be useful.

good luck
 

Users who are viewing this thread

Top Bottom