View Full Version : Find duplicates based in a few characters


mkelly
09-23-2008, 09:03 AM
I have two tables I deleted the duplicates from each then did a union query based on name.
However the two tables named things different like vermont college vs vermont college inc.
Can I do a find duplicates based on sat the first 5 characters?
If so how.

Thanks in advance

georgedwilkinson
09-23-2008, 09:15 AM
This is a difficult problem. There are a couple of tools that can help but there are few perfect solutions.

I personally use a Jaro-Winkler algorithm for exactly your problem. It doesn't give a black/white answer...more shades of grey. It takes a human to evaluate the answer but it definitely culls the results down to a managable level.

You can also look at Soundex(). Same problem, you won't get an exact answer and will have a lot of mis-matches.

You can also compare using left(), which is in the spirit of your proposed solution. It can still give you an inaccurate reading, however.

mkelly
09-23-2008, 09:20 AM
Where would I use the left() command??

SELECT First([union duplicates].name) AS [name Field], Count([union duplicates].name) AS NumberOfDups
FROM [union duplicates]
GROUP BY [union duplicates].name
HAVING (((Count([union duplicates].name))>1));

Thanks

georgedwilkinson
09-23-2008, 09:29 AM
You're not going to be able to do it that way. How would SQL know which of the partial matches to use as [union duplicates].name?

mkelly
09-23-2008, 09:33 AM
If I don't have Jaro or soundex is there any way to do it?

georgedwilkinson
09-23-2008, 09:40 AM
I can provide you with both (just PM me with your personal e-mail address). You'll have to use a different approach, though. They don't work with aggregate functions. Same with left(). This is an insanely difficult problem to overcome using a single standard query.