Duplicate Records Search

chelm · May 25, 2008

Duplicate/Similar Records Search

What I need to create is a query that will find duplicates with a bit of a twist. The field to search on is a string field, so one record may say:
"A Science Journal" and a second "Science Journal" a third "An Science Journal" or "The Science Journal".

I'd like to find a way to run a loop with a like or contains statement in a query to help limit the results to check. Currently there are about 40K titles to search and I need to find a way to seriously reduce the number to be checked.

Any help is greatly appreciated.

Edit:
So for example can this query be modified to use a like statement?

Code:

 SELECT tblCitationTitles.[CitationTitle], tblCitationTitles.[CitationTitleID]
FROM tblCitationTitles
WHERE (((tblCitationTitles.[CitationTitle]) In (SELECT [CitationTitle] FROM [tblCitationTitles] As Tmp GROUP BY [CitationTitle] HAVING Count(*)>1 )))
ORDER BY tblCitationTitles.[CitationTitle];

I've tried modifying it with a like but I get an error stating that the query can only return one result.

georgedwilkinson · May 25, 2008

You need to use a distance algorithm to do this.

My favorite is Jaro-Winkler. It was written by employees at the US Census Bureau. I believe there is an implementation of Jaro-Winkler built in to SQL Server 2005 and the newer versions of Oracle, though I ended up hacking one together from different versions in different languages throughout the Internet.

The Jaro algorithm compares sub-strings to each other and gives a likelyhood that 2 strings are equal. The results are based on the number of changes that have to occur to one of the strings in order to get the other string. It measures the number of deletions, insertions, and transpositions that need to occur to create equality.

I've found that a Jara-Winkler score of .8 or greater is incredibly accurate at finding company names that are similar. I use a score of .75 for city names. We hired a human to validate the results when I had compared all of our 125K sites to each other and got a smaller sub-set.

Please be aware that distance algorithms are incredibly cpu intensive. I eventually had to do a straight equality compare on one of my fields to get the results to come out fast enough to keep a human busy validating them. I was also unable to use the computer doing the work for anything else.

Try a google on Jaro-Winkler and you can probably pick up some source code.

chelm · May 25, 2008

So then there isn't any way to do this within access? Or can this algorithm be implemented in VBA?

jal · May 25, 2008

If you're facing a performance problem, perhaps you can address it another way instead of reducing the checks? I mean, have you tried to eliminate the IN keyword which is usually slow?

I haven't tried this, but maybe you could put the results of that subquery in a temp table and then use a LEFT JOIN (with an Is NULL in the WHERE clause) against the temp table as the means of doing the check? Temp tables are usually faster than subqueries.

jal · May 25, 2008

chelm said:
So then there isn't any way to do this within access? Or can this algorithm be implemented in VBA?

Ok, I just tried something similar on one of my tables. I was able to eliminate the IN keyword and, in so doing, got good performance. I didn't do it in terms of a Delete, but rather a Select. What I do is (1) Select all unique records (i.e. the ID for each of these rocords) from the main table into a temp table called UniqueLines (I should have called it UniqueIDs). (2) Then I do another select which inner joins these UniqueIDnos to that same table (the original table). Here too I put the query results in a temp table. This second temp table is now the new main table and should be free of dups. (Hope I did this right). In this experiment I used zip codes (a Text field) meaning any records that uses the same zip code as another record is a dup record. Here are the two queries (asumes the main table has an autonumber field, in my case it's called ContactID)

Select Min(ContactID) as ID into UniqueLines From Contacts Group By PrimaryZip Having count(primaryZip) > 1

Select C.* INTO TableWithoutDups From Contacts as C INNER JOIN UniqueLines as U ON C.ContactID = U.ID

jal · May 25, 2008

By the way, my table has 40,000 records.

georgedwilkinson · May 25, 2008

chelm said:
So then there isn't any way to do this within access? Or can this algorithm be implemented in VBA?

Chelm, you can write a function in VBA to do a distance compare. I got the code, as I said, from a google search and modified it to suit my needs.

jal · May 25, 2008

Oh, darn, I think the HAVING clause should be removed from my first query.

mearle · May 26, 2008

Hi, I had a spare few hours and enough curiosity, so I've attached how i would do it, which is to create a new table to store each word of each title into a seperate row, so that a query can be used to compare how many words match for different titles.
Common words (and, the etc) and punctuation are removed prior to running the query.
Query qryShowSimilar display those titles whos words match 60% of another title's words and with a word count difference of no more than 2 - as an example.
Less sophisticated than Jaro-Winkler, but it does a similar job.
If you want to try it, I would attach your titles table to this database, and modify the code/queries to suit your field names.

Duplicate Records Search

chelm

Registered User.

georgedwilkinson

AWF VIP

chelm

Registered User.

jal

Registered User.

jal

Registered User.

jal

Registered User.

georgedwilkinson

AWF VIP

jal

Registered User.

mearle

Registered User.

Attachments

Similar threads

Users who are viewing this thread