Broad question about de-duping names & address

dewsbury · Sep 2, 2008

Hi,

I know there are various software packages to assist in de-duplicating and cleansing names addresses.

Aside from this has any body got ideas about how to do this for about 100,000 names & addresses in pretty inconsistent state.

Obviously I need more than just a direct comparison of one field with another.

All comments welcome.

MagicMan · Sep 2, 2008

To do a valid duplicate remove of names, you need 2 items, Postal Code/Zip code that norrows the possible duplicate to one city block and birth date to confirm duplicate person. Be carefull, my son is Dave Edward Harrison Brown, and I am Dave Edward Brown and we both live at the same address with the same phone number, and we both buy a lot, don't drop either one of us, or you lose!
You can sort by postal code, birth date, then fuzzy match on name, or present to a form the hits for further analysis. A family with triples means, you need to find differences on first given name, or if first given name matches, second given name to remove. Donna Grace Brown, Donna Kayla Brown, Donna Susan Brown live here and are triplets.
Oh, by the way. When I had a contract with a big org, I had an appartment in City A (worked 3 days a week there), another apartment in City B (Head Office 2 days a week), and my Home address, and my Cottage address, and my summer retreat address. So postal code does not always find duplicates!

georgedwilkinson · Sep 2, 2008

You might look into using a Jaro-Winkler algorithm to do a fuzzy match. This algorithm was developed by two eggheads at the US Census Bureau for just this purpose. Obvioulsy, a human should review the results.

Also, one of our long-time forum members, Mike375, works with mailing lists and has some practical hands-on experience. I don't understand his methods but it is obvious he gets the results his business needs.

MagicMan · Sep 2, 2008

georgedwilkinson said:
he gets the results his business needs.

George is right on, the process is extremely complex, but may be simple depending on the results that you need for your specific application.

dewsbury · Sep 2, 2008

Thanks for your responses.

I am working for a large financial institution who want to do a mailshot to clients.
We do not have post codes in Ireland- which makes a big difference.

I will research the algorithm mentioned.

I was also considering looking for the word "county" and "co." and validating the following data against a county in Ireland.
I was thinking too about removing all spaces , commas, apostrophes.
Also removing words such as "the" "and" "of" ??
I also think that removing vowels might give some interesting results.

dewsbury · Sep 2, 2008

As suggested by the above post I am checking out the Jaro Winkler algorithm. I read about it on Wiki and it seems to be worth a try.

Does anybody know where I could download the algorithm or similar?

Thanks.

MagicMan · Sep 2, 2008

If it is soley a promotion, then that makes it simple. Exact matches on the address get removed. If it is an offer to a person for a credit card for example, then you are back to finding the criteria that satifies the institutions requirements (age, name, address all unique). You are on the correct path, find criteria that creates a smaller set of data. On a large database, this should be done in multiple passes. 1st pass is to create a new table where the fuzzy last name (1st 3 characters of the last name and the 1st char of the first name) and county match. That should create a work table that is about 20% the size of the original table. 2nd pass is to remove extract matches from the work table to your confirmed duplicates table, deleting them from your work table. 3rd pass is to create a new work table containing a new field with county (I am assuming the county will always be correctly spelled) strung together with the last name (vowels removed), 1st name initial and (if you have) birth date.
Sort and extract dupicated to your confirmed duplicates table and delete from the work table. If you are down to 1000 records or less, it is time to create a form to present the remaining data. A user can check mark the records presented that appear to be duplicates. These get extracted to your confirmed duplicates file.
Last step, use the confirmed duplicates file in a join when extracting the mailing list to eliminate those confirmed duplicates.

Jaro Winkler code thread can be found here:
http://forums.devshed.com/software-design-43/jaro-winkler-matching-algorithm-62100.html

georgedwilkinson · Sep 2, 2008

dewsbury said:
As suggested by the above post I am checking out the Jaro Winkler algorithm. I read about it on Wiki and it seems to be worth a try.

Does anybody know where I could download the algorithm or similar?

Thanks.

PM me with your e-mail address and I'll send you mine. Not sure about the copyright so I can't post it here.

Broad question about de-duping names & address

dewsbury

Registered User.

MagicMan

Registered User.

georgedwilkinson

AWF VIP

MagicMan

Registered User.

dewsbury

Registered User.

dewsbury

Registered User.

MagicMan

Registered User.

georgedwilkinson

AWF VIP

Similar threads

Users who are viewing this thread