Delete Duplicates Query

ChampionDuy

Registered User.
Local time
Today, 08:47
Joined
Mar 14, 2002
Messages
94
What is the best SQL statement to Delete duplicates in a table?
 
Records or fields?

I don't know of an easy way to delete duplicate records but there are basically two methods, the easy way and the hard way.

1. The easy way

Your lucky and the records are genuine duplicates, including id numbers.

Create a query select all fields and click the group by button
Make a new table from it
Change it from a make new table to a delete query and put some critera in ("is not null" or somthing in a field that is populated all throughout the query)
Delete the records
Then simple append the new table back to the original one.

***Always make a back up of your tables/db****


2. The Hard Way

Your duplicates are not identical

(I work alot with address dat so i'll use this an example)
Firstly you need to clean the data as much as possible without comprmising the integrity. This usually means getting rid of all punctuation and unwanted spaces etc firstly Trim(column name) all your columns this gets rid of preceding and trailing spaces.
search for erroneous syntax and remove it from the data.
instr(column, start position) this searches for a particular character in astring of data
left(column, number of charcters) this takes a specified number of characters fro the left
right(column, number of charcters) and this for the right.
len(column)

so in an update query you would update a column (to remove any "-") to

left(col1,(instr(col1,-)-1)) & right(col1,(instr(col1,-)))

You would have to repeat this for every erroneus character.

(I used to use this method alot so built a sort of template that would do alot of the cleaning of common field automatically)

Then you need to make a matchkey which is basically a compbination of fields that if they are all similar then it's likey to be a dupe. so left(add1, 5) & left(postcode, 6) & left(surname, 5)

Then you make a comparison similar to example 1 to look for dupes, except you only select the matchkey field, also in the query create an expresion "count(*)" in another field and instead of the "group by" drop box select "Expresion" this will tell you how many dupes of each you have.

If it doesn't matter which ID number you have then create a new table of the dupe matchkeys (grouped), add a field for id numbers, append the id numbers based on the matchkey, create query with the id's and all the other fields from the orignal table (fromthis point it's the same as example 1) make another table delete the dupes append the new table back.

there easy!!


It's a real pain in the arse but it has to be done, and it's better doing it sooner rather than later, imagine if you have dupes but with valid data in different fields so you can't pick one over the other.

Feel free to post back if you've got any questions, I know I may not have explained everything very well.
 

Users who are viewing this thread

Back
Top Bottom