Removing duplicates

ozdemirozgur

Registered User.
Local time
Today, 22:55
Joined
Apr 29, 2011
Messages
47
Hi,

I have a table of employees. It has 5 fields which are name, fname, dob, dateofplace, numberofdays (holidays taken since beginning of the year.) Table has more than 50,000 records and most of them are dublicates.

Dublicates are all the fields except the numberofdays. I would like to remove the dublicates when Name, Fname, DOB, DateOfPlace, are the same but with the maximum value of NumberOfHolidays. Please can someone tell me what is the easiest way of getting this done with minimal code.

example,

John, Green, 02/05/1977, London, 23 - Delete
John, Green, 02/05/1977, London, 9 - Delete
John, Green, 02/05/1977, London, 27 - Keep
John, Green, 02/05/1977, London, 12 - Delete

James Lour, 01/01/1977, Paris, 25 - Keep
James Lour, 01/01/1977, Paris, 12 - Delete
James Lour, 01/01/1977, Paris, 19 - Delete






thanks, Ozgur
 
Re: Removing dublicates

I think this should do it:


Code:
Dim rs1 As DAO.Recordset
Set rs1 = CurrentDb.OpenRecordset("SELECT DISTINCT Name, FName, DOB, DateOfPlace FROM Employees", dbOpenSnapshot)
With rs1
    If .RecordCount > O Then
        Dim rs2 As DAO.Recordset
        Dim i, j As Long
        .MoveFirst
        Do While Not .EOF
            Set rs2 = CurrentDb.OpenRecordset("SELECT * FROM Employees WHERE Name = " & Chr(34) & .Fields("Name") & Chr(34) & " AND FName = " & Chr(34) & .Fields("FName") & Chr(34) & " AND DOB = #" & .Fields("DOB") & "# AND DateOfPlace = " & Chr(34) & .Fields("DateOfPlace") & Chr(34) & " ORDER BY NumberOfHolidays", dbOpenDynaset)
            If rs2.RecordCount > 1 Then
                i = rs2.RecordCount - 1
                rs2.MoveFirst
                For j = 1 to i
                    rs2.Delete
                Next j
            End If
            rs2.Close
            .MoveNext
        Loop
    End If
    .Close
End With

Of course, test it on a copy of the database until you're certain it's working
 
Hi,

thanks for your reply, but I got a type mismatch error on the line

Set rs2 = CurrentDb.OpenRecordset("SELECT * FROM Employees WHERE Name = " & Chr(34) & .Fields

thanks, ozgur
 
Obviously that's not the full line.

Here is the line in full (with line continuation to make it easier to read):

Code:
Set rs2 = CurrentDb.OpenRecordset("SELECT * FROM Employees WHERE" _
        & " Name = "  & Chr(34) & .Fields("Name") & Chr(34) _
        & " AND FName = " & Chr(34) & .Fields("FName") & Chr(34) _
        & " AND DOB = #"  & .Fields("DOB") & "#" _
        & " AND DateOfPlace = "  & Chr(34) & .Fields("DateOfPlace") & Chr(34) _
        & " ORDER BY NumberOfHolidays", dbOpenDynaset)
If there's a chance some fields might be null then it would have to grow more to this:
Code:
Set rs2 = CurrentDb.OpenRecordset("SELECT * FROM Employees WHERE" _
        & IIf(IsNull(.Fields("Name"))," Name Is Null", " Name = "  & Chr(34) & .Fields("Name") & Chr(34)) _
        & " AND " & IIf(IsNull(.Fields("FName")), "FName Is Null", "FName = " & Chr(34) & .Fields("FName") & Chr(34)) _
        & " AND " & IIf(IsNull(.Fields("DOB")),"DOB Is Null","DOB = #"  & .Fields("DOB") & "#") _
        & " AND " & IIf(IsNull(.Fields("DateOfPlace")),"DateOfPlace Is Null", "DateOfPlace = "  & Chr(34) & .Fields("DateOfPlace") & Chr(34)) _
        & " ORDER BY NumberOfHolidays", dbOpenDynaset)

I hope you understand how the code should work:

rs1 is a list of distinct people in the table (distinct by all four fields)
It loops through each record in rs1:
For each it opens all records for that person (rs2) ordered by the number of holidays
If there are more than one record for that person - it deletes all but the last

The field DateOfPlace bothers me, should it be PlaceOfBirth?
Or should it actually be a date field?
I took it to correspond with the column London, Paris in the sample data you gave.
 
Last edited:
Hi,

I understood some parts of the code but some of them are difficult for me -as a beginner. I have attached the file and will be glad if you can have a look. I am getting a different error now.

thanks
ozgut
 

Attachments

Yeah, there were a couple of errors.

Here is the code, works on mine:

Code:
Dim rs1 As DAO.Recordset
Set rs1 = CurrentDb.OpenRecordset("SELECT DISTINCT Name, FName, DOB, DateOfPlace FROM Employees", dbOpenSnapshot)
With rs1
    If .RecordCount > 0 Then
        Dim rs2 As DAO.Recordset
        Dim i, j As Long
        .MoveFirst
        Do While Not .EOF
            Set rs2 = CurrentDb.OpenRecordset("SELECT * FROM Employees WHERE" _
                    & IIf(IsNull(.Fields("Name")), " Name Is Null", " Name = " & Chr(34) & .Fields("Name") & Chr(34)) _
                    & " AND " & IIf(IsNull(.Fields("FName")), "FName Is Null", "FName = " & Chr(34) & .Fields("FName") & Chr(34)) _
                    & " AND " & IIf(IsNull(.Fields("DOB")), "DOB Is Null", "DOB = #" & Format(.Fields("DOB"), "mm/dd/yyyy") & "#") _
                    & " AND " & IIf(IsNull(.Fields("DateOfPlace")), "DateOfPlace Is Null", "DateOfPlace = " & Chr(34) & .Fields("DateOfPlace") & Chr(34)) _
                    & " ORDER BY NumberOfHolidays", dbOpenDynaset)
            If rs2.RecordCount > 1 Then
                i = rs2.RecordCount - 1
                rs2.MoveFirst
                For j = 1 To i
                    rs2.Delete
                    rs2.MoveNext
                Next j
            End If
            rs2.Close
            .MoveNext
        Loop
    End If
    .Close
End With

.RecordCount > 0 was .RecordCount > O (strange that)
I had neglected the rs2.MoveNext line (vital :s)
And the date needs formatting US-style, which I keep forgetting about
 
Hi,

thansk a lot. It is still running, I do not know how long it might take. meanwhile, if I want to add another field -which is a field with typoe double and format 0.000000. Please ca nyou tell me how i should add a new line ?

thanks, ozgur
 
How do you mean add a new field? A new field to identify unique people?
 
Hi,

I have a the lsit of fields to find the dublicates such as name, fname, dob, dateofplace. I would like have another field (age) in addition to the 4 fields above again to find the dublicates. and it is a field with type double and format 0.00000.

thanks, ozgur
 
OK here is the code with that addition (changes in bold)

Code:
Dim rs1 As DAO.Recordset
Set rs1 = CurrentDb.OpenRecordset("SELECT DISTINCT Name, FName, DOB, DateOfPlace[B], Age[/B] FROM Employees", dbOpenSnapshot)
With rs1
    If .RecordCount > 0 Then
        Dim rs2 As DAO.Recordset
        Dim i, j As Long
        .MoveFirst
        Do While Not .EOF
            Set rs2 = CurrentDb.OpenRecordset("SELECT * FROM Employees WHERE" _
                    & IIf(IsNull(.Fields("Name")), " Name Is Null", " Name = " & Chr(34) & .Fields("Name") & Chr(34)) _
                    & " AND " & IIf(IsNull(.Fields("FName")), "FName Is Null", "FName = " & Chr(34) & .Fields("FName") & Chr(34)) _
                    & " AND " & IIf(IsNull(.Fields("DOB")), "DOB Is Null", "DOB = #" & Format(.Fields("DOB"), "mm/dd/yyyy") & "#") _
                    & " AND " & IIf(IsNull(.Fields("DateOfPlace")), "DateOfPlace Is Null", "DateOfPlace = " & Chr(34) & .Fields("DateOfPlace") & Chr(34)) _
[B]                    & " AND " & IIf(IsNull(.Fields("Age")), "Age Is Null", "Age = " & .Fields("Age")) _
[/B]                    & " ORDER BY NumberOfHolidays", dbOpenDynaset)
            If rs2.RecordCount > 1 Then
                i = rs2.RecordCount - 1
                rs2.MoveFirst
                For j = 1 To i
                    rs2.Delete
                    rs2.MoveNext
                Next j
            End If
            rs2.Close
            .MoveNext
        Loop
    End If
    .Close
End With

I should say it's better to calculate someone's age from their date of birth than store it. A figure for someone's age goes out of date within a year ;)
 
By the way, is the first version still running? (I'd expect 50,000 records to take a few minutes at least but not too many)

If it's finished did it work properly?
 
Hi,

I o not know why but it did not clean all of them. Because the size of data is huge, I did not manage to find why it did not work and when it is not working. any idea?

thanks, ozgur
 
Without seeing the data - no :(

Maybe copy the table, delete all but the first few hundred records and run it on that to see what's happening.

Obviously, better that it deletes less than it should than more.

The one thing I would have to check: Does Name, FName, DOB and DateOfPlace uniquely identify each person? I ask because I would've thought first three do. So when you ask about adding another field I get concerned that it's going to miss duplicates even more:

Too many fields and it will miss duplicates and not delete them
Too few fields and it will falsely identify duplicates and delete them (very bad)

To reiterate: I would expect Name, FName and DOB to be the right fields to identify a person in the table. Are you sure it needs DateOfPlace too to do that?
 

Users who are viewing this thread

Back
Top Bottom