Delete duplicates without primary key

YariLei · Nov 4, 2013

Hi,

I bounced into a big problem with Access 2010 where I cannot seem to be able to remove duplicates from a table containing millions of entries midway through a series of queries.

The table is formed like this:
Date - Serial Number - Reading 1 - Reading 2 - Reading 3
30.7.2013 - 1122334455 - 12345 - 987654 - 654321
30.7.2013 - 1122334466 - 12345 - 987654 - 654321
29.7.2013 - 1122334455 - 12210 - 987654 - 643210
29.7.2013 - 1122334455 - 12210 - 987654 - 643210
29.7.2013 - 1122334466 - 12344 - 986013 - 453213

As you can see, there are some rows which contain exact duplicates and some are a bit different. I wish to remove those fields with exact duplicates and only the duplicated fields. Running "Find duplicates" in Access gives me about 250,000 rows with such data.

I've tried a few options already:
- I cannot use the date as primary key as there are several serial numbers. I cannot use serial numbers as primary key, because there are several dates. Using reading value as primary isn't an option either.
- Microsoft help says I should mark all duplicate values with an x and then make a delete query to get rid of all the x-marked rows. For 250,000 duplicates, that's a bit too much manual 'x-ing'.
- If I do a delete query using the Find Duplicates query as a base, it removes all 250,000 entries from that table, instead of just the 125,000 which had a duplicate elsewhere in the table.
- If I make a query which identifies duplicate data and gives me just one row for each duplicate, the delete query still deletes both entries from the original table.
- I could make a new query which would have only unique values using Totals as criteria (for instance, using First for the Date-column). However, this still leaves the duplicate values in the original table. Note that this database is already 800 MB large and new data is imported once a week, for the next decade or so. I cannot have a table get duplicates every week and leave it there.
- If I make a macro which would create the unique values table first and then deletes the old table, what happens next week when I try to import new readings? I would need to make a new macro each time I try to import new data as the table names change. Or is there a way to first run the unique numbers out, then replace the original table with the new one? With a 800MB database, this would put me dangerously close to the 2GB size limit. I wouldn't be able to use this as a part of a macro, as the database would have to do Compact & Repair each time it deletes the original table, midway through a longer series of queries.
- Having duplicates removed from the original import isn't an option either, as it comes in as a overriding excel sheet for the past 3 months, once a week.

As you can see, it's quite a pickle getting the duplicates out of the original table. This is just a small part of a very long macro, which takes about 15 minutes to complete, but due to duplicates the database is getting way too large.

Any ideas?

Cheers!

-YariLei

Mihail · Nov 4, 2013

Unfortunately only ideas not solutions.
Create a new table that have an exact copy of the original table STRUCTURE.
Create indexes on fields in order to not allow duplicates (use the Indexes window).
Copy the data from the original table into this new one (CTRL+A , CTRL+C, CTRL+V).
Pay attention to the popup window that will appear.

If this is working we'll try, together, to find the way to automate the task.

In order to gain better help, create a new DB, import that table, keep some simple data in it, convert in 2003 format and upload this DB.

Mihail · Nov 4, 2013

And WELL COME to AWF !

YariLei · Nov 4, 2013

Mihail said:
Unfortunately only ideas not solutions.
Create a new table that have an exact copy of the original table STRUCTURE.
Create indexes on fields in order to not allow duplicates (use the Indexes window).
Copy the data from the original table into this new one (CTRL+A , CTRL+C, CTRL+V).
Pay attention to the popup window that will appear.

If this is working we'll try, together, to find the way to automate the task.

In order to gain better help, create a new DB, import that table, keep some simple data in it, convert in 2003 format and upload this DB.

Hi,

Thanks for the tip. That's a good way of getting a clean table with unique data only (pretty much the same way as using 'First' as criteria for creating a new table). I'm just worried on what to do with the original table after this. I could just make a VBA macro to delete the old table and have following queries in the series to refer to this new table. However, I still need to do Compact & Repair right after this, as somehow the original is left to some invisible 'memory', causing the whole database to increase in size while the rest of the queries run after this procedure.

Also another issue is when running the import again next week. I can't append the new data to the original table, if I had deleted the original. So, leaving the original there is the only option, but every week new duplicates will be generated, and this will continue until 3 months has passed. But perhaps I could try what you proposed, making it impossible to import new duplicate data to the table using the Index...

-YariLei

YariLei · Nov 4, 2013

Well, that didn't work. The database is too large for Access to handle. Having Duplicates Not Allowed in the target table Index and using an Append query for the new import is too complex for Access. It tries to compare each field in each row and gets the query to about 1/3 and then just stops to make any progress, as calculating all the fields in a million-row database exceeds the available calculating power of my laptop...

Mihail · Nov 4, 2013

Here seems to be something that could help you.
I think that you can completely remove the old table now.
Rename the new created one with the same name as the old one. Run the Append query. This should append only records that not create duplicates (at least this is what I think - try !)

And I think that you don't split your DB. You should do this. This way your DB will not increase in size when you run queries.

YariLei · Nov 4, 2013

Mihail said:
Here seems to be something that could help you.
I think that you can completely remove the old table now.
Rename the new created one with the same name as the old one. Run the Append query. This should append only records that not create duplicates (at least this is what I think - try !)

And I think that you don't split your DB. You should do this. This way your DB will not increase in size when you run queries.

Thanks, I got it to work, but for some reason no new values are being added to the Indexed table. When I tried to put in non-duplicate values, the index allowed only about 440 entries into the new table, instead of the 1.2 million. Also, append query rejects all entries due to type conversion failure. Removing the Index to allow duplicates removes this problem...

Mihail · Nov 4, 2013

As I already said, give us a simple of your DB and the table from where you attempt to add records.
I ask you again: Is your DB split into a FE and a BE ?

YariLei · Nov 4, 2013

Mihail said:
As I already said, give us a simple of your DB and the table from where you attempt to add records.
I ask you again: Is your DB split into a FE and a BE ?

Here you go, a sample of the queries attached. This is about 1/10 of the query series, actually the first 7 queries of the 76 that will continue after this. Currently the DB is indeed in just a single file, so I'll need to separate them. But if I did split it to have one access for the queries and another for storing data, I still need to find a way to not store any duplicates.

In the attached sample, consider esap new readings and xsap new readings as linked tables from excel, which both contain different kind of data, but which are imported and transformed into similar data, possibly causing further duplicates. The sources are updated once a week, and the source file always contains three months' data. In the samle, the sequence of adding the data is a1-b1-b2-b3 etc.

Cheers!

YariLei

YariLei · Nov 5, 2013

Unfortunately indexing and primary keys resulted in empty tables. The only thing that actually worked was to make a small loop for the tables. First, it would add the new data to Total, leaving duplicates there. Then, it would create a list of unique rows (using the reading 1 as "max" or "first" while others are group by). Then a simple query rewrites the original Total table with the unique table data, making it possible to add new data later on and using the loop to clear the duplicates. This is done in a separate database, to keep the one holding the queries as small as possible for the other queries that follow.

Thanks a ton for your help with this! The separation of databases was a good point, for a project as large as this.

-Yari

Delete duplicates without primary key

YariLei

Registered User.

Mihail

Registered User.

Mihail

Registered User.

YariLei

Registered User.

YariLei

Registered User.

Mihail

Registered User.

YariLei

Registered User.

Mihail

Registered User.

YariLei

Registered User.

Attachments

YariLei

Registered User.

Similar threads

Users who are viewing this thread