Deduplication Query help

pmonaghan · Jul 17, 2012

Hi,

I need help deleting duplicate records from a table.
The table contains details about email, such as sender, recepient, subject, sent date etc etc. I would like to delete records where the sender, subject and from fields match.
I have an ms access generated unique ID field.
The duplicates there because my central table has been amalgomated from several different sub tables. I queried a search term on the central table get the current table I am working on. There are just under 1000 records.
If possible I would like to delete certain data sets before others but this is not essential.

In summary, if all of the three from, subject and sender selected fields match, I would like to delete the duplicate records. I would like to delete on a priority basis based on what table the record comes from (shown in another field).

Can anyone help with this?

I am pretty new to access so go easy on me!

Pete

Sketchin · Jul 17, 2012

Try the Query wizard and use the "Find duplicates" option

plog · Jul 17, 2012

Can you provide some example data? Provide the table name, the pertinent field names, sample data for each field and then show what rows should remain after the deduplication method is applied.

pmonaghan · Jul 18, 2012

Thanks Sketchin, I have tried using the wizard but it isn't really doing what I want it to do.

Plog- Here is a link to an example table. Sorry its in spreadsheet form instead of in an access db but I thought this would be the best way to send it.

The query I want to run would leave only the records from data set 1 (or 2 if one is not present). The records left after the dedup would be 1, 4 and 7.

Thanks in advance for you help and time!

pmonaghan · Jul 18, 2012

This post is so I have 10 posts so I can post my link.

pmonaghan · Jul 18, 2012

Not quite... this is 10.

Vassago · Jul 18, 2012

You can attach documents to the post directly without needing an outside link. Posting for the sake of just posting is generally frowned upon.

AccessOfEvil · Jul 18, 2012

Assume your table is tblEmail with these columns:

ID, DataSet, Folder, From, To, Subject, Body

To group the records by From/To/Subject, selecting only the lowest/first DataSet for that set of duplicates, you can create a query qryImportantEmails as follows:

SELECT Min(tblEmails.Dataset) AS MinOfDataset, tblEmails.From, tblEmails.To, tblEmails.Subject
FROM tblEmails
GROUP BY tblEmails.From, tblEmails.To, tblEmails.Subject;

Next, create qryDuplicateEmails as follows:

SELECT tblEmails.*
FROM tblEmails INNER JOIN qryImportantEmails ON (tblEmails.From = qryImportantEmails.From) AND (tblEmails.To = qryImportantEmails.To) AND (tblEmails.Subject = qryImportantEmails.Subject)
WHERE (((qryImportantEmails.MinOfDataset)<>[tblEmails].[Dataset]));

- basically just all the emails where the DataSet values don't match the "keeper" DataSet values.

Finally, just delete the above using qryDeleteDuplicateEmails as:

DELETE *
FROM tblEmails
WHERE tblEmails.ID
IN
(SELECT ID FROM qryDuplicateEmails);

This assumes ID is unique for every email.

This is a bit cumbersome, but it has the advantage that you can look at both your valuable and not valuable emails before executing that last step.

It is always a good idea to do a SELECT query of whatever you are planning to delete first to kind of "preview" it, just in case you thought you are getting rid of 10 rows but actually it shows 1000 rows...you can then step back and rethink your criteria before losing anything valuable.

pmonaghan · Jul 18, 2012

Thanks for this AccessofEvil.

I should probably give you the right field names now I am back at the right computer.

ID, File, CorrectedFolder, From, To, CC, BCC, Attachment, Subject, Body, Sent, Received

The table is called RAT.

I changed the fields in the query as my criteria for a unique email are From, Subject and Sent being the same.

I got the first qryImportantEmails to work with a bit of tweaking, the SQL looks like this and it seems to give the right results:

SELECT Min(RAT.File) AS MinOfFile, RAT.From AS Expr1, RAT.Subject AS Expr2, RAT.Sent AS Expr3
FROM RAT
GROUP BY RAT.From, RAT.Subject, RAT.Sent;

On the second query qryDuplicateEmails I get a message saying 'enter a parameter value' when I try and view the sheet and one saying 'Access can't represent the join expression RAT.From = qryImportantEmailsRAT.From in Design view'.

My SQL:
SELECT RAT.*
FROM RAT INNER JOIN qryImportantEmailsRAT ON (RAT.From = qryImportantEmailsRAT.From) AND (RAT.Subject = qryImportantEmailsRAT.Subject) AND (RAT.Sent = qryImportantEmailsRAT.Sent)
WHERE (((qryImportantEmailsRAT.MinOfFile)<>[RAT].[File]));

What have I done wrong?

Thanks for you help!!

AccessOfEvil · Jul 18, 2012

In your qryImportant Emails, you are renaming the columns (not sure how this happened), so that if you look at the results, "From", "Subject", and "Sent" have been replaced by "Expr1", "Expr2", and "Expr3".

I think this is why the next query is complaining - it can't find "From" (or "Subject" or "Sent").

The solution is to delete the "As Expr" pieces from the SQL. If you are looking at the query in the "normal" way (vs just SQL), change where it says "Expr1: From" under Field to just "From". I wonder if using potential keywords such as From and To as column names caused Access to want to rename them.

pmonaghan · Jul 18, 2012

Ok- that runs now, however when I run the delete query I still have duplcates in the table- around 120 to be precise. I assume you are going to ask for the table I am working on at this point.. which I can't really share for confidentiality reasons. Is there anything you can see thats wrong with the SQL at this stage?

Thanks again!

pmonaghan · Jul 18, 2012

Also, I'm not sure if this makes a difference to the output but the qryImportantEmails qry isn't exactly as we have described above.
As you may have suspected, the File field has a file name in it instead of a dataset number as I simplified previously.

The files I have are:

Personal Folder
Private Folder
Public Folder

If this isn't a problem I still need to edit the first query to assign priorities to these folders as opposed to treating them as minof selections.

AccessOfEvil · Jul 18, 2012

As far as the remaining 120 duplicates, it's hard to say without looking at the data (which I understand you can't share), but one guess is that you need one or two more fields to make the data truly unique.

In other words, you thought that From, Subject, and Sent are enough to uniquely identify an email, but maybe you need To in there as well (I'm not sure if Sent is the date/time when the email was sent?).

You could pore over the remaining duplicates and see what patterns you see - where the data are similar, and how you could further differentiate between them.

For using text (Public, Private, Personal) vs. numbers, I would recommend using a lookup table with priorities of folders (call it tblFolderPriorty for example). The table would have [FolderName] as a Text field and [FolderPriorty] as a Number field. Then you could assign priorities as you wish to the folders (and any other future folders could be added into that table).

You would have to do an extra JOIN in the query, something like the following:

SELECT Min(tblFolderPriority.FolderPriority) AS MinOfPriority, RAT.From, RAT.To, RAT.Subject
FROM RAT
INNER JOIN tblFolderPriority
ON tblFolderPriority.Folder = RAT.Folder
GROUP BY RAT.From, RAT.To, RAT.Subject;

pmonaghan · Jul 18, 2012

Its a little strange.. the From, Subject and Sent fields all match in the duplicates but they haven't been deduped out.. I'm going to scramble the sensitive data on a selection of my data so I can show you an example.

I'll also have a look at the Priority Folder query you wrote.

Thanks so much for your continued support.

P

pmonaghan · Jul 18, 2012

Ok, I simply can't share this data. By the time I've changed the senstive parts theres nothing left to look at. Theres no point.

Just picking up one something you wrote..

I want the duplicate query to treat a record as a duplicate if it has the same sender, subject and sent time (the sent time is in dd/mm/yyyy hh:mm:ss format and is the date and time that the email was sent). This is on the basis that a person cannot sent more than one email at exactly the same time and the inclusion of the subject is a further control. Is the current query taking other fields into account?

To go over my last point again, the duplicates that remain all have the same From, Subject and Sent fields.

Sorry if this is a little basic- as I said, I'm new.

AccessOfEvil · Jul 18, 2012

Hmm...without really seeing the data, I'm just going to make progressively stupider guesses.

One thing to verify is that none of the remaining duplicates share the same folder name. In other words, the original idea of "save the lowest/first folder of each set of duplicates on From/Subject/Sent" is actually not a sufficient criterion.

pmonaghan · Jul 18, 2012

So I'm rewriting my second initial query to have another go but its asking me for a parameter value when I try to run the qry.

My tblImportantFolders has two fields: [FileName] and [FilePriority]
My table is called kst_tbl

The SQL for the qry looks like this:
SELECT kst_tbl.*
FROM kst_tbl INNER JOIN qryImportantEmails ON (kst_tbl.From = qryImportantEmails.From) AND (kst_tbl.Subject = qryImportantEmails.Subject) AND (kst_tbl.Sent = qryImportantEmails.Sent)
WHERE ((([qryImportantEmails].[MinOfFileName])<>[kst_tbl].[File]));

Where have I gone wrong?

pmonaghan · Jul 18, 2012

I think its because I don't have the file field in my qryImportantEmails query...

WHERE ((([qryImportantEmails].[MinOfFileName])<>[kst_tbl].[File]));

How do I fix this?

pmonaghan · Jul 18, 2012

Also- you're not located anywhere near Surrey are you!?

AccessOfEvil · Jul 19, 2012

a rather long swim from there I'm afraid...I think it maybe should be "MinOfFile" rather than "MinOfFileName"?

Deduplication Query help

Registered User.

Registered User.

Banishment Pending

Registered User.

Registered User.

Registered User.

Former Staff Turned AWF Retiree

Registered User.

Registered User.

Registered User.

Registered User.

Registered User.

Registered User.

Registered User.

Registered User.

Registered User.

Registered User.

Registered User.

Registered User.

Registered User.

Similar threads

Users who are viewing this thread