Duplicates, removing by date?

MikeDuffield · Oct 24, 2012

Hi guys & girls,

I'm having a little difficulty finding an answer to this one - I assume it's a query of some kind that I need, but I'm no expert on them.

Here's the scenario:

I import a big Excel file into a new table, it has no duplicate rows, but it does have duplicate values in column 6. What I need the database to do is look for duplicate values in column 6, find the duplicate that has the soonest date in column 10 and then delete the others.

It sounds complicated, but surely it can't be too difficult?

Cleaning up the source data is unfortunately not an option, unless we do it manually - on over 66,000 lines that's not practical!

Thanks in advance for any help.

Mike.

plog · Oct 24, 2012

Can you post some sample starting data and then what you expect to end up with based on that sample data?

gemma-the-husky · Oct 24, 2012

i do not think you can do this with a query directly.

i think you would need code to take a sorted query, and then delete the duplicates with code

sort of this pseudocode

Code:

loop until end of file
{
     read record
     if same date as last record then
         delete the record
     else
          reset any counters to reflect new date
     end if
}

MikeDuffield · Oct 24, 2012

I can't post the exact data, due to the nature of the business, however I can post an example of what I'm trying to get.

How our system works is that we may have 3 (or more) deliveries of a product to make. Now, to make these 3 deliveries we don't make the product in 3 different batches, we make 1 batch for the total quantity. Unfortunately the base data from our system doesn't show that, it splits it into actual delivery lines as follows:

Product Due Qty Batch
Prod4 1/1/2012 10 30
Prod4 1/6/2012 10 30
Prod4 1/6/2012 10 30

In that data there will be other products also, but to keep it simple I'm just focusing on "Prod4". If we take the data above as an example, I can see that I need to make 30pcs of the product - luckily the base data tells me that on every line of column 4, so I don't need to total it up myself. All I need the database to do is look at all lines (Which in reality is around 12,000 at the moment) and, for example, show me just one "Prod4" with the earliest due date.

Once it's done, I should have a few hundred different part numbers, all with their earliest due dates and a total of how many are needed.

I hope that's explained clearly enough, it's quite tricky doing it over the internet!

Thanks for your time.

Mike.

MikeDuffield · Oct 25, 2012

Can anyone else help??

Thanks

jdraw · Oct 25, 2012

Try

Code:

SELECT Prods.Product,Prods.Due,Prods.qty
FROM Prods
where  (((Prods.Due)=(SELECT min(Due) from Prods)));

I'm using Prods as the name of the Table.

If you have multiple Products and need to get the minimum due date by Product, the try

Code:

SELECT Product, Min(Due), Qty
FROM (SELECT Prods.Product, Prods.Due, Prods.Qty
            FROM Prods
            GROUP BY Prods.Product, Prods.Due, Prods.Qty)
GROUP BY Product, Qty;

I added a Prod5 to the table

Code:

ID	Product	Due	Qty	Batch
1	Prod4	1/1/2012	10	30
2	Prod4	1/6/2012	10	30
3	Prod4	1/6/2012	10	30
4	Prod5	1/7/2012	6	30
5	Prod5	1/23/2012	6	30

Query result

Code:

Product	Expr001	Qty
Prod4	1/1/2012	10
Prod5	1/7/2012	6

Good luck with your project

EDIT: Just saw plog's comment after I had added the multiple products and multiple dates. I saw the issue after I posted. plog is correct, but I think my adjustment for multiples handles it.

plog · Oct 25, 2012

Not to crap on jd's solution, but it only works if all your products have the same first Due date.

Mike, you provided the first part of what I wanted (starting sample data), but not the second part (what you expect the query to return based on that sample data). So let me guess: I think you would want this data returned based on that sample data:

Product, Due, Qty, Batch
Prod4, 1/1/2012, 10, 30

If that is correct, then you will need a sub-query and a main query to do this. Since you didn't provide a table name, I used 'YourTableNameHere' as the table's name. Replace all instances of that in the below code with your actual code's name.

Code:

SELECT YourTableNameHere.Product, Min(YourTableNameHere.Due) AS FirstDue
FROM YourTableNameHere
GROUP BY YourTableNameHere.Product;

Save the above query naming it FirstDue_sub. It finds the minimumt date of each Product. The below SQL will return the record you want (assuming I was correct in what you want).

Code:

  SELECT YourTableNameHere.Product, YourTableNameHere.Due, YourTableNameHere.Qty, YourTableNameHere.Batch
FROM YourTableNameHere INNER JOIN FirstDue_sub ON (YourTableNameHere.Due = FirstDue_sub.FirstDue) AND (YourTableNameHere.Product = FirstDue_sub.Product);

MikeDuffield · Oct 25, 2012

Thanks for the replies all.

Plog,

Your assumption on the result I want is correct, apologies for not posting that initially.

I modified your queries to show actual table and field names but it's not getting rid of duplicates. When I checked the query results against the main data table, they both had 25 records (For the part number I checked) - I was expecting the query to return just 1 record. The due date on all 25 records was the same, would that cause the problem?

MikeDuffield · Oct 25, 2012

to add to my post above, upon checking others it seems like the date is the problem because I've witnessed this:

In the master data table a product had 6 records. Your query resulted in 4 records, all with the same due date (Which of the 6 records, was the earliest date). Pretty much what I want, but I only want 1 record left per product.

Thanks again for your time.

jdraw · Oct 25, 2012

Try the multiples code from post #6

MikeDuffield · Oct 25, 2012

So, after modifying the queries suggested by Plog I came up with the following.

This is the sub query:

Code:

SELECT LiveData.Ref, Min(LiveData.[Mfg End Date]) AS FirstDue
FROM LiveData
GROUP BY LiveData.Ref;

Which returns 13,344 records, none of which are duplicates. Desired result.

This is the other query:

Code:

SELECT LiveData.Ref, LiveData.[Mfg End Date], LiveData.[WO Balance], LiveData.Qty, LiveData.[Part Number], LiveData.[Item Desc], LiveData.Cls1, LiveData.[Current Op], LiveData.[Prod Locn], LiveData.[WO Status], LiveData.[WO Print Status], LiveData.Comments, LiveData.History
FROM LiveData INNER JOIN FirstDue_sub ON (LiveData.Ref = FirstDue_sub.Ref) AND (LiveData.[Mfg End Date] = FirstDue_sub.FirstDue);

Which returns 23,475 records, including duplicates.

I assume I've done something wrong which is easy to fix, so would prefer to modify the above rather than try changing it completely.

Thanks for your patience - as you can see SQL isn't my strong point. I am OK with other aspects of Access but SQL still confuses the hell out of me!

plog · Oct 25, 2012

One of my two favorite words with data: 'duplicate' (the other is 'unique'). Neither is usually used in referencing data as the dictionary defines them. Let's make sure what 'duplicate' is in your mind.

Your SQL is going to show 13 fields of data. If these 'duplicates' are truly duplicates, in that every one of those 13 fields has the exact same data, you can use a GROUP BY clause to eliminate them. This would be that code:

Code:

SELECT LiveData.Ref, LiveData.[Mfg End Date], LiveData.[WO Balance], LiveData.Qty, LiveData.[Part Number], LiveData.[Item Desc], LiveData.Cls1, LiveData.[Current Op], LiveData.[Prod Locn], LiveData.[WO Status], LiveData.[WO Print Status], LiveData.Comments, LiveData.History
FROM LiveData INNER JOIN FirstDue_sub ON (LiveData.Ref = FirstDue_sub.Ref) AND (LiveData.[Mfg End Date] = FirstDue_sub.FirstDue)
GROUP BY LiveData.Ref, LiveData.[Mfg End Date], LiveData.[WO Balance], LiveData.Qty, LiveData.[Part Number], LiveData.[Item Desc], LiveData.Cls1, LiveData.[Current Op], LiveData.[Prod Locn], LiveData.[WO Status], LiveData.[WO Print Status], LiveData.Comments, LiveData.History;

If however, you have a different definition of 'duplicate' where one or more of those fields can be different among records (for example the Comments field in one 'duplicate' says 'Ship Overnight' and another says 'Two Day Delivery' then the above method will not work.

If that is the case, you need a way to define which one of those different 'duplicates' needs to be returned.

MikeDuffield · Oct 29, 2012

I understand, thanks for the explanation.

With regards to what I class as a duplicate, you're right when you say "you have a different definition of 'duplicate' where one or more of those fields can be different among records". In this instance values in all fields other than Qty will be the same. Some fields will be blank initially, as they are populated by database users - comments and history for instance.

Because of this, I would be happy for the 'duplicate' with the highest value in field Qty and the lowest value in Mfg End Date to be returned.

Does this help?

MikeDuffield · Oct 29, 2012

Actually, Qty can be duplicated also - but not in every case :-/

For instance, on one part number I have four records with the same Mfg End Date, of which three of those records have the same value in Qty.

Spanner in the works??

plog · Oct 29, 2012

Time for more sample data. Give me some examples and then what you want returned based on that sample data.

MikeDuffield · Oct 29, 2012

Right, on importing the data this is what I get, including 'duplicates':

Cls1 Ref Part Number Item Desc Qty Mfg End Date Prod Locn Status WO Balance
247 23110004 985209-14 123456 10.000 01/02/2012 zone1 503.000 98772530S
247 23110004 985209-14 123456 10.000 01/02/2012 zone1 503.000 98772530S
247 23110004 985209-14 123456 10.000 06/02/2013 zone1 503.000 98772530S
247 23110004 985209-14 123456 10.000 06/02/2013 zone1 503.000 98772530S
247 23110004 985209-14 123456 10.000 06/02/2013 zone1 503.000 98772530S
247 23110004 985209-14 123456 10.000 06/02/2013 zone1 503.000 98772530S
247 23110004 985209-14 123456 10.000 06/02/2013 zone1 503.000 98772530S
247 23110004 985209-14 123456 10.000 06/02/2013 zone1 503.000 98772530S
247 23110004 985209-14 123456 10.000 05/05/2014 zone1 503.000 98772530S
247 23110004 985209-14 123456 10.000 05/05/2014 zone1 503.000 98772530S
247 23110004 985209-14 123456 10.000 05/05/2014 zone1 503.000 98772530S
247 23110004 985209-14 123456 10.000 05/05/2014 zone1 503.000 98772530S
247 23110004 985209-14 123456 10.000 05/05/2014 zone1 503.000 98772530S

If I then run the query you helped me with, it chops it down to the reference number with the earliest Mfg End Date, which leaves me with:

Cls1 Ref Part Number Item Desc Qty Mfg End Date Prod Locn Status WO Balance
247 23110004 985209-14 123456 10.000 01/02/2012 zone1 503.000 98772530S
247 23110004 985209-14 123456 10.000 01/02/2012 zone1 503.000 98772530S

So, your query returns two records - whilst this example is actually a duplicate there are some that I consider to be duplicate but the value in Qty field may vary.

Problem I face, I think, is that if I modify the query to show lowest Mfg End Date (As it is now) AND the highest value in Qty, those two records above would still remain as they have the same value in Qty.

I feel I'm not explaining this very well, it would be an awful lot easier if it were face to face - apologies if my descriptions are not great.

Thanks,
Mike.

plog · Oct 29, 2012

Yeah, not a good example because all the fields values are the same. Using the GROUP BY query I posted 2 posts back would solve this for the example you posted.

Need a better example.

MikeDuffield · Oct 30, 2012

Right, well then this example is better:

Cls1 Ref Part Number Item Desc Qty Mfg End Date Prod Locn Status WO Balance
247 23110004 985209-14 123456 10.000 01/02/2012 zone1 503.000 98772530S
247 23110004 985209-14 123456 2.000 01/02/2012 zone1 503.000 98772530S
247 23110004 985209-14 123456 30.000 01/02/2012 zone1 503.000 98772530S
247 23110004 985209-14 123456 10.000 01/02/2012 zone1 503.000 98772530S

So, you already helped me with reducing duplicates down to what I have above, then with the GROUP BY query you posted would reduce this down to three records, all I need then is a query to keep only the record with the highest value in field Qty.

Does that sound right?

plog · Oct 30, 2012

Does that completely define your duplicates though? Right now, in English, this is what you want:

Records occuring on the minimum [Mfg End Date] with the Maximum value of [Qty] on that date.

This means if any of those other fields you listed is different it will cause duplicates. Suppose you have 2 records on 1/2/2012 both with a Qty of 7. If one has a Status of 503 and another has a status of 504 it will show twice. If one has a Locn of zone1 and another has zone3 it will show twice.

If you have records with the same End Date and Qty value but different anything else you will still get 'duplicates'. Let's make sure we define what you want so that it doesn't produce them.

MikeDuffield · Oct 30, 2012

Yes you're correct and I've checked the data to make sure:

Zones would never change as a reference number is always made in the same zone.

Status relates to reference number and as such would never change, as the reference number is either work in progress or isn't, this would be the same on all records for that reference number.

Cls1 relates to the part number, it defines what type of product it is and would never change.

WO Balance is the total needed for that reference number, so would be the same on every record.

Part Number and Description would also never change.

The only fields that may differ are Mfg End Date and Qty

Is there an easy way to modify the SQL I already have (Or is it easier to make a whole new query?)

Thanks for your time Plog, I'd be pretty stuck without your assistance!

Mike

Duplicates, removing by date?

Registered User.

Banishment Pending

Super Moderator

Registered User.

Registered User.

Super Moderator

Banishment Pending

Registered User.

Registered User.

Super Moderator

Registered User.

Banishment Pending

Registered User.

Registered User.

Banishment Pending

Registered User.

Banishment Pending

Registered User.

Banishment Pending

Registered User.

Similar threads

Users who are viewing this thread