Removing duplicate text from random records (1 Viewer)

breezett93

New member
Local time
Today, 05:03
Joined
Jun 7, 2021
Messages
20
I've got a table where some of the cells in one column have nearly identical information entered twice. Here are some examples:
Duplicate.PNG
Duplicate2.PNG


I need a query to go through the whole table, catalog what data is in the cell, and delete duplicate text. So in my examples, the extra 11/16 49 97, 11/16 61 97, and Durafl would all be deleted.


Thanks!
 

plog

Banishment Pending
Local time
Today, 05:03
Joined
May 11, 2011
Messages
11,613
Please be explicit with your example data. We need 2 sets of data:

A. Starting data from your table(s). Include table and field names as well as enough data to cover all cases.

B. Expected results of A. Show us what data you expect to end up with after processing the data in A.

Again, 2 sets of data--beginning and expected results. Don't just give me starting data and then use words to explain--actually show me what data you expect to end up with.
 

breezett93

New member
Local time
Today, 05:03
Joined
Jun 7, 2021
Messages
20
Please be explicit with your example data. We need 2 sets of data:

A. Starting data from your table(s). Include table and field names as well as enough data to cover all cases.

B. Expected results of A. Show us what data you expect to end up with after processing the data in A.

Again, 2 sets of data--beginning and expected results. Don't just give me starting data and then use words to explain--actually show me what data you expect to end up with.
A. Table name = ItemTbl Field Name in question = Desc

B. There's over 2200 records in this table, potentially hundreds need correction. I can export the entire Desc column if you really want that. I hand-picked several examples to give an idea of how the data currently looks.
 

plog

Banishment Pending
Local time
Today, 05:03
Joined
May 11, 2011
Messages
11,613
Nope, not words. Data.

Let's call the data in your initial post A. Now give me B based on that.
 

Pat Hartman

Super Moderator
Staff member
Local time
Today, 06:03
Joined
Feb 19, 2002
Messages
42,981
Humans are excellent computers. Our ability to recognize patterns is astounding. However computers are dumb as rocks. They cannot do ANYTHING you cannot specifically define. For example
1. The first 10 characters duplicated later in the same field
2. Any 6 characters duplicated later in the field.

The first rule is not hard to implement, you can Left() and Instr() and looping is not required. However the second one requires a loop and queries don't do loops so you need to call a function. The function has to select the first 6 characters and staarting from character 7 and moving one character to the right at each iteration, compare the selected characters with the six characters starting at position x.

The problem is you can't define how many duplicate characters we are looking for or where in the field they might be. So, you can create a function that takes a number and that function will start at position 1 and select x characters and use instr() starting at x+1 to see if that pattern repeats, then the function loops and starts as position 2 and selects x characters and uses instrr() starting at x + 2, etc. Then the outer loop will change the number of characters from 6 to 7 and the inner loop starts with selecting the first 7 characters and looking in 8-end with Instr()
 

breezett93

New member
Local time
Today, 05:03
Joined
Jun 7, 2021
Messages
20
Nope, not words. Data.

Let's call the data in your initial post A. Now give me B based on that.
11/16 49 97 11/16 49 97 -> 11/16 49 97
11/16 61 97 11/16 61 97 -> 11/16 61 97
1 Duraflake 49 97 Durafl -> 1 Duraflake 49 97


I will pm you link to the table as links aren't allowed in post.
 

MajP

You've got your good things, and you've got mine.
Local time
Today, 06:03
Joined
May 21, 2018
Messages
8,463
There's over 2200 records in this table, potentially hundreds need correction. I can export the entire Desc column if you really want that. I hand-picked several examples to give an idea of how the data currently looks.
This would help a lot. Just export to excel and post. Like Pat said, it is unlikely you can find a rule. But I just did a similar thread. The trick is not to fully automate it, which is impossible because without artificial intelligence you cannot define a rule. You want to build some interfaces and queries to make clean up easy.
See discussion. Download the tool and see how easy it is to search for different patterns. You would likely need something like that.
 

MajP

You've got your good things, and you've got mine.
Local time
Today, 06:03
Joined
May 21, 2018
Messages
8,463
Without seeing the data this would be impossible to even guess at an approach to ease the burden. But bottom line there is no code that could every be written.
Like that other thread, there is no way to determine what is considered duplicate, unless you have an idea of what the real patterns and data are.
What instead of?
11/16 49 97 11/16 49 97
it was
11/16 49 97 11/16 49 98
Is the 11/16/ 97 part a duplicate? Who knows
or
1 Duraflake 49 97 Durafl
was
1 Duraflake 49 2 Duraflake 49
Is that a duplicate or has some meaning for a 2nd Duraflake.

Send the data and will see if we can provide some tools and queries to help identify duplicates.
 

isladogs

MVP / VIP
Local time
Today, 10:03
Joined
Jan 14, 2017
Messages
18,186
You should be able to add links to your next post as you now have 10 posts completed
 

breezett93

New member
Local time
Today, 05:03
Joined
Jun 7, 2021
Messages
20
Without seeing the data this would be impossible to even guess at an approach to ease the burden. But bottom line there is no code that could every be written.
Like that other thread, there is no way to determine what is considered duplicate, unless you have an idea of what the real patterns and data are.
What instead of?
11/16 49 97 11/16 49 97
it was
11/16 49 97 11/16 49 98
Is the 11/16/ 97 part a duplicate? Who knows
or
1 Duraflake 49 97 Durafl
was
1 Duraflake 49 2 Duraflake 49
Is that a duplicate or has some meaning for a 2nd Duraflake.

Send the data and will see if we can provide some tools and queries to help identify duplicates.
My understanding is that there should only be one instance of the word in the cell. That's coming from the users actively using the data. So maybe we start there since numbers are harder to track.
 

Pat Hartman

Super Moderator
Staff member
Local time
Today, 06:03
Joined
Feb 19, 2002
Messages
42,981
What constitutes a word? If "words" are separated by spaces, use the Split() function to get each non-space into an element of an array.. Loop through the array and determine if all the characters are letters, If so, append the "word" to array 2. Then create a loop that takes each element in array 2 and compares it to each of the other elements to its right. So if you end up with 8 items, the loop runs 7 times. The eighth item has nothing to its "right".

If the link is to a file to be downloaded, I suggest that you use the site tools instead and use the attach files option to load it here. People will be more likely to download the file.

So far, I've given you several concrete suggestions for finding dups, why not at least try to implement one of them. This one is the easiest.
 
Last edited:

MajP

You've got your good things, and you've got mine.
Local time
Today, 06:03
Joined
May 21, 2018
Messages
8,463
So I can do a 95% solution. You would then need a nice interface built to make the cleanup easy. You want to find the greatest repeated substring, and then sort by that. Here are my results. As you can see the longer the substring the more likely it is a duplicate.
Query1 Query1

IDDescLCSLength
1092​
Carton 28 76 Carton 28 7Carton 28 7
11​
2661​
Glass 12 12 Glass 12 12Glass 12 12
11​
2377​
Recon 50 124 Recon 50 12Recon 50 12
11​
627​
11/16 61 97 11/16 61 9711/16 61 97
11​
1090​
Carton 28 52 Carton 28 5Carton 28 5
11​
609​
11/16 49 121 11/16 49 1211/16 49 12
11​
1095​
Carton 34 64 Carton 34 6Carton 34 6
11​
1097​
Carton 34 76 Carton 34 7Carton 34 7
11​
604​
11/16 49 97 11/16 49 9711/16 49 97
11​
409​
11/16 49 97 11/16 49 9711/16 49 97
11​
1091​
Carton 28 64 Carton 28 6Carton 28 6
11​
410​
11/16 61 97 11/16 61 9711/16 61 97
11​
433​
11/16 61 145 11/16 61 1411/16 61 14
11​
931​
11/16 61 121 11/16 61 1211/16 61 12
11​
981​
1 1/8 49 97 1 1/8 49 971 1/8 49 97
11​
194​
1 11/16 35 82 1 11/16 351 11/16 35
10​
185​
1 11/16 35 95 1 11/16 351 11/16 35
10​
594​
5/8 49 145 5/8 49 1455/8 49 145
10​
530​
1/4 49 145 1/4 49 1451/4 49 145
10​
737​
1/16 37 86 1/16 37 861/16 37 86
10​
2760​
Oak,White-PS Oak,WhiteOak,White
9​
740​
1/16 36.5 109 1/16 36.51/16 36.5
9​
126​
BP585808X 0 0 BP585808XBP585808X
9​
552​
3/8 61 97 3/8 61 973/8 61 97
9​
2623​
Cherry-PS Cherry-PSCherry-PS
9​
523​
1/4 49 97 1/4 49 971/4 49 97
9​
373​
3/8 49 97 3/8 49 973/8 49 97
9​
388​
1/2 49 97 1/2 49 971/2 49 97
9​
2849​
Walnut-QD Walnut-QDWalnut-QD
9​
2848​
Walnut-PS Walnut-PSWalnut-PS
9​
507​
1/8 49 97 1/8 49 971/8 49 97
9​
2769​
Olive Ash-PS Olive AshOlive Ash
9​
2762​
Oak,White-RF Oak,WhiteOak,White
9​
543​
3/8 49 97 3/8 49 973/8 49 97
9​
513​
1/8 61 97 1/8 61 971/8 61 97
9​
2321​
Okume,RT 50 99 2 Okume,RTOkume,RT
8​
130​
BP585910Y 10 60 BP585910BP585910
8​
2590​
Beech,Stmd-PS Beech,StBeech,St
8​
149​
BUZZISKIN 12 12 BUZZISKIBUZZISKI
8​
131​
BP586010X 50 150 BP58601BP58601
7​
2322​
Okume,RT 50 123 2 Okume,ROkume,R
7​
129​
BP585910X 20 120 BP58591BP58591
7​
54​
46.0689 Maple 46.068946.0689
7​
67​
46.0691 Maple 46.069146.0691
7​
66​
46.0692 Maple 46.069246.0692
7​
2733​
Maple,White-PS Maple,WMaple,W
7​
60​
46.0694 Maple 46.069446.0694
7​
56​
46.0690 Maple 46.069046.0690
7​
68​
46.0693 Maple 46.069346.0693
7​
62​
46.0696 Maple 46.069646.0696
7​
446​
1 Duraflake 49 97 DuraflDurafl
6​
128​
BP585810Y 200 400 BP5858BP5858
6​
2421​
Walnut-PS 37 85.5 WalnutWalnut
6​
127​
BP585810X 100 300 BP5858BP5858
6​
122​
BP5850K10NA 25 125 BP585BP585
5​
121​
BP5850K10BL 25 125 BP585BP585
5​
196​
1 3/4 NMB7 1.25 85 1 3/41 3/4
5​
268​
1 3/4 TB10 1.5 128 1 3/41 3/4
5​
1677​
Birch-RT 50 99 SLWH BirchBirch
5​
844​
1 Honeycomb 49 97 HoneyHoney
5​
899​
5/8 FSC NAUF 49 97 FSC NFSC N
5​
106​
49.1840 Trolleys 49.1849.18
5​
 

MajP

You've got your good things, and you've got mine.
Local time
Today, 06:03
Joined
May 21, 2018
Messages
8,463
The function is pretty easy.
Code:
Public Function GetGRSS(str As String, Optional MinLength = 1) As String
    Dim N As Long
    Dim i As Long
    Dim j As Long
    Dim strI As String
    Dim strJ As String
    Dim MatchedString As String
    N = Len(str)
    For i = 1 To N
      strI = Mid(str, i)
      For j = i + 1 To N
        strJ = Mid(str, j)
        MatchedString = GetMatchedString(strI, strJ)
        If Len(GetGRSS) < Len(MatchedString) Then GetGRSS = MatchedString
      Next j
    Next i
    GetGRSS = Trim(GetGRSS)
    If Len(GetGRSS) <= MinLength Then GetGRSS = ""
End Function
Public Function GetMatchedString(strI As String, strJ As String) As String
  'I always long
  Dim i As Integer
  Dim chrI As String
  Dim chrJ As String
  For i = 1 To Len(strJ)
    chrI = Mid(strI, i, 1)
    chrJ = Mid(strJ, i, 1)
    If chrI = chrJ Then
      GetMatchedString = GetMatchedString & chrI
    Else
      Exit For
    End If
  Next i

End Function

A query could be something like
Code:
SELECT ItemTbl.ID, ItemTbl.Desc, GetGRSS([desc]) AS LCS, Len(GetGrss([Desc])) AS Length
FROM ItemTbl
WHERE (((GetGRSS([desc]))<>""))
ORDER BY Len(GetGrss([Desc])) DESC;

This kind of query is expensive you are running a complex function that runs in (O(N^3)) for each record. You may need a temp table to hold your results.

My interface would then show you the results and allow you to select which ones you want to remove the second occurrence. It still requires user oversight, but super easy to clean up.
 
Last edited:

MajP

You've got your good things, and you've got mine.
Local time
Today, 06:03
Joined
May 21, 2018
Messages
8,463
Here is a working application. Not the greatest, but it works.
When you open the form it runs a query and then appends the results to the table. Then you can edit by hand the duplicates or click the button to replace the duplicate. You would have to do some modification to handle other fields, but the idea is there.

Application2.jpg
 

Attachments

  • GreatestRepeatedSubString.accdb
    2.3 MB · Views: 337

arnelgp

..forever waiting... waiting for jellybean!
Local time
Today, 18:03
Joined
May 7, 2009
Messages
19,175
use layup don't slam dunk!

think twice before just deleting anything.
is this your Master table?
do you have transaction table also that refer this table?
what if you have transaction on the dup item.
they got orphaned and your report does have a big difference
because these items no longer show up on your report?
 

breezett93

New member
Local time
Today, 05:03
Joined
Jun 7, 2021
Messages
20
Here is a working application. Not the greatest, but it works.
When you open the form it runs a query and then appends the results to the table. Then you can edit by hand the duplicates or click the button to replace the duplicate. You would have to do some modification to handle other fields, but the idea is there.

View attachment 94406
Wow! Thank you so much for putting so much effort into this. I look forward to testing it out!
 

Users who are viewing this thread

Top Bottom