Form to handle duplicates (1 Viewer)

zfar

New member
Local time
Today, 02:40
Joined
Aug 13, 2017
Messages
17
I have a large inventory database.

There is a table of type of object eg:
"4 drawer Filing Cabinet".

There are multiple duplicates of this due to misspelling
"4 drawer Filing Carbinet"
"4 drawer Filling Cabinet"
"4 draw Filing Cabinet"

The primary key of this table has then been used in other places in the database to assign other information.

I would like to have a form where one can identify duplicates such as this in a database.
There would probably be a list box where the user could assign several of them as equivalent.
The user would then declare one of the variations "correct"
The form would then run an update query on all the foreign keys that used the incorrect variations and update to the "correct" primary key.
I then want to delete (or probably in proper database design declare redundant) all incorrect variations.
This would then eventually allow for properly normalised data.

I understand the principles of how this would be done. (Filter datasheets from test boxes, generate update queries reading the primary key off, search the database for uses of the primary key.) However, building this would take me a very long time.

This seems such a fundamental data cleaning task I would have thought there would be multiple templates and examples. However, I haven't found any - maybe it is just my bad searching.

Any help would be appreciated.
 

theDBguy

I’m here to help
Staff member
Local time
Today, 02:40
Joined
Oct 29, 2018
Messages
21,454
Hi. I don't know of an existing template either, but do you already have a logic for listing duplicate candidates? If not, maybe I can point you to a couple of algorithms.
 

plog

Banishment Pending
Local time
Today, 04:40
Joined
May 11, 2011
Messages
11,638
However, building this would take me a very long time

Absolutely, the system you described would take anyone a long time. The real question is why must that duplicate system exist?

Why must it be so formal? A system as polished as you propose assumes that these duplicates will be an ongoing issue, that this system won't just be used once. It just accepts that new duplicates will continue to be entered into the database. Why? Why can't the database be built that prevents duplicates going forward?

If you can achieve that then you don't need such a duplicate correction system. If you can prevent duplicates going forward you just need to clean the data once. If you just need to clean the data once you don't need such a polished system, you just need to fix the duplicates somehow. Worse case, that could be achieved by 2 people-- a database person to run ad hoc queries to help find "duplicates" and a user who knows the data well enough to identify duplicates. If you have one database person who knows the data, they coudl do it all by themselves.
 

Pat Hartman

Super Moderator
Staff member
Local time
Today, 05:40
Joined
Feb 19, 2002
Messages
43,223
One of the things that causes this type of issue is allowing users to modify combos on the fly. If the item they type doesn't exist, you can use the NotInList event to allow the user to add a new value so he adds his typo.

I prefer to limit the maintenance of lists to users who are designated as admins. These people have better training and know enough to search
 

zfar

New member
Local time
Today, 02:40
Joined
Aug 13, 2017
Messages
17
There is a long and painful story of why this data exist in this format. Let's just say copies on an excel system for tracking equipment inventories sent out to multiple sites with semiliterate individuals completing the data (and refusing assistance from anyone with knowledge). When all the seperate excel sheets are finally collated there are multiple combinations that exist.
The form and idea I am describing are to allow a few informed individuals to correct the system. (which now contains 600 thousand items split across 13000 categories)
Obviously controlling the original data entry would have been ideal - but I am not involved in a manner that allowed that.

If I can create the form and system I describe then the cleaning of the data can be distributed across a number of informed people. (who know the equipment but not databases or even advanced excel)
 

theDBguy

I’m here to help
Staff member
Local time
Today, 02:40
Joined
Oct 29, 2018
Messages
21,454
There is a long and painful story of why this data exist in this format. Let's just say copies on an excel system for tracking equipment inventories sent out to multiple sites with semiliterate individuals completing the data (and refusing assistance from anyone with knowledge). When all the seperate excel sheets are finally collated there are multiple combinations that exist.
The form and idea I am describing are to allow a few informed individuals to correct the system. (which now contains 600 thousand items split across 13000 categories)
Obviously controlling the original data entry would have been ideal - but I am not involved in a manner that allowed that.

If I can create the form and system I describe then the cleaning of the data can be distributed across a number of informed people. (who know the equipment but not databases or even advanced excel)
Hi. In case you're still interested, the two algorithms I mentioned earlier were Soundex and Levenshtein Distance.
 

zfar

New member
Local time
Today, 02:40
Joined
Aug 13, 2017
Messages
17
Hi. In case you're still interested, the two algorithms I mentioned earlier were Soundex and Levenshtein Distance.
Thanks - I will look at those - I suspect Soundex will be the most appropriate
 

moke123

AWF VIP
Local time
Today, 05:40
Joined
Jan 11, 2013
Messages
3,912
Following DBG's suggestion, here's a sample search using both soundex and levenschtein distance.
 

Attachments

  • searchdb.zip
    105.8 KB · Views: 109

moke123

AWF VIP
Local time
Today, 05:40
Joined
Jan 11, 2013
Messages
3,912
There are multiple duplicates of this due to misspelling
"4 drawer Filing Carbinet"
"4 drawer Filling Cabinet"
"4 draw Filing Cabinet"
looking at this sample data, the levenschtein distance may be a good choice. With LD you can adjust the "distance" (# of incorrectly placed characters) It may be helpful to start at 1 or 2 and increase it in subsequent runs, or vis versa.
 

zfar

New member
Local time
Today, 02:40
Joined
Aug 13, 2017
Messages
17
looking at this sample data, the levenschtein distance may be a good choice. With LD you can adjust the "distance" (# of incorrectly placed characters) It may be helpful to start at 1 or 2 and increase it in subsequent runs, or vis versa.
Very useful - this will definitely form part of my solution
 

MajP

You've got your good things, and you've got mine.
Local time
Today, 05:40
Joined
May 21, 2018
Messages
8,525
@MajP provided some examples based on both Soundex and Levenschtein
FYI after doing that I concluded that neither of these are a worthwhile solutions. There are almost no cases where you cannot find it much faster and more likely doing a wildcard search; "*some text*". There could be some real extreme cases where it sounds the same but spelled very different, but these are extremely few and far between. You are wasting your time with Soundex and Levenschtein as I show in that thread. The find as you type listbox using a wildcard search is much faster and more likely to give a usable solution.
 

Pat Hartman

Super Moderator
Staff member
Local time
Today, 05:40
Joined
Feb 19, 2002
Messages
43,223
I would start with using a group by query and creating a new lookup table with the contents of the field in question along with a count field. Then add a new column called ChangeTo.

Next step is to add the ObjectTypeID field to the original table. Run an update query that joins the existing text to the lookup table on the text field and updates the ObjectTypeID with the generated TypeID field. Once you have validated the results, delete the old text field. Change your forms to use a combo with the new lookup table as its RowSource.

Next step, try to match items. Given your three items, The ChangeToID will be set to 2 because that is the one that is correct.

Every time you find a group and want to update them, run an update query that joins the ObjectTypID to the lookup table and then replaces ObjectTypeID with ChangeTo.

Now your list is shorter. Just keep working through the list until you get rid of the dupes. 600,000 is a lot of records but you probably have only a few hundred different fields and not all will be duplicates.

This method allows you to work over time to clean up the list. Make sure that the combo is set to LimitToList. If someone needs to add a new item to the list, have them do it through a maintenance form and restrict access to that form to people who can spell:)
 

zfar

New member
Local time
Today, 02:40
Joined
Aug 13, 2017
Messages
17
I would start with using a group by query and creating a new lookup table with the contents of the field in question along with a count field. Then add a new column called ChangeTo.

Next step is to add the ObjectTypeID field to the original table. Run an update query that joins the existing text to the lookup table on the text field and updates the ObjectTypeID with the generated TypeID field. Once you have validated the results, delete the old text field. Change your forms to use a combo with the new lookup table as its RowSource.

Next step, try to match items. Given your three items, The ChangeToID will be set to 2 because that is the one that is correct.

Every time you find a group and want to update them, run an update query that joins the ObjectTypID to the lookup table and then replaces ObjectTypeID with ChangeTo.

Now your list is shorter. Just keep working through the list until you get rid of the dupes. 600,000 is a lot of records but you probably have only a few hundred different fields and not all will be duplicates.

This method allows you to work over time to clean up the list. Make sure that the combo is set to LimitToList. If someone needs to add a new item to the list, have them do it through a maintenance form and restrict access to that form to people who can spell:)
Yes, this is exactly the outline that I think I need to use. Has there ever been a template created for this that you know of.
 

Pat Hartman

Super Moderator
Staff member
Local time
Today, 05:40
Joined
Feb 19, 2002
Messages
43,223
Not that I am aware of. These tasks always seem to be unique in practice even though you can generically describe a process.
 

Users who are viewing this thread

Top Bottom