Form to handle duplicates (1 Viewer)

zfar · Jun 24, 2020

I have a large inventory database.

There is a table of type of object eg:
"4 drawer Filing Cabinet".

There are multiple duplicates of this due to misspelling
"4 drawer Filing Carbinet"
"4 drawer Filling Cabinet"
"4 draw Filing Cabinet"

The primary key of this table has then been used in other places in the database to assign other information.

I would like to have a form where one can identify duplicates such as this in a database.
There would probably be a list box where the user could assign several of them as equivalent.
The user would then declare one of the variations "correct"
The form would then run an update query on all the foreign keys that used the incorrect variations and update to the "correct" primary key.
I then want to delete (or probably in proper database design declare redundant) all incorrect variations.
This would then eventually allow for properly normalised data.

I understand the principles of how this would be done. (Filter datasheets from test boxes, generate update queries reading the primary key off, search the database for uses of the primary key.) However, building this would take me a very long time.

This seems such a fundamental data cleaning task I would have thought there would be multiple templates and examples. However, I haven't found any - maybe it is just my bad searching.

Any help would be appreciated.

theDBguy · Jun 24, 2020

Hi. I don't know of an existing template either, but do you already have a logic for listing duplicate candidates? If not, maybe I can point you to a couple of algorithms.

plog · Jun 24, 2020

However, building this would take me a very long time

Absolutely, the system you described would take anyone a long time. The real question is why must that duplicate system exist?

Why must it be so formal? A system as polished as you propose assumes that these duplicates will be an ongoing issue, that this system won't just be used once. It just accepts that new duplicates will continue to be entered into the database. Why? Why can't the database be built that prevents duplicates going forward?

If you can achieve that then you don't need such a duplicate correction system. If you can prevent duplicates going forward you just need to clean the data once. If you just need to clean the data once you don't need such a polished system, you just need to fix the duplicates somehow. Worse case, that could be achieved by 2 people-- a database person to run ad hoc queries to help find "duplicates" and a user who knows the data well enough to identify duplicates. If you have one database person who knows the data, they coudl do it all by themselves.

Pat Hartman · Jun 25, 2020

One of the things that causes this type of issue is allowing users to modify combos on the fly. If the item they type doesn't exist, you can use the NotInList event to allow the user to add a new value so he adds his typo.

I prefer to limit the maintenance of lists to users who are designated as admins. These people have better training and know enough to search

zfar · Jun 27, 2020

There is a long and painful story of why this data exist in this format. Let's just say copies on an excel system for tracking equipment inventories sent out to multiple sites with semiliterate individuals completing the data (and refusing assistance from anyone with knowledge). When all the seperate excel sheets are finally collated there are multiple combinations that exist.
The form and idea I am describing are to allow a few informed individuals to correct the system. (which now contains 600 thousand items split across 13000 categories)
Obviously controlling the original data entry would have been ideal - but I am not involved in a manner that allowed that.

If I can create the form and system I describe then the cleaning of the data can be distributed across a number of informed people. (who know the equipment but not databases or even advanced excel)

theDBguy · Jun 27, 2020

zfar said:
There is a long and painful story of why this data exist in this format. Let's just say copies on an excel system for tracking equipment inventories sent out to multiple sites with semiliterate individuals completing the data (and refusing assistance from anyone with knowledge). When all the seperate excel sheets are finally collated there are multiple combinations that exist.
The form and idea I am describing are to allow a few informed individuals to correct the system. (which now contains 600 thousand items split across 13000 categories)
Obviously controlling the original data entry would have been ideal - but I am not involved in a manner that allowed that.

If I can create the form and system I describe then the cleaning of the data can be distributed across a number of informed people. (who know the equipment but not databases or even advanced excel)

Hi. In case you're still interested, the two algorithms I mentioned earlier were Soundex and Levenshtein Distance.

zfar · Jun 27, 2020

theDBguy said:
Hi. In case you're still interested, the two algorithms I mentioned earlier were Soundex and Levenshtein Distance.

Thanks - I will look at those - I suspect Soundex will be the most appropriate

theDBguy · Jun 27, 2020

zfar said:
Thanks - I will look at those - I suspect Soundex will be the most appropriate

"Sounds good!"

Good luck!

moke123 · Jun 27, 2020

Following DBG's suggestion, here's a sample search using both soundex and levenschtein distance.

moke123 · Jun 27, 2020

zfar said:
There are multiple duplicates of this due to misspelling
"4 drawer Filing Carbinet"
"4 drawer Filling Cabinet"
"4 draw Filing Cabinet"

looking at this sample data, the levenschtein distance may be a good choice. With LD you can adjust the "distance" (# of incorrectly placed characters) It may be helpful to start at 1 or 2 and increase it in subsequent runs, or vis versa.

zfar · Jun 27, 2020

moke123 said:
looking at this sample data, the levenschtein distance may be a good choice. With LD you can adjust the "distance" (# of incorrectly placed characters) It may be helpful to start at 1 or 2 and increase it in subsequent runs, or vis versa.

Very useful - this will definitely form part of my solution

isladogs · Jun 27, 2020

See this thread https://www.access-programmers.co.uk/forums/threads/matching-logic.304471/ where @MajP provided some examples based on both Soundex and Levenschtein

MajP · Jun 28, 2020

@MajP provided some examples based on both Soundex and Levenschtein

FYI after doing that I concluded that neither of these are a worthwhile solutions. There are almost no cases where you cannot find it much faster and more likely doing a wildcard search; "*some text*". There could be some real extreme cases where it sounds the same but spelled very different, but these are extremely few and far between. You are wasting your time with Soundex and Levenschtein as I show in that thread. The find as you type listbox using a wildcard search is much faster and more likely to give a usable solution.

Pat Hartman · Jun 29, 2020

I would start with using a group by query and creating a new lookup table with the contents of the field in question along with a count field. Then add a new column called ChangeTo.

Next step is to add the ObjectTypeID field to the original table. Run an update query that joins the existing text to the lookup table on the text field and updates the ObjectTypeID with the generated TypeID field. Once you have validated the results, delete the old text field. Change your forms to use a combo with the new lookup table as its RowSource.

Next step, try to match items. Given your three items, The ChangeToID will be set to 2 because that is the one that is correct.

Every time you find a group and want to update them, run an update query that joins the ObjectTypID to the lookup table and then replaces ObjectTypeID with ChangeTo.

Now your list is shorter. Just keep working through the list until you get rid of the dupes. 600,000 is a lot of records but you probably have only a few hundred different fields and not all will be duplicates.

This method allows you to work over time to clean up the list. Make sure that the combo is set to LimitToList. If someone needs to add a new item to the list, have them do it through a maintenance form and restrict access to that form to people who can spell

zfar · Jul 11, 2020

Pat Hartman said:
I would start with using a group by query and creating a new lookup table with the contents of the field in question along with a count field. Then add a new column called ChangeTo.

Next step is to add the ObjectTypeID field to the original table. Run an update query that joins the existing text to the lookup table on the text field and updates the ObjectTypeID with the generated TypeID field. Once you have validated the results, delete the old text field. Change your forms to use a combo with the new lookup table as its RowSource.

Next step, try to match items. Given your three items, The ChangeToID will be set to 2 because that is the one that is correct.

Every time you find a group and want to update them, run an update query that joins the ObjectTypID to the lookup table and then replaces ObjectTypeID with ChangeTo.

Now your list is shorter. Just keep working through the list until you get rid of the dupes. 600,000 is a lot of records but you probably have only a few hundred different fields and not all will be duplicates.

This method allows you to work over time to clean up the list. Make sure that the combo is set to LimitToList. If someone needs to add a new item to the list, have them do it through a maintenance form and restrict access to that form to people who can spell

Yes, this is exactly the outline that I think I need to use. Has there ever been a template created for this that you know of.

Pat Hartman · Jul 14, 2020

Not that I am aware of. These tasks always seem to be unique in practice even though you can generically describe a process.

Form to handle duplicates (1 Viewer)

zfar

New member

theDBguy

I’m here to help

plog

Banishment Pending

Pat Hartman

Super Moderator

zfar

New member

theDBguy

I’m here to help

zfar

New member

theDBguy

I’m here to help

moke123

AWF VIP

Attachments

moke123

AWF VIP

zfar

New member

isladogs

MVP / VIP

MajP

You've got your good things, and you've got mine.

Pat Hartman

Super Moderator

zfar

New member

Pat Hartman

Super Moderator

Similar threads

Users who are viewing this thread