I have a person table with completely inconsistent location data, and I need to consolidate/fix it. Looking for ideas and methods.

Bettany

Member
Local time
Today, 03:55
Joined
Apr 13, 2020
Messages
44
I'm building a database that's importing Excel contact data into it, and the data (which consists of many thousands of records) is very inconsistent. In the Excel doc, there's a "coverage" field ("coverage" indicates the area a particular contact covers/manages, so it can be multiple geographic locations) and this coverage field has wildly inconsistent entries: sometimes it's just the country "England", sometimes a state "New York", sometimes a region "Northwest", sometimes it's a combination "New York City, New York." In summary, there's no consistency at all. I believe I have two main options for cleaning this up and normalizing: (please provide thoughts on if there are more):

  1. Create many to many relationships for each type of data I find: people_country, people_city, people_region, people_state, etc.
  2. Denormalize all coverage entries into a "locations" table, and then create a many to many relation between people_locations, with a lookup table that provides metadata for each location
Never encountered something quite like this situation before, so wanted to ask the community for pros and cons on these two methods, or an entirely different method.
 
3. Make a mapping table.

You need a table that has 2 fields:

InputCoverage, OutputCoverage

InputCoverage holds every single value you ever receive in that Coverage field. OutputCoverage is what you want it to ultimately be. For example:

InputCoverage, OutputCoverage
New York, New York
NY, New York
NYC, New York
DC, Washington D.C.
Washington D.C., Washington D.C.
Wash DC, Washington D.C.

You do this by making a MAKE TABLE query from your input file using:

Code:
SELECT InputTable.coverage AS InputCoverage, InputTable.coverage AS OutputCoverage INTO CoverageMapping
FROM InputTable
GROUP BY InputTable.coverage, InputTable.coverage;

Once made open that table, sort by InputCoverage and manually go through them and clean it up by merging various InputCoverage values into standard OutputCoverage values like I demonstrated above. Once done you can then use that table to convert the values you receive to the values you want.
 
3. Make a mapping table.

You need a table that has 2 fields:

InputCoverage, OutputCoverage

InputCoverage holds every single value you ever receive in that Coverage field. OutputCoverage is what you want it to ultimately be. For example:

InputCoverage, OutputCoverage
New York, New York
NY, New York
NYC, New York
DC, Washington D.C.
Washington D.C., Washington D.C.
Wash DC, Washington D.C.

You do this by making a MAKE TABLE query from your input file using:

Code:
SELECT InputTable.coverage AS InputCoverage, InputTable.coverage AS OutputCoverage INTO CoverageMapping
FROM InputTable
GROUP BY InputTable.coverage, InputTable.coverage;

Once made open that table, sort by InputCoverage and manually go through them and clean it up by merging various InputCoverage values into standard OutputCoverage values like I demonstrated above. Once done you can then use that table to convert the values you receive to the values you want.

This is absolutely brilliant! Thank you! Once I've eliminated inconsistencies on how people inputted the same location, how best to tackle the table relationships? I suppose I ultimately need a many to many (mtm) relationship between people and the locations they cover. So I could create City, State, Country, Region, etc tables. And if a person only has a city listed, then I create an entry in the person_city junction table, and so forth and so on for other geographic entities? Or do I ultimately need a person_location table that maps every coverage value (previously entered on the Excel sheet but now corrected)?
 
Last edited:
I'd need a better big picture view of what you are doing to advise properly--what does this organization this database is for do? What does this database help accomplish within that?

So I create City, State, Country, Region, etc tables

I generally don't like over-normalizing location data. I'd only have one location table that lists city/state/country/region in one record, not a seperate table for each entity. Then if you have a many-many relationship between people and locations, I'd just make that location table, a person table and a junction between the two for a total of 3 tables.

In fact, that mapping table could be your full location table. Just include the InputCoverage field and then put all other location fields in it as well.
 
I'm building a database that's importing Excel contact data into it, and the data (which consists of many thousands of records) is very inconsistent. In the Excel doc, there's a "coverage" field ("coverage" indicates the area a particular contact covers/manages, so it can be multiple geographic locations) and this coverage field has wildly inconsistent entries: sometimes it's just the country "England", sometimes a state "New York", sometimes a region "Northwest", sometimes it's a combination "New York City, New York." In summary, there's no consistency at all. I believe I have two main options for cleaning this up and normalizing: (please provide thoughts on if there are more):

  1. Create many to many relationships for each type of data I find: people_country, people_city, people_region, people_state, etc.
  2. Denormalize all coverage entries into a "locations" table, and then create a many to many relation between people_locations, with a lookup table that provides metadata for each location
Never encountered something quite like this situation before, so wanted to ask the community for pros and cons on these two methods, or an entirely different method.
Take a step back first. Could coverage actually be a whole country or a whole state or a whole city or a whole region within a country or a state? Just be careful before you assume coverage can only be fined as one thing. You could have a locations or coverage table with a field named LocationDescription with entries of "England" or "New York, New York" or "The Pacific Northwest". They could be perfectly valid coverage areas.

Also, could each person have multiple locations attached to them? Or does each location or coverage area have multiple people attached to it?

Does this post relate to your other post with respect to questionnaires? I think we need to see your actual table and relationship design before going any further. It sounds like you may be unsure as to how to design your project.
 
Last edited:
When you say you are building the database, is this for your company? You really need to get an understanding of what they are trying to do with the location, and whether it needs fixing at all, which is similar to what @LarryE just said.
 

Users who are viewing this thread

Back
Top Bottom