Isaac
Lifelong Learner
- Local time
- Today, 04:27
- Joined
- Mar 14, 2017
- Messages
- 8,777
I have a small 'reconciliation' project where I am going to need to routinely compare two datasets. I have zero authority, influence or control over the data structure of these two data sets. Worse yet, they don't have any sort of Key values, and my task will be to compare them on a text field which is common to both. I'll call it TextField, from Table1 and Table2.
To complicate things to the point of absurdity, almost, in some cases a related record has one particular value in Table1.Textfield, and a slightly different value in Table2.Textfield. (Of course all this implicates a major redesign of these systems, or a new one to replace them, but I do not have that option right now - my job is only to create a widget for monthly ingestion of the two datasets and some reconciliation reports, pointing out discrepancies on other columns, and using Textfield to link them).
I'm able to create a process now (at the beginning of this adventure!), for the business to tell me "whenever you see Value1 in Table1, it should be considered the same as Value2 in Table2". Approved pairings, so to speak.
I'll need to incorporate this new intelligence to perform joins of some kind in the future. For example, Table1.Textfield might say "Western US Division" and Table2.Textfield might say "US Division of the West". The business will "start me out" by letting me know that those 2 should be paired as if they are the same.
We're talking about pretty small data here. A few hundred records at a time in any one of the 2 tables.
I'm trying to think of the best way to structure this new 'intelligence' on 'pairing' from the business and unsure which would be theoretically best or most scalable/efficient.
Should I maintain a table with just 2 columns.....showing Table1.Textfield and Table2.Textfield values? If so, how in the world would I join anything to this?
Or should I create a table with 3 columns....the approved pair, and a third meaningless key value.....and then create a process whereby, during my once-a-month ingestion of these tables' data, each of the two incoming tables gets represented with their source data, plus a new column: TextFieldNew--and TextFieldNew contains the meaningless key value - thus paving the way for a normal join thereafter?
This task is working with very small data, but my mind wants me to come up with a textbook-sound process anyway.
Feel free to provide any ideas you might do in this scenario?
To complicate things to the point of absurdity, almost, in some cases a related record has one particular value in Table1.Textfield, and a slightly different value in Table2.Textfield. (Of course all this implicates a major redesign of these systems, or a new one to replace them, but I do not have that option right now - my job is only to create a widget for monthly ingestion of the two datasets and some reconciliation reports, pointing out discrepancies on other columns, and using Textfield to link them).
I'm able to create a process now (at the beginning of this adventure!), for the business to tell me "whenever you see Value1 in Table1, it should be considered the same as Value2 in Table2". Approved pairings, so to speak.
I'll need to incorporate this new intelligence to perform joins of some kind in the future. For example, Table1.Textfield might say "Western US Division" and Table2.Textfield might say "US Division of the West". The business will "start me out" by letting me know that those 2 should be paired as if they are the same.
We're talking about pretty small data here. A few hundred records at a time in any one of the 2 tables.
I'm trying to think of the best way to structure this new 'intelligence' on 'pairing' from the business and unsure which would be theoretically best or most scalable/efficient.
Should I maintain a table with just 2 columns.....showing Table1.Textfield and Table2.Textfield values? If so, how in the world would I join anything to this?
Or should I create a table with 3 columns....the approved pair, and a third meaningless key value.....and then create a process whereby, during my once-a-month ingestion of these tables' data, each of the two incoming tables gets represented with their source data, plus a new column: TextFieldNew--and TextFieldNew contains the meaningless key value - thus paving the way for a normal join thereafter?
This task is working with very small data, but my mind wants me to come up with a textbook-sound process anyway.
Feel free to provide any ideas you might do in this scenario?