identifying a potential primary key given unknown data (challenging!)

ronstar · Sep 1, 2011

Hi Everyone,

I have a challenging problem (at least for me it is)

I have a table with about 200 columns, which are not logically named. there are 1.6 million records in the table. No column alone can form a PK as all of them have duplicate data.

I do not know how many columns I need to combine in order to form a PK.

I do know that there is a combination of columns that will be unique and thus usable as a PK.

The good news is that there are many binary columns which can be discarded! ;o)

does anyone know how to figure out the optimal PK?

Thank you kindly
Ron

plog · Sep 1, 2011

As with 99% of tables, the optimal primary key is an autonumber. Add a new field of that datatype and you've solved your problem.

ronstar · Sep 2, 2011

Hi,

Unfortunately that wont do it because there are two tables with similar data and we need to run the process twice in order to find the primary key on each. That will allow us to relate the tables via the keys.

Thanks

gemma-the-husky · Sep 2, 2011

i don't think plog is quite right. An autonumber key may be useful going forward, but it doesn't help your analysis

The thing is - this is your data. Surely you MUST know which columns you need to take together to uniquely identify a single record/row. If you do not know this, how can we?

ronstar · Sep 2, 2011

Hi,

You are right that I should, but I don't because there are 1.6 million records of raw data. I know that all columns have duplicates, so that I can't use a single column.

figuring out which combination of columns would produce a unique set of data that allows me to establish a PK is virtually impossible to do manually. It is raw data so I don't really know the significance of it, so to speak. Which is why I am hoping (very much) that someone knows an algorithm that can find the optimal PK.

thanks again
Ron

plog · Sep 2, 2011

So the issue isn't what you initially posted. Let me see if I understand the actual issue:

1. You have 2 tables.
2. You want to merge these two tables.
3. The result of this merge should leave you with unique rows.

Are any of my assumptions incorrect? If so, let me know. If not then what you need to do is:

1. Create a union query bringing all the data of these tables together.
2. Create a query based on that union query bringing in just the data you want in your new unduplicated table.
3. Group every field of that query up.
4. Make a new table using that query's data.

The above process will leave you with a table full of records that are unduplicated among every field. Again, if any of my 3 initial assumptions are wrong let me know.

gemma-the-husky · Sep 2, 2011

ronstar said:
Hi,

You are right that I should, but I don't because there are 1.6 million records of raw data. I know that all columns have duplicates, so that I can't use a single column.

figuring out which combination of columns would produce a unique set of data that allows me to establish a PK is virtually impossible to do manually. It is raw data so I don't really know the significance of it, so to speak. Which is why I am hoping (very much) that someone knows an algorithm that can find the optimal PK.

thanks again
Ron

sorry - i struggle to see how you can have a data file, but not understand the content of the data file. As I say - if you do not understand the data, how could anyone else?

jdraw · Sep 2, 2011

I'm with Dave on this. Take any given record, what do the columns mean to you?
200 columns in a single table is NOT common.

What does this really mean

I know that all columns have duplicates,

Can you show us an example?

You may have to go to taking each field(column) name and define it. Then work on a series of queries to break the data into meaningful tables.

If these columns have no meaning to you, then what is the purpose of the exercise?

ronstar · Sep 3, 2011

Both tables have a lot of raw data and values(inputs), plus reults of some functions that were run on this data(outputs). The functions that were run on each table are different and therefor so are the outputs. I want to relate the tables so that I can relate the results.

Therefor merging the tables won't help.

I am not tasked with producing the tables and therefor do not understand the data. My task is only to relate the tables so that results can be related. Due to the massive size of these tables the pk has to be efficient (I.e I cannot just take 100 columns and make a pk from them. Since working with the results will require too many resources then.

Hope the motivation is clearer now. I wrote that this was challenging....

)

unfortunately I cannot show an example of the data as there is just too much of it, and it is confidential. Therefor I would have to make it up which isnot possible.

I need an efficient algorithm to find the smallest pk possible.

Thanks
Ron

jdraw · Sep 4, 2011

Sorry, but that doesn't help me. You have a bunch of data and don't know what it represents and are trying to relate two large bunches of same. You're closer to the issue than we are, and you have no idea what it means; so what do you realistically want from us?

ronstar · Sep 4, 2011

jdraw said:
Sorry, but that doesn't help me. You have a bunch of data and don't know what it represents and are trying to relate to large bunches of same. You're closer to the issue than we are, and you have no idea what it means; so what do you realistically want from us?

Hi
what I need is An algorithm that will find the most efficient pk.

Thanks

stopher · Sep 4, 2011

ronstar said:
Hi
what I need is An algorithm that will find the most efficient pk.

Thanks

Yes you can write some code that iterates through all the combinations of fields. But all you will have achieved is finding a unique identifier for that particular dataset. Since this unique identifier has been selected arbitrarily, it would have no relationship whatsoever with an other dataset. It would be a pointless exercise.

Consider this:

Table 1

Code:

Fld1    Fld2     Fld3
A         10       xx
A         11       xy
A         12       xz
B         10       mm

Fld1 & Fld2 will uniquely identify the above data. But so will Fld1 and Fld3. Indeed, so will Fld2 and Fld3.

So which do you use?

Consider Table 2

Code:

Fld1    Fld2     Fld3
A         40       xx
A         41       xy
A         12       ff
A         11       ff

You could use any of the keys mentioned in table 1 to relate to table 2. But you will get different results for each.

You have to go to the providers of the data and get them to tell you how your tables are related.

Chris

gemma-the-husky · Sep 4, 2011

ronstar said:
Hi
what I need is An algorithm that will find the most efficient pk.

Thanks

assuming that every row is unique - ie no completely duplicated rows - then clearly the entire row is a candidate primary key.

other than that you are on your own. given that there are untold millions of combniations of multiple columns - this really us up to you to first pinpoint the "key" columns, if you will.

identifying a potential primary key given unknown data (challenging!)

ronstar

Registered User.

plog

Banishment Pending

ronstar

Registered User.

gemma-the-husky

Super Moderator

ronstar

Registered User.

plog

Banishment Pending

gemma-the-husky

Super Moderator

jdraw

Super Moderator

ronstar

Registered User.

jdraw

Super Moderator

ronstar

Registered User.

stopher

AWF VIP

gemma-the-husky

Super Moderator

Similar threads

Users who are viewing this thread