PKs, FKs, Candidate Keys and teasing apart an already populated master table

tripptofer · Sep 1, 2009

I've posted here before and thought I made some progress, but it would seem that there is still something very fundamental that I am simply "not getting".

I have tried to use the examples in the Access2003 bible and even the results of Analyze table to automatically see how Access wants to do things. At first I thought it was simply I couldn't figure out to get access to do what I wanted, but now think it a fundamental problem between defining and then setting up candidate/super/foreign keys.

Problem: import grants data from a master spreadsheet into access, then tease apart information to form subtables with regard to the disbursements/grants themselves, grantors (who gave it), investigators (who got it) and recipient institutions (where they were at the time), plus city/state/country.

From the data source, there already exists a primary key for the disbursements (RCID). There are also “unique identifiers” from the NIH for grantor, investigator and recipient institution. Unfortunately there are also many synonyms for investigator names and many have multiple “unique IDs”.

I have teased out the unique investigator names, standardized them to the longest unique name and placed multiple “unique IDs” in separate columns.

I have also standardized recipient institution names, but stopped at the school/institute level as recipient organization IDs are at present only at the school/organization level- any greater granularity, while interesting, I think should await until I “get it”.

I understand primary keys, but I think I am getting hung up (actually I know I am) on foreign keys, candidate keys, superkeys- how to make them, how to set up the proper relationships and know that much of this is dependent upon structuring the database as a whole from the outset. I feel that I’m getting bogged down in symantics, but then different things are often called different names because they are indeed different if even subtly so…

I would like to be able to use referential integrity to backfill in the many holes that exist as well as to reconcile the master grants table with the new standardized investigator and recipient names ie, UCSD, UC San Diego, University of California at San Diego should all be set to the last example.

If I can get out of my rut and find out how to do one of these then I am hoping that the rest will fall into place…

I will also attach the original list of attributes from the spreadsheet and how I have tried to tease them apart. Putative PKs are in bold and FKs are in ital (is this even right?). The bits in square brackets are just headings and don't exist- in the original sheet like attributes are all over the shop and a good many are null, so this is just me trying to get organized. There's also a bit on the last page of my attempts to make sense of all of the potential relationships so that things are properly structured from the outset.

Eager to get out of this rut as I believe once I do that I'll be off to the races again!

Cheers in advance,

christopher

The_Doc_Man · Sep 1, 2009

I looked at that briefly. I noticed a theme that perhaps should be addressed.

Per your comments, if something was italicized, you thought it might be a PK. In more than one of your lists, I noted a xxxID field and an xxxName field BOTH italicized. This is overkill. If the xxxID will be unique, the xxxName field should not be part of the key. In normalization, you can sometimes find multiple candidate keys (to be the PK). When that happens, you do NOT make them both the PK. You pick one. The other becomes an ordinary attribute and you are done. The xxxID is probably a better choice if it is numeric because the indexes based on the PK are smaller for numbers than for alphabetics. Smaller keys = faster searches.

Superkeys? It's a buzz-word for most Access situations. And the last time we had a lengthy discussion in this forum on the concept, I recall the consensus that superkeys were generally redundant.

Use of FKs? They are just pointers to a corresponding record in some other table. They are copies of the PK of that record. Given the relationship, Access can quickly find the right record - because to be a valid FK pointing into table A in the strict technical sense, it must be the PK of table A - because with a PK, the record can be found very quickly.

Here's a question that I've seen before, so I'll anticipate it. You can in theory find a record based on ANY unique index, so why does the FK in table B pointing to table A have to be the PK of table A? Normalization as it relates to candidate keys is why.

Suppose, for example, you had the case I mentioned above where the name was unique and so was the ID field. You could make a unique index for both. Both would be candidates for PK. But in the end analysis, you must pick one because you want to minimize the size of all PKs to maximize search speed. Now the normalization rule kicks in. When defining the values of each record's fields, the rule about keys is that the value of each non-key field must depend only on the value of the PK and nothing else. This rule "demotes" the candidate key to "ordinary field" status. Once that happens, you really should treat it consistently with other non-key fields, as otherwise you invite confusion.

A second reason is that if you define relationships between two tables, Access demands that you use the PK on the one that is independent; the other one automatically becomes dependent.

I feel that I’m getting bogged down in symantics, but then different things are often called different names because they are indeed different if even subtly so…

This is a CRUCIAL situation to resolve BEFORE you go much farther. If two things are different because of many strong similarities and one minor difference, perhaps they are the same and the difference merely needs to be recorded as a field that is sometimes empty in other records. In other words, seek to combine similar items when all that is needed to support their combination is a couple of extra fields describing the differences as "attributes" of the entity being described. It is OK to have empty slots sometimes. At least, it is OK when the result allows you to merge two or more previously disparate tables and thus simplify your query structure.

tripptofer · Sep 1, 2009

Thank you for this Doc Man!

I've now rejigged my overall design (attached, pg 3) taking on board all of your comments and standardizing my notation (and naming, which I hadn't realised was still so sloppy!).

I've tried to split out all repetitive data into their own subtables and then add FKs linking back to the PKs. Generic data also squats in their own subtables.

I think I was adding IDs + names as I wouldn't be able to tell just from the IDs themselves what they were in a subtable. Also I was trying to link say, GRANTOR_ID to City/State/Country making a 3 attribute FK/candidate key which is not only awkward (see below), but going about things fundamentally the wrong way!

If I think I get the point you made regarding PKs/FKs/candidate keys is that while you can have the smallest common denominator of a multiple attribute candidate key, if you gave each unique combo of, say, city/state/country is own PK and then use that as an FK wherever you needed to reference them, then is makes the whole process down the road faster (looking for a single unique number as opposed to a unique combination of 3 text attributes)?

I now realise that this is NOT the function of tables, and that I can always go back in and tease out exactly what I want through a query and then add, say grant/grantor/recipient type through a form.

Let me know if you think this structure looks better as I know there's precious little point in moving forward with a duff design.

Again, thanks for the swift nudge out of my rut and for all of your kind assistance...

As as ex-structural biochemist, I appreciate that form begets function. I think I was trying to run before I was ready to stand, hence banging my head on the vertical part of the learning curve!

Cheers again and do let me know if you spot something else, anything else you think I might be not quite getting...

-christopher

dportas · Sep 2, 2009

Just to expand a bit on what Doc Man said, in case further definitions help.

A Superkey is a set of attributes that are collectively unique. Taken together the set of values for those attributes will be different on every row.

A Candidate key is a superkey which is also minimal. In other words if you took any one attribute away from it then it wouldn't be unique any more.

Uniquness of candidate keys is normally enforced through unique indexes and/or constraints in the database.

The design principles of Normalization are concerned with candidate and superkeys only. However, it is common (though very incorrect) for people to discuss normalization in terms of primary keys only - a primary key being just "any one" of the candidate keys.

Most DBMSs enforce the rule that "foreign key" constraints may reference any superkey with a unique index on it and this applies equally to MDB/Accdb databases. Primary keys are therefore entirely optional for referential integrity purposes. Having said that, convention very strongly dictates that every table have one of its candidate keys designated as a primary key. If a table has more than one key that is being enforced then most database designers will naturally choose just one key to be the primary one even though that choice generally has little or no practical significance in the database. Identifying it as a primary key at least acts as a reminder - usually (not always) indicating that it is the only candidate key referenced by foreign keys in other table(s).

The_Doc_Man · Sep 2, 2009

A comment regarding one of DPortas's definitions:

Given the definition as DPortas stated, a SUPERKEY is any candidate key that cannot be reduced by losing one of its member fields without also losing uniqueness. Obviously, for a synthetic single-field key, this applies as the trivial case. For unique compound keys (multi-field keys), the term also applies, though I personally prefer to just say "unique compound key" in Access context.

A second comment: I'm not entirely sure that I agree with DPortas on the comments regarding referential integrity (RI). I don't have Ac2k7 to look this up, but on earlier versions, RI depended on the existence of a formal defined relationship (via the relationships window) and that only was allowed if one of the two participating tables was being related through its PK. I am unsure whether when you have a relationship not enforcing RI, it need not rely on only the PK. Where possible, I always define a PK if there is even a slim-to-none chance that I'll do a lookup on it. I omit a PK if and only if I know for an absolute fact that the table in question will never be the one-side target of a many-to-one relationship.

I noticed a couple of things in your updated description that made me pause. One of them has to do with a single-field table called Currency.

First problem: Currency is a reserved word (name of a data type). Don't do that. Access really gets hinky when you use reserved words as names of user-defined objects. Worst part is you never quite know WHEN it is going to bite you on the butt, so you won't even know when to order the bandages and ointment for the bite marks. But the longer you wait to fix that, the deeper the bites will go.

Second problem: There is only one time that you use a single-field table and that is as a validation or selection lookup. If that is its intent, then OK. Like, you could say "US dollars" or "Japanese yen" or "Italian lira" etc. as a unit of payment, and if it isn't in that table, you don't accept that form of payment. If that is similar to what you were doing, then it is good. If you were doing anything else with that table, revisit it.

I've tried to split out all repetitive data into their own subtables and then add FKs linking back to the PKs. Generic data also squats in their own subtables.

This is a generally good idea. However, now comes a caution. If the Currency table was the result of this split-out process, you are possibly over-splitting. And yes, that is possible to do.

Here is a thought to consider. If the repetitive data is somehow also non-uniformly time-variant, you face the dilemma of whether you want to keep that in a separate table or not. The most common examples of this problem occur when someone has to take into account sales tax rates that vary from year to year and from one sales tax item category to another. Would you try to build a time-sensitive tax table or would you just store the current tax rate in the line-item table for each item sold? There is no guaranteed right answer, but there can surely be some wrong ones. I know your situation isn't quite the same, but maybe I illustrated the point for you.

The_Doc_Man · Sep 3, 2009

After further review, your keywords entry in your grant table should also be split into a child table of keywords with the grant ID, perhaps some sort of discriminator ID, and the keyword value. To have a field in which you try to jam-cram a few keywords just makes your tables that much harder to search.

You've also got something suspicious under activity code, because you still are thinking inside a box. Looks like this in your presentation:

Created_by
Created_date
Updated_by
Updated_date

I might have done those four fields this way, just intended as an example:

ActionHistoryTable (as a child associated with the grants table)
RCID,
action code (will equate to create, update, review, suspend, etc.)
action date (when this action occurred.)
Whodunit-code (identifies who performed the action - if that is something you track.)

Then you can get the history of the actions for this grant by doing a SELECT with the appropriate joins and such, grouped by the RCID and within each RCID group, ordered by the action date. Then you have as many action codes as you need to distinguish what can happen with these grants, including review, 2nd-pair-of-eyes-review, update, reconcile, create, suspend, redirect, terminate, etc etc. Just an example of how to look at such things to find commonality where you first think there is none.

Don't worry that it is slow to come to you. There will come a moment when the pieces fit together and there will be that blinding flash of light that says, "I understand why that is not normalized." (Then you'll wake up, scribble notes on the notepad by your bed, go back to sleep, and wonder the next morning what "mumblefratz gefrillnicht besopasnosty" means.)

Seriously, don't give up on it. Keep your eyes (and your mind) open to the suggestions of others. That way, as parts of our ideas get to you by osmosis or whatever, you will at least begin to understand this most crucial part among all parts of database design.

gemma-the-husky · Sep 3, 2009

i took dportas comments on keys as follows

lets say you have a sales order table with

pk - autonumber
unique key - customer account + order number

than i think dp is saying that you could use either of these keys to assert relational integrity.

tripptofer · Sep 3, 2009

This is all gold- thank you all...

I shall take another crack at teasing out the various bits into more sensible groups of buckets and have already changed the CURRENCY to CURRENCY_ID, which is when I then realised that the data in that attribute wasn't an international standard as I'd assumed (it was USD, POUND, EURO instead of UDS, GBP, EUR- hence balking at referential integrity).

I already now have subtables for GRANTOR, RECIPIENT and INVESTIGATOR (PI) which I know to be unique and will spent the day getting my head around JOINS now that I think I've gotten most of my PKs and FKs correctly established. The task now is to tease out those instances where the main grants table is "wrong" and standardize these fields from the unique subtables.

I've been bouncing between theory and practice, sql and access- all of which has been utterly new to me since I first posted on here since July.

Candidate keys, unique multiple attribute keys, while I understand them in theory, I think still confound me in practice. If a proper, robust and moreover numeric PK is in place, this should suffice instead of dragging multiple text fields along through which to search. I don't think that this feature is a hangover from traditional database design, hence thinking that I am still indeed missing something crucial.

Just to give you all a bit of context, I analyze biomedical research grants worldwide to conduct landscape analyses of where monies are being spent within a particular therapeutic area. Last year when I stumbled across this niche the only tool I had a my disposal (without learning sql from scratch), was excel (I was on a mac!), which did the trick after much bashing, but knew there must be a better way... I need to get my main grants database in order so that I can take on yet another utterly-new-to-me tool, WEKA. It's been slow going, I know, but I am patient and know that once I get my head around it all, the sky is the limit!

Thank you again for all of your kind and patient assistance and will certainly let you know how I get on...!

Cheers!
christopher

The_Doc_Man · Sep 3, 2009

gemma, I didn't read dportas's comments the way you did. If I'm wrong in how I read them, then I'm wrong. 'tain't the first time. 'twon't be the last time, I'm sure. But it brings up the old discussion about synthetic vs. natural keys.

If you have, say, XID as an autonumber and you also have customer number and order number in the same record, and if XID is a candidate key, and the pair of <customerID and orderID> is also a candidate key, you lead yourself to the interesting question, why does XID even exist in the record if you weren't planning to use it as the single-field PK?

My rule of thumb for database design is to include everything I am going to need and exclude anything I won't need. In line with this simple rule, I would eliminate XID if I were planning to use the compound key as a PK. Particularly if XID is an autonumber and I'm not going to use it as the PK, and if form follows function, and if XID no longer has any function, why would it exist? (Answer: Re-read the threads on synthetic vs natural keys and get your blood pumping on one side or the other of the running argument that started up on that topic a couple of years ago.)

Which is part of where I was going, I guess, in the discussion about choice of candidate keys and demotion of the unselected candidates to either "ordinary" fields or at most as indexed but non-PK fields. But then again, I've been known to jump to confusions before.

I agree with your interpretation of what dportas was trying to say about possibly enforcing RI using either key. Where I had a question was whether, once you had chosen the PK, you could enforce RI using the field that you DIDN'T choose as the PK, even if it was still unique. In other words, yes, if they are both candidate keys, you could use either for RI - but once you chose which candidate was the basis for RI enforcement, you HAD to have also chosen that field as the PK. Those two functions must go together, I think.

And my further point above is that once you made the choice of PK, if you didn't choose the synthetic key field then it has no reason to exist because you CAN'T build an RI-enforcing relationship with it. Not clear on whether you can build a non-RI-enforcing relation either, but in earlier versions of Access, you could not. Of course, if you are using an Access Front End and something else for the Back End besides the Jet engine, perhaps that restriction doesn't apply.

tripptopher, some of this is a purist-vs-pragmatist viewpoint that may be peripheral to your immediate problem, but trust me that it is important in the long run.

In any case, don't look at normalization as a goal. Look at it as a process. One that takes many passes to weed through the design. Otherwise you'll lose your sanity due to unreasonable expectations.

Another point that I'll bring up for you is, what do you do once you have designed your tables? Answer: Look to writing queries to re-merge the tables for the times when you will use the data therein. "What?" you say... "Why remerge what I just split?"

Answer: Because the queries only seem to re-merge, and you can merge the same data many different ways with many different queries. Just remember this sometimes misunderstood rule: Forms and reports are built on recordsets, not specifically on tables or queries. Tables are only one source of recordsets. Queries are the other biggie.

Queries will be your work-horses. It is in queries that you can do sorting, grouping, selection (and exclusion), and computation. It is with queries that you can rejoin multiple tables in specific ways. Queries are CHEAP AS DIRT to store in your database because they only take up a static and relatively small QueryDef entry until you actually try to open one. You can have queries for reports, different queries for forms, still more queries for special VBA code that you might use for batch-like operations...

You might have read elsewhere in this forum that you never (what, never?), well... hardly ever store computed data in a table. If you can compute it with a query from data available elsewhere in your DB, then why store it? And taking that attitude, you save space. As to the time it takes to recompute stuff, look at this little bit of math...

A disk spinning at 7200 RPM is 120 RPS, or about 8.33 milliseconds per revolution. The average expectation value for a data delay due to a spinning disk having to seek a new record is half that, or about 4.16 msec. On a 1 GHz system, 4.16 msec = 4.16 Million clock cycles, or about 2 million instructions. You can do an AWESOME amount of work in between disk seeks. So don't be afraid to let the queries recompute at least some stuff for you.

dportas · Sep 3, 2009

I didn't intend to make any point about surrogate or natural keys. Gemma has it right. I was simply pointing out that it is candidate keys that are significant rather than primary keys. I think that point doesn't get made often enough.

As regards referential integrity. I tested it with Access 2007 and found that you can create a foreign key referencing a key with a unique index, whether the primary key or not.

tripptofer · Sep 4, 2009

DocMan, dportas, gemma-

Thank you all for your guidance and insightful advice...

I think I am still a bit vague on candidate keys- not so much on what they are in theoretical terms (I get the lowest number of attributes that makes them uniquely id any row in a table), but in getting them set and functioning- but this should be just more reading and tinkering. As I think I might have asked before, is it better to have city/state/country as a candidate key than to invent an autonumbered PK? From what Doc was saying earlier, but also now with the computational time, it's the latter as it's easier to chunder through a smaller range of numbers than a larger volume of text. I think this was my quandary earlier, unless dealing with sensical values (compound key of invoice/order/part/customer numbers, then making an extra, nonsensical autonumbered PK probably is working backwards.

Attached are most if not all of the relationships that I think I will need with a slew of synthetic PKs generated (RCID came from data source- the rest I made). Some bits are either null from source or have been retained, but are of little value (created by, updated by + dates...). Grant activity code is unique and taken from the NIH, but access wouldn't let me make this is a PK so I did it the other way through RCID... Currency isn't quite fixed either, but these at the moment are more niggling details.

I've been tinkering now with various JOINS so that I can make a query using the subtables that I know to be unique to fix or fill-in the GRANTS table (tblALZ_XRds_DATA). Or am I fundamentally missing something here too? If the subtables are right and the relationships are right- can I simply remove the attributes that are 'wrong'/empty in what is left of the GRANTS table (tblALZ_XRDS_DATA)? For instance, I know that investigator names and IDs contain many mistakes and synonyms so I plowed through by hand and made unique table choosing the longest name I could find and then taking any multiple 'PI unique ids' and placing in Investigator_NIH_ID1-6 (hence requiring my own non-sensical PK in this instance).

Perhaps I'm just being paranoid (as if there was something the matter with being wrong when learning!), so let me bang around for a couple of days and I'll get back to you next week if I find myself still floundering ... I know that even if I crack this next bit there's bound to be something else around the next bend in the road...

Thank you all once again for your kind patience...

Take care,
christopher

The_Doc_Man · Sep 4, 2009

That middle was a harder read that I thought it would be, but let me toss in my own viewpoint on teasing apart externally generated tables based on spreadsheets.

I have often added extra columns so that I could back-fill the columns with what I thought were the correct numeric keys. For instance, suppose we have a names column that I would never do this way in the real world, but let's just suppose...

We've got John Quincy Smith, J Quincy Smith, John Q Smith, J Q Smith, and Phil Spitalne in the table. So we take the longest one and say it is #1. We suspect that the other Smiths in this table should also be #1, and Phil gets #2. Now what we do is when we go back to extract selectively for the names table, we would pick out the ones we wanted as the "master" names and then every place we think that name might apply, we feed back the index from the table where we are storing the "decided-upon" names. The numbers in the names table are real. The numbers in the table being picked apart are temporary and tentative. So if you are worried about having what you believe to be spurious or extraneous key fields, in your situation it happens. You will eventually get everything filled in and can then do the final population of the tables and then get rid of the "pick-apart" table. It's a process. It happens that way. It's as tedious as answering a three-year-old's questions for an hour or so. (My signature SAYS I'm a grandpa. Trust me, it's true.) But it is part of what you are doing and it is necessary for proper progress.

tripptofer · Sep 9, 2009

Okie, I've now had a few days to digest all of the input and suggestions from last week and have been trolling the net for analogous examples of trying to JOIN two tables using OR statements so that I can "backill/correct" entries from a master table imported from excel; attached are the relationships.

I had been attempting to JOIN the tblPI_unique to tbl_Alz_XRds_Data where tbl_Alz_XRds_Data.lgnzInvestigator_NIH_ID = tblPI_UNIQUE.lgnzInvestigator_NIH_ID1 OR ... _ID2, OR... _ID6:

SELECT chrPI_Surname, chrInvestigator_Name, chrPI_FirstName, chrPI_MiddleNameInitial, chrPI_Surname, chrInvestigator_name, chrPI_FirstName, chrPI_MiddleNameInitial
FROM tblALZ_XRDS_DATA JOIN tblALZ_XRDS_DATA ON tblALZ_XRDS_DATA.lngzInvestigator_NIH_ID = tblPI_UNIQUE.lngzlInvestigator_NIH_ID1
OR tblALZ_XRDS_DATA.lngzInvestigator_NIH_ID = tblPI_UNIQUE.lngzlInvestigator_NIH_ID2 OR tblALZ_XRDS_DATA.lngzInvestigator_NIH_ID = tblPI_UNIQUE.lngzlInvestigator_NIH_ID3 OR tblALZ_XRDS_DATA.lngzInvestigator_NIH_ID = tblPI_UNIQUE.lngzlInvestigator_NIH_ID4 OR tblALZ_XRDS_DATA.lngzInvestigator_NIH_ID = tblPI_UNIQUE.lngzlInvestigator_NIH_ID5 OR tblALZ_XRDS_DATA.lngzInvestigator_NIH_ID = tblPI_UNIQUE.lngzlInvestigator_NIH_ID6;

I have already manually teased out and merged all of the various synonyms for investigator names and placed any redundant "NIH unqiue IDs" into their own fields. If a "NIH unique ID" = any of the "NIH unique ID" fields, then I want those lined up by surname so that the missing and/or incorrect fields can be reconciled.

I am certain that I am missing something fundamental regarding my sql syntax as the above keeps getting spit out by access. I am not even sure which flavour of sql access is even closest to- I can't seem to find this either...

I will keep plugging away on this end as I know that once I get this next bit I'll be able to cross-reference and patch up my db and be off to the races again...

Thanks again,
tripptofer

The_Doc_Man · Sep 9, 2009

Just a note on the extended "OR" sequence. Eventually you do better to use the approach of, where the two different ID fields have been decided as equalling the same person, go back and pick the name applied to that person and replace non-matching names in the source (i.e. pick-apart) table. And at some point you lose the OR sequence because finally, the names are unique AND aligned with each other. Like I said before, it is a process, not a single event.

PKs, FKs, Candidate Keys and teasing apart an already populated master table

tripptofer

Registered User.

Attachments

The_Doc_Man

Immoderate Moderator

tripptofer

Registered User.

Attachments

dportas

Registered User.

The_Doc_Man

Immoderate Moderator

The_Doc_Man

Immoderate Moderator

gemma-the-husky

Super Moderator

tripptofer

Registered User.

The_Doc_Man

Immoderate Moderator

dportas

Registered User.

tripptofer

Registered User.

Attachments

The_Doc_Man

Immoderate Moderator

tripptofer

Registered User.

Attachments

The_Doc_Man

Immoderate Moderator

Similar threads

Users who are viewing this thread