Normalisation, referential integrity

tripptofer · Aug 12, 2009

Hiya-

Apologies in advance, but I seem to have gotten stuck and could use a swift nudge in the right direction.

I had been bludgeoning excel to act as a quasi-database (b/c I was on a mac and it was the only thing I had) and have subsequently acquired a pc and access and so I am trying to slowly transform what I did before into a proper db.

I first mapped everything out on paper so that fields could be normalized properly from the get go, but then realized it was less typing to simply import the spreadsheet as a populated table.

What I would like to do is start with a massive amount of grants data from the NIH and then tease out their unique codes for recipient institutions, investigators and granting agencies. Then, using referential integrity I was hoping to backfill in as many of the holes as possible and correct any typos (deduping).

So the main data table would have fields including GrantID, RecipientID, RecipientOrgName, RecipientOrgCity, state, country; InvestigatorID, IvestigatorName, etc... Initially I can use all of the NIH unique IDs as primary keys, however eventually all will require their own internal PK as foreign data will be added and I can assume that not everyone in the world has been funded by the NIH.

I guess at this stage, I am mostly concerned with structuring everything as best as possible from the get go and keeping my brownian motion efforts to a minimum...

I had imported the main data table and built the accessory lookup tables which will be populated by the main data table however access is balking at me making anything other than non-referential relationships.

Do the supplemental tables need to be first populated? I thought it would simply be a left (?) join and having banged around for a while now on my own I am quietly convinced that a little knowledge is dangerous and I've missed something fundamental.

Any guidance would be appreciated...

Kind regards,
tripptofer

The_Doc_Man · Aug 12, 2009

OK, you are not on a trival path.

First things first, carefully read up on normalization. Access Help covers it lightly. There is a good Database Normalization article at wikipedia.org, and then there is a general search of the web. On the latter search limit yourself to colleges and universities you know that respond from the .EDU domain.

Second, search the forum in the "Design" section for articles on "Entity Identification" so you can decide what your business model needs to track.

Third, use your temporary import table as a source for the tables you eventually identify from steps 1 & 2. Add fields to the import table for each of the things you discover.

Let's say for S&G (the second letter is "giggles") that you identify a column that has some duplicates and it represents a simple attribute. Say, ... Color. So what you do is first design the color lookup table. Then do an append that populates the possible colors with a SELECT DISTINCT into the table, and have an autonumber prime key for that lookup table. Cleanse this a little so that if you have alternate spellings of the same color, you remove them from the import table and erase the lookup table contents and try again. Eventually you will have unique colors and a unique number to go with each. It almost doesn't matter what that number might be. Finally, go back and use an UPDATE query to store the color number in a new column where you join on the color names to be sure everything links up.

What does that get you? A column in the import table that points to a record in the color table, that's what.

Now do this for everything that can be looked up. Eventually you will have rows in the spreadsheet where instead of storing the text, you store the associated codes. Just remember that you cannot store anything this way until all spelling irregularities are resolved. You can then use append queries to create the main table, and you've already built the lookup tables with each SELECT DISTINCT that you had to use.

Hope that was clear enough.

tripptofer · Aug 13, 2009

Doc-

This was brilliant and has set me down the right road... I spent quite a bit of time re-doing my table charts trying to push the normalization to the fullest from the outset.

I then started reconciling name typos and this is when I realised that many investigators have many different names, (which I thought I could ignore) and then realised that they also then often have multiple "unique ID numbers", which I now know I cannot ignore. I now need to make some identity thievery/witness protection tables to retain and moreover benchmark the schizophrenic nature of the data I've taken on.

I'm beginning to now understand why no one has done this before...

Cheers again for your help!
-tripp

tripptofer · Aug 15, 2009

Doc-

Just noticed you were in NO- I was just there! (Literally- like on mon). Sad irony I guess- figured you were in the uk (hence the co.uk), where I spent most of the last 20 yrs...

Anywho, I did as you said and it worked-ish. I manhandled the data back out of access into excel because I couldn't figure out how in access to fish out fist, last names, etc... It turns out that the NIH has up to 6 (!) unique identifiers for their investigators and have now normalized all grants to a single PI name.

So I'm thinking I can set up a temp primary key by name now so that their NIH 'unique identifier' will get folded in and then can go back and set an internal PK? However, considering I seem to be only able to get access to do what I want through: Tools -> Analyse => tables, methinks that I'm trying to sprint before I can stand.

I've gone over my sheets/tables and "think" that I have them to 5NF (still no sure about keys on all levels), but I guess I really don't understand how to drive access well enough to get from thought to reality.

I guess this is the problem of where tires meet the tarmac!

Could use a swift nudge in a constructive direction and sorry I can't stand a round a Harry's Corner...

Take care and thanks again,
tripp

Galaxiom · Aug 15, 2009

Sounds like you have made massive progress in a difficult project. Understanding normalisation is crucial.

I would suggest your next step is to post an image of your current table relationships. Use print screen.

tripptofer · Aug 15, 2009

GalaxiomAtHome said:
Sounds like you have made massive progress in a difficult project. Understanding normalisation is crucial.

I would suggest your next step is to post an image of your current table relationships. Use print screen.

I'm a bit new to pc's- wound up getting one only b/c access doesn't seem to exist on macs and it seemed an easier path than learning sql from the ground up. In fact an academic chap I ran across said he always prototypes his dbs in access before porting them so i did it this way.

Anywho, there seems to be a bug with the new pavilions (6830s) that the screen capture (print screen) doesn't work (found this on another forum) for some reason so if there's a way around this (patch, sys config tweak or 3rd party screen grab software), I'd be obliged...

I've now normalized all of the names so that I can retain the multiple 'unique identifiers' that govt so charitably handed out like smarties and now need to backfill in all of the institution holes using the institution unique id's, city, state, country...

Are there any open source lookup tables for city-state? Even if just for the US...?

So many questions... just trying to take things slow and get it right...

Thanks for your kind help,
tripp

Galaxiom · Aug 16, 2009

ScreenGrab is a simple efficient freeware printscreen utility. It and other great free tools by the same developer on this page. TextRep and SetFileDate are worth downloading too.
http://no-nonsense-software.com/freeware/

SnagIt is without a doubt the best screen capture program of all time. It has a free trial but be warned. You will quickly get hooked on what it can do.
http://www.techsmith.com/screen-capture.asp

tripptofer · Aug 16, 2009

Cheers for those links- not sure why a brand spanking new laptop with a dedicated button for screen dumps doesn't work, but there you go...

Attached is a .doc of my tables. At present I've gone through the entire Alz dataset from the NIH and reconciled all of the typos in the principle investigator (PI) names. I then tried to 'reverse engineer' their 'unique PI codes', but then rapidly realised that unique was less that such.

I have now gone back through and standardized all synonymous names by hand (all +23K rows, er, tuples!) in the hopes of being able to use the now unique PI name as an initial key (FK?) so that multiple "unique IDs" from the NIH are retained. However, when I gently ask access to do this for me (as diagrammed in the .doc file) access adds a PK and keeps PIs with multiple IDs from the NIH in multiple rows- how do I ask it to kindly place this extra bit of information in a new column?

I also now have an open source table of US cities/states/zips that I would like to bolt in so that IF country=US and State=X, then a list of choices for city is provided with the eventual idea of then adding university/ institute/etc, but not wanting to reinvent the wheel, this could be populated from the original data table and typos or mistakes could be corrected through these lookups.

I know I'm getting way ahead of myself, but I do enjoy the verticle part of the learning curve...

Kind regards,
tripp

tripptofer · Aug 18, 2009

Okay, I've made significant progress thanks to a kind academic.

It turns out that you cannot simply make a table object the way I thought you could- it requires running a make-table-query, which in hindsight makes sense (running an sql script). One of those niggling details that is so small and yet utterly fundamental.

So now I've taken my master table (import from an excel sheet) and torn out all of the ancillary tables regarding PI, grantor and recipient org, city/state/country, etc... However, even having given each its own PK when I go into, say, the recipients' table populated only with DISTINCT data where referential integrity has been imposed, if I try to backfill in the many gaps these changes DO NOT seem to cascade back into the master table and I fear that as with the using query to make a table (instead of just making the object) I am missing yet another fundamental bit.

Any swift nudges in the appropriate direction would be kindly accepted.

Cheers,
tripp

tripptofer · Aug 18, 2009

Oops! Grabbed the wrong screen by mistook.

Sorry 'bout that...

Normalisation, referential integrity

tripptofer

Registered User.

The_Doc_Man

Immoderate Moderator

tripptofer

Registered User.

tripptofer

Registered User.

Galaxiom

Super Moderator

tripptofer

Registered User.

Galaxiom

Super Moderator

tripptofer

Registered User.

Attachments

tripptofer

Registered User.

Attachments

tripptofer

Registered User.

Attachments

Similar threads

Users who are viewing this thread