Making Normalization seem normal

visacrum · Oct 16, 2008

Help making Normalization seem normal?

I'm really not trying to ask noob questions, I promise I'm trying, but there are just some things I can't get my head around.

I've finally got an opportunity to make a database the way I want. The boss who demanded messy database structure is gone, the new bosses are sort of looking to me to clean house, and I actually have some time to learn now.

Our company hosts 4 different kinds of meetings. Large Forums, Regional, Beta, and In-House. We host surgeons, sales reps, and in-house personnel at various locations.

I've currently got a compiled Master table. It is rife with duplication. If doctor A went to meeting 1 and meeting 3 and meeting 4, then doctor A is in that database 3 times, often with a different address, or different rep information (we list which rep services a given doctor).

So I'm trying to normalize all of this. I figure the tables I'll need (and the fields, not including the ID field) are:

People (name, degree, address, phone, email)
Events (city, state, start date, end date)
Properties (i.e. hotels. name, city, state, address, maybe phone)
Reps (name, rank, address, phone, email)
Territory (for the Reps - Territory number, territory name, organization name)
Hospitals (name, address, city, state, phone - this is for affiliations, not for actual practice)

I think I got all that nailed down.

People to Events would be many to many, so I'd need a junction table there.

People to Hospitals would be many to many (a doc may be affiliated with several hospitals, and of course a hospital can have billions of docs)

Properties to Events would be many to many (in case we needed to use more than one property for an event, which we've never done, but I'd rather have it and not need it) so another junction table.

People to Reps would be one to many (a doctor may be covered by more than one rep)

Reps to territory would be one to one.

I'm pretty sure I got all that down. But I start running into questions.

For People, some of the doctors work at (not just 'are affiliated with') a hospital, some just out of an address. If I understand Normal forms, I can't be 1NF if I have a blank practice name, but I can't see that slapping 'Private Practice' is going to do me any sort of good, especially when it comes to merging documents. Also, it's impossible to be sure the information we're getting is not just a home address (despite our asking for a practice address). I start thinking I should change the last table from 'Hospitals' to just 'Addresses', but I'm still left with the null field possibility.

And I have a possibility of duplication. A rep may cover a given doctor, AND may be attending a meeting. So I could conceivably have a rep listed under People for having attended a meeting, and under Reps for simply being a rep (and in that case, I'd have nothing to link to under reps, unless I link it to themselves. Would that look hokey?).

I'm still weeding out thousands of duplicates to create my 'Peoples' table, but trying to smooth the way for the rest of the database development. I'm learning lots, but I can't find anything to explain the above. Am I on the right track? Do I need to structure this differently?

The_Doc_Man · Oct 16, 2008

OK, let's try to see how you approach this. Since you are gainfully using the word "normalization" I'm going to assume you know at least something about it.

* People (name, degree, address, phone, email)
* Events (city, state, start date, end date)
* Properties (i.e. hotels. name, city, state, address, maybe phone)
* Reps (name, rank, address, phone, email)
* Territory (for the Reps - Territory number, territory name, organization name)
* Hospitals (name, address, city, state, phone - this is for affiliations, not for actual practice)

You have made a start at table design, but it leads me to specific questions designed to draw out answers OR illustrate murky thinking.

You've got persons and reps. I take it that you believe reps to be sub-human? (I've probably met some of them who fell into that category...)

Address is repeated in several tables. Is there a particular reason you didn't split out an address list and just point to it from whatever tables need it?

You are building a representational model to track a real-world event. So the idea is to find everything that is unique in function, bring that together in its own table, and build relationships between that table and other tables. You are doing object (or entity) identification at the moment. Here is what you need to know.

1. If two entities serve exactly the same function, they belong together. Obviously, people who attend events have exactly the same function. Do reps ever attend events in the same purpose? (EVER?) Because if they do, you might wish to consider that reps are people, too.

2. Splitting out common attributes actually makes things easy. Hospitals have addresses. Doctors who have offices in the hospitals might have the same addresses. One table fits all, I think. It doesn't matter that several tables each have pointers to a given address entry. You might wish to separate out whether there is a different suite or office number, though. Ditto phone number.

3. The real goal of normalization is to make your tables have this property: Every field in the table depends exactly and only on the prime key (PK) of that table. Stated another way, no element of a table is allowed to depend on anything else than its PK. Therefore, a date in a person table must represent something unique about that person, like "birthday" - otherwise it belongs in another table.

Here's the logic for that statement. If a person has two or more related dates, then you get back multiple dates when you query that person's ID as a PK. But that means the ID should not have originally been a candidate for PK because of uniqueness issues. You need the PK to really be prime, because otherwise using it will only lead to headaches and confusion and funky Cartesian JOINs. I.e. too many records come back because Access does a combinatorial SET THEORY join in that case. Ugly. Big. Often intractable.

So when you run across, say, an attendable event, IT might have single date or start/end dates in it. It would have a unique ID as well.

Then the junction table that vexes you might look like:

EventAttendees:
PersID (FK to person) - an attendee
EventID (FK to event) - an event attended by the attendee.

Want the dates? That's in the event table.

So you have an address for the event? FK to the address table.

You have a property? FK to the address table. Junction table lists IDs of properties used for a given event.

What about the person? FK to the address table for residence, FK to the address table (AGAIN) for business address. Want to know about events? Junction table can contain person ID, event ID, and hotel/motel property ID. IF the person arrived late or left early (or you allow and care to track that fact), add arrive/leave dates to the JUNCTION table since it is specfic to one person's attendance to a given event and stay at a given property. (Also allows you to track if the person stayed at property 1 on the first day and property 2 on the second day, using two entries in this junction table.)

This is how you should be analyzing for entities to build your model. This kind of thinking is required. (I could be wrong about the details of what you are doing or tracking, but the guided tour will perhaps give you ideas anyway.)

wazz · Oct 16, 2008

a few more junction tables and many-side tables might solve several concerns.

- it is very common to separate phones, addresses and email addresses from contacts into their own tables (contact may have many phones, addresses and email addresses).
- keep the hospital table and be sure to include data that is specific to the hospital. you will probably need it. the hospital will have an address. you can look for matches when you recieve address from people, if necessary.
- people to reps could end up being N:N
- reps could be in the people/contact table and joined 1:1 with a repdetails table to isolate specialized rep-fields (rank) and eliminate null fields. if a person does not have any rep details, they are not a rep.

occasional null fields are practically inevitable.

visacrum · Oct 16, 2008

The_Doc_Man said:
OK, let's try to see how you approach this. Since you are gainfully using the word "normalization" I'm going to assume you know at least something about it.

You have made a start at table design, but it leads me to specific questions designed to draw out answers OR illustrate murky thinking.

You've got persons and reps. I take it that you believe reps to be sub-human? (I've probably met some of them who fell into that category...)

Address is repeated in several tables. Is there a particular reason you didn't split out an address list and just point to it from whatever tables need it?

You are building a representational model to track a real-world event. So the idea is to find everything that is unique in function, bring that together in its own table, and build relationships between that table and other tables. You are doing object (or entity) identification at the moment. Here is what you need to know.

1. If two entities serve exactly the same function, they belong together. Obviously, people who attend events have exactly the same function. Do reps ever attend events in the same purpose? (EVER?) Because if they do, you might wish to consider that reps are people, too.

2. Splitting out common attributes actually makes things easy. Hospitals have addresses. Doctors who have offices in the hospitals might have the same addresses. One table fits all, I think. It doesn't matter that several tables each have pointers to a given address entry. You might wish to separate out whether there is a different suite or office number, though. Ditto phone number.

3. The real goal of normalization is to make your tables have this property: Every field in the table depends exactly and only on the prime key (PK) of that table. Stated another way, no element of a table is allowed to depend on anything else than its PK. Therefore, a date in a person table must represent something unique about that person, like "birthday" - otherwise it belongs in another table.

Here's the logic for that statement. If a person has two or more related dates, then you get back multiple dates when you query that person's ID as a PK. But that means the ID should not have originally been a candidate for PK because of uniqueness issues. You need the PK to really be prime, because otherwise using it will only lead to headaches and confusion and funky Cartesian JOINs. I.e. too many records come back because Access does a combinatorial SET THEORY join in that case. Ugly. Big. Often intractable.

So when you run across, say, an attendable event, IT might have single date or start/end dates in it. It would have a unique ID as well.

Then the junction table that vexes you might look like:

EventAttendees:
PersID (FK to person) - an attendee
EventID (FK to event) - an event attended by the attendee.

Want the dates? That's in the event table.

So you have an address for the event? FK to the address table.

You have a property? FK to the address table. Junction table lists IDs of properties used for a given event.

What about the person? FK to the address table for residence, FK to the address table (AGAIN) for business address. Want to know about events? Junction table can contain person ID, event ID, and hotel/motel property ID. IF the person arrived late or left early (or you allow and care to track that fact), add arrive/leave dates to the JUNCTION table since it is specfic to one person's attendance to a given event and stay at a given property. (Also allows you to track if the person stayed at property 1 on the first day and property 2 on the second day, using two entries in this junction table.)

This is how you should be analyzing for entities to build your model. This kind of thinking is required. (I could be wrong about the details of what you are doing or tracking, but the guided tour will perhaps give you ideas anyway.)

Nono, reps aren't subhuman, it's just that they're filling possibly 2 roles, one of being a warm body at the meeting, and one of being the rep who covers the doc. A rep may cover a doc and not go, or cover a rep and go. I just figured if John Smith was a rep for Dr. Jones, then John Smith being in both the attendees table and in the sales rep table would be a duplication anomaly. So reps are people too, but with respect to the doctor, they are like a company or an affiliation. But I don't think they should be in 2 places at once. I could go with reps as people and just have the rep identifier blank (since they ARE the rep, who happens to be going to the event), but I don't want to get the beatdown for null fields (which wazz said are occasionally unavoidable, but then I don't get the special blue ribbon for being in 6NF.). So I'm trying to figure out what to do and stay legal there.

I'm kind of getting lost on #3. I think what it means is that I'd need to have tables:

Hospital (Name, FK to addresses)
Venue (Name, FK to addresses)
Even the Reps and Attendees (Formerly known as 'People', changing it for semantics) would FK to addresses. Basically Addresses is getting beaten like a rented mule by a bunch of tables.
Event (Name of the event, start date, end date)

Junction table EventAttendees (AttID, EventID, VenueID, and more below). So EventAttendees would daisychain, for example, Venue and the Venue's address by looking at the FK under Venue for the Address? I think I might have a handle on that. The Junction table would be the table that has, for example, the hotel room type a given attendee wants, the billing for the room. It ties one way to the attendee, the other way to the event, and therefore you'd see Dr. Jones went to a Large Course on March 22-23, 2008, but had a King Non-Smoking room with a check-in date of March 21, check-out of March 24, and the company paid for the room, and the only fields that aren't FK's are room type, smoking/ non, check-in, check-out, room billing.

Crap I just remembered food allergies. That's like a whole other junction. I just added it to the sketch pad.

I think I might have it, I'm still a little fuzzy on the reps as Reps and the reps as Attendees issue. I saw what wazz said, but then the repdetails table would be nameless. That's where I get a little lost.

The_Doc_Man · Oct 17, 2008

Getting lost in # 3 is not uncommon. It is a sticking point in assuring correct normalization. It is related to the "no repeating groups" rule... but really isn't only that rule.

In a spreadsheet (let's say,...) you would list Date, Doctor ID, Event, Property, and a bunch of other stuff. When you searched the SPREADSHEET for a Doctor ID, you would probably get many rows for one doctor. BECAUSE in that application, the DoctorID doesn't provide uniqueness. So in a table listing DoctorID and EventDateAttended, DoctorID by itself cannot be the PK. Because the PropertyID could be different for different events, thus causing a member of the row to not depend solely on DoctorID. The combination of Date/Doctor might be a valid compound PK, but not the DoctorID itself. Now, since this is a thorough spreadsheet, let's include the Doctor's telephone contact information. But now THAT information doesn't depend on the date of an event (that much). It is a doctor thing, not an event thing. If the PK was a compound key of DoctorID and EventID, the contact info doesn't fit.

This is what # 3 is all about. It allows you to recognize when a given table contains things that don't belong together.

"Reps sub-human..." - just a joke brought about by the fact that according to your original table list, Reps weren't People. Sometimes my mind malfunctions that way, finding the oddball humor implicit in a strange description.

"Address table getting beaten like a pack mule." Yeah, that happens in a commonly used table for an attribute that is popular. But you are correct to note that this table is used very heavily. AND there is an issue discussed in another thread. I was avoiding this but there are reasons why you MIGHT not wish to unify the address list. The problem won't be in the address lookup for a person, rep, hospital, etc... it will be in the REVERSE lookup of "who belongs to this address" - because of there being multiple possible tables feeding this one. So you might need a table code or form a junction table with a table code to show who/what is at a given address.

A word of warning about database design: Sometimes to be excruciatingly precise you find that you have generated a can of swirly spaghetti with all the implied and explicit linkages you need to establish. But then,... the real world is that way too.

gemma-the-husky · Oct 17, 2008

I hope this is germane

on the other hand i tend to take a different view to a lot of people, concerning addresses

if i have a number of different entities all with addresses eg , doctors, patients, reps (or customers/suppliers) i do not normally have a general purpose address table, i generally find there is little overlap, and even if there is a little duplication, it doesnt really matter. I believe the address is an attribute of the doctor, or patient or customer etc, and belongs there.

I think perhaps if you have a very substantial database. with hundreds of thousands of records, then perhaps this would justify using a postcode/zipcode picker to find the address, but for modest databases, I really dont think its so important.

Now is an approach that stores the same address in two different tables, because one table is an employee, and another table is a customer still normalised - yes i think it is.

But there's the rub - there are circumstances in which multiple approaches to table design/analysis can be taken, without making your data unnormalised - so what you need is to get a normalised structure according your enterprise rules.

i think the people/rep discussion is a case in point. Generally, i would have thought reps were (a subset of) people (ie reps are people, but people are not necessarily reps) - but if you have a situation where a rep is generally NOT also a people, then it may be better to have two different tables, as you may find it easier to manage that way. so thats up to you as system designer, and I dont think taking a considered decision to have two separate tables could be criticised.

The_Doc_Man · Oct 17, 2008

gemma is absolutely right, of course.

Function directs design, not the other way around.

visacrum · Jun 17, 2009

I realize after I had had the nervous breakdown and almost got fired that I never replied to this thread. Thank you both for your help, in the end you did help me understand all of this, and my database has a sane, workably normalized (I don't have separate tables for cities and states) structure. Instead of people, I have Surgeons, Reps, and RVP's (those are the rep's bosses). Turned out it didn't matter if the Reps attend or not, we just want to know what surgeons attended, and who their reps are, so I simply jxt'd them the same way I did so for their facilities.

Anyway, long story short, it's all set up and I've almost completed all the elements (forms nearly done, reports getting started). But I didn't want to seem ungrateful, if only tardy in my gratitude

The_Doc_Man · Jun 17, 2009

The fact that you replied at all with your success story makes it worth while.

Good luck with the maintenance thereof.

RossWindows · Jul 1, 2009

Just happened to browse through this thread and...
1) Replying to an older thread to tell everyone the overall outcome of the thread is something I really apreciate too.
2) This thread IS very rich in information and has given me some insight that I've never considered. I've been developing databases for almost two years and I learn something new EVERY DAY. I don't consider my work a "job", It's a practice.
Good luck with everything!

tonypony · Jul 30, 2009

RossWindows said:
Just happened to browse through this thread and...
1) Replying to an older thread to tell everyone the overall outcome of the thread is something I really apreciate too.
2) This thread IS very rich in information and has given me some insight that I've never considered. I've been developing databases for almost two years and I learn something new EVERY DAY. I don't consider my work a "job", It's a practice.
Good luck with everything!

likewise.

I am very new to this Forum and have already learned quite a bit by reading over other peoples' problems. I find that along the way you always pick up a similar problem to someone else.

Making Normalization seem normal

visacrum

Registered User.

The_Doc_Man

Immoderate Moderator

wazz

Super Moderator

visacrum

Registered User.

The_Doc_Man

Immoderate Moderator

gemma-the-husky

Super Moderator

The_Doc_Man

Immoderate Moderator

visacrum

Registered User.

The_Doc_Man

Immoderate Moderator

RossWindows

Que?

tonypony

New member

Similar threads

Users who are viewing this thread