Need expert advise on normalization

Montemontgo

Registered User.
Local time
Today, 07:02
Joined
Oct 25, 2004
Messages
26
The question is whether or not to normalize the fields "Place of Birth/Place of Death" in my 100'000 records famous people data base. So far, 15'000 different birth places are deployed. Only 3'000 of them are used more then once, that is to say from 2 to a maximum of 1000 times (New York City). 12'000 birth places are therefor used only once up to now. So, a table of 15'000 places would need a 3-digit (all characters used) or even a 5-digit (numbers only used) identification. In the latter case the ident would often be longer than the returned value (Wien, Oslo, Rom, it's in german, you know). Furthermore, instead of just entering a birth place like "Novodny Chomarowsky" I had to search the table of places whether or not the entry has been used yet. Please supply strong pro-normalization-arguments.
 
I would suggest that Place of Birth and Place of Death are attributes of the person because in both cases there can be only 1 value per person,

Therefore these attributes become part of the Person entity

Len
 
normalizing the db is a given. if you don't, you will have trouble with your statistics.

the question of an ID is up to you. dealing with the length is somewhat of a moot point but i wouldn't worry about it. it has been discussed here and you can search for discussion.

what you want to avoid having are errors such as New York and New Yrk - data that mean the same thing but look different.
 
Agree with wazz. Not only is normalisation a given it is a fundamental requirement.

Failure to do so will not only give trouble with statistics but your application could be accused of lacking data integrity.

Any database without data integrity is a failure.

L
 
Thanks lads. Very well, so I normalize. One last idea: While it is quite easy to enter say Dubrovnik in the field birth place, it is very boring looking up a giant 15'000-record-table to find that the ident for Dubrovnik is say 7086. How about having the ident for Dubrovnik Dubrovnik? That is to say the ident equals the returnd value. So for New York I type New York and so forth. Data integrity would be maintained and I could change the output value from New York to New York City at any time. Yes? No? Yerghhkkk...?
 
Question is are you having a separate table for birth place. If so what is to be the primary key. Dubrovnik will appear once and only once so why have another attribute in the table ?.

L
 
I'm not a DB guru, quite the opposite, and I get terribly confused about the use of normalisation. Nothing in this thread makes me see a reason for splitting off the place of birth. Why is it more likely that New Yrk will be entered instead of New York than 7068 instead of 7086, and its a damn sight easier to spot and put right, Also to change New York to New York City just requires a simple update query, doesn't it?

Brian
 
Absolutely, I agree....however from an operational point of view having a separate table would allow you to set up a combo box to select place of birth/death and totally control available entries with the down side that it must exist in the control table first,

Alternative is to have a combo box as selection whose record source is the data from existing entries with the limit to list set to No thus allowing new entries. Downside is that users may not be diligent in their checking if what they want is there already and selecting from the list and instead just typing it in maybe spelling it wrong in the process.

I would agree that Normalisation wise a separate table is not necessary, only maybe if you wanted to add some grouping by continent say where then the location would be associated with a continent.

Generally I have taken a view as to how hard I want to enforce a "standard". If I really want to nail it down then I use a control table otehrwise a combo as described above.

I would agree with all that say Normalisation is a fundamental requirement, no arguments there at all but sometimes control is paramount.

len B
 
Place of Birth and Place of Death in the same table. A new table for Locations which you would use as the foreign key to the two aforementioned fields.
 
Using lookup box with a choice of 20,000 options could be fun :)

Maybe a country field to act as a filter might help?

Peter
 
Bat17 said:
Using lookup box with a choice of 20,000 options could be fun :)

Maybe a country field to act as a filter might help?

Peter


Combo box would attempt to match as you typed in the beginning of the name so that is not really a problem. The objective is data integrity so that you do not get USA, U S A, U.S.A., US o A and all the other possible variations that people dream up

len
 
I'll toss in a thought.

For a table that size, having a LONG as the key to a lookup table means your main table is shorter even though you have extra data (namely, the PK of the lookup table, the FK in the main table). A LONG for an FK or PK takes 4 bytes. Even with the extra info in the lookup table, the first time you re-use a location you saved the ENTIRE LENGTH OF THE CITY NAME. In other words, your return on investment breaks even at TWO references. Can't beat that with a gnarly branch, man!

Last time I counted, "New York, New York, USA" took more than 4 bytes. Like, 8 bytes city, 8 bytes state, 3 bytes country. And then let's not forget even longer names like Nachitoches, Louisiana. Or Truth Or Consequences, New Mexico. Or Carmel-by-the-Sea, California.

Load the names, your person table gets huge. Use a lookup, your main table record doesn't change size and the size of the lookup field doesn't really matter. Further, a combo box created with the wizard will automagically store the right thing and show the right thing for you if you do it right.

The longer the record, the fewer you can fit in the buffer that is used for searching and other manipulations. Besides, as has been pointed out, a combo box with auto-lookup makes the size of the lookup table less of an issue.

Don't forget when doing statistical things that you can have darn near anything you want in the lookup table for locations. If you make a query that joins your people table to the location table, you can do your stats on the query as a basis. Anything you can do with a table you can USUALLY do with a query. Like do summations and counts and averages and such.
 
Hi guys
can I delete my 1st post on this thread, #7 , as you have all managed to deconfuse me. I am convinced of the need to split off the locations and how to do it. I hope Montemontgo is as happy :D

Brian
 
Just click on Edit, select delete post, and hit the Delete button. ;)
 
No one so far has mentioned the hierarchial nature of location data. I would create three tables and use cascading combos to minimize the dropdown lists - Country - Province/State - City/Town. It is only necessary to store the value from the final combo since it will be unique and allow you to join back to its parent table to obtain the full location.
 
Very well, thanks to all of you. So, after having read all your comments and all the links on the subject I come to the conclusion that in my 100'000-records famous people data base I shall normalize every single data field except the automatic counter. Let's see:

If you urge me to normalize the birth/death location where in the end about 25'000 values will be needed then I certainly should normalize:

christian name -> 11'000 values needed (1150 times "John" so far)
surname -> 51'000 values needed (160 persons named "Smith" so far)

... and so on, and so on, but most of all:

day of birth/death -> a set of 31 values will suffice
month of birth/death -> a small table of 12 values will do
year of birth/death -> a set of only about 1520 values needed
(pharao cheops born 2800 BC to Jack Kilby died 2005 [yesterday?])

... am I on the right path?
 
Don't get carried away. There is normalized and there is impossible to use. This type of normalization is frequently done with city/state/zip/country because of the analysis requirements and the ability to create reasonably small combos to select from. It is usually pretty easy to obtain a complete set of data at the outset so you can use the limit to list property and not allow users to enter anything not in the list. The stuff you're talking about is not usually done except in data warehouses. Keep in mind all the index maintainence that has to go on when foreign keys are added/changed.

If you start doing this for first and last names for example, you'll need to add code to the combo's not in list event to pull up a form to enter the "new" value since you won't be able to obtain a comprehensive set of data to begin with.
 

Users who are viewing this thread

Back
Top Bottom