Normalisation Help (1 Viewer)

ACM · Feb 26, 2013

Right, I've been through countless tutorials on how to normalise a database. It's something I have successfully done in the past, but quite a long time ago so I've forgotten a lot of it. Here is my UNF (normalised form) of a soccer match database:

UNF

Season
Tier
Division
Fixture
Result
Date
Status
Time
StadiumName
Attendance
RefereeName
First Half Goals
Second Half Goals

I'm struggling with 1NF. I'm having difficulty understanding the meaning of 'repeating data'. The tutorials I've been through say you need to separate the repeating and non-repeating data. Okay, but I can look at all of the above and say it's repeating, because... all of the data under each of those headings repeats somewhere. I'm confused about it.

I know that things from the above would have to be broken down during the normalisation process, such as 'Fixture', 'Result', 'RefereeName' and 'First Half Goals' & 'Second Half Goals', but I would appreciate any help for getting me started on this.

Okay, so for my attempt at 1NF, I've addressed the issue of having multiple values in each field by breaking down some of the headings. Problem is, the table split up into two (meant to be repeating and non-repeating data). However, I'm sure I've done this wrong. In the examples I've seen, they conveniently split up the table because they have appropriate data to fit into two tables. Looking at my set, I don't know how to split it into two without splitting up what is necessary in each table.

1NF

Season
Division
Status
Date
Time
StadiumName
Attendance

HomeTeam
AwayTeam
HomeGoals
AwayGoals
RefereeFirstName
RefereeLastName
HomeFirstHalfGoals
AwayFirstHalfGoals
HomeSecondHalfGoals
AwaySecondHalfGoals

Thanks.

GanzPopp · Feb 26, 2013

You are talking about data redundancy here. It is always recommendable to keep redundancy as little as possible, i.e. eliminate repeating data by referencing a separate look-up table.

For example, in your main table the Season will have repeating data and to store the full season properties in the Season field for each row would generate a lot of data. Instead a clever designer would define all the different seasons in a separate table called tblSeasons, and refer these seasons in the main table using an ID which is unique for each season. Therefore in the season table an ID autonumber field is required.

There are several advantages using this approach:
-Data storage is limited to a minimum
-Changing a season property will change it for every record in your main table referring to that season
-Your database will be much faster
-Your database can use relationships

You can do the same for Tier, Status and several other fields.

ACM · Feb 26, 2013

GanzPopp said:
You are talking about data redundancy here. It is always recommendable to keep redundancy as little as possible, i.e. eliminate repeating data by referencing a separate look-up table.

For example, in your main table the Season will have repeating data and to store the full season properties in the Season field for each row would generate a lot of data. Instead a clever designer would define all the different seasons in a separate table called tblSeasons, and refer these seasons in the main table using an ID which is unique for each season. Therefore in the season table an ID autonumber field is required.

There are several advantages using this approach:
-Data storage is limited to a minimum
-Changing a season property will change it for every record in your main table referring to that season
-Your database will be much faster
-Your database can use relationships

You can do the same for Tier, Status and several other fields.

Okay, I get the advantages of doing it, but I'm still struggling with what is repeating and what is non-repeating data. Unless I'm missing something blatant, isn't it all repeating data because everything under each of those columns (I've listed in my last post) will repeat.

I need to know what my 1NF is going to look like to start with — because despite going through various tutorials, I'm still struggling to apply the logic to my own set of data.

Please critique the below attempt:

UNF

Season
Tier
Division
Fixture
Result
Date
Status
Time
StadiumName
Attendance
RefereeName
First Half Goals
Second Half Goals

1NF

Season
Division
Status
Date
Time
StadiumName
Attendance

HomeTeam
AwayTeam
HomeGoals
AwayGoals
RefereeFirstName
RefereeLastName
HomeFirstHalfGoals
AwayFirstHalfGoals
HomeSecondHalfGoals
AwaySecondHalfGoals

GanzPopp · Feb 26, 2013

Check http://en.wikipedia.org/wiki/First_normal_form for a quick example (under 'Examples').

I would do something like this in 1NF setup:
SeasonID
DivisionID
StatusID
Date
Time
StadiumID
Attendance
HomeTeamID
AwayTeamID
HomeGoals
AwayGoals
RefereeID

All fields where I added 'ID' to the field name can be made non-repeatable by defining the unique names in a separate table and then referring them by a unique ID. Obviously, the referee of the match is going to ref other games, so that field is repeatable and should be separately stored.

Im not sure about these fields, can you elaborate on what they mean?
HomeFirstHalfGoals
AwayFirstHalfGoals
HomeSecondHalfGoals
AwaySecondHalfGoals

ACM · Feb 26, 2013

GanzPopp said:
Check http://en.wikipedia.org/wiki/First_normal_form for a quick example (under 'Examples').

I would do something like this in 1NF setup:
SeasonID
DivisionID
StatusID
Date
Time
StadiumID
Attendance
HomeTeamID
AwayTeamID
HomeGoals
AwayGoals
RefereeID

All fields where I added 'ID' to the field name can be made non-repeatable by defining the unique names in a separate table and then referring them by a unique ID. Obviously, the referee of the match is going to ref other games, so that field is repeatable and should be separately stored.

Im not sure about these fields, can you elaborate on what they mean?
HomeFirstHalfGoals
AwayFirstHalfGoals
HomeSecondHalfGoals
AwaySecondHalfGoals

Thanks for the 1NF structure. So are all the ones where you've added ID on the end non-repeatable? And all the ones which don't have an ID are repeatable?

Don't quite get what you mean about the referee. Surely it still is non-repeatable as it's coming from a referee table containing each ref?

With regards to those fields, a soccer match is 45 minutes each half. So these fields state the number of goals each team has scored in the first half and second half of the match (separately).

GanzPopp · Feb 26, 2013

You are getting it now!

It says 'made non-repeatable', but I guess I should have been more clear. At least the way you say it in your last post you are right, so stick to those ideas! Indeed the referee will be referenced in another table and therefore is repeatable. The date, time and final score are unique for each game and there for non-repeatable.

Regarding the half time scores: if you store these values separately for each half, then you don't need to store the scores for a whole match, since these can be derived from the two half time scores by adding them up. As a general rule of thumb, if a value in a field is derivable from on or more other fields then you don't want to store it.

ACM · Feb 26, 2013

GanzPopp said:
You are getting it now!

It says 'made non-repeatable', but I guess I should have been more clear. At least the way you say it in your last post you are right, so stick to those ideas! Indeed the referee will be referenced in another table and therefore is repeatable. The date, time and final score are unique for each game and there for non-repeatable.

Regarding the half time scores: if you store these values separately for each half, then you don't need to store the scores for a whole match, since these can be derived from the two half time scores by adding them up. As a general rule of thumb, if a value in a field is derivable from on or more other fields then you don't want to store it.

Sorry, I'm getting confused. In my last post I said the ones with an ID are NON-REPEATABLE. Although you've just said that: "The date, time and final score are unique for each game and there for non-repeatable."

So the ones WITHOUT an ID (date, time and score) are actually NON-REPEATABLE and the ones WITH an ID are REPEATABLE?

Please clear that up...

GanzPopp · Feb 26, 2013

Yes, you are exactly right! I will elaborate a bit, so you can decide for yourself in the future:

If specific data in a specific field is very likely to be present in multiple rows, then it is repeatable. Then you can consider to create a look-up table with unique IDs for use as a reference in the specific field in the main table. This is the case for all field where I added ID in the field name. I do this because then I know the values in this field are references.

Is the value in a specific field of the main table always unique then it is non-repeatable and you should store that value in the main table. Therefore these fields don't have ID added in the name, since the actual value is stored.

I hope this is clear now, if not feel free to ask again.

ButtonMoon · Feb 26, 2013

Hi ACM,

Normalization is about accurate representation of dependencies (functional and join dependencies) and solving certain problems that result from having undesirable "non-key" dependencies in tables. In a nutshell, normalization is a formal set of principles for analysing and eliminating undesirable dependencies from database designs.

When you talk about eliminating "repeating" data it's possible you may be conflating at least three very different issues, which is something that GanzPopp hasn't exactly addressed for you, so I'll try to do so.

Firstly there is the concept of repeating groups. Eliminating repeating groups was one of the motivations of 1st Normal Form (1NF) as originally described by E.F.Codd (inventor of the relational model). You can safely ignore the problem of repeating groups as it was something very specific to the legacy CODASYL type of database and is irrelevant in most modern DBMSs. A repeating group is where a table contains variable length array of values, which may or may not repeat. Most modern DBMSs (those based on the SQL model anyway) don't have any such feature - groups of values are never "repeated" in a table in the sense that 1NF is concerned with. In a SQL DBMS, columns can usually contain only single values. The concept of repeating groups is slightly muddied however because Codd later modified his definition of 1NF to say that tables should not contain nested relations (a relation being a real or virtual "table", i.e. the data structure that the relational model is based upon). This revised version of 1NF is controversial however and not everyone agrees on whether it is necessary or desirable always to avoid nested relations. Fortunately the dilemma of nested relations can usually be discounted. Most DBMS vendors don't even support nested relations even though they are part of standard SQL. Most database developers will avoid using nested tables even when they are available in their software.

The second type of "repeating" data you may have in mind is the idea of multiple columns in a table serving the same or similar purpose. Such designs are often regarded as an anti-pattern but sometimes quite wrongly labelled as a "repeating group" even though they are no such thing. A classic example of this kind of anti-pattern is where you have a collection of columns enumerated for the same kind of data item. E.g. a contacts table with columns for multiple email addresses called EmailAddress1, EmailAddress2, EmailAddress3, etc. This is usually a bad idea on the principle of DRY (Don't Repeat Yourself). If you store the same kind of attribute multiple times then any logic that depends on that piece of data may have to be duplicated or made more complex than it needs to be. Although such designs are often undesirable, formally speaking this type of repetition only violates Normal Form if nulls are allowed in any of the columns where no value is specified. Tables without nulls are a requirement of all of the "classic" normal forms (1st-6th NF, EKNF, BCNF, PJNF), although the "no nulls" requirement is frequently forgotten or ignored by database developers who use nulls.

So far I've been describing 1st Normal Form (1NF) and possible violations of it. 1NF is a complicated and often controversial topic but there's a decent introductory write-up here by Anith Sen:
www.simple-talk.com/sql/learn-sql-server/facts-and-fallacies-about-first-normal-form/

A third type of "repeating" data has been discussed already in this thread and here I do have to disagree with some of what GanzPopp said, or at least clarify the meaning of it a little. Let's take the example of StadiumName. The stadium that a match is played at is certainly an attribute of a match and surely does belong in any match table. Of course if there is more than one match at any stadium then the values of the stadium attribute will be repeated on multiple rows. Note very carefully however that normalization (the principles of analysing dependencies and applying normal form) is entirely independent of and indifferent to the type of data used to identify a stadium. Normalization says nothing about what type of attribute you should use to identify a stadium and certainly does not require you to think differently about the stadium attribute just because its values repeat on multiple rows or because it may be a text value and not a number. Substituting a numeric StadiumId in place of StadiumName (i.e. replacing a string with a number) therefore has nothing to do with normalization. In any case, such a substitution obviously does not eliminate "repeated" data from the match table: StadiumId would be repeated just as often as the stadium name. This is an extremely important point if you want to understand what normalization is (after all, this the database theory forum!). In fact normalization never involves inventing new attributes or substituting one attribute for another. Normalization certainly does not require you to identify or eliminate data simply on the basis that it repeats on multiple rows of a table - something which would anyway be impossible and/or totally unnecessary and undesirable.

The question you really need to consider in this case is what is the most prudent, concise and accurate way to identify a stadium. There are three useful criteria commonly applied to choosing identifiers - identifiers which typically become keys in the database. Those criteria are: Familiarity, Simplicity, Stability. A stadium name is probably familiar to the users of the match database but it isn't necessarily very simple or stable - there could be multiple spellings or formattings of a name and the name might well be subject to unpredictable changes, e.g. when the team's sponsor changes.

While there are plenty of reasons why you may or may not want to have a stadium name in the match table, you should also appreciate that those reasons have nothing to do with normalization per se and really have very little to do with eliminating repeating data.

My advice to you is that you put aside the misleading notion of "repeating" data - it is not really a helpful or important concept in database design. Study a good book on database design to understand some more of the things you should be thinking about. One very good, concise but insightful book that I often recommend is Fabian Pascal's "Practical Issues in Database Management". For a more in-depth treatment of how to design a database: "Information Modeling and Relational Databases" by Terry Halpin. Whatever you do, don't rely on what you find on Wikipedia or many other online sources (including this post of mine!). Study some reliable and well recommended sources of information or take a reputable course. Avoid books or courses based around specific software so that you can master general principles first before you get into the nuances of software products.

Hope this helps.

ACM · Feb 27, 2013

GanzPopp said:
Yes, you are exactly right! I will elaborate a bit, so you can decide for yourself in the future:

If specific data in a specific field is very likely to be present in multiple rows, then it is repeatable. Then you can consider to create a look-up table with unique IDs for use as a reference in the specific field in the main table. This is the case for all field where I added ID in the field name. I do this because then I know the values in this field are references.

Is the value in a specific field of the main table always unique then it is non-repeatable and you should store that value in the main table. Therefore these fields don't have ID added in the name, since the actual value is stored.

I hope this is clear now, if not feel free to ask again.

What about adding primary keys to these tables? The examples I've seen split the table in half and assign a primary key to the top table and assign 2 primary keys to the bottom table. However, looking at my list, the whole of the top table are primary keys. How do I advance beyond this and into 2NF?

SeasonID
DivisionID
HomeTeamID
AwayTeamID
StadiumID
RefereeID
StatusID

HomeGoals
AwayGoals
Attendance
Date
Time

GanzPopp · Feb 27, 2013

Personally I wouldn't split the table since that is not necessary. You will need a reference in each table to be able to connect the corresponding rows from both tables. The above-proposed setup with the ID references provides excellent ways of managing your data and has little redundancy. Most important is that you can always reproduce the information from the data stored and provide it to yourself or the end-user in a human-readable and editable way.

More examples are given here: 2NF and 3NF. If you can understand those pages you don't need much more to start a well-designed database. I wouldn't worry about all these design principles too much right now, as it seems you have little experience in building a database.

About the primary keys: it is useful to have at least one key defined per table. For example, the look-up tables would have the primary key defined for the ID field. This way the database engine can quickly look up referred data. For the main table you can maybe define a MatchID field which will just give a unique identifier for each match defined in your database.

Normalisation Help (1 Viewer)

ACM

Registered User.

GanzPopp

Registered User.

ACM

Registered User.

GanzPopp

Registered User.

ACM

Registered User.

GanzPopp

Registered User.

ACM

Registered User.

GanzPopp

Registered User.

ButtonMoon

Registered User.

ACM

Registered User.

GanzPopp

Registered User.

Similar threads

Users who are viewing this thread