To Null, or Not to Null

WaiveIt

Registered User.
Local time
Yesterday, 19:21
Joined
Dec 2, 2013
Messages
58
That is the question.

Letting Null happen is so easy and fast when designing import functions and getting data from forms.

I have read a lot of "Avoid Null", but not much on why.

So, Why should I spend effort building an application that avoids nulls?

and

When is Null acceptable / preferred? or When should Null be absolutely avoided?
 
I see no reason to avoid Nulls. I suspect you have come across people repeating what they have heard without any understand of why. There are such myths in Access that can be traced back to hearsay all built upon one unsubstantiated claim.

Anything entered in a record where no real information was actually was recorded is misleading. Null does not mean the same as Zero.

For example if you record the number of somethng as Zero to avoid a Null and you don't actually know the real value then it is the wrong information.

I have imports where I go to the trouble of foring Nulls which would otherwise arrive as zero.
 
That is the question.

Letting Null happen is so easy and fast when designing import functions and getting data from forms.

I have read a lot of "Avoid Null", but not much on why.

So, Why should I spend effort building an application that avoids nulls?

and

When is Null acceptable / preferred? or When should Null be absolutely avoided?

It's usually good practice to avoid or minimise the use of nulls. Nulls cause some queries to return results that are "incorrect" (i.e. the results won't correspond with the intended meaning of the database). Unfortunately SQL and SQL-style databases can make nulls difficult, though not necessarily impossible, to avoid. It's a very real problem and even experts often have trouble spotting flaws in query logic caused by nulls.

Since there is nothing like nulls in the real world, using them means making some compromises in the way your database represents reality. In fact there is no single consistent "meaning" of nulls and little general agreement on what they are for. In practice, nulls get used to represent all sorts of different situations. When you do use them it's a good idea to document exactly what a null means for any given attribute.

Here's a very good lecture about the "null problem" by Chris Date (founding influence and leading authority on relational database theory):
http://www.youtube.com/watch?v=kU-MXf2TsPE
 
I seem to recall a discussion on this subject some years ago, perhaps with my old friend Banana. Some of the domain aggregate functions work OK with nulls in the relevant data stream but others don't. Therefore, if you are going to have nulls (which technically is OK), what you REALLY need to do is decide how you will handle them when you get them.

Another thread here mentions one-to-one relationships and brings into play the concept of Z-cardinality, meaning "one of the participating tables has a record for key value XXX but the other table does not." In that case, an outer join of the two tables would show nulls in the fields derived from the table that is missing the record.

By extension, if you had other relationships that included 0, 1, or many associated records (say, records representing dependents in a personnel benefits table), the joins would have to allow for the zero dependents case without dropping the employee's primary record from processing. So you WOULD get some nulls for that case. I'm sure many folks here have run across such situations many times. I won't bother to contrive other examples.

The answer, to me, is always simple. When a null is a possibility, plan for it. Decide how and when you test for it and decide what you need to do when the test comes back positive. The ONLY wrong answer is to know you might have nulls and do nothing about it.
 
The answer, to me, is always simple. When a null is a possibility, plan for it. Decide how and when you test for it and decide what you need to do when the test comes back positive. The ONLY wrong answer is to know you might have nulls and do nothing about it.

My feelings exactly.
 
It appears that I may have misread / misinterpreted the question here. The question is posted in the database design forum and I read it as a question about database design, i.e.: whether a database designer should permit nulls in tables and why it might not be a good idea to do so. That was the question I tried to answer.

On rereading the original question and the answer by Doc Man it seems like the topic is actually applications that use data and not necessarily the design of the database itself. Databases can obviously be designed either with or without nulls. Equally obviously if an application is required to deal with nulls as one of its expected inputs then it is very sensible to consider and design for that eventuality!
 
I am not sure that you can build a database that will have no nulls in it, even if you happen to not call them nulls. You can build a QUERY that might never have any nulls - by judicious use of the NZ function or its equivalent in your database. But if I remember my ANSI SQL rules (and admittedly, it has been a while so be nice, folks...), if you do a JOIN of tables in a one-many relationship and the many table can have nothing to match, you can choose an INNER JOIN or an OUTER JOIN. If you take the INNER, you don't see the unmatched "one" record. If you take the OUTER JOIN, you will ALWAYS face the chance of seeing nulls in the raw datastream. Set theory would demand such a response, as would formal Interrogatory logic.

I don't know whether you want to call the missing data NULL or represent it with a code not allowed as a data element. Doesn't matter whether it is NULL or DID NOT RESPOND or NOTHING THERE or VOID. Whatever the code in whatever the implementation, you can never avoid the situation. If the database itself disallows the existence of nulls then all that means is that someone made a default rule equivalent to the NZ function because the situation CAN exist in the real world. If the database is a way of modeling the real world, it NEEDS the null case to represent something that is isolated or unmatched.
 
I am not sure that you can build a database that will have no nulls in it, even if you happen to not call them nulls

It is certainly possible and there are many successful databases designed without nulls. That does not mean using some value as a default or "surrogate null"; it simply means taking the fact-based modelling approach (e.g. see Halpin/Nijssen). Each table models a set of propositions which are understood to be true propositions (facts). If some proposition doesn't include a value for X then that proposition rightly belongs in a table which doesn't have attribute X. That is the model implied by correct application of Normal Form in database design.

Sometimes for various reasons database designers choose to deviate from Fifth Normal Form by adding nullable attributes to a database design. In those cases nulls are a purely technical feature added to the original and accurate representation of the business domain - a representation which did not have nulls. Nulls aren't part of the reality being modelled, they are part of its technical implementation.

In this thread I've talked about nulls in the SQL sense of the word because that's what null means in SQL DBMSs and other systems (like Access/Jet) that are derived from the SQL model. Science and maths modelled the world accurately without nulls for thousands of years before the invention of nulls in 1970s databases. Science and maths continue to do so today without using anything like nulls. SQL nulls certainly are not something implied by logic or set theory. As you may know, the relational model is founded on set theory and first order logic and does not include nulls (Codd 1969,1970). Codd much later proposed nulls marks as part of an "extended" version of the relational model. They aren't required and are not even possible in the relational model proper.

I notice in this post you say that null is "based on" the meaning of null in set theory. It's true that "null set" is an alternative name for the empty set in set theory. That has nothing to do with SQL-style or Codd-style nulls however. Codd makes it crystal clear that nulls are marks, not values or sets of values and that null marks are completely distinct from all other values including the empty set. Empty sets are properly supported in the relational model and are quite distinct from nulls even in Codd’s proposed extension of that model. The empty set has the important property of being equal to itself and different from other sets, i.e. X=X is true and X=X’ is false if X is the empty set. That’s not the case if X is null because null is not a set or a value and is nothing like the empty set.

Null and the empty set are concepts distinct from each other in SQL just as they are distinct concepts in relational theory, but because the SQL model of data is not truly relational it doesn't properly support empty sets in all the (non-null) ways that the relational model does. It is extremely unfortunate that the word "null" in set theory is used to describe the empty set but the same word is used in SQL to mean something completely different and unrelated. Keeping the concept of null apart from non-nulls and sets of values (empty or otherwise) is one of the things that students of database theory are supposed to learn. If, when you wrote that historical post, you thought that SQL’s null or Codd’s null was based on or similar to the empty set then I respectfully suggest that you hadn’t properly understood the relational model. I realise it was written more than six years ago.

In the same post you also say that Codd clearly thought multiple null marks were similar to multiple infinities. I seriously doubt it. The relational model and its implementations are finite systems. The only apparent parallel between null marks and infinity is the superficial one that there is more than one kind of each. Codd was proven wrong about there being only two possible versions of nulls. Hundreds have been proposed since then but they suffered much the same fate as Codd’s proposal: never widely accepted and seldom used outside of an extremely limited academic setting. The only kind of null that matters in data management practice is the SQL kind and that is what myself and many others (including DBMS vendors like Microsoft in their official documentation) recommend should be avoided or at least kept to a minimum.

I mention your old post only because I recently found it linked on another site where some confusions about nulls were being made. I wondered if your views might have changed since then.
 
Avoiding Nulls is all very well in theory but lets take a practical example.

A table with fields for FirstName, MiddleName, Surname. Sorry you can't have that. Some people don't have a middle name and since Nulls are banned... Indeed some people only have one name and it isn't even possible to define that name in a sense of FirstName or Surname.

Therefore "proper" design demands that all names be held in a related table like the structure I described in Post 13 of this thread.

Any of you doing things properly and storing names like this?:rolleyes:
 
It is certainly possible and there are many successful databases designed without nulls. That does not mean using some value as a default or "surrogate null"; it simply means taking the fact-based modelling approach (e.g. see Halpin/Nijssen). Each table models a set of propositions which are understood to be true propositions (facts). If some proposition doesn't include a value for X then that proposition rightly belongs in a table which doesn't have attribute X. That is the model implied by correct application of Normal Form in database design.

Sometimes for various reasons database designers choose to deviate from Fifth Normal Form by adding nullable attributes to a database design. In those cases nulls are a purely technical feature added to the original and accurate representation of the business domain - a representation which did not have nulls. Nulls aren't part of the reality being modelled, they are part of its technical implementation.

In this thread I've talked about nulls in the SQL sense of the word because that's what null means in SQL DBMSs and other systems (like Access/Jet) that are derived from the SQL model. Science and maths modelled the world accurately without nulls for thousands of years before the invention of nulls in 1970s databases. Science and maths continue to do so today without using anything like nulls. SQL nulls certainly are not something implied by logic or set theory. As you may know, the relational model is founded on set theory and first order logic and does not include nulls (Codd 1969,1970). Codd much later proposed nulls marks as part of an "extended" version of the relational model. They aren't required and are not even possible in the relational model proper.

I notice in this post you say that null is "based on" the meaning of null in set theory. It's true that "null set" is an alternative name for the empty set in set theory. That has nothing to do with SQL-style or Codd-style nulls however. Codd makes it crystal clear that nulls are marks, not values or sets of values and that null marks are completely distinct from all other values including the empty set. Empty sets are properly supported in the relational model and are quite distinct from nulls even in Codd’s proposed extension of that model. The empty set has the important property of being equal to itself and different from other sets, i.e. X=X is true and X=X’ is false if X is the empty set. That’s not the case if X is null because null is not a set or a value and is nothing like the empty set.

Null and the empty set are concepts distinct from each other in SQL just as they are distinct concepts in relational theory, but because the SQL model of data is not truly relational it doesn't properly support empty sets in all the (non-null) ways that the relational model does. It is extremely unfortunate that the word "null" in set theory is used to describe the empty set but the same word is used in SQL to mean something completely different and unrelated. Keeping the concept of null apart from non-nulls and sets of values (empty or otherwise) is one of the things that students of database theory are supposed to learn. If, when you wrote that historical post, you thought that SQL’s null or Codd’s null was based on or similar to the empty set then I respectfully suggest that you hadn’t properly understood the relational model. I realise it was written more than six years ago.

In the same post you also say that Codd clearly thought multiple null marks were similar to multiple infinities. I seriously doubt it. The relational model and its implementations are finite systems. The only apparent parallel between null marks and infinity is the superficial one that there is more than one kind of each. Codd was proven wrong about there being only two possible versions of nulls. Hundreds have been proposed since then but they suffered much the same fate as Codd’s proposal: never widely accepted and seldom used outside of an extremely limited academic setting. The only kind of null that matters in data management practice is the SQL kind and that is what myself and many others (including DBMS vendors like Microsoft in their official documentation) recommend should be avoided or at least kept to a minimum.

I mention your old post only because I recently found it linked on another site where some confusions about nulls were being made. I wondered if your views might have changed since then.

bib

are you sure?

empirically a name might be

John Smith,
or John Frederick Smith
or even John Frederick Steven Smith

it's one thing having the human intelligence to "know" that each of these represent a valid "name"

it's completely different defining this in a logical system. one solution is not to separate the forenames and surname, and treat the name as an atomic entity. However we generally need to be able to separate out the surname.

To take your solution you seem to implying we need 2 (or more) tables - one for "names" WITH forename, surname AND middle name, and one for "names" WITHOUT the middle name - and then we have to move data between these tables as we edit the offending missing values. And we would STILL need to union the sets for both cases ....

I know in the US they often use a "NMN" for "no middle name" - but it is hard to avoid the underlying reality of using nulls - or an indicator of some sort that the value is not there, or at any rate not identified.

I suspect that the simple answer is that computers cannot handle a null set, (as you are discussing above) which is why we have to treat them specially.
 
To take your solution you seem to implying we need 2 (or more) tables - one for "names" WITH forename, surname AND middle name, and one for "names" WITHOUT the middle name - and then we have to move data between these tables as we edit the offending missing values. And we would STILL need to union the sets for both cases ....
[/QUOTE]

That certainly would not be a proper solution.

Have a look at the link I posted. It is the proper normalization of name data though in practice I have never seen it used.

Multiple name form repeating groups and hence should be held in the same field as I showed.
 
Much ado about nothing?
I managed a federal DB that had no nulls. Of course, in their wisdom of moving from a legacy system, zero was common. It was to say the least very @#*%^!@
The argument for defining data types is very valid from the design aspect.
However, for quick delivery solutions that will pay-the-bills, the null would seem to be perfectlly acceptable.
There are good arguments on both sides. Just don't create a thesis paper when a yellow-sticky-note will do.

On the names, I assisted on a large IBM DB that managed the largest 401K (its a tax retirement thing) DB at the time. They had normalized a very large system to the point that it wouldn't function. They had a lookup index for every last name so "smith" would be efficient. They also had index for First, MI ... and for huge amounts of street address (123 Elm).
It was well designed, documented and planned by some of the most expensive people in the industry at that time. But, it became unuseable. It had to be De-Normalized to be functional.
To me, the take-away is that theory is very important. However, the objectives are also important aspects of any project. So, the correct answer (for any contractor) is "it depends".
 
To take your solution you seem to implying we need 2 (or more) tables - one for "names" WITH forename, surname AND middle name, and one for "names" WITHOUT the middle name - and then we have to move data between these tables as we edit the offending missing values. And we would STILL need to union the sets for both cases ....

I know in the US they often use a "NMN" for "no middle name" - but it is hard to avoid the underlying reality of using nulls - or an indicator of some sort that the value is not there, or at any rate not identified.

You propose using nulls in name columns even when the full name is applicable, known and complete? That sounds like a really bad idea to me. It illustrates what I said before: nulls get used for so many different things that even if the intended meaning of a null is documented in each case there's little chance of making logical sense of the results. Isn't a middle name of "Joe" != to the middle name of someone who doesn't have a middle name? What would you expect to achieve with nulls that wouldn't be achieved by making an empty middle name = "" (empty string)?

I generally wouldn't choose to create a middle name attribute. A single "Given Name" or even a "Given Names" column is a preferable and perfectly valid option for many situations. Or if you really do care about individual names then you should decompose them into a separate names table as Galaxiom also suggested. Try to avoid terms like "first", "middle" and "last" because they mean different things to different cultures.

Names are a very uninteresting example for a discussion about nulls however and it's a shame the thread has taken this turn.

I suspect that the simple answer is that computers cannot handle a null set, (as you are discussing above) which is why we have to treat them specially.

Computers and databases can. They do. It's nulls that are the billion dollar problem, not empty sets (vide Tony Hoare).
 
Names are a very uninteresting example for a discussion about nulls however and it's a shame the thread has taken this turn.

Sure, but it is easily understood example of a practical reality.

Could you give a solid practical example of somethig you consider more appropriate? Instead of just Codd-pieces.;)
 
Could you give a solid practical example of somethig you consider more appropriate? Instead of just Codd-pieces.;)

Example:
attachment.php


In this schema, employees can optionally belong to a pension scheme. The attributes for pension roll number, type and contributions apply if and only if an employee belongs to a pension scheme. This is the design required to satisfy BCNF / 5NF if {EmployeeNum} -> {Salary,Department},{RollNum} -> {EmployeeNum,SchemeType,EmployerContribution,EmployeeContribution} and given that an employee can enroll in the pension scheme no more than once.

A possible (IMO bad) alternative would be to merge all the pension data into the employee table and make the pension-specific columns nullable where the employee isn't enrolled in a pension. This "nullable" version of the database would have at least the following disadvantages:

- Loss of the key on RollNum (RollNum is specific to pension-enrolled employees so it is relegated to being a nullable attribute)
- Consequently the key dependencies on RollNum would not be enforced as required to satisfy 3NF (i.e. {RollNum}-> is no longer a key determinant).
- Allows required pension attributes to be left null when they should be populated. In my original design the attributes for employees in the pension schema were all mandatory (non-null), now they are optional.
- Allows pension attributes to be wrongly assigned to employees who don't even have a roll number because they aren't enrolled in the pension scheme.
- No longer possible for other pension-specific tables to have foreign keys referencing the EmployeeNum in the EmployeePension table.

You could apply procedural code, some fairly complex database constraints and other strategies to address some, but probably not all, of these potential data integrity problems. The result would be a database that is more complex to develop and maintain and more difficult to comprehend. unless you read the code behind the constraints you wouldn't know what business rules were being implemented whereas in my model they are pretty obvious just by looking at the keys and foreign keys.

A nullable version of the schema would of course force users to deal with the complexities of nulls in columns where they wouldn't otherwise have to worry about them. A nullable schema suffers the same disadvantage implied by any kind of denormalization: it introduces unnecessary "bias" into the schema by tailoring it to one assumed pattern of usage while making it less suitable for other purposes. It is a compromise with flexibility and potential source of technical debt in any database solution.

Best practice is to design a schema without nulls and then add nulls to the picture only where technical expendiency dictates it. One "good" reason why you might want to denormalize part of a database schema for example is to enforce some business rule that couldn't be implemented without doing so, e.g. where the 5NF alternative wouldn't preserve some important dependency, or where the DBMS wouldn't otherwise permit an integrity constraint that spans multiple tables. Given the disadvantages of adding nulls to a database design they should be added only sparingly and after careful consideration.
 

Attachments

  • EmployeePensionExample.jpg
    EmployeePensionExample.jpg
    49.5 KB · Views: 354
I sense that we are quibbling about terms here. Let's cut to the chase.

Null (assuming you weren't talking about the ASCII NULL character) means the computer got nothing back from whatever was attempted. There are a lot of these floating around. For example, in IEEE formats, there is a bit configuration called "NaN" (not a number) that you can use to determine that a given variable has not been initialized. The point is that if you don't define it, the computer can't tell you what it is - and good, bad, or indifferent, we use the world "NULL" (in the context of "null result") to represent something that the computer cannot tell us.

Assertoric logic does with nulls, but Interrogative logic deals with it more often, since that branch of logic (sometimes) asks the question of whether the query / question / action even condones an answer, much less whether it currently has one. There, we can make a distinction between a NULL answer and a FALSE question - the latter doesn't condone an answer whereas the former case merely didn't find anything to answer.

A database will ALWAYS have the opportunity for returning the null set (as the result of a query with zero records). However, we can ask whether any records were returned before we try to actually process them. We simple have intercepted the null rather than tried to process it. If you look at non-finite automata theory, you recognize that intercepting nulls is simply the attempt to convert a non-finite automaton to a finite one.

Similarly, any variant of OUTER JOIN will sometimes return null components for cases where the parent record has no matching child record. If you ALWAYS apply the NZ function or other analogous method to mitigate those nulls, then your database will not SEE any nulls - but it had them until it hid them by doing something else with them.

I don't deny that you CAN take such mitigating action - but let's not hide the fact that the REAL world often has the "no matching data" case, which is the most common database null. To claim that a database will never have nulls simply means that you have spent considerable time in your design so that all nulls are properly mitigated - to which I say KUDOS - because that level of mitigation ain't easy.
 
I sense that we are quibbling about terms here.
It's not a disagreement about terms at all. It's about two fundamentally different concepts: one concept is NULL (in the SQL database sense or in the sense of Codd's "null marks"); the other concept is the empty set, which is not a NULL although it is sometimes referred to by mathematicians as the "null set". "Null set" is a term rarely used in the data management field, presumably to avoid any risk of confusion with NULLs. Accordingly I am going to avoid using the term null set and stick with the more widely used term empty set. I'll try and pick out the cases where you are mixing these two things.

Null means the computer got nothing back from whatever was attempted.
NO. That's what the empty set means (at least in relational database terms). In relational algebra, the expression:
attachment.php


returns the empty set. It does not ever return NULL. No query in the classical relational algebra returns a NULL.

Similarly in SQL, the near-equivalent query:

SELECT * FROM R WHERE 1=0;

returns an empty set. It does not return NULL. Even Codd's "Rule 3" (which by the way post-dates the relational model by two decades) clearly stipulates that NULL must be distinct from all other values in the database - including of course, empty sets, which are distinct values in Codd's relational algebra.

The result of the above expression in the RA is a value (the empty set) which is equal to itself and is not equal to other values. The = and != operators return TRUE or FALSE as expected. That would not be the case if the result was NULL.

we use the world "NULL" (in the context of "null result") to represent something that the computer cannot tell us.
The empty set again. It would be unwise to use NULL to represent such a case.

A database will ALWAYS have the opportunity for returning the null set (as the result of a query with zero records).
Correct. The empty set, that is, which is something different to NULL. A query returning no rows self-evidently is something different from a query that returns NULL. Always has been, always will be.

Similarly, any variant of OUTER JOIN will sometimes return null components for cases where the parent record has no matching child record.
That's actually just an annoying "feature" of SQL. OUTER JOIN is a derived operator: a join followed by a union. Arguably a "better" alternative would be to return two result sets or simply to provide a more meaningful "outer tuple" to join into the query.

To claim that a database will never have nulls simply means that you have spent considerable time in your design so that all nulls are properly mitigated - to which I say KUDOS - because that level of mitigation ain't easy.
It isn't easy in SQL-style databases. It is extremely easy in relational ones.
 

Attachments

  • emptyselect.jpg
    emptyselect.jpg
    4 KB · Views: 309
but to take Galaxiom's example that some names do have a middle initial, and others do not, you need to establish a way you can manage and use a set of "persons", some of whom do have middle names, and others do not.

you therefore need a way of representing something that does not exist, and often the default state of the appropriate variable type is not quite adequate, I think.

it seems to me that arguing the differences between a null and an empty set is somewhat similar to arguing that one infinity is different to another. It may be true in absolute mathematical terms, but is really of little practical importance in empirical terms.
 
it seems to me that arguing the differences between a null and an empty set is somewhat similar to arguing that one infinity is different to another. It may be true in absolute mathematical terms, but is really of little practical importance in empirical terms.

It is of enormous practical importance! It's the difference between getting results that are right rather than wrong and it is just as fundamental and important as 0<1 or 2+2=4. If your DBMS ever returned a null when it should return an empty set then your DBMS would be broken. I'm actually surprised if there are people who have been using database systems for more than a few months who don't yet know the difference between a null result and a result that isn't null.
 
button, I think you are creating a problem where none exists - or maybe more correctly one where we are aware of the problem, and are used to handling it.

take this, from one of your posts

It's not a disagreement about terms at all. It's about two fundamentally different concepts: one concept is NULL (in the SQL database sense or in the sense of Codd's "null marks"); the other concept is the empty set, which is not a NULL although it is sometimes referred to by mathematicians as the "null set". "Null set" is a term rarely used in the data management field, presumably to avoid any risk of confusion with NULLs. Accordingly I am going to avoid using the term null set and stick with the more widely used term empty set. I'll try and pick out the cases where you are mixing these two things.
if mathematicians refer to sets wrongly in strict terms, but probably correctly, both colloquially and empirically - then we can do the same.

so if we want to find orphan records, and do so by establishing that a join of one sort or another produces no records, it is of little significance whether that is (in truth) a null, a null set or an empty set.

And (again without me having the pure math skills to correctly annotate the point in logic terms) I know the difference between a tuple/entity/object that is blank because

a) the value is known to not exist
b) the value is currently not known
c) the database object I examined produced no records.

and I can loosely term all of these as blanks, while knowing they represent different categories of blank.
 
Last edited:

Users who are viewing this thread

Back
Top Bottom