Calculated fields, Dynamically calculate vs Pre-calculate

emorris1000 · Mar 15, 2013

I'm working on a new database which will need to regularly display a bunch of values that are derived from raw data in the database. Like, I have x and y and the form needs to show x/y. Usually the calculations are simple, occasionally they are more complex (like one requires a very nasty application of solver, but I'm not discussing that one here because it involves a lot of other factors.)

I've generally always maintained the attitude that you should dynamically calculate these and only generate them downstream to keep things simple/clean/pure/etc., as opposed to pre-calculating them and keeping them in the relevant table, as that violates rules of normality.

But I'm starting to work on a more advanced application of these values that would put these calculations into an already complex search query (complex enough that the query design wizard will crash if you look at them in any form other than SQL). So now I'm entertaining the idea of pre-calculating them, because in this case at least any efficiency is lost when trying to dynamically calculate them.

I'm curious if other people have internal logic they use for determining when/if they should pre-calculate values, or do they have hard rules to never do this?

ButtonMoon · Mar 16, 2013

There are various options, partly depending on what software features you can take advantage of. Certainly use a client-server DBMS so that any calculations can be performed and cached server-side. Many DBMSs have features like materialized views / snapshots, function-based indexes and computed columns to maintain the results of complex calculations in the database.

There's nothing necessarily "wrong" with persisting the results of calculations directly in tables if it's efficient and useful to do so. Such calculations don't usually need to violate Normal Form either. Storing data that's determined only from non-key values in the same row would be a violation of BCNF but it's seldom useful to persist those kinds of calculations anyway. Typically the type of calculations that it's desirable to persist require aggregations or joins or iterative calculations that don't involve the kind of functional or join dependencies that classic normalization theory is concerned about.

Another good reason to persist calculations is if you want to break any dependency on the underlying data. E.g. to preserve historical values at a point in time or to save summary data after detailed data is archived.

Hope this helps.

Pat Hartman · Mar 18, 2013

Storing calculated values violates normal forms. That doesn't mean you wouldn't ever do it. It does mean that you need to understand what you are doing and the risks it entails. You also need to be sure it is necessary. Violatiing normalization rules shouldn't be your first option, it should be your last.

ButtonMoon · Mar 18, 2013

Pat Hartman said:
Storing calculated values violates normal forms.

Hi Pat,

As I already mentioned there are some cases where storing a calculated value would create a non-key dependency in violation of normal form but those cases are ones where it's probably not desirable to store a calculated value anyway. The more typical and more interesting situations where it is useful to store calculated values don't require you to violate any normal form.

E.g. storing totals in a periodic summary table or deriving some values from other rows in the same table. Unless you intentionally create a non-key join dependency within any single table you can (and probably should) still ensure your tables satisfy 5th Normal Form.

Storing the results of calculations is very common in lots of well-designed databases, especially in business intelligence and management information systems for example. What I would suggest to emorris is that if he finds a need to store calculated values then he does so in a way that doesn't violate Normal Form. Maybe that's what you meant to say too.

gemma-the-husky · Mar 18, 2013

ButtonMoon said:
Hi Pat,

As I already mentioned there are some cases where storing a calculated value would create a non-key dependency in violation of normal form but those cases are ones where it's probably not desirable to store a calculated value anyway. The more typical and more interesting situations where it is useful to store calculated values don't require you to violate any normal form.

E.g. storing totals in a periodic summary table or deriving some values from other rows in the same table. Unless you intentionally create a non-key join dependency within any single table you can (and probably should) still ensure your tables satisfy 5th Normal Form.

Storing the results of calculations is very common in lots of well-designed databases, especially in business intelligence and management information systems for example. What I would suggest to emorris is that if he finds a need to store calculated values then he does so in a way that doesn't violate Normal Form. Maybe that's what you meant to say too.

surely the thing is managing the integrity of data.

let's say a data warehouse stores a collapsed sum of a domain of items, as well as the itesm themselves, for speed of access.

The problem that I can see is that if something happens to the underlying data, then the collapsed total needs to be updated to reflect that.

This is typical in say, a sales ledger (US receivables) where the system is likely to sotre a current account balance, as well as the underlying transactions. So somewhere in there, there is often a "data integrity", or "reconciliation" test, that doble checks that the transactions do indeed sum to the account total.

In general, a calculated total is worth storing for data that is oftten referred to, or needs a bit of work to derive, AND is unlikely to change - such a VAT/sales tax rate, and value on an invoice.

Pat Hartman · Mar 18, 2013

Button - I'm going to guess that the average poster here has no idea what a data warehouse is or why it is fine or even best if their schemas voilate the first three normal forms. On the rare occassions where we get a question regarding a data warehouse application, the poster tells us that is what we are dealing with. As long as we are working with a transactional application, storing calculated values is fraught with danger and simply not recommended. Even experts who decide to violate the rules and do it can leave gaps that result in eroneous results. As developers we have a fiduciary responsibility to keep our client's data safe and accurate to the best of our ability. That's why as Dave suggested, if you store the calculation, you need to have a way of finding errors caused by updates that didn't get rolled up correctly.

ButtonMoon · Mar 18, 2013

Pat,

As so often in data management the only answer that is invariably true is: It depends. There are (I hoped you would agree) plenty of situations when not storing a calculated value would likely be a poor choice that I for one wouldn't usually recommend. E.g. I would recommend putting periodic balance totals in a retail banking transaction database. I would generally put a calculated price or line item total on an order record. Those are indeed data integrity issues: ones where not storing a calculated value very likely would cause a loss of information and perhaps integrity (or at least loss of utility and correctness).

Of course data integrity is of enormous importance in data management but when someone asks for guidance I prefer to assume that people are competent to evaluate the issues under discussion and don't need to be frightened with, frankly, scaremongering phrases like "fraught with danger"! In a "Theory and practice of database design" forum I would hope it's in order to give more detailed, practical and sound advice than that.

Proper handling of calculations and calculated data is an essential aspect of database design that should be understood, mastered and used by everyone working in the field. That's why I spent my first paragraph in this thread mentioning some of the features that DBMSs use to ensure integrity of calculated values.

I also already said a couple of times said that calculated values should generally be stored only when doing so would not violate Normal Forms (ignoring DKNF, which is unimportant here) so I'm not sure why you needed to bring that up again. What other "rules" you might be referring to I'm not sure. In a scientific discipline like data management "rules" are not statutes that forbid you from doing things or that can or should be "broken" according to choice. The only rules are theorems that explain phenomena and are tools to solve problems.

Anyway I hope some of the things I've said have helped with the kind of informative discussion that emorris wanted about these issues.

Pat Hartman · Mar 20, 2013

I also already said a couple of times said that calculated values should generally be stored only when doing so would not violate Normal Forms (ignoring DKNF, which is unimportant here) so I'm not sure why you needed to bring that up again

That's just it. Storing a calculated value ALWAYS violates normal forms no matter what logic you use to justify it. I never said you couldn't do it. I said it shouldn't be your first option and you should only do it if you understand the risks and are prepared to protect the validity of your data.

ButtonMoon · Mar 20, 2013

Pat Hartman said:
Storing a calculated value ALWAYS violates normal forms no matter what logic you use to justify it.

That's certainly not true. If I create a summary table which aggregates data in another table then there is typically no additional join dependency created by that. Eg:

R {x,y,z} (key = {x,y})
Rtotal {x, T} (key = {x})

where T is z summed and grouped by x from R. R and Rtotal both satisfy 6NF (and therefore all normal forms below 6NF).

If I store a line item price on an order item

OrderItem {OrderNum, SKU, Qty, Price} (key = {OrderNum, SKU})

I'm not violating any Normal Form because there's no non-key join dependency implied by Price in this table or any table. The set of FDs being satisfied are {OrderNum, SKU}->{Qty, Price}. SKU alone doesn't determine price because the Price changes over time and calculated from other factors like discounts.

Another of the most common reasons for storing calculated values is precisely to avoid or eliminate any dependency on the underlying data. E.g. an annual or monthly balance in a sales ledger or stock control system creates a value which does not depend on the base data and is not supposed to change if that data changes (e.g. if it is archived and deleted). When you create a snapshot of some table or query then that object is intended to be independent of the underlying data because the DBMS only updates it periodically. You can't violate normal form by removing dependencies, only by creating them.

Hope that helps explain what I mean.

gemma-the-husky · Mar 20, 2013

Button. I don't have the theoretical set logic to argue with you - but I can see where Pat is coming from, as I mentioned earlier in #5

if you collapse and store a total in the summary data table, and then change the data in the base table - I don't mean delete the data, I mean edit the data, because something has changed), then the summary is now no longer correct.

While it may be logically in normal form, there is still redundancy, and a possibility of losing data integrity.

ButtonMoon · Mar 20, 2013

gemma-the-husky said:
if you collapse and store a total in the summary data table, and then change the data in the base table - I don't mean delete the data, I mean edit the data, because something has changed), then the summary is now no longer correct.

Dave, yes you are right that there are data integrity issues to consider, just as there should be for for any data of any kind. By persisting calculated data you are possibly adding redundancy in some form, but redundancy isn't always a bad thing and is sometimes very useful. It's pretty certain that almost all non-trivial databases contain redundancy by some measure. In information theory redundancy is actually defined as the useful information in a system.

Redundancy is useful even in the very simple examples I already described. If you increase a price in your product pricing table, would you really want to update the price on every order already placed? Probably not if you want to keep your customers happy!

Pat Hartman · Mar 21, 2013

Redundancy is useful even in the very simple examples I already described. If you increase a price in your product pricing table, would you really want to update the price on every order already placed? Probably not if you want to keep your customers happy!

Storing the unit price in the order details table is not redundant since the price depends on the point in time the order was placed in addition to the product ordered. It is also necessary if your business rules allow overrides at the time the order is placed or even a 0 price under certain conditions. If you go with the alternate method of keeping a history of price changes, you can never override a price. You can only use the price from the price table which is much less flexible.

ButtonMoon · Mar 21, 2013

Pat Hartman said:
Storing the unit price in the order details table is not redundant since the price depends on the point in time the order was placed in addition to the product ordered. It is also necessary if your business rules allow overrides at the time the order is placed or even a 0 price under certain conditions. If you go with the alternate method of keeping a history of price changes, you can never override a price. You can only use the price from the price table which is much less flexible.

My point was that there are other suitable designs that would reduce the degree of redundancy while still meeting the same requirement. The design I proposed satisfies 5th Normal Form but it could be further decomposed to satisfy 6NF if desired. Complete elimination of redundancy is probably unachievable for most non-trivial database designs. Even fully defining what we mean by redundancy is pretty hard (classical database normalization theory of course characterises only certain very specific types of redundancy).

In any case, I think we agree that this is an example where a calculated value probably should be stored and that it can and should be done without violating 5th Normal Form.

Pat Hartman · Mar 22, 2013

I don't think the OP cares about 6th normal form or even 5th. He seems to have no interest in this discussion at all.

emorris1000 · Mar 25, 2013

That's not true, I'm following it. I'm just coming from a very different attitude of what I mean by "calculated". I normally don't have to deal with simple summation, the closest inolved a pretty complex integration.

For this one set I had to integrate a time-dependant data series. If the time segments were of equal length this would have been a simple summation, but because they were not it ended up requiring a fairly complex series of sub-queries (I have the query saved somewhere...).

Anyways, in this case there was no good reason the underlying data would ever change. It was historical data from an experiment. In fact, you would consider it a major violation of the integrity of the database to change historical data in this case. Anyways, it wasn't realistic to integrate all these datasets ahead of time if you ever wanted to use that integration as a parameter in a search. You HAD to pre-calculate it.

gemma-the-husky · Mar 25, 2013

well, definitely worth saving the result of a complex calculation, then.

Calculated fields, Dynamically calculate vs Pre-calculate

emorris1000

Registered User.

ButtonMoon

Registered User.

Pat Hartman

Super Moderator

ButtonMoon

Registered User.

gemma-the-husky

Super Moderator

Pat Hartman

Super Moderator

ButtonMoon

Registered User.

Pat Hartman

Super Moderator

ButtonMoon

Registered User.

gemma-the-husky

Super Moderator

ButtonMoon

Registered User.

Pat Hartman

Super Moderator

ButtonMoon

Registered User.

Pat Hartman

Super Moderator

emorris1000

Registered User.

gemma-the-husky

Super Moderator

Similar threads

Users who are viewing this thread