Best way to manage time-dependent calculated data?

Banana · Jun 22, 2007

Hi,

I've searched the forums, but haven't found something that would address the efficiency to normalization (?) ratio of storing something that is ultimately calculated.

My users want to be able to categorize clients' income according to guidelines provided by government. Now those guidelines usually are updated on an annual basis, but my users want to keep record of whatever category their clients fell in for any given year.

I thought of two possible solutions:

1) I can create a time stamp for every time they update a client's income data (mind you, the income data will be a many side table so we can maintain a history), keep a table listing guidelines' category with effective data and expiry date, then run a query In Between to output the appropriate category assigned for this particular record. No need to store any calculated data, but extra overhead

2) Create a field in income data table to save the category assigned, since there can be one and only one category assigned in a given period of time for any given income data record. Just need to run a query lookup only for current guidelines and assign category only once when the record is created, much less overhead.

Thought I'd go and ask you guys on what you would suggest, or maybe have a better solution?

The_Doc_Man · Jun 23, 2007

address the efficiency to normalization (?) ratio

I understand what you are asking as I am a pragmatist. A purist would laugh before they blow you off. But in this case I will politely suggest that the ratio must be all the way towards normalization because of something else you said.

Now those guidelines usually are updated on an annual basis, but my users want to keep record of whatever category their clients fell in for any given year.

The Old Programmer's Rule says: Access won't tell you anything you didn't tell it first. So... If you're gonna wanna make time-based queries, you have to track time-based data. I.e. date and category have to both be present. To save space, though, you need two tables.

Person - with a prime key to identify each person uniquely

Person_income_history - where the person table's prime key is a foreign key and where the date and income are present. There are arguments about whether you actually would need a prime key on a child table - but if you did, it would be the compound of person ID and date.

This would be fully and properly normalized as long as the income category and date NEVER appears in the Person table, only in the history table.

So your "ratio" question is, in effect, trumped by your data requirement.

Banana · Jun 26, 2007

The_Doc_Man, thanks for taking time to reply.

After thinking about why i asked this in first place; I thought I should ask few more tangential questions to help me understand how I should balance the need to keep database running efficiently as possible.

When I said efficiency to normalization ratio, I think I was thinking more about overhead, not normalization. As implied, calculated fields adds to overhead, but IIRC, it's preferable to waste a instruction cycle than a byte of memory, and as you pointed out, my data requirement does have a need for a normalized format.

However, I'm still wondering what situations where calculated fields should be stored to save on overhead?

Furthermore, here is a practical example why I'm so concerned with overhead and efficiency, though this has nothing to do with calculated data.

In my database, we store records of service we provide. We may provide case management and/or workshops. IINM, you're a federal employee, so you probably know something about how government are about keeping records of everything. The database exists to help us satisfy our state government's contract.

Now, with case management, we need to store a client's address(es), and that is done with a many side table. Simple as pie! Nothing to see, keeping on moving, everybody.

But for workshops, we are required to distribute sign up sheets which asks attendees to disclose the county they live in. And that's where everything get tricky.

To keep everything normalized, the data about one's residence, including the county is kept in address. Therefore, an attendee has an essentially blank address record except for county field.

But on forms, I need to implement several workarounds to make the county field act as if it's part of same table without confusing my users, which is just more work (and more opportunities for mistakes) than if I had just had put the county data in the table about people themselves rather than the address table, violating the normalization rules.

In fact, one of my form has a subform which only has one control (County) with properties set to make it appear "invisble" to the user, but still has the quirk of having to tab twice before the cursor appears in the control.

Then there's the problem about should attendees decide to become our client. I need to implement a check to catch blank record and use it instead of adding another record which would needlessly duplicate the county data.

So right now, when I'm building income tables, I'm kind of wondering whether I will run in similar problems if I decide to keep everything strictly normalized. This is where I'm just not 100% sure.

Maybe you can enlighten me?

The_Doc_Man · Jun 26, 2007

Banana, I'll give it a rip.

Pound for pound, every time you normalize a table, you save space. How? Because if you needed to store several items in the master record or one pointer to an item, AND that item has significant potential of multiple references, then you save the difference in space between the pointer and the set of all item descriptive data for the second and subsequent references to the same item.

But that raises the question: How often does an item get repeated in order to make the lookup worth the effort? See, that's where YOU come in. I don't know your data set. You do. If you split items into main field plus lookup field, but it turns out that the lookup field has the same cardinality as its parent table, then you might as well not have performed the split.

CARDINALITY: n - Expected average number of records returned by a query with a single non-wild-card value in its selection criteria. For unique keys, cardinality is 1. For yes/no cases, cardinality of an N-record table is N/2 (approximately).

OK, past that point, what else does normalization get you? Where the details ARE repeated, the table containing the pointer is smaller. Smaller tables are faster to manage if you aren't pulling in the related record at the same time. Has to do with number of records that fit in a disk buffer. Shorter records = more records per buffer.

Your question: When does it make sense to store a calculated field?

Case 1: Where you have something that is not normally captured, it might be a good idea to store a computation. For example, a store with a "Our prices are lower or else we will beat anyone else's advertised price by 10%" offer. They offer ordinary discounts - but on the day that their competitor decides to do them dirt, they have to compute a SPOT discount - and STORE it - because they don't have that particular discount in their discount tables. So they store discount code "Beat competitor's price by 10%" and then enter the price. The COMPUTER stores the computed discount which cannot be recovered from their own records because the discount rate wasn't their own discount.

Case 2: Inventory system where the quantity on hand is the sum of all positive and negative transactions on a given item... but the database is getting big and bloated so it is time to archive. You pick a date, compute the sum of all transactions up to that date, and create a record with a computed stock level as of that moment. Then you archive all previous records - and you now have used a computed field to show stock level as of the date of archiving.

I need to implement several workarounds to make the county field act as if it's part of same table without confusing my users

Forms don't usually know or care whether your recordsource is a query or a table. Use queries that include JOINs to do translations. Then it LOOKS like the translation is a true part of the dataset when in fact it is there only due to the presence of a query.

As to government storage of records, you want those records stored in the most economical terms, which usually still implies normalization. I can tell you that the US Dept. of Defense has a 30-year archives rule and we are doing our best to live with it. Sometimes you do take what appears to be a radical shortcut to meet government requirements. But it is still best to retain your data in normalized form for as long as you can.

Strict normalization can be a real pain in the patootie. I'll let you in on a little secret. For Access, if you can get to strict 3rd normal form, you are ahead of the game. You COULD get to 4th or 5th normal forms - they do exist - and there are specialized normalizations as well. But in general, stop at 3rd and be content that you've done a pretty good job.

Question regarding speed vs. space... You don't care about speed in most computers these days. They have FAR outstripped the disks. Let's say that you are running a 1 GHz computer. (Yeah, I know... yesterday's news...) It's an Intel box for which instructions take 1-3 clock cycles. So say 2 cycles average. (That's wrong, but use the number for argument's sake.) So 2 cycles/instruction on a 1 GHz machine is 0.5 Giga-instructions per second = 500 MILLION INSTRUCTIONS. A disk seek takes maybe 5-15 milliseconds because a mechanical part is involved. So... say 10 msec. Do the math and you find that one seek is worth 5 million instructions. Even in a high-level emulated language, that a LOT of lines of code between seeks. So if you have to compute something on the fly, do it. It will take you a LONG time to break even between storing something and recomputing it each time.

That byte you save by recomputing adds up slowly. If it is the ONLY difference between the computed record and just storing the value once, it might not matter. When the computed value is four or eight bytes long, that adds up more rapidly.

This is kind of rambling. I'll let you digest this and if you have any other "tangential" questions, ask them. Also, if I missed answering a question or if my answer seemed to be strangely unresponsive or unrelated, chalk it up to digestive languor. Ask again, perhaps being more direct in what you wanted to know.

Banana · Jun 26, 2007

You're not rambling at all, this is very great answer and definitely reassuring.

As for forms/query and joins comment- Believe me, I've already done that, but it doesn't work that way. Remember I mentioned that the address table is a many side, and joining the child table to parent table will render the query non-updateable and consequently totally useless for my forms. This is true regardless how I join it, inner, outer, whatever. Hence, the one control subform.

I do believe you're right about needing to have 3rd normal form to be ahead of the game. Yet, the biggest problem I've consistently run into is to generate a updateable query which I can base my forms to, and this is essentially true for where I may need to show two siblings table and a parent table on a form (a good example would be Address table and Income table related to a parent) which necessitates two separate subforms, even though if my user only needs to input data for all table only once and this messes with UI.

Or did that indicate something is awry?

The_Doc_Man · Jun 26, 2007

Or did that indicate something is awry?

Possibly. You tell me. It's your data set.

more work (and more opportunities for mistakes) than if I had just had put the county data in the table about people themselves rather than the address table, violating the normalization rules.

? It depends as to whether this violates rules based on whether you track the person over time or whether you don't care. This might be one of those cases where the country code COULD be stored as an attribute of the person if you are not trying to track movement.

Detailed addressing is in the same boat. It makes sense to put address in another table when (a) you expect to re-use the same address for multiple people or (b) you are tracking the person's movement over time. In the latter case, you would have a table with key field, address data, and a date field on which the person started using that address. If neither situation applies (re-use or time-tracking) then address is just a current attribute of the person and can reasonably (for normalization purposes) be in the Person table.

See, the toughest part of deciding how to normalize often isn't what the data IS but how you will use it. If address is a "ho-hum, that's nice" attribute, you won't raise even a purist's hackles by including it as a person attribute. After all, everybody's got to be somewhere... position's an attribute isn't it?

In the final analysis, if your query on this form is not updateable, you have run yourself into the ground. Unless it is a SUMMATION query of some type, you should never have an ambiguity as to what record is intended - which is one way that queries become non-updateable.

So is something wrong? I think maybe there is definitely a non-zero probability that something could possibly be wrong. {Said in my best Pinnochio voice from "Shrek III"}.

Banana · Jun 26, 2007

In my case, tracking over time would be nice, but not strictly necessary to the design.

However, the driving force behind the reasoning is government who insist that we have provisions for our clients to fill in both their physical address and mailing address. Since repeating fields violate normalization, it was reasonable to create a many side table containing the address.

Government also wants county included with the address. However, they also want county of our attendees for our workshops which is derived from sign up. There, we don't ask for their full addresses; just county.

Furthermore, we do need to be able to hold several records of how to reach a given individual, and for that reason, contact information is once again a many side table.

So for clients form, user has two sibling subforms; one for addresses, and one for contact information, of which they can add unlimited amounts for any one client.

Whereas on attendee form, for sake of keeping it simple, we really only need county and one record of contact information. However, I cannot create a updateable query with siblings as Pat told me some time ago, since Access in general cannot understand relationships between two groups of data if they're only related together through another table and is not arranged in hierarchial fashion (which would be sibling tables; they're both related to a common parent table, but are not directly related to each other).

neileg · Jun 26, 2007

I'm a pragmatist, too. Unless there are overwhelming reasons for using a purist approach, I always store time dependant data. So, as an example, if I was storing invoice data, I would store the price of the items on the invoice because the effort require to track and extract prices on a date basis, isn't worth it.
However, I would not want to store the customer address if I had lots of invoices for each customer. That means I'd have to store the customer ID. Since the customer address may change over time, I'd have to deal with that change. What I could do is keep track of the changes by date and look up the relevant address. But since this involves keeping a series of records for the customer anyway, I'd probably just store the PK of the address table that was relevant at the time of raising the invoice.

I have no problem with the concept of normalisation, I just don't let it make life more difficult than it needs to be.

The_Doc_Man · Jun 26, 2007

I'm beginning to smell the nature of the problem.

you have person Joe Schmidlap. You have a mailing address, a business address, a residence, a summer home, a post offic box, etc. etc. So you hae a one (person) to many (address) relationship. BUT... you can make this query updateable again IF you include codes that specify order of use IN THE ADDRESS TABLE for each address. Then pick the item with the lowest order number for that person and your query should then be updateable.

OR make the order variable a code variable that says, code 1 = residence, code 2 = business, etc...

If you are updating business addresses, only select the business codes and your are good to go. If you are doing residences, only use the residence code, etc.

One or the other of those SHOULD suffice to make the JOIN query result in a single record identification. And that would be updateable again.

Banana · Jun 26, 2007

Now, that would be totally cool!

I'm going to give it a whirl eventually as I progress with the project. I'm so glad we had this talk!

Which brings us back to the normalization; up to this date, this sort of thing has been a big pebble in my shoe so to speak- everyone tells me normalization will make my database much easier to maintain, and I do believe that, but when I can't generate an updateable query even if I've double (and triple!) checked if everything is properly normalized. In fact, my personal rule has been "When in doubt about attributes' relationship to a key, make a new table!", and I recall myself insisting on strict normalization of all of my tables when I re-developed the database. Now, I'm getting nonupdateable queries there and there which consequently requires me to implement a half-assed workaround, and that's when I start having doubt and wonder if there is such thing as over-normalization.

Edit: I just realized; when you posted a solution for getting a join query to yield a updateable recordset between a one and many table, how does that work with a one-side table joined to two many side tables?

The_Doc_Man · Jun 26, 2007

The issue I mentioned earlier about to split tables or NOT to split tables, aye, that is the question. Whether 'tis nobler to endure the slings and arrows of outrageous joins.... but I digress.

Back to reality: There is such a thing as overdoing it. When an item is not going to be re-used and has no time dependency, normalization rules of ANY sort will not require it to be in another table. So if you split out fields just because you weren't sure, you might wish to remerge them.

The issue of non-updatable tables is ALWAYS because Access cannot find its way back to a unique record. Obviously, queries involving SQL aggregates will have this problem. Less obviously, but it is still so, if you create a parent that could link to any of several child records, Access can display it - but will be more exacting about updating it through the form because of uniqueness issues. And I believe that because you don't have an address discriminator code (to show type of address), your records appear to have non-unique keys. Make the discriminator part of a compound prime key and you might find yourself in a better position.

Banana · Aug 19, 2007

The_Doc_Man said:
The issue I mentioned earlier about to split tables or NOT to split tables, aye, that is the question. Whether 'tis nobler to endure the slings and arrows of outrageous joins.... but I digress.

Back to reality: There is such a thing as overdoing it. When an item is not going to be re-used and has no time dependency, normalization rules of ANY sort will not require it to be in another table. So if you split out fields just because you weren't sure, you might wish to remerge them.

The issue of non-updatable tables is ALWAYS because Access cannot find its way back to a unique record. Obviously, queries involving SQL aggregates will have this problem. Less obviously, but it is still so, if you create a parent that could link to any of several child records, Access can display it - but will be more exacting about updating it through the form because of uniqueness issues. And I believe that because you don't have an address discriminator code (to show type of address), your records appear to have non-unique keys. Make the discriminator part of a compound prime key and you might find yourself in a better position.

So I got around to tinkering with the proposal. It doesn't seem to make the query updatable even though I have made a compound prime key for the tblAddress.

The data structure is:

tblPerson
* PersonID
Some personal data

tblAddress
* AddressTypeID (FK to a lookup table)
* PersonID
Some address data

tblVisit
* VisitID
PersonID
Some data about visit itself.

* = Primary Key

Relationship:

One Person to Many Address
One Person to Many Visit

The SQL I have so far:

SELECT tblAddress.PersonID, tblPerson.CategoryID, tblPerson.EthnicityID, tblPerson.GenderID, tblAddress.AddressTypeID, tblAddress.CountyID, tblAddress.ZIPCodeID, tblVisit.VisitID, tblVisit.RoleID, tblVisit.VisitTypeID
FROM (tblPerson INNER JOIN tblAddress ON tblPerson.PersonID = tblAddress.PersonID) INNER JOIN tblVisit ON tblPerson.PersonID = tblVisit.PersonID
WHERE (((tblAddress.AddressTypeID)=7) AND ((tblAddress.ZIPCodeID)=-1));

Mind you, if I remove the tblVisit from that SQL, the query is 100% updateable.

Any suggestions?

The_Doc_Man · Aug 19, 2007

Your question contains at least part of the answer. What is it about the tblVisit relation that appears to be ambiguous with respect to the table(s) you are trying to update?

See, you said if you remove tblVisit it updates. If you put it in it does not. This quite narrowly points to the relationship that brings tblVisit into the fold. Draw a diagram to see what you are trying to relate. I'll lay odds that a discriminator field related to tblVisit is somehow not properly selected such that Access would return more than one visit for this SELECT field. In which case, you have to select which record you meant to update. Of course, if there is no discriminator in the SELECT statement, Access cannot synthesize the impled UPDATE query needed to make the change through the SELECT.

Banana · Aug 19, 2007

Ahhh, I think I understand....

It's not enough just to discriminate the address table; I'd need to discriminate Visit as well, so that for any given personID, there will be *exactly* one Address record and *exactly* one Visit record.

Too bad; as I actually do need to have many Visit records for the form, which can possibly contain more than one instance of PersonID.

I guess the only solution then is an invisible subform. (Because my users expect to be looking at one instance of visit and they don't need to actually know that address table is involved in reporting the data about county.

The_Doc_Man · Aug 19, 2007

Yep, in order to be updateable, a query must be unambiguous as to which record is being updated. If tblVisit is many-to-one but you didn't specify which of the many you were touching, Access cannot guess for you.

Banana · Sep 18, 2007

The_Doc_Man said:
I'm beginning to smell the nature of the problem.

you have person Joe Schmidlap. You have a mailing address, a business address, a residence, a summer home, a post offic box, etc. etc. So you hae a one (person) to many (address) relationship. BUT... you can make this query updateable again IF you include codes that specify order of use IN THE ADDRESS TABLE for each address. Then pick the item with the lowest order number for that person and your query should then be updateable.

OR make the order variable a code variable that says, code 1 = residence, code 2 = business, etc...

If you are updating business addresses, only select the business codes and your are good to go. If you are doing residences, only use the residence code, etc.

One or the other of those SHOULD suffice to make the JOIN query result in a single record identification. And that would be updateable again.

*bump*

I tried the idea of making my tables traceable using the code and making it a compound key along with the PersonID, but that didn't make the query updateable even though I've specified what code to use and the only variable is the PersonID.

Am I missing something?

The_Doc_Man · Sep 19, 2007

Clearly, but I cannot tell what. This might be a time when you need the whiteboard approach. Draw all of the tables involved in the query on the board. Populate the board with sticky notes from an actual set of records based on one that you cannot update. (Do ad-hoc queries to get this info.)

Now look at the query and manually execute it to determine what records are involved in the JOINS through all layers thereof. Draw out the process of finding the contributors to the record in the top-layered query. For the query to not be updateable, one of these following must be true:

1. One or more tables is read-only. (Could happen if linked from an external source.)

2. One field in the query is computed. (Or one field in a sub-layer of the query is computed.)

3. One field in the query is ambiguous - i.e. can refer to either of two records in one of the underlying separate tables feeding the query AND you don't have a WHERE clause or JOIN clause to specify the required elementary component. Might also be caused by not using the PK of one of the tables to specify the contributing record in question.

4. You defined the query to be read-only. (You're advanced enough that I'm betting against this one.) OR you are using a layered query approach and one of the underlying layers is read-only. (That might have gotten past you.)

5. Access doesn't like you. Have you been saying nasty things about Bill Gates in public near any computers that were turned on at the time? Think HAL 9000 ...

Banana · Sep 19, 2007

Okay, I whipped up a quick sample database to demostrate that even if I did what you suggested (and assuming I understood your solution), it still isn't updateable when I want to use the query.

There's just four tables, and an query. The query is supposed to be a recordsource for a form, and obviously has to be updateable if it's going to be of any use as a form recordsource.

I remember that Pat said that once you have two "sibling" tables, Access no longer is able to determine which record belong to what daddy, and even though I've made a compound primary key and made a criteria to filter what record to return, it's still non-updateable when both Visit and Address table are included in query. Remove either one of them, and query can be updated.

As for badmouthing Bill, I'm quite sure I'm on blacklist. I use and prefer Macintosh. It's only because my workplace uses PC, I have to use it.

Best way to manage time-dependent calculated data?

Banana

split with a cherry atop.

The_Doc_Man

Immoderate Moderator

Banana

split with a cherry atop.

The_Doc_Man

Immoderate Moderator

Banana

split with a cherry atop.

The_Doc_Man

Immoderate Moderator

Banana

split with a cherry atop.

neileg

AWF VIP

The_Doc_Man

Immoderate Moderator

Banana

split with a cherry atop.

The_Doc_Man

Immoderate Moderator

Banana

split with a cherry atop.

The_Doc_Man

Immoderate Moderator

Banana

split with a cherry atop.

The_Doc_Man

Immoderate Moderator

Banana

split with a cherry atop.

The_Doc_Man

Immoderate Moderator

Banana

split with a cherry atop.

Attachments

Similar threads

Users who are viewing this thread