How to link time-dependent data tables ?

Alex E.

Registered User.
Local time
Today, 13:23
Joined
Jan 13, 2007
Messages
31
While much of the discussions here are out of my league, this may just be the place to finally get some help.

I have a hobby weather station and I am also into gardening. Over the years I have accumulated a fair amount of data/ observations which are currently in assorted Excel spreadsheets & are becoming more an more difficult to maintain.

So, I decided to move the various Excel tables into MS Access. While I was able to draft a entity-relationship diagram for my horticultural data, I am at a total loss when it comes to weather/ climate data which are primarily time-dependent. With other words, I cannot find any references that explain how to design pairs of foreign and primary key for linking datasheets that contain time varying data such as various weather and climate data.

I have ordered dozens of books through the local library. Every book presents exactly the same example: customers and orders/sales. This example is fine when you have a myriad of interdependencies in your data (and proved useful for my horticultural data). But, my weather data are more or less independent of one another; the only thing they have in common is the time/date when I make the observations.

Do I have to create dozens of surrogate keys and add a ton of redundant data ?? I hope not !
 
If you are keeping the weather data in multiple tables because you kept it in multiple spreadsheets, now is the time to normalize. You should have one table that contains observations for as many years as you care to keep. At some point you may need to archive data and so you would then create a procedure to remove old data from the observation table and delete it or copy it to a history table.
 
I wish it would be so straight forward. Currently, I have no other redundancies than the date - nothing really to normalize unless I wanted to throw out some raw data. The Access datasheet would have to be some 65 columns/ fields wide.

The crux - again - is in the time dimension. While the date itself can be easily transformed into a modified Julian Day Number, some of my records/ data or kind of snapshots (morning and evening readings), others are from ... to ... and I also have records for regular hourly intervals (every hour). The people working with Oracle apparently did a considerable amount of work to come up with some 5-6 standard time dimensions.

During my wanderings I also noticed that there is a so-called Long Term Ecological Research Network. I would probably qualify as category 2/3 station by their terms. Everything in their database structure seems to be called a "variable". I picked up a few bits and pieces from there but overall it's far beyond my level of understanding. Typing LTER into Google brings up pages of references.
 
In this case, normalization has TWO meanings.

There is the typical database meaning. You cannot establish a relationship within a badly normalized (or de-normalized) database. Big-time headaches.

There is also the mathematical meaning of normalization. You talk about the time dimension and suggest that you have some specific granularity, such as morning, evening - or perhaps day-quarter.

It is perfectly possible to build relationships between two tables based on times even when the times aren't equal. But the only way to make it work is to normalize the numbers to a scale other than hh:mm:ss

For instance, say you never took two readings in the same six-hour period. You could define a unit such as day-quarter that would divide into morning, afternoon, evening, and graveyard shift. Remember your original time of sampling but also generate a day-segment code and match up records based on being in the same day-segment. Is this making sense? Because this is what I mean when I talk about MATH normalization.
 
I certainly can compress the date and time fields into a Modified Julian Date Number: “54,100.125” stands for “Monday, January 1, 2007, 3 p.m.” But then what ?

Below are 4 sample tables:
Table A has two record per day/date (morning and evening observations)
Table B has 24 records per day/date; one record for every hour
Table C has many records or none. It may snow, rain, etc all on the same day. Or, it may be a fine day; i.e. nothing to record. It usually rains from .... to ..... . Here we got bounded time periods; i.e. when it rains it rains and not for a fixed time period.
Table D has one date and several point time fields

One cannot link the sunrise of table D to rain (table C) to wind speed (table B) to soil temperature (table A) or any other impossible combination one could invent. So, where do I get my pairs of primary and foreign keys from ??

The datasheets are quite simple. However, the computations and queries I want to do are more complex and will require a well established relationship structure.

Examples 1: Every farmer looks up “growing degree days” from the local weather station for pest management and other decisions. Problem, genetically modified seeds will grow almost on top of a glacier – flowers and vegetables are not there yet. So, I have to recalculate.
Example 2: I want to compare the “climate” in my cold frames and shade structures with a wide range of weather observations (we have a very short growing season here in Canada and timing is critical)
Example 3: Leaf wetness (misting) and bottom temperature are important when rooting cuttings. They are very important when it comes to my water/ electricity bills. Some experimental charting could save me a few cents.

Table A Table B Table C Table D
date date date date
twice daily hourly time periods time points
air temp RH and how much sunrise
max, min, cur Wind Dir Snow sunset
soil temp WindSp Rain twilight
2”, 7”, 30” Cloud Cover Hail start
Solar Radiation Dew Point Haze/Smoke end
Evapotransp. ....... .....
 
the tables are a mess. I try again as an attachment (Word file: tables A-D)
 

Attachments

In the formal sense, it will not be possible to JOIN tables that have no key values in common. (Not "key fields" ... key VALUES). You can, however, do bounded linkages via queries that take data from two tables where you have a reasonable BETWEEN ... AND ... possibiility. With the understanding that this is an unpredictable relationship that must be treated like a many-to-one or one-to-many. Which can include NONE-to-one as a possibility.

You can also do DATEPART joins to see if you have records from the same day, same week, same month, same year, same hour ...

The catch is that you need to re-think table design. Why? Because your tables do not currently support what you want to do with them.

Now, I can think of a bazillion ways to do this, but it is going to boil down to some type of data normalization (in the math sense). If you cannot put the data together mathematically, Access will NEVER give you what you want. You could build a query that takes the observation date and tacks on several computed fields based on different DATEPART functions so you could more easily characterize your dates. Then correlate against the query rather than the table. But other answers are possible.

Here is the rule you must remember: If you can't do it on paper you will never do it in Access. (Intended to be read as: Learn to do what you want by hand before you try to program it. Because if you don't understand the math on paper, Access won't understand any better.)
 
Could you give me a literature reference that specifically explains mathematical normalization. I tried a Google search but could not find anything relevant. All theses millions of articles talk about is, redundant data and some sort of data dependencies. Just as very book ever written seems to use the very same example of customers/ invoices and merchandise. As I stated in my initial post, database design is not my field of expertise.

I thought of moving my data into MS Access after I have more or less outgrown Excel where I have to do my calculations at the same level as I keep my data. The many lookups and “past-link”s are starting to interfere with each other. It never occurred to me that I could run into such a nasty problem with data as simple as mine. Lots of people make use of MS Access.
 
Here's a very good description of what DB normalization is and it includes examples of both normalized and denormalized data (along with plenty of math).

DB Normalization
 
An example of normalization is the Gaussian or "bell" curve that says something about data distribution. If you use absolute numbers, the scale is crazy. But if you compute the standard deviation of your data set, you can talk about 68% probability for one standard deviation (one "sigma"), 99% for three sigma, etc. etc. This is normalization (to the standard deviation).

You normalize dates when you just name the year. Or when you just name the month. You normalize times when you talk about AM and PM. The idea with normalization is to compute where an observation falls within a large grouping of units appropriate for that observation. For time, you groupings can be anything from individual seconds to minutes to hours to days to months to whole years. (To centuries, but that is unlikely to be helpful.)

The key is that if you want to correlate two tables across a field that exhibits different precision (for lack of a better term) in those tables, you sometimes have to reduce (normalize) the higher precision table to a lower precision to support the correlation. It's a type of grouping. And you correlate your two tables AFTER grouping the one with the finer granularity to make it have the same approximate basis as the one with coarser granularity.

I feel as though I'm dancing around this without complete clarity, but I don't know how else to explain it.
 
Doc_Man, I honestly appreciate your help. But this is getting far too complex. If I have to use calculus to store simple 2-digit data – it cannot get more basic than my set of observations – than perhaps we are pushing way beyond the limits of what the software can handle.

In retrospective, I did not heed 3 red flags:

(1) There seems to be a good reason that everyone is regurgitating the same example of customers/sales/merchandise (or whatever word games one wants to play such as students/classes/grades). It is astonishing that nobody has risen above the crowd and defined A,B,C along with a long list of limitations – limitations of the relational database concept that is.
(2) Normalization as it is portrayed by the mass media (millions of books and web articles) is a very specialized process custom tailored to the narrowly defined ABC scenario above. It makes sense in the context of a hastily thrown together membership lists or similar. But, why would I record my instrument readings twice and, those readings are 100% time dependent. No need to look for partial dependencies and so forth that may interfere with the database maintenance.
(3) The material published about the Long Term Ecological Research Network looks so different. I though this was because it handles a very large number of reporting weather and climate stations. More likely, the people at LTER are a lot smarter than I am and knew right from the start that relational database techniques are no good for this and (probably many other applications).

Anyhow, I learned a lesson. Now, I have to go back and make the best of Excel.
 
There are limitations to relational databases when you try to step outside of the set theory that was the basis for creating such databases. I am not fully convinced that you have reached a SET THEORY impasse. But you might have reached a design impasse because you cannot find what you consider to be an acceptable compromise on the timing issues. I don't think set theory has let you down. I think your imagination (and ours) may have done so.

Normalization of databases does indeed have a lot of cut-and-dried data layouts for which relationships are easily defined. But approximate relationships work, too. You are merely delving into the areas of "fuzzy" logic where you start talking, not about exact relationships, but approximate relationships. This, unfortunately, DOES delve into some types of calculus now and then.

However, using the BETWEEN ... AND ... operator, you can still form a "fuzzy" relationship between two tables. You just have to recognize a fact that seems almost impossible - yet it is true. It is possible to have too much precision in a measurement. The "Doc" in my moniker is not a smokescreen. I have a PhD in Analytical Chemistry and can (at least socially) use Doctor as a proper title. You don't see this phenomenon very often, but your data records might fall into the category of too much precision.

The problem with having too much precision is that you have a deep-seated reluctance to lose the data implied by a loss of precision. But if it leads you to make poor decisions in data correlations, it is time to pitch the precision. The real trick is to know how far to trim back what you have.

Which is why I originally selected data normalization. If you have data with times precise to the second, but you are trying to correlate to the day of the year, you have about 86,400 times the precision required for the intended computation. This is what I mean by "too much precision." Reduce the time to a data and correlate by dates.

If you want AM/PM as your level of detail, reduce your data precision to 12 hour intervals, at least. I can't say this any other way than that your meticulous observations now need to be converted to a useful form. And here you thought they already WERE in such a form.
 
There is an excellent power point presentation on the time issue. Unfortunately, it “only” analyzes a wide range of issues (developer be aware of this and that...) but does not provide hints what the solution might be.

[PPT] Designing, Building and Using Time Dependent Databases
File Format: Microsoft Powerpoint - View as HTML
The Rules of Time: Data Quality Issues for Time Varying Databases DAMA National Capital Region – Mar 2002. Dr. Jerry Rosenbaum. ConcentrX, LLC ...
www.dama-ncr.org/Library/2002-03-12 Rosenbaum- TimeVaryingDBs.ppt -

I am aware that I have to be flexible as far as granularity is concerned. But, my idea was that this comes into play at the query stage. I would be somewhat reluctant to input averaged data into my database tables. In any engineering or scientific application you want to make a clear distinction between your raw data and subsequent interpretations. How else would you make sure that you are comparing like with like both in terms of absolute values and units of measurements.

The more I think of my problem in terms of engineering/scientific data versus a handful of business data of some sort, the more I tend to believe that there must be two radically different approaches to database design. Business data (in a broad sense of the word) are by nature messy with many data overlaps, etc. In contrast, before you start an engineering test or scientific experiment, you spent a long time getting organized. The end result is normally a squeaky clean set of raw data.

My weather/ climate data are not an end in itself. They only become useful in comparison with my horticultural notes. Even with my horticultural data, I had a hard time concocting relationships. I was only able to do so by introducing a dozen surrogate keys and duplicating an awful lot of data – that is creating a lot of redundant data that did not exist before. There, I was able to utilize the chain of events from seed collection/ characteristics to germination results to eventually the mature plants/ harvest data.

No such luck with the weather/climate data – they are truly independent sets of data. No chain of events to cling to in desperation.

In a previous life, I used to be a geologist (M.Sc) working on a broad-brush regional scale looking for trends to outwit the competition. This involved mountains of data. Unfortunately, in the days of the main frame, computer departments were a kingdom of their own. The results of data requests were served to you like a formal dinner. I would love to know how they managed those geological and geophysical data bases.
 
The reason you needed a mainframe was because of the approximations you had to introduce, which take more time to compute when you seek things that are "close" - whatever that word means in your context. And that kind of thing, for large enough arrays/matrices of observations, would even strain the mainframe.

However, I tend to agree that the mind set between business data and engineering data can sometimes be topsy-turvy. In business, you match up numbers - invoices contain part numbers, part numbers lead to prices, invoices contain customer data, customer data leads to sales leads, etc. - and they all are exact-match situations.

Here, you have approximate match situations and you have to look woefully at your beautiful datasets that - at the moment - seem useless. Yet here again I wonder why you are so reluctant to print averages.

When I did my dissertation, I generated literally hundreds of observations and had to reduce them to something useful. I had enough data to show the flaws inherent in using computer-measured data from a natural process as though the results were linear. The more data you included, the less accurate your regression statistics became - because you were trying to compute the best straight line for a curved data set. The less data you included, the less accurate your statistics became - because they were from a smaller data set and therefore deviated from the Gaussian ideals (or if you prefer, the laws of large numbers.) So I really sympathize. The only thing that made me feel better is that I published with correlation coefficients above 99% regularly while my colleagues were floundering in the high 70s to middle 80s.

Reluctant to publish averages? When you have really large data sets, the human mind cannot comprehend the mass of numbers without some statistical treatments - like averages and medians and the like. Your colleagues might WISH they had months or years of continuous averages on anything.

But, it is your data set and your research. Good luck. (Consider the above a "pep talk.")
 
Statistical analysis certainly will require some form of averaging. In most cases, the goal is to extract a mathematical equation/ expression that then can be “universally” applied to solve problems or predict an outcome.

I am more interested in graphing/ plotting and pattern recognition such as seasonality. I am working with (among others): floating bar charts, seasonal sub-series plots or column combination charts. Sometimes, it seems useful to superimpose one or two line graphs in addition.

Floating bars charts or sub-series plots are so simple yet they afford an amazing insight into temperature fluctuations on a daily basis and at the same time reveal also longer term trends. In addition, they answer a lot of other questions faster than any computer will. For instance, a quick glance tells you instantly the length and time frame of the frost free period.

There are several formulas for computing growing-degree-days. A commonly used formula works with averages but I prefer the Single Sine formula. The daily values are meaningless; you have to plot a cumulative curve(s) and that is where the Single Sine formula proves its worth.

I need the original data for these kind of applications.
 
functions/formulae for sine calculating degree-days

isn't this ironic...I'm looking to calculate degree-days within access via the single sine method with a horizontal cutoff....where I can simply specify different thresholds as parameters...

but being an access newbie i have NO idea whatsoever if it is possible to...and I can't seem to find any formulae or logic to do this with. Any suggestions?
 
Don't forget that you are NEVER throwing away your data. You might have to use a query to adjust it to a normalized time scale, but the underlying data would still be there.

You can do a JOIN between queries where the queries do the normalization for you.

Suppose, for example, that you have twice-daily humidity measurements - 9 AM, 3 PM - and four-times-per-day temperature measurements - 6 AM, 11 AM, 1 PM, 5 PM. (Just as an example...)

At first glance, they have little in common. Still, you can work with this if you wish. You can choose two or three ways to JOIN these data sets. This might give you an insight into where I'm going...

In all three of the cases, build two queries, one for the humidity and one for the temperature. Then join the two.

Case 1 - compute the time as falling into one of four day-quarters and assign this as your "match" field. A day-quarter can be computed by taking the date field (which is a DOUBLE) and computing DOUBLE( INT( SAMPLEDATE * 4 ) ) / 4. OR you can do FIX( SAMPLEDATE * 4 ) and just retain the actual date. Do this for both tables. What you will get is that the humidity entry would match twice.

Case 2. Same as above but take the temperature average for all entries having the same day-quarter. Then join the humidity to the aggregate query that took the averages. Again, join on day-quarter.

Case 3. Convert to simple dates, take the average of the humidity for that date and the average of the temperature for that date. Never mind that both are averages and start from different numbers of samples. Join them on their day-numbers.

In all three cases, you get a JOIN query that lists time, temperature, and humidity.

OK, this is contrived but is an example of what I mean about the MATH sense of normalization. And you will note that you didn't throw away any data. You can take the same exact raw tables and add a third table, incipient light (perhaps as measured by a photocell & voltmeter apparatus.) You could do correlations the same way.

Suppose you like the idea but want to break it up into finer granularity. Just remember that dates are DOUBLE sized day counters with fractions to represent time of day. You can take the integer of a date and come up with something tractable even if there is a scaling factor.

Suppose, just for giggles, you wanted day-tenths. Take the date, multiply it by 10, and make it an integer. That would be your day-tenth field and because it is an integer, it is ripe for joining. The only issue to watch is to use LONG integers for these fields. The reference date for Windows is up around 37K or 38K by now (that's a day number). Multiply that by anything more than 8 and you run into integer overflows for WORD integers.
 
http://beaumont.tamu.edu/eLibrary/Publications/Ted_Wilson/LTW31.pdf
formulas galore; this paper really goes to great depth


http://biomet.ucdavis.edu/DegreeDays/DDhand.htm
a “hybrid” approach; it provides a lookup table for constants that you can use together with the basic formula. Pretty close approximation of the single sine formula


http://www.ipm.ucdavis.edu/WEATHER/ddconcepts.html
compares different methods, some nice graphs

http://biomet.ucdavis.edu/DegreeDays/DDsine.htm
has a zip.file you can down load; seems very elementary
 
woohoo single sine

Hey Alex,

thanks for the heads up; the pdf by zalom et al has exactly what i'm looking for... formulae for doing the calculations.

However, now my next major stumbling block is what to use for the lower threshold values when calculating degree days below 2 selected arbitrary values... (which in this case are 0 and 18 degrees celsius).......or do the lower threshold values even matter?
 
Hi, coho66. I think the thresholds really matter if you are using them to indicate significant conditions in your data sets. I can see using 0 as one of the cross-overs. The other "natural" cross-over is -40, which is the same for both degr. C and degr. F - but that's colder than my blood allows me to go. If there is some other "natural" breakpoint related to a physical process, use it. But use it consistently or you run into the same problem with temperature as the thread above regarding times.

Just an opinion, you understand.
 

Users who are viewing this thread

Back
Top Bottom