Optimize query

wvanjaarsveld · May 14, 2007

I have a query in ms access. (It is used to simulate inventories under (s,q) policies(not really important)).
It is used on a big table(1.000.000 entries), with data on ship dates and quantities of about 70.000 parts. It adds to the 4 columns one extra columns, with the usage in the leadtime prior to the ship date of the specific part number. It works, but it is too slow. I can only run it within 10 minutes if I leave a number of parts out of the table simulatie. Does anyone have an idea how to optimize it, so that I can include all parts?
sql:
SELECT simulatie.product_nr, simulatie.date, simulatie.customer, simulatie.usage, Sum(simulatie_1.usage) AS usagetotal
FROM simulatie AS simulatie_1 INNER JOIN simulatie ON simulatie_1.produkt_nr = simulatie.produkt_nr
WHERE (((simulatie_1.date) Between [simulatie].[date]-[simulatie].[leadtime] And [simulatie].[date]))
GROUP BY simulatie.produkt_nr, simulatie.date, simulatie.customer, simulatie.usage;

Note that the query works only on a single table.

Any help would be appreciated!

ps. version:Ms access 2002

The_Doc_Man · May 14, 2007

I can only suggest a few thoughts.

Whenever a JOIN is involved, even a self-join, you MUST have a good key on the joining entities. You probably do, but it is something I shouldn't omit as the starting point for query optimization. Also, when using order by or group by, you do better when there is an index on each field participating in the sort. You have four group-by cases. Four chances for poor performance.

I'm not going to swear this is causing speed trouble, but you use DATE as a field name. This is generally not good because DATE is a reserved word for a standard Access function name. But that is a thought in passing. Your code will probably not be affected by this so much, but it is an issue that should not be left alone forever.

I think what I am seeing is improperly constrained in the WHERE clause somehow. It feels wrong, like somehow the where clause isn't restricting itself properly. Never mind that the computer can add better than you can. Look at your result with the old naked eyeball, even for just the limited set you use for testing. Does the sum make sense when you look at it, or is it a LOT larger than you might have guessed? Sanity-check your results.

I'm also not going to swear to it, but this might be a case where a simple, non-self-joined query would work better using function DSUM to generate your sums. When you do that join, Access passes through that table a LOT more than twice. I am having trouble wrapping my head around just how many iterations through the table it would take to do what you are doing correctly. I keep getting an overflow number.

After a deeper look: You are also doing way too many GROUP BY elements. You have four grouping fields, which forces a four-level sort, one on each field. If even ONE of those elements doesn't have an index, you are doing a relation scan on literally thousands of records for each record in the table.

Besides, considering normalization criteria, you cannot have a table with this structure and still have any meaning to the GROUP BY usage entry. Not only that, but I question what must be your prime key. Here's my thinking:

If this is a normal inventory-like table, you have entries: product, date, customer, usage. This already says you have only one record per customer per part per day. I.e. if the customer buys something this morning and comes back this afternoon with a supplemental purchase, do you edit the usage entry? Or do you include time with the date? If you don't include time, you have a denormalized table because in that case, the value in the usage field depends on something not retained in the table. This is a MAJOR BIG-TIME normalization no-no. Or did you leave out something in your description of your table?

I can't BEGIN to compute the number of passes this requires. As I look at it more and more, I'm not surprised it doesn't work without removing a lot of parts. The darned thing is at least an N-squared algorithm if not higher. For a table with a million elements, that's a TRILLION operations. I'm more suprised that your machine doesn't explode when trying to run that many sort ops in a single query. Just out of curiosity, do you have to compact & repair your DB after each attempted run of the full-blown query on a full data set? Has it EVER run to completion?

This might be a case where you would want to consider a layered approach. Generate your sums separately without all of that group-by stuff (other than the product number). Then in a separate query, JOIN the ordered data set to the separate summation set. I think you are trying to do too much in a single query when a divide-and-conquer approach will probably work better.

wvanjaarsveld · May 14, 2007

Thank you for all the ideas on how to perform the query. I should have added to the previous question that I am not an experienced access(or sql) user, I only learned access last weak.
The group by is just what I get when using the design view of access to make the query. I don't know how to get rid of them without hurting what I wan't to do.
Normalization doesn't ring a bell with me, but in any case all records are unique, if that is what you mean.

I found a way to accomplish what I was trying to do, but it was too slow. In stead of saying how I accomplished what i was trying to do, I should have told you what the goal was.
I have a table with four columns:
(item nr. ; date ; customer ; quantity ; leadtime)
A record represents that on (date) a (quantity) of (item nr.) was requested by (customer). The leadtime column states the leadtime of the item from the warehouse. I need a table with the same columns, but with one column added. This column should be the sum of the quantity of the specific item ordered between [date]-[leadtime] and [date]. So each entry in the result table depends on a lot of entries in the original table. I have tried to fix this with the nsum function but i couldn't get it working.

The_Doc_Man · May 14, 2007

OK, questions.

1. You mention leadtime but it isn't in the four-column table. From where does it originate? How does it relate to specific parts? OR how does it relate to the place from where you get the parts?

2. The business purpose of this query eludes me. Are you trying to generate a performance statistic for various parts? An average lead-time so you can eventually implement just-in-time ordering? An average delay-time so you can rewrite delivery contracts with late-delivery penalty clauses?

3. Normalization is the key to making something that massive work with any reliability and accuracy. Google-search "Normalization" and look at the top several results. Limit your reading to a few articles from .EDU sites for universities or colleges you know. Or read what vendors of the major database products will tell you. And Access Help has a small section on how to normalize data. You acknowledged that you didn't understand about normalization. For something like this, it is not merely useful. It is almost mandatory because otherwise you will NEVER get your machine to deal with a data set of that size.

4. I understand why you get the GROUP BY now. When you convert a simple query to a summation query, all other items must be somehow qualified as GROUP BY or something similar. But if I understand this concept correctly, you don't give a rat's patootie about customer number for a parts lead-time. You care about part numbers.

This is why you should not have started this beast without normalization considerations. Your business model and your database probably do not match very well. (Heck, maybe they do - but it just seems odd.)

You want to see how many items were ordered within X (=lead time) of the date of a record. But this is a sliding window that is going to make your life difficult. EACH RECORD in your million-entry database represents a potential reference point for a sliding window reacing back X, since each record is unique. So you will be doing sums of an unknown number of records (unknown because you cannot a priori predict the clustering of orders of the same parts, which is strongly relevant to your lead time issue.)

I think part of the problem is that your lead time is not well enough defined in your question for me to see what you are really asking. There MUST be more data here than meets the eye, and without it, it is impossible to answer the question as asked.

wvanjaarsveld · May 14, 2007

The leadtime is a fifth column, I did not point that out clearly in the first question. It only depends on the product id. (product_nr)

The_Doc_Man · May 14, 2007

The question still remains, if the lead time is actually part of the data set, your computation doesn't help unless you have a SIXTH column that allows you to compute the delta-time between the order and the delivery.

I.e. OK, Joe Blow ordered a left-handed dumiflanger on the 10th. It arrived on the 12th. That is two day LAG time. (Since the delivery occurred AFTER the order was placed.) Now, on the other hand, Jack Frack ordered a left-handed dumiflanger on the 13th, and if you had enough in stock left over from Joe's order, Jack actually had a one day lead time.

I know you are doing a simulation (because your table name says so) but the catch is, what are you simulating? Based on the query, you are doing something AFTER THE FACT. There still aren't enough dates there to make sense. OK, I'm dumb about some things and freely admit this could be one of them. But... unless you are externally simulating timed deliveries of items on a periodic basis, you still have nothing in particular to compute.

So the point remains, what am I missing here?

wvanjaarsveld · May 14, 2007

What I am simulating is the order fullfillment under a (S-1,S) ordering policy.
In this ordering policy, the total of outstanding orders + inventory on hand is always S. Say S is 10.
Then when someone orders 4, then 4 is ordered from the external supplier and 4 is delivered to the customer. The order at the supplier arrives after leadtime days. So, to see what is the inventory on hand at the time a certain order is places, you only need to look at how many orders were placed in the period (leadtime) before the order is placed. Those are the orders that haven't arrived yet.
For instance, if leadtime is a month, and someone orders an item at the end of januari, then by looking at how much was ordered in januari you can say if the order can be satisfied from the on hand inventory, or that it has to be backordered.
Hope to have clarified it now? Problem is, normally I would use C# to simulate the ordering process, but C# isn't available at my current employer. Therefore, I need to fix it all with Acces.

The_Doc_Man · May 15, 2007

You have posted a related question elsewhere, which I have answered. I still think this is looking at too much data by doing this self-join.

A self-join, even when implemented correctly, involves elimination of a lot of permutations that eventually get rejected. But it is an N-squared algorithm to reject the inappropriate tables.

The answer I posted for the other question might make it clearer. You need to linearize the operation and get rid of the N-squared issues.

Optimize query

wvanjaarsveld

New member

The_Doc_Man

Immoderate Moderator

wvanjaarsveld

New member

The_Doc_Man

Immoderate Moderator

wvanjaarsveld

New member

The_Doc_Man

Immoderate Moderator

wvanjaarsveld

New member

The_Doc_Man

Immoderate Moderator

Similar threads

Users who are viewing this thread