Need advice on DB design.

sumdumgai

Registered User.
Local time
Today, 17:32
Joined
Jul 19, 2007
Messages
453
Hello again. Looking for great advice that I always get on this forum. I'm working on a new application that tracks sales.


Let's say the application has these characteristics:

  • There are 800,000 salespersons each with unique ID (I).
  • The number of product groups being sold could be 6 to 12 (P).
  • Each product group could have up to 20 subgroups (S).
  • Each subgroup will have 12 attributes which is where sales quantities are given (A).
  • This sales data will be imported once a month.
  • A date for the monthly sales data will either be included or it can be generated.
Typical queries might be something like:

  • Show me most recent sales for Salesperson (I1) for Attribute (A1) for Subgroup (S1) for Product (P1).
  • Show me total sales for Attribute (A1) for Subgroup (S1) for Product (P1) for last 12 months.
  • Show me best selling Attribute (A1) for Subgroup (S1) for Product (P1) for last 12 months.
  • Show me most recent sales of all Attributes for all Subgroups for all Products.
  • Show me total sales of all Attributes for all Subgroups for all Products for last 12 months.
My questions are:

  • Since the number of potential fields for just this data could be 12x20x12 or 1,368, I can't put all products in one table (255 field Access limit). I suspect that Access' 2GB limit will also be a problem. Therefore, linking the tables from different databases (D1 thru D12) is an option. What would queries generally look like if I'm summing multiple product groups?
  • Is this too much to ask of Access?
  • If not, besides the obvious indexing that would be required, is there any other performance consideration?
Thanks as always for any comments or suggestions.
 
Given the number of sales persons, you probably are looking at SQL Server as a back end. Access could certainly still handle a front end.

However,

Since the number of potential fields for just this data could be 12x20x12 or 1,368, I can't put all products in one table (255 field Access limit).

No. Absolutely not. That is flat-file (EXCEL) thinking. You have a natural hierarchy of groups and subgroups. Each level of the hierarchy would be a table for that level only, with a child table for the next level down. Keep the levels separate. Then the number of fields does not enter into play.

Your typical queries are all easy. Look up two topics: TOP predicate and SQL Aggregate. These topics lead you to either the "most recent sales" (TOP when ordered by descending date), total sales for a given A, S, P = (SUM of a field within a query that has GROUP BY statements for A, S, P)

Your issue is going to be a design problem first. You have a lot of data. But if you store too much with every individual sales record, you will blow out darned near anything you try to use for the data storage. Think "NORMALIZATION" right now while you are in the design phase or you will put a .44 magnum slug in your foot.

I'm not quite sure how the word "attribute" fits in. What is the product or service being sold? You show "product group" and mention a subgroup and an attribute - but what is actually being sold and how do these things fit in?

Regarding your question on indexes and performance... with this many sales persons and the detailed volume implied by 800,000 sales reps, if you DON'T index stuff, you won't HAVE any performance whether we are talking native Access, SQL Server, or ORACLE ENTERPRISE SERVER on a really high-performance back-end machine.
 
Thanks loads, Doc.

Think of the app this way.

Product groups are vehicle brand names (e.g., Ford, GM, Kia, etc.), subgroups as type of vehicle (e.g., sedan, SUV, truck, etc.), attributes as colors (up to 12 of them).

Budget plays a big role so we're trying to fit this into Access. A fast query response time will not be as important as getting reports, even if they run a while.
 
Last edited:
Normally I wouldnn't post just to confirm good advice, but I want underscore how important normalization is here.

With the amount of data you are talking about properly structuring your tables makes or breaks this thing in Access. Set it up like the spreadsheet method you implied in your initial post and Access just won't be able to accomodate your data. However, if you properly normalize your data Access should be able to handle this data.
 
I understand normalization to mean setting up tables so that duplication of data is reduced as much as possible. Also, that searched fields are indexed. I know it's hard to describe in words here, but given my examples of Product Groups, Subgroups and Attributes, can someone please explain a little more how this data would be normalized?
 
Suppose you have sales data By month for 3 sales people. Using a spreadsheet you use 13 columns (SalesPerson + 12 months) and 3 rows (one per sales person). In Access you use 3 columns (SalesPerson, SalesMonth, SalesAmount) and 36 rows (3 sales people * 12 months).

That's the difference we are talking about. The Access method (aka normalization) is more efficient for a database.
 
More info:

Data will arrive as delimited text file, 800,000 records with 450 fields per record. First few fields are sales person identifiers, next fields come in sets of 12 with each field in the set representing a different attribute and containing the quantity sold. The header will identify the group and subgroup. So, the fields may look like this:

sales id, 1A-1, 1A-2, 1A-3, ... ,1A-12, 1B-1, 1B-2, 1B-3, ... , 1B-12,... , 12A-1, 12A-2, 12A-3, ... , 12S-1, 12S-12

where the first digit is the group number (6-12), the second alpha character is the subgroup (A-Z), and the characters after the hyphen represent the attribute number (1-12).

I was thinking that I would first split this .txt file horizontally by group number, with sales id included in first field of each split file. Then, I would import these files into separate databases. The tables would then be linked.


Note that every sales person will have sales. Also, less than 5% of the possible sales buckets, on average, are non-zero. In the most recent monthly data, the group with the most subgroups has 19 of them, with 12 attributes each. With 780,000 sales persons, that's about 180 billion sales buckets. So, with 5% being sold, about 8.9 million buckets will be non-zero.


Comments please?

Thanks.
 
Last edited:
No, I'd split the 450 fields in the .txt file horizontally by group number into separate .txt files. To analyze the data, I split the .txt file vertically into multiple files of 10,000 records each, and then I imported one of those files into Excel. I found that one group has 2 subgroups, and another has 19 which is the most. So, in the case of the group with the 19 subgroups, with each subgroup having 12 attributes, there would be 228 fields (plus a sales person ID field) that I would import into Access.
 
I tried to import a .txt file with 229 fields and 780,000 records. Access said it ran out of resources, which I did not expect. There may be a problem with that .txt file. I'll take a look at your solution.
 
OK, "ran out of resources" is a different question. Not the first time we have seen this.

"Resources" is a reference to a Windows data structure that is used to handle stuff of different types including graphic images, icons, programs, and files. You often run into the problem of resource depletion when something you are doing forces you to handle too much at once or too many different things at once. Let me leave this idea, though, because I think normalization comes first.

You have a sales person. I infer that your spreadsheet is rows of monthly sales broken down such that each cell represents a different product group, subgroup, and attribute. That is "flat-file" layout and is what spreadsheets do.

The Access equivalent is that you would have a table of sales reps. Each rep has a RepID.

You would have a table of products where group, subgroup, and attribute are fields. You would have a ProductID that uniquely identifies a single combination of group, subgroup, and attribute. From your discussion, you might have maybe 1200-1500 combinations of these.

The master sales table would have as fields the RepID, ProductID, and whatever you use as a date, plus a quantity of sales - BUT only if not zero. You will STORE NOTHING at all if nobody bought the Baby-puke green Cadillac Escalade.

To import the dataset, you need to learn how to open an Excel Application Object (look up that topic for reading material). You would open your spreadsheet and write a VBA loop to scan that big, ugly puppy one row at a time, traversing each cell to find a non-zero sale. You have the Rep ID from the first couple of cells in the row, and you know which cell you are reading by its position within the row. So you can figure out which Product ID to use when entering a non-zero sale. Inserting a sales record would probably be done via a recordset that stores RepID, ProductID, date, and quantity - four fields per non-zero entry.

OK, 780,000 rows later, you have finished this big ugly puppy. Close the object. You now have a list of ONLY those products that were sold.

To build your reports, you would now write a JOIN query to tie together each sales person ID to the sales person details. AND that same query can JOIN a second time to the Product table to tie together each product with its group, subgroup, and attribute.

This query can then be the basis for all of the things you asked about in your question about queries, with judicious use of the SQL Aggregates and TOP predicates.

BE WARNED: You said you didn't mind if it would be slow. This WILL be slow. But with that mass of data, you don't have a choice. You are looking at storing roughly 225 million fields. 225 million of ANYTHING will take a while to handle.

There is also the issue of "blowing out" resources. Look up in this site the topic "maxlocksperfile" which has some data on resource problems.
 
Last edited:
Let's say the application has these characteristics:

  • There are 800,000 salespersons each with unique ID (I).
  • The number of product groups being sold could be 6 to 12 (P).
  • Each product group could have up to 20 subgroups (S).
  • Each subgroup will have 12 attributes which is where sales quantities are given (A).
  • This sales data will be imported once a month.
  • A date for the monthly sales data will either be included or it can be generated.

This implies the following
Each record in the "Sales" table would have
SalesID - Autonumber
Salesperson - Reference to the Salesperson record, number
ProductCode - Reference to ProductTable record.
DtTransaction - DateTime that holds WHEN the transaction took place.

Your product table would hold your group/subgroup/attribute information. Once more, looks like it will be an autonumber primary, descriptive "Name", then your group/subgroup/colour/what have you information.

Does that help with starting to identify what should be in your table?
 
Last edited:
Thanks for ideas and help with setting up the tables.


I've done some analysis of the incoming source .txt file. The 780,000 records hold about 350 million sales quantity buckets (Excel cells). Of these, about 3% are non-zero. That means, by importing only non-zero sales quantities, I'd create about 10.5 million Access rows, each with at least these four fields:
SalesID - Autonumber
Salesperson - Reference to the Salesperson record, number
ProductCode - Reference to ProductTable record.
DtTransaction


Also, this would happen every month, adding to the Sales table. My queries need to show total sales by sales person (s) of product group (s), for the current month, or months (s).


Would I be better off doing a separate database for each product, separate database for each month, or something else?


Thanks again to all who have helped.
 
Don't think this has been brought up after reading a number of comments I notice the 2gb limit is quoted but there is also the table level lmit of 1Gb I don't think think thats changed with office 365.
I have over the years dealt with importing txt files with up to 5million records but i did in that case add the data to the following tables


tblRecordingArtists
tblArtistsAndRecords this table has 2 long integer fields
tblRecordings


there was a number of additional sub tables but they are not important here
While The recordings artists ended up with 1.7 million entries, recordings 800k and the artistsRecords 5.5 mil which is 650mb


these tables have been spreed across 8 backends with tables in each maintaining the normalization.


I know you sead SQL server couldn't be used but by the sounds of it I don't thnk you will have a chouse as the data is just to masive even importing and running a function as I described above could take hrs I created a vb program to do my importing and exporting to my web Mysql.


Good luck with your project
 
Would I be better off doing a separate database for each product, separate database for each month, or something else?

Definitely not a separate database. Having extra files means you make it possible to keep more data in Access back end files but every new BE file adds maintenance headaches. Further, you cannot spread a single table over multiple BE files.

each with at least these four fields:

Does it ever happen that someone sells TWO units in the same month? Or does that actually get reflected in the big ugly spreadsheet? If so, you have a fifth field, Quantity Sold, a number which will never be negative. If you take the stance of not storing the slots with zero quantity, then you simply make the dataset sparse.

The 780,000 records hold about 350 million sales quantity buckets (Excel cells). Of these, about 3% are non-zero.

Pardon the pun, but this is where Access excels. This is a perfect example of what we call "sparse" - i.e. lots of possibilities but few actualities. SQL Aggregates don't care that something is absent because a COUNT won't hiccup and neither will a SUM.

Your proposed 10.5 million records would be relatively short. Including my "count", you have:

SaleID: Autonumber PK (4 bytes)
RepID: Long FK (4 bytes)
ProdID: Long FK (4 bytes)
TheDate: Date (8 bytes)
TheCount: Long (4 bytes)

That's 24 bytes per record, about 250 Mbytes for the table. I am not sure about exact sizes of indexes but since you want to query those things, each one of those except the count field would require an index. The PK needs it for structure, but date, rep, and product each need it. You could hold up to 3 month of data in that table before archiving would be required, and that would let you do something like quarterly summaries.

However, I'm not so sure that SQL Server or My SQL has the same limits on table size if you were to use that as a back-end. And if so, then your sparse data squishes down enough that you might get even more month - eventually enough to do full-year trends. Given that you have 250 Mb per month, that's 3 Mb per year, so an Access BE isn't going to do what you want.

Splitting those yearly quantities across monthly BE files MIGHT be possible but you would have to learn how to open and close BE files selectively, open up the tables with fully-qualified prefixes for each separately opened DB, and do your queries in a way to load your summary data to a staging table before producing a long-term report. And you DO run into a limit on the number of simultaneously open DBs, which would tend to limit how much you could summarize at once. Which is why I mentioned a more dynamic approach of opening and closing stuff only when you need it.

My best advice is that the multiple BE method would be tedious and of very high difficulty to maintain, whereas some sort of active BE like SQL Server, MySQL, or even some other ODBC-capable product with high capacity might be better given your volume of data. Which of those IS better? There, I must defer to my colleagues. The few cases of SQL Server that we had at the office were a different project so what I know about that is what I picked up in comparing notes with the gang at my last job before I retired. But it worked well for what they were doing.

I looked up SQL server sizing specifications. Here is the reference:

https://docs.microsoft.com/en-us/sq...fications-for-sql-server?view=sql-server-2017

I can only tell you from personal experience that ORACLE Enterprise Server might have the capacity you want because ORACLE allows you to split things across multiple files that it transparently manages for you (once you set them up). You can even manage that across multiple disks if you have some sort of storage array that acts as network-attached storage. Due to protocols involved, Access "native" BE files don't work quite as well with network storage arrays because that still depends on a server supporting SMB protocol. Array managers tend to not like SMB from non-server systems reaching into their arrays. They want only their own servers doing that reaching. They get funny like that.
 
Last edited:
Does it ever happen that someone sells TWO units in the same month?
If I understand your question, the answer is yes. The number of items sold (i.e., > 0) per sales rep per month ranges 1 to 100 +.

When you say back-end, I was thinking more along the lines of linked tables (or do linked tables always imply a back-end?).

What if I created a separate database for each product (group), each with one product table and loading into that table only the 5 fields that have been suggested for that product. I could append monthly non-zero sales data for that product there. Then I could link that table to a 'master' database that contained sales rep and product information?

Thanks.
 
Last edited:
do linked tables always imply a back-end

In a word, yes. Tables are either linked or local. Local tables are in your primary app file which is usually your front-end or FE file. Linked tables are in another file that is usually your back-end or BE file.

Here is my take based on personal historical perspective. Adding separate files (distinct BE files) adds to the maintenance overhead of the project because to do maintenance, you typically have to take down all of the FE files and all of the BE files AT THE SAME TIME. You need some guaranteed common point (yet another BE file) so that you can store data regarding your maintenance time. And trust me, for files this big, you WILL need maintenance. They will eat your socks if you aren't careful.

The more "things" you have to touch, the worst this gets. (A) More opportunities to make mistakes. (B) More need for a careful backup system. (C) More downtime needed to perform the maintenance tasks. (D) More opportunities to get confused. (E) More resources required - and you already have an issue with that WITHOUT the multiple BE files being in place.

I need to explain (B) a bit better. When you have a database, it has an implied or explicit "instantiation" date or number or SOMETHING that says that "All files involved with this overall database were synchronized as of " either a date or a sequence number of some sort. ORACLE used instantiation numbers. I wouldn't rule out other systems that might use dates. The idea, though, is that if you have to RECOVER a blown database, you must be able to recover files such that you can return EVERY COMPONENT FILE to the same instantiation number or time. If you cannot do that, you cannot recover the database reliably and you risk data loss.

Using SQL Server and a Terabyte disk, you can hold a lot of big data in a single place. Big file - but easy to maintain, easy to back it up.

Using ORACLE Enterprise Server and one or more large disks, you can hold a lot of big data in a couple of big places but ORACLE manages the files for you if you set aside the required backup time.

Using Access "native" BE files, you are doing your own data management and your own backup. You are asking for a major accident to happen. This is based on nearly 50 years of program engineering experience that includes systems analysis, product management, security management, and network usage management. And in those 50 years, I have NEVER seen a system that did not occasionally require a data restoration. It WILL happen.
 
Last edited:
Don't mean to be argumentative and I really do appreciate all of your comments, but these databases will be read-only and updated once a month with backups. There will not be any 'transaction' queries where instantaneous results are needed. Once the data retrieval queries are developed, there shouldn't be any more database updates. Does that change anyone's opinion of using Access?
 
Looking at the structure again, I changed my mind about something. You are making a lot of "roll-up" reports but I don't see it likely that you will want to go back and reference anything by the sales ID that would be a PK/Autonumber field. I also don't see any child table depending on the five-field sales record. If there are no child tables and everything you are doing is based on roll-ups, you don't need the SalesID PK field - or any PK. But if you DID need one, you have a natural compound key in the form of the RepID and date.

The thing about PK fields is that everybody thinks you need them. But that ain't so. Sometimes you need them if something is going to depend on something else. When you have dependency, the independent item needs an ID but the dependent item might not. It's all in how you use that dependent item. In this layout, Rep ID depends on a representative table. Prod ID depends on a product table. But what field in what table depends on an individual sales record? At the moment, I don't see one.

So your table changes from 24 bytes to 20 bytes and shrinks from about 250 Mb to a little over 200 Mb. So storing a year of data means 2.4 Gb total before you add in indexes. Which means Access STILL can't fit one year of that data into a single table.

But if your sales were differently recorded, you might get some space back here. I'm going through the exercise to show you that it won't be enough.

You said you had limited group size (<100) and limited subgroup size (<100) and limited attribute size (<100). This question is INCREDIBLY important: How likely is it that the number of groups, subgroups, or attributes will change? BECAUSE you could store the info like this:

RepID: Long, FK to representative detail table - 4 bytes
TheDate: Date, 8 bytes
Group: BYTE, FK to group-name table, 1 byte
Subgroup: BYTE, FK to subgroup-name table, 1 byte
Attribute: BYTE, FK to attribute-table, 1 byte
TheCount: WORD, 2 bytes (unless you have some go-getter sales persons who can sell more than 30,000 units in a month single-handedly)

This would make your group, sub-group, or attribute roll-up queries trivial since there is no JOIN requirement. That would make everything a lot faster and the only time you need do to the lookup is to print some names on your report.

That means one sales record is 17 bytes. Your 10.5 million records now take about 180 Mb. That means you can now fit 5 months into a single table.

The reason I went through this is to show you how really tight for space you would be. You STILL will be unable to fit six months worth of data into a single table with a native Access BE file. You are STILL looking at a data overload situation.

You appear to be desperately seeking an Access solution but I have to say this does not look like Access can do this without a LOT of really hard, complex, and perilously messy work. This is probably a job for an active SQL engine like SQL, ORACLE, MySQL, or one of the other ODBC compliant engines.

Perhaps we can understand your persistence by asking this: Did the boss tell you it HAD to be done in Access? Because I think we are going to do you the greatest favor by suggesting that you would do better with an Access FE and something else for a BE.

AND I have made it a point to include my detailed thinking on the layout in case one of my colleagues sees something I have overlooked. In fact, I would WELCOME anyone else telling me why I'm totally wrong-headed about this one. But at the moment, I don't see it.
 
Last edited:

Users who are viewing this thread

Back
Top Bottom