Reporting on 5M Rows

winshent · May 13, 2013

I've recently joined a new company and this is my first time working with what I would call Big Data.. We are running Access 2003.

I have been tasked trying to speed up the reporting on a monthly extract that comes from our data warehouse team.. Currently, we append flat file data to our MS Access 'master' database. Its loaded into a single table and currently holds around 5M rows covering 12 months of data.. Its currently sitting at around 1Gb..

My predecessor built an 'acceptable' db from a performance perspective by splitting the down the data in to individual country tables. Reports are then dynamically pointed at the specific Country table that a user specifies when running reports avaliable thru the front end. However, this obviously means the data cannot be analysed across a number of countries..

Users then copy the master db on to their C drives and then analyse it from there.. The users copy the database to their C drives because the performance when running over the shared drive is poor..

The structure of the flat table is as follows:

Code:

  YearMonth   : LONG (4)
  MT          : TEXT (255)
  FT          : TEXT (255)
  Country     : TEXT (255)
  Company     : TEXT (255)
  BD          : LONG (4)
  BN          : LONG (4)
  Programme   : TEXT (255)
  Product     : TEXT (255)
  Descriptor  : TEXT (255)
  TC          : TEXT (255)
  Count       : LONG (4)
  Amount      : DOUBLE (8)
  CB          : DOUBLE (8)
  Fee         : DOUBLE (8)

Its obvious that this needs to be moved to a database server, and I will be driving this forward but thats not going to happen anytime soon..

I want to create a shared environment with some multiuser apps in both Excel and Access for the users, for which I am experienced in..

So, for those of you who work with big data.. what would your approach be on how to optimise this?

If I can do anything with this then the next db I'll be looking at contains 7.5M records and is hitting 2Gb..

plog · May 13, 2013

I would set up a properly normalized database and extract the data from my flat file into that structure. This proper structure will include the correct number of tables, the correct field types for each field and indexes on the appropriate fields. Doing that will improve performance and reduce the file size.

With just cryptic field names and their current data types from the source file, we can't specifically help you. If you can post some sample records I can give specific pointers, but for now, read up on normalization and indexes.

way2bord · May 13, 2013

However, this obviously means the data cannot be analysed across a number of countries..

Simple Union SQL query will join the two countries for analysis.

I would set up a properly normalized database and extract the data from my flat file into that structure. This proper structure will include the correct number of tables, the correct field types for each field and indexes on the appropriate fields. Doing that will improve performance and reduce the file size.

Do this for as long as you can... caveat below:

If I can do anything with this then the next db I'll be looking at contains 7.5M records and is hitting 2Gb..

Access will BREAK at >2GB. Given your current description you may want to split countries into entirely different DATABASES (rather than tables). These databases can be easily daisy-chained (linked table to external) into a "Main" Database for your data.

CJ_London · May 14, 2013

Nobody has mentioned indexing yet. If your table(s) are not indexed then performance will be slow and indexing the fields used for linking will make a massive difference.

The downside is indexes will increase the size of the db and it will take longer to load the data initially so you may still need to take other action.

Galaxiom · May 14, 2013

The first thing to do is normalize.

Countries should be converted to numeric code with the names stored in a lookup. You can use any code but the international telephone code is a good choice.

Similarly convert the Company names and anything else that is repetitive. This will bring the storage requirement down substantially.

I would convert YearMonth to a date such as the first of the month.

winshent · May 14, 2013

Thanks a lot for the responses guys..

I am in the process of normalizing the database atm.. However, I wonder if normalizing and then linking from the main table on to the lookup would be slower? This is a reporting database, with no data entry required.. Currently the 5M db is around 900Mb when compacted although it does bloat up to around 1.2Mb when used for some reason..

I am going to use the smallest data type I can get away with for each column ie if a lookup will contain less than 255 rows then i'll use BYTE else i'll use INTEGER for primary and foreign keys.

Anyway.. what I am thinking of is building a front end form where users can specify a number of parameters. I'll use VBA to rewrite and save a 'sub' query definition using these parameters. Then I'll nest this into another query which links in all of the lookup tables to bring in the definitions..

My idea is that i'll be joining the lookups on a filtered dataset of say 100k records than 5 million records.

In your experience, do you guys think this would work or would be a complete waste of time ?

Galaxiom said:
I would convert YearMonth to a date such as the first of the month.

Am curious about this comment, currently we store in YYMM format, so I could easily store this as an INTEGER which is 2 bytes as opposed to a DATE which is 8 bytes..

CJ_London · May 14, 2013

I'm frequently dealing with very large volumes of data so I can definitely say this is worthwhile doing. Depending on the amount of duplicated text, efficient use of number types less additional requirement for indexing should see savings in the region of 50-70%.

The technique I use is to copy all data from source to a temporary table in a temporary db and add relevant indexing. Then parse through the data updating each table in your normalised set - e.g. customer, contact, invoice header, invoice line etc. Once done, the temporary db can be deleted

The principle I use is

1.insert new records where record does not already exist
2.update existing records where record has changed (or insert new record with timestamp to provide a history)

Clearly invoice records should not preexist so action 2 is not required in this case - although a check for data integrity would not be a bad thing!

Galaxiom · May 14, 2013

winshent said:
Am curious about this comment, currently we store in YYMM format, so I could easily store this as an INTEGER which is 2 bytes as opposed to a DATE which is 8 bytes..

Good idea. Maybe I should reconsider my propensity to store anything date related as a Date.

I put any serious databases on SQL Server so I have not really had to worry about size.

Reporting on 5M Rows

winshent

Registered User.

plog

Banishment Pending

way2bord

Registered User.

CJ_London

Super Moderator

Galaxiom

Super Moderator

winshent

Registered User.

CJ_London

Super Moderator

Galaxiom

Super Moderator

Similar threads

Users who are viewing this thread