Too much data for Access - what are options?

sumdumgai · Sep 29, 2023

Hello all,
We're trying to build a new Access database with .dat data that is downloaded from a remote site. The problem is that there is too much data and the 2GB limit size on the database is being exceeded. Can I please get some recommendations on what to do? The data coming in cannot be scrubbed or filtered. There is a field in the data that contains a yyyymm value for each record. We need to keep a rolling 24 month's of data.
Any suggestions are much appreciated.
Thanks.

NauticalGent · Sep 29, 2023

Check out MySQL

sumdumgai · Sep 29, 2023

Thanks. Is that free software? Is it similar to Microsoft SQL Server 2022, which is free and can handle 10GB?

plog · Sep 29, 2023

sqlite is free and can handle 10GB. If you are dealing with an SQL database they are are going to function similiarly--by using SQL. All interfaces are going to be different though, but not by much.

CJ_London · Sep 29, 2023

Is it similar to Microsoft SQL Server 2022, which is free

to b clear, the express version is free, not the full sql server version

jdraw · Sep 29, 2023

plog said:
sqlite is free and can handle 10GB. If you are dealing with an SQL database they are are going to function similiarly--by using SQL. All interfaces are going to be different though, but not by much.

plog,

Have you used sqlite as backend with Access as frontend? Just curious, I have not seen this type of set up in the forums.
I did find this from devart.

plog · Sep 29, 2023

No, but I know it has ODBC drivers to connect to Access. I used it in a raspberry pi project using python. I used SQLiteStudio to work with the database:

SQLiteStudio

sqlitestudio.pl

The neat thing about SQLlite is that its a standalone file like an Access database. Need a copy--just right click, copy and paste it in windows explorer.

sumdumgai · Sep 29, 2023

Thank you all for the MS SQL suggestion.
What if I stuck with MS Access and loaded the data by year into different databases. Then link the tables in another pseudo database, and run queries there against all three data databases (e.g., 2021, 2022, 2023). Every month, either add a new "month" table to the appropriate database, or create a new (e.g., 2024) database? Thoughts?
Thanks.

isladogs · Sep 29, 2023

Bad idea
You cannot enforce referential integrity between tables in different Access databases.
Once you start down that route you will make additional work for yourself and the outcomes will be worse than if you move the BE data to SQL Server or once of the other alternatives mentioned above

CJ_London · Sep 29, 2023

usually these data 'dumps' are not normalised - if that is the case with your data you might find normalising the data is a way forward. As an example, a file (.csv) a client downloads is typically around 160k a month. Normalising the data adds around 20k to the size of the access BE - and that includes indexing.

sumdumgai · Sep 29, 2023

Thanks CJ. Don't know what you mean by normalizing. The data comes in as records separated by LF and CR's. Can you please explain?

NauticalGent · Sep 29, 2023

Pat Hartman said:
I think SQL Server Express is also limited in size but it it larger than Access.

10gb according to their website

Gasman · Sep 29, 2023

sumdumgai said:
Thanks CJ. Don't know what you mean by normalizing. The data comes in as records separated by LF and CR's. Can you please explain?

Google Search

CJ_London · Sep 29, 2023

sumdumgai said:
Can you please explain?

lets say your records are to do with sales, so you might have as headings

customerCode, customerName, InvoiceNo, invoiceDate, itemName, Quantity, price, value
abc12345............alphaCo.................12345...........12/9/2023.....gadget1......1..................10......10
abc12345............alphaCo.................12345...........12/9/2023.....gadget2......5..................8.........40
abc12345............alphaCo.................12345...........12/9/2023.....gadget3......2..................15......30

the first two columns are repeating for every row of the invoice, as are the second two

the 8 digit customercode takes 26 bytes whilst the customer name (say average name is 18 chars in length) takes 46 bytes - per row

normalising means you store these two fields, just once, in a customer table with a long primary key which takes four bytes - plus the 62 bytes for the text.

but this means you can replace the first two columns with a 4 byte number to link to the primary key in the customer table.

So simple maths. Save 72 bytes per row (less the 4 byte number) means over three rows you have have saved 204 bytes for a one time 'cost' of 66 bytes. And every time you you make a sale to that customer - you save another 66 bytes because you already have the customer details.

A similar principle would apply for the repeating invoice 'header' detail (number and date) and also for the items

sumdumgai · Oct 13, 2023

Just to update you, I've managed to load all of the data (24 months worth) in a database. I first had to remove the table indices and then load the data one month at a time. After all months were loaded, I reset the table indices. The database size is 1.9 GB. There is another table that is loaded with demographic data. I had to put this data into a second database. Both of these databases will be back end databases. A third database will link to the back ends. This is where additional tables, queries, etc. will be created for reporting purposes.

The_Doc_Man · Oct 13, 2023

CJ's comments on normalization are true and appropriate. However, there comes a moment here where you need to LOOK at the data to decide which columns can reasonably become candidates for normalization. Another equally important issue is whether it pays to normalize or to otherwise reformat. A little bit of up-front skull sweat usually pays off BIG time for big data sets.

For example using CJ's data from post 16 and taking an assumption that your data comes in as all text (because I remember reading about your other post with a big CSV data set)...

Reformat examples first:

If your import is purely CSV and you don't do conversions, you probably get everything as text. Look at CJ's field called "Invoice No." which is a 5-digit number. If you store this as a string, it takes up 5 bytes. If it will ALWAYS be purely numeric, though, you can store it as a 4-byte LONG integer. That's a savings of one byte per record. Not so much, you say? If you have 100,000 records, you just saved 100,000 bytes.

If you have a date in string format, you assume dd/mm/yyyy text format which is 10 bytes. If you convert the string date to a DATE field, you have the same information in 8 bytes (the size of a date field). Then considering 100,000 records, you have saved 200,000 bytes. AND date order works OK sorting date fields, but text fields have to be in yyyy/mm/dd format to sort correctly by date.

Normalization:

Note CJ's customer code and customer name examples. If you have 100,000 records but only 1000 distinct customer codes and customer names, make a customer table that includes an autonumber plus the customer code and customer name. If you make the customer table first, then when you import the main data and you only store the unique autonumber in a record (by looking up the customer code from the customer table), the math says you store 1000 customers x (autonumber, 8-character code, and at least 7 character name) = 19,000 bytes (minimum) in the customer table + 100,000 records x 4 bytes in the main table = 419,000 bytes between the two places. If you DID NOT normalize but instead retained the code and name (at least 15 bytes) x 100,000 that would be 1,500,000 bytes - three times as many as if you normalize. And I'm betting customer name could be more than 7 bytes.

This savings in space utilization is WHY you spend time up front planning how you want to store things and WHY you want to include normalization among the tools in the tool box. That up-front analysis and planning saves your skin more often than you would ever realize.

sumdumgai · Oct 13, 2023

Thank you. Data comes in as a .DAT file, almost 8 mil. records I found after using a free text editor that can accommodate the file size. I'm reading the data into an array, splitting the file into lines and then lines into fields, adding only those records that match a year_month value to the array. The array is then used to populate the table, one month at a time. I do the same for the next year_month, 23 more times. By doing the processing one month at a time, the array does not get big enough to choke Windows.

Cotswold · Oct 14, 2023

plog said:
sqlite is free and can handle 10GB. If you are dealing with an SQL database they are are going to function similiarly--by using SQL. All interfaces are going to be different though, but not by much.

SQLite is not a multi-user database. It is however, serverless, self-contained and cross platform. As plog has said it can contain around a dozen gigabytes of data. However, SQLite puts everything into a single file and very large datasets can stress its capability. Many modern filesystems are capable of handling terabytes of data, that doesn't mean they are very good at it. If you need to store several gigabytes or more of data, it might be wise to consider a more performance orientated product.

Cotswold · Oct 14, 2023

sumdumgai said:
Thanks. Is that free software? Is it similar to Microsoft SQL Server 2022, which is free and can handle 10GB?

MariaDB is basically the same as MySQL and written by the original MySQL team with very few language differences. It is also free to use to the best of my knowledge.

cheekybuddha · Oct 14, 2023

Cotswold said:
SQLite is not a multi-user database

That is incorrect.

Though caution should be used in a multi-user situation.

From their "When to use" page:

Situations Where A Client/Server RDBMS May Work Better

High Concurrency

SQLite supports an unlimited number of simultaneous readers, but it will only allow one writer at any instant in time. For many situations, this is not a problem. Writers queue up. Each application does its database work quickly and moves on, and no lock lasts for more than a few dozen milliseconds. But there are some applications that require more concurrency, and those applications may need to seek a different solution.

Too much data for Access - what are options?

Registered User.

Ignore List Poster Boy

Registered User.

Banishment Pending

Super Moderator

Super Moderator

Banishment Pending

Registered User.

Access MVP / VIP

Super Moderator

Registered User.

Ignore List Poster Boy

Enthusiastic Amateur

Super Moderator

Registered User.

Immoderate Moderator

Registered User.

Well-known member

Well-known member

AWF VIP

Similar threads

Users who are viewing this thread