Linking table from external database vs internal database performence

jaryszek · Feb 24, 2019

Hi,

following this topic:

https://www.access-programmers.co.uk/forums/showthread.php?t=303945

What will cause less database bloating while importing data from external sources?

Linking tables within database, importing data and unlinking them?
Or having separate database and linking from it?

Or there is no difference?

What do you use?
Best,
Jacek

theDBguy · Feb 24, 2019

Hi. The file size shouldn’t change much when you link because the data is stored elsewhere.

The_Doc_Man · Feb 24, 2019

What will cause less database bloating while importing data from external sources?

This might be a bit of difficult reading, but it is a common question and deserves a good answer. Hold on to your hat, this is a bumpy ride. I will explain the theory first, then give you the answer in a second post.

Database bloat will occur as the result of the way Access does either of two actions - specifically, the DELETE and UPDATE actions. But I have to digress to explain something about Access structure for you to understand how it works. The information I will discuss IS available online with a suitably worded search, so I'm not completely guessing - but I did not reverse-engineer Access so I am filling in some of this with my experience from other sources.

When you use Access, MSACCESS.EXE takes advantage of the memory management hardware by MAPPING your database file. When we talk about having to bring the database into memory, we are talking about something that in other versions of BASIC is called a "virtual array." What Access does is similar to, but not necessarily identical to a virtual array. The database file is treated as an array that becomes part of the virtual memory of the main Access image and routines. It is mapped starting at address 0x80000000 (= 2 billion, or the halfway point of the 4 GB address range afforded by a 32-bit address). This placement explains why you have a 2 GB limit on database files. And since it is easy to remap memory segments, this is why you can have more than one DB file open at once (like back-end, front-end cases and the idea of a 3rd DB that we have discussed in other threads.) Access just dynamically remaps what it has to remap to address each virtual array. And the fact that you have a limit in how many database files you can have simply indicates the size of the remapping array that access uses to manage multiple files.

All of the pointers inside the database file point within this 2 GB virtual address space. So when we talk about "collections" - those are really lists of pointers that point to the data structures that were collected. Pointers are just addresses within the virtual space and are limited to the upper 2 GB of the 4 GB address space.

OK, so you create a bunch of tables. Well, a table is a collection of records, so each table includes a list of its member records, which can be stored anywhere within the address space. The other collections (like the collections of table definitions, query definitions, form definitions, etc.) are also lists but those are smaller structures that are unlikely to cause bloat. BUT recordsets associated with tables can number into the thousands or even millions of members. And a million of anything is a lot. Here is where "bloat" comes into play.

When you create a record, you draw from the "empty" space of the virtual array (i.e. stuff not allocated yet) and take out enough space for that record. That empty space is the lowest unused address in the virtual address space. It increases from 2 GB towards the 4 GB end of the space. In a way, it is just stacking stuff one on top of the other.

Now you have the record - AND a pointer to it that is associated with the table's collection of records. That pointer identifies where you find the record the next time you want to use it. Not a big issue. Every record is stored the same way. And as long as you have only done record creation/insertion, no problems occur. You just pack the actual records next to each other and they tightly fit adjacent to each other in the address space like a can of sardines is tightly packed. No wasted space - so far.

But here come the two actions that screw up the works. First, you can choose to delete a record. If you do, that tightly packed address space suddenly has a record that you aren't going to use any more. So you would think you have a gap in memory there. But it is the size of one record. You would say, "Well, let me use that record's space for the next record I want to create." BUT in order to do that, Access FIRST has to decide if it will fit. I'm going to belabor the point because that is necessary to understand WHY that action is done the way it is done and not some other way.

If the record had ANY field of type Short Text, that is a variable-length data item that is not stored in a slot in the record. Instead,, the field contains a text descriptor that points to the end of the fixed portion of the record. ALL variable-length fields are packed that way. A record is all fixed-length portions followed by all variable-length portions, with pointers in the fixed area pointing to the variable elements.

If you have a field that is Text(255) but you only store an 80-byte string, the record will be written with only enough bytes for that 80-byte string. Remember I said Access "packed" everything tightly? This is a case where that becomes significant.

So... the question is, why doesn't Access just re-use that space? Access can see this deleted record that it COULD re-use - but the new record to be written might have an 81-byte string in that text field - and it wouldn't fit in a space where the prior record used only 80 bytes.

This means you have two scenarios - one where you go LOOKING for a slot that would fit and the other where you just say "the heck with it, put the next record at the next free spot above the highest currently written record." If you have done a LOT of deletions, that "search for a free spot" would require a massive loop looking for a big enough spot and might seriously slow down the whole process - particularly since Access runs in a VIRTUAL address space, part of which might have been paged out of memory. To search the address space for a suitable slot, you would page-fault the poop out of your system to go finding that open space. (I've personally seen it happen with a different system.)

If the file is being shared, you can't "shuffle" the memory to squeeze down the virtual space because of all the OTHER pointers you would have to move and the other people who might have copies of those pointers in THEIR virtual address space. So that deleted record simply gets "marked for delete" and left in place. That means you now have unused space in the virtual address range. But as long as you don't have 1.99 GB in use, you can keep going.

The OTHER action that will get you is UPDATE queries. Here, I have to guess because I don't know the answer exactly. Updating a record in-place is possible if you don't change the size of the record. Access could easily decide whether the updated record will fit by overlaying the new data. And in such cases, it might do that. Here, the test is either "will fit" or "won't fit" - not a lengthy search but just a yes/no test. And if the update extends the size of that Short Text field I mentioned? The answer is "won't fit" so the UPDATE has to create the new record at the top of the address space and then (invisibly) delete the old record. The original record becomes just another deleted record that cannot be re-used.

This explains two things - bloat and the fact that internally, you cannot rely on the order of appearance of records in tables that have undergone updates. Since updates cannot reclaim the space used and don't shuffle pointers around, the order of appearance for an updated record changes from "middle of the list" to "end of the list." This is why we always suggest having an ORDER BY in any query. Without it, you never know what you will get on the next recordset.MoveNext operation.

That leaves the question of BLOAT - the fact that your database has undergone a lot of deletions and updates and therefore has a lot of deleted records taking up unusable space. This is where COMPACT & REPAIR comes in.

The C&R creates a NEW, EMPTY database. Then it steps through every collection and identifies whether each member is current or marked for deletion. It skips the marked entries and just stores the current entries. It does so in a way that always puts the next copied item at the end of the collection list being copied. When C&R is done, all "marked for delete" items are gone and what is left was allocated from the bottom of that high-address 2GB space towards the top. And all is like a sardine can again. The new copy (the output of the C&R) is renamed to replace the original copy, which is deleted entirely. And as far as you can tell, the DB file is neat and spiffy and ready to go.

The_Doc_Man · Feb 24, 2019

Having provided the explanation, here is the actual answer:

To avoid bloating, do things in a way that avoids deletion and updating in the database file that you want to keep.

Mapping through Excel would avoid bloating your database file because you wouldn't delete anything, just do inserts.

Mapping a temporary and therefore disposable 3rd file would avoid bloating your database because you are going to dispose of the 3rd file when done, and file deletion gets rid of bloating completely - for the disposed file.

jaryszek · Feb 25, 2019

thank you The_Doc_Man for great explanation.

"This is why we always suggest having an ORDER BY in any query." Do you add order feature to each query for example for PK ?

so answer is - 2 methods are correctly. i think easier to implement is linking tables inside one database.

still if i will be deleting umtached records my database will be bloating but slower than i would import whole tables and delete them...i think ;-)

Best wishes,
Jacek

Linking table from external database vs internal database performence

jaryszek

Registered User.

theDBguy

I’m here to help

The_Doc_Man

Immoderate Moderator

The_Doc_Man

Immoderate Moderator

jaryszek

Registered User.

Similar threads

Users who are viewing this thread