Question about field size

@The_Doc_Man - so what would that look like with two strings? And what about the effect on the index for a string?
 
@The_Doc_Man - thanks for that. But the area pointed to by the X pointer(s) gets changed/replaced as the length of the required storage changes, and you get a fragmented string storage area that gets defragmented by a C&R operation. Do new records get inserted after the original string area, or does that work differently also?
 
Whenever you update a record, the rules say that you COPY the new i.e. updated record to a different place (first), then you diddle with the pointers that link the old record (to remove it), then you link in the new record.

NOTE that it HAS to be this way since in a shared DB, when you open a query, you "lock in" the records. You would put this together by looking up "Order of SQL execution" (a.k.a. "Order of SQL operations"). You do the FROM first, so JET/ACE has to make a list of the records to be involved in the query. So up front, you immediately have a pointer to the record. While YOUR query is active on its next phase (the WHERE clause followed by the SELECT clause), if someone comes along and issues the UPDATE and you didn't invoke the "See Changes" option, both records co-exist. The old copy is in your query's list. Somewhere in the process depending on options, you might get the warning about "another user updated..." The NEW copy of that record is in the table's list of records after the completion of the UPDATE.
 
@The_Doc_Man - so what would that look like with two strings? And what about the effect on the index for a string?

I will answer that two-parter out of order. I believe that the index pre-allocates the space with fixed size as declared, in a hidden table that is the index with two significant fields - the value of the index and a pointer to the start of the record to which that index refers. However, I am not at all sure about that. It IS possible to do it with sparse allocation, since string comparison rules allow for uneven-length comparison rules. Which would mean that the index ALSO uses short strings. This one, I cannot answer with any certainty.

Regarding the FIRST part of your question: With two strings you would have TWO size/pointer fields, with the first pointer to point {X} and the second pointer to point {Y} which would follow the end of the actual end of the string that started at point {X}. Let's do a simpler record with autonumber (which is a LONG), string 1, string 2, and a date.

Code:
|LONG|string1-size&pointer-to-{X}|string2-size&pointer-to-{Y}|DATE|{X}string1-actual|{Y}string2-actual|

In this diagram as in the other one, the vertical bars are for YOUR VISUAL distinction; they are not in the record, because the contents are picked out using the equivalent of a MID function except that it isn't only for text strings, it is for any byte streams. Neither {X} nor {Y} take up space; they are more visual aids to show you how & where things align. So the fixed portion of the record would be 4 + 10 + 10 + 8 bytes (but no "|" as actual delimiters), and the variable portion's length would be LEN(string1) + LEN(string2) - again with no "|" delimiters. And the actual LEN() values would be in the size indicators of each 10-byte descriptor.

This is my opinion based on lots of reading of various articles over 20+ years of Access work, PLUS reading how other Windows API calls work. Even if I'm wrong, I'll say that it should not be much different from this, because even Microsoft wants SOME consistency in their lives. Stated another way, based on having been involved in computing since 1966 in either student or professional capacity, if I implemented Access, this is close to how it would look based on all of the "best practices" books I have read over that time period. And yes, that is 57 years with computers.[/CODE]
 
To amplify Pat's comments about optimization of record placement, I know this for an absolute fact: In older ORACLE databases for mainframes, they have a thing called a "bucket" which holds several disk blocks. The disk's physical geometry becomes a consideration in optimizing layout because they want the buckets to never cross a cylinder or track boundary. So if you have, say, 24 blocks per track/cylinder combo, then bucket sizes can be 4, 6, 8, or 12. (They didn't, at the time, like buckets bigger than 12 blocks.) The point of this is that they work on the contents of a whole bucket at a time. When storing records initially, they never let a record cross a bucket boundary. Which means some limited slack space at the end of the bucket. Later, during updating if they could rewrite the bucket with a longer record and still not cross into another bucket, they would do that. But if the update exhausts the slack space, they have to do what was called a bucket split to accommodate the longer record by making one bucket become two buckets.

So Pat's comments about how things are stored in buckets is absolutely SPOT ON for the case of older ORACLE databases. I have to admit I have lost track of the newer ones, and of course for an SSD, device geometry has no meaning. But there were times in the "good old days" where folks even wrote things in a specific scatter pattern that was designed to optimize both rotational and seek-arm speeds.
 
I got interrupted.
I just wanted to thank everyone above who presented helpful information about some inner workings of access. Very helpful in general. :D
 
Many thanks for everyone's comments

Had a few minutes so thought I would do a quick test to see how the db size changed based on different setups

  1. Created a new db, added a table with a single 6 char text field, unindexed and compacted - 356kb
  2. added 1m random 6 character random text records 19,068kb
  3. compact/repaired 19,056kb
  4. added a non unique index - took about a second - 30,688kb
  5. compact/repaired 30.684kb
  6. change field size from 6 to 200 - took several seconds - 54,740kb
  7. compact/repaired 30,684kb
  8. removed index 30688kb
  9. compact/repaired 19052 kb
  10. changed field size from 200 to 100 and added back the index - took less than a second - 43,112kb
  11. compact/repaired 30.684kb
Summary
file sizes at steps 3 and 5 indicate records are added consecutively as would be expected, so not much compacting to do

step 3 - simple maths based on the 18,700 file size increase indicates a 6 character string takes 18 or 19 bytes. Rule is (I think) 10bytes +2 bytes per char=22 bytes - maths is pretty basic but file size increase is broadly in line with expectation

step 5 - increase in size from step 3 to step 5 is an increase of 11,628 implying the index takes less space than the actual value even tho' a pointer is also required. Maths implies only the characters are stored in the index, not the 10 byte overhead.

step 6 - confirms the comments about the index being rebuilt if the field size is changed (increase in size 24,056)

step 7 - demonstrates indexes would appear to take up the same space regardless of the field size

step 9 - restored to original non-indexed state - same size as for 6 character limit - to be expected

step 10 - increase in size 12428 - approx 1/2 of the increase for a 200 char index - to be expected.

step 11 - back to the same situation at step 7

Conclusions
  • If you change the field size on an indexed field, do a compact/repair to recover the space
  • Indexes take up space related to the field value, not the specified size
  • Indexes (in respect of short text) take up less space that the field value, (other datatypes not tested)

Not tested
This test in based on 1m records each with a 6 char value. Not tested to see if the index is adjusted to the max actual field length when there are varying field lengths. This thought comes because the creation of the 100 char index was significantly faster than when the 200 char index was created. I might have expected around 50% of the time, not 5 or 10%. Alternative speculation is there is a time overhead on first creation of an index
 
No, the record is rewritten in the same spot if it is the same size or smaller. Only if it is bigger, is it written to the end of the tab.e.

This is not true for at least some cases, and for JET/ACE I think it is never true. I don't think it is EVER true in a shared-access case for ANY database that is ANSI SQL 92 (or later) compliant.

If you consider the case of a shared JET/ACE BE file, let user1 runs a SELECT query to include record XYZ; then let user2 come in a couple of seconds later and UPDATE record XYZ. The rules as I understand them say that user1 has a snapshot-in-time of the old record XYZ whle the SELECT is still open, whereas user2 has created AND RELEASED a new version. If user1 did the SELECT with a "No locks" option on the query, that can happen and BOTH records must co-exist.

At least in this case, the updated record HAS to be written at the end of the table no matter WHAT size it is because the old record is still in use. And if user1 doesn't attempt to do an update through the SELECT query before the automatic requery time found in the File>>Options>>Current Database section but instead just releases the SELECT (after the UPDATE is complete), user1 might never know that the record was updated while he/she was busy. But for that brief time of overlap, both records have to exist. And the one that sticks around after the closure of the SELECT query is the updated record at the end of the table, which was the only open place for it to go.
 
Can't we just test it with a Hex Editor on the Access file and find out? I tried it with XVI32 on a new monolithic Access DB, and changing the value of a 3-character string in a table resulted in the old value not getting a hit at all, and the new value appearing in the file in the identical decimal position.

I tested that 4 times with consistent results, then changed the 3-character string to 4 characters. Sure enough, the old (latest revision) 3-char value was still a hit at the old decimal position and the new value was a hit in a new position. tested that twice, resulting in the first 4-char test being overwritten, then changed the 4-character to a 5-character, then to a 6.

At that point, the Access file contained the latest 3-character value in it's first position, the second test 4-character value in the position of the first 4-character value, the 5-character test still searchable, and a new position for a 6-character value. The actual table row of course shows only the latest 6-char value. I can't say if the latest enlarged revision is at the 'end' of the actual table - I'd assume it must be - but there's so much gobbledook that appears in the hex editor window I can't tell.

I didn't test a shorter string. My quick and dirty test was with an AutoNumber PK and a single non-indexed column defined as 64 characters. I put 100,000 otherwise identical rows in and tested changes on a single row, number 50,000 every time. Access and the Hex Editor closed/re-opened fresh for each test.
 
I imagine if a changed record was written in place you would see the data change. If you write the record at the end of the current file, then the original record might not be cleared. It would (or could) just be left in position with the original data, but there would be no pointer to it. Maybe the space couldn't actually get reused until after a compact operation.
 
I imagine if a changed record was written in place you would see the data change. If you write the record at the end of the current file, then the original record might not be cleared. It would (or could) just be left in position with the original data, but there would be no pointer to it. Maybe the space couldn't actually get reused until after a compact operation.

Dave, I totally agree with that viewpoint. I think you have correctly described what actually goes on. The problem that many of us have is that different engines do different things. I know that ORACLE handles updates differently from JET/ACE. In the ORACLE case, I believe that Pat's view is correct because ORACLE explicitly includes a "slack space" parameter in their installation options and when creating a new DBF container file. With slack space, you CAN overwrite a record during an update because you know how much room for overlap that you have. But Access doesn't, to my knowledge, have slack space. It takes the "sparse" approach i.e. store only the bare minimum to get the job done. Particularly because of the way that Access limits DB space to 2 GB, allowing slack space would be contraproductive.
 
Hmm. I was thinking that if you looked at a database file with a hex editor, you might be able to inspect the physical order. Therefore you might see both the old data which is now no longer pointed to and can now be recovered, and the new record which has just been created, but I'm not really sure.
 
I will agree that with a good hex editor, you could probably decipher something meaningful out of an Access database file.

However, please be aware that in general it is a violation of proprietary software End User License Agreements to reverse-engineer their products. It is unclear that MS could claim the database file but they probably COULD take legal action for publishing proprietary information. The potential lawsuits are probably more trouble than it is worth.

Please also note that I have to make that disclaimer in the forum, not privately but publicly, in order to protect forum interests. The forum members can do what they want, but acting in my capacity as a moderator on the Access World Forums, I want to have it on record.

The forum officially doesn't condone reverse-engineering proprietary formats.

Jon didn't put me up to saying this... I learned this rule in the Navy as a contractor. Jon, you can thank me or spank me later.
 
I agree with your official statement. However, if some folks hadn't poked around, we would not have the tools we have that can take advantage of the internal structure. MS certainly never bothered to create them for us.

If it weren't for Access being Microsoft's red-headed step-child, we would HAVE a lot of those tools. But it is a legal mine-field to publish too much despite the potential utility of such publication.
 
It was more just clarifying how the data changed, in the light of our discussions. If you only have 2 records in the database, and then edited the first record, would you now see 3 records, or would you still see 2. Just academic interest.
 
Dave, to the best of my understanding, if you updated a record (and for the sake of argument, did so in a way that made a string field longer by 1 or more bytes), then this is what would happen.

JET/ACE would make a new, free-standing record that is not yet part of any table, then would copy fields one at a time in fielddef order, probably applying the update to the string field this record in passing. It would put this record at the first free spot in its current virtual memory area.

Next, it would perform a transaction style operation where both of the next two steps make it or neither makes it.
Thread the new, updated record at the end of the list of records that are in the table
Unthread the old record and close up any opened links needed to rejoin the records that had preceded or followed the now removed record.

At this point, you see the same number of records you started with but they are in a different internal order with the updated record apparently at the end of the list of table records. There is now an unthreaded record "floating" in the memory that was formerly part of the table, contributing to bloat until such time as the next Compact & Repair does its thing.

Is that what you were asking with your question?

As food for thought, considering corruption issues, if you have a power glitch between the point where you were unthreading the old record and rethreading logically adjacent records, this is when a table gets corrupted, because the links between some of its members are now broken.
 
@The_Doc_Man

At this point, you see the same number of records you started with but they are in a different internal order with the updated record apparently at the end of the list of table records. There is now an unthreaded record "floating" in the memory that was formerly part of the table, contributing to bloat until such time as the next Compact & Repair does its thing.

Yes, but I assume the now unlinked "floating" record is still in the disc file, and visible to a sequential read of the disc file, but the space occupied by that record might be available to be overwritten. Maybe it's not available to be overwritten and all new records get written to the end of the file, and the space for logically deleted records is only made re-usable by a C&R.
 
The space is not available to be overwritten. Windows-based programs that use their standard dynamic memory routines cannot do that.
 
I don't know if Jet/ACE are multi threaded

There is an old article in the forum were we looked this up. JET/ACE may or may not be multi-threaded BUT each session's connection IS single-threaded. I recall that I found an article and linked to it that suggests that AT BEST the innards of JET/ACE could have a separate internal cleanup thread to police its own working space. I may have to do a deep forum search to find that. However, except for that cleanup thread, Access from end-to-end IS synchronous unless you use API or addressing tricks to multi-thread it. There is no inherent syntax to multi-thread a native Access database. NOTE: When the back-end is NOT native Access but rather is an active SQL engine of some type, it certainly CAN be multi-threaded. Then it would depend on the vendor.
 

Users who are viewing this thread

Back
Top Bottom