Optimise Large Array Loading From .xls Workbooks

twoplustwo

Registered User.
Local time
Yesterday, 20:24
Joined
Oct 31, 2007
Messages
507
Hi guys,

Hope everyone is well :)

I have written a function in Access 2003 to import 3 large workbooks (16 columns, ~120K rows total).

At the moment I am opening each workbook, writing it to an array after running a few fields through some parameters and writing the results to a table in the database.

My current array-writing process is the following:

Code:
For iRow = 1 to iRows
  For iCol = 1 to iCols
    varrInputData(iRow, iCol) = wsData.Cells(iRow, iCol)
  Next iCol
Next iRow

This seems to take a long while. Writing the array to the db is fast enough.

Is this the optimal way to achieve what I'm trying to do?
 
I don't see the answer to this question so I'll ask it, then answer your question based on an assumption.

How did you declare varInputData? Did you include explicit dimensions?

I'm thinking that if you free-form the declaration (Dim varInputData as Variant), what is happening is that you have to expand the array dynamically. If there is a way for you to make this a two-pass algorithm such that you determine the final number of required rows and columns and then declare the variant array with explicit dimensions, you will monolithically allocate the array. Further, if you know the data type is STRING, if you know how big to make the strings, you could allocate them as a fixed-size structure, then just populate the structure as you go.

I suspect that what is happening to you is more or less like this:

You allocate the variant. You start populating it. Because the default of the variant is some limited size, you quickly outgrow that size. Also, if you are adding strings, the variant has to dynamically allocate string buffers from the Access heap. So there you are, adding strings right and left, and suddenly you run out of allocated virtual memory. So Windows has to stop your process, do some swap magic, allocate more space in the swap file, swap you out, and bring you back into memory bigger (in virtual terms) than you were before. So you do some more importing ... and run into the same problem. It is the dynamic expansion that (I think) is eating your socks.

The way around that is to pre-allocate everything that CAN be pre-allocated. The bit about using a structure is that (I think) strings in structures are FIXED sizes, so when you allocate an array of structured strings, it is a fixed size action that does not involve the heap in the same way that the "unstructured" case does. So if you can determine the maximum required string size and the dimensions of your matrix, you can do a single allocation and thereafter simply do some memory-only operations to store your data.

I don't know if that is going to be helpful, but if that is NOT what is happening, then I have no other suggestions to make.
 
Hi Doc,

In the function that writes the array:

Code:
Dim varrInputData() as Variant
Redim varrInputData(1 to iRows, 1 to iCols)

Where iRows is the changing number of records per workbook.

So, I loop through each workbook in a folder in a sub-routine and call the function to write the contents to the array. I have a mixture of data types in the data itself - some strings, some integers and a few dates for good measure.

Thanks for the very detailed explanation - interesting (and new territory for me).

I guess an option might be to TransferSpreadsheet into temp tables, run the algo over the data, write to the output table and delete the temp data. Thing is I have some null values etc. in the data and I'm not sure if they'd transfer correctly. I wanted something quicker/cleaner (considering the need to compact the db after the import etc.) than doing this but it might be the only option.

PS When you say "dynamic expansion" do you mean the resizing of the array for each workbook or addding each value separately?

Thanks very much again.
 
you could always try declaring a range object and loading the entire range into your array.

Code:
Dim rng As Range
Dim var() As Variant
'first element = row, second = column
ReDim var(1 To 10, 1 To 10)
Set rng = ActiveSheet.Range("A1:J10")
var = rng.Value
 
Morning DJ,

The above takes ~3 secs to load 27K records :)

Thanks very much.

Do you have any idea why it's so much faster than method in the OP?
 
Morning DJ,

The above takes ~3 secs to load 27K records :)

Thanks very much.

Do you have any idea why it's so much faster than method in the OP?

Glad it works for ya :D

I'm just guessing, but I think it might have to do with the number of calls made to the Excel COM object. When you use the Sheet.Cells(x,y) method, each time you hit that you are making a call to Excel, behind the scenes it has to marshal several internal functions to retrieve the data for that one cell then return it to you. If you have 27K rows by 16columns that's 432,000 individual calls to Excel, using the range object you are essentially making one call to Excel and telling it to give you all the values at once.
 
djkarl, great catch. Process-switch overhead is not trivial. Not to mention that my comments about resizing the array dynamically are handled in that single call because the code is managed in the function.

Not to mention that in any VBA loop you are dealing with pseudo-compiled code (a.k.a. semi-compiled, or semi-interpreted). Doing it that way forces a totally different approach that is compiled "all the way" into machine code because it is part of the COM interface.

When you say "dynamic expansion" do you mean ...

Dynamic expansion works like this. You have a variant. It could be an array automatically. For example, look at the SPLIT function. You don't have to pre-dim it as an array. But when you start adding elements, it was pre-allocated based on sizing assumptions. Once you blow past those assumptions, Access can either barf or try to resize the space allocated to the variable. But if you didn't have an explicit size, it STILL can only guess how big to go, so it makes another assumption. Which works a while longer until you blow past THAT limit, too. This goes on every so often, with Access having to stop, allocate some more dynamic memory for you, and then resume your processing. If at some point, the variant array resizing forces Access to ask Windows for more memory because your initial allocation blew out, that is a different level of dynamic expansion that ALSO could occur. What you want to avoid is letting that kind of thing be dynamic.

When you are adding strings, a third type of dynamic expansion is required. A variant can be very small if you were dealing with something that resolves to a byte, integer, or long. Even a double or date isn't so terribly bad. But a variable-length string requires allocation of a string buffer because that string could get pretty big. Like most variable length strings in Access, I believe a variant doesn't "commit" its size until it knows the type of variable it is about to accept, so it might have to allocate that string buffer as a separate expansion. That will not change the array size but it DOES take memory from the Access heap storage. That, in turn, will affect the virtual memory that Windows has to allocate, so again a dynamic task-resizing might occur.

Just for S&G, you might compare your old and new methods by calling up Task Manager to the active task list and see the size of Access. Then kick off your import process and watch the fireworks. If I'm right, the old method will get huge slowly. Whereas pre-allocating and doing the things suggested by djKarl will expand quickly, almost in a single jump. If that is true, then your speed bump was the combination of semi-compiled code and incremental size expansion, which was making Windows work very hard to service your needs.
 

Users who are viewing this thread

Back
Top Bottom