Fast way to cut a long string (1 Viewer)

The_Doc_Man

Immoderate Moderator
Staff member
Local time
Today, 03:11
Joined
Feb 28, 2001
Messages
27,303
The most expensive string function is the one that produces the most strings as variables. Taking a string apart one character at a time and storing each character separately causes the following steps for EACH character: (1) Allocate memory for the string descriptor. (2) Allocate memory for the actual (sub) string to be created. (3) Extract the sub-string into the allocated slot. (4) Update the string-length slot in the descriptor. NOTE that steps 1 and 2 must allocate a new string area each time and cannot re-use previously allocated memory due to the possibility that each new string is potentially longer than the previous one. (Its a long tale of woe. You don't want to hear it.)

But the worst part isn't that memory allocation. It is the part where you are done with the strings and have to release them. The problem is that as variables, they usually are stored in the programming structure called the HEAP and occupy that space for the remainder of the session. If you do this enough you see a message about "HEAP OVERFLOW" or "STACK / HEAP COLLISION" or some other similar message that has changed over the years but still means "you blew out your heap work space." The message you get varies depending on which facility (or even which library) sensed the overlap of the stack and heap address range.
 

Galaxiom

Super Moderator
Staff member
Local time
Today, 18:11
Joined
Jan 20, 2009
Messages
12,854
What I do in SQL Server is create a table with the parsing information and Cartesian join it to the string data.

Select the data table key, the field name from the parsing information and the substring using the Mid function with the parameters drawn from the parsing information table. This results in very tall narrow output with fields for the data key, the field name and the value, essentially an Entity, Attribute, Value construct.

Then simply Pivot (Cross Tab) to bring the values out into multiple fields.

Access could do the same.

This technique uses the strengths of the engine to process data as sets and is orders of magnitude faster than any RBAR approach parsing the data line by line. I expect the fixed width import in Access uses a similar technique. Similarly with SQL Server's fixed width SSIS imports.

The parsing data I use for this system includes the output datatype and any conversion formulae in the table to arrive at the stored value. This supports a completely generalised system where parsing information is used to generate dynamic sql, ultimately writing the data to a table or recordset. It is essentially an easy to configure enhanced SSIS substitute.

When I first created it, the server was returning 15,000 records with 16 fields from 128 character wide strings in less than six seconds. I have not timed it since we upgraded the server last year but the performance of everything else on the new server is far and way faster than on the old server.
 

The_Doc_Man

Immoderate Moderator
Staff member
Local time
Today, 03:11
Joined
Feb 28, 2001
Messages
27,303
Session meaning, the garbage is cleared when the procedure is done without needing to be explicitly done so.
No. When Access exits, not when the procedure exits. The active session cannot garbage collect within itself. However, WINDOWS can reclaim the memory used after the procedure exits. If you want to read more about memory reclamation under Windows, look up "Paging Dynamics" on the web.
 

arnelgp

..forever waiting... waiting for jellybean!
Local time
Today, 16:11
Joined
May 7, 2009
Messages
19,246
i made my own test (see CompareTest sub on Module1).
using records from table1 and looping for 500 repetitions.
Code:
Private Sub CompareTest()
Dim src As String, j As Integer

Dim s0 As String, s1 As String, s2 As String
Dim s3 As String, s4 As String, s5 As String

Dim sOut As typTextLine
Dim sIn As typAllText

Dim tIn As Variant, tOut As Variant
Dim db As DAO.Database
Dim rs As DAO.Recordset

Set db = CurrentDb
Set rs = db.OpenRecordset("Table1")
tIn = Timer
With rs
    For j = 1 To 500
        rs.MoveFirst
        Do Until .EOF
            src = !memo
            s0 = Mid$(src, 1, 12)
            s1 = Mid$(src, 13, 12)
            s2 = Mid$(src, 25, 200)
            s3 = Mid$(src, 225, 34)
            s4 = Mid$(src, 259, 10)
            s5 = Mid$(src, 269, 2)
            .MoveNext
        Loop
    Next
End With
tOut = Timer
Debug.Print "2. Mid$ test: " & tOut - tIn & " sec"
tIn = Timer
With rs
    For j = 1 To 500
        rs.MoveFirst
        Do Until .EOF
            src = !memo
            sIn.sText = src
            LSet sOut = sIn
            .MoveNext
        Loop
    Next
End With
tOut = Timer
Debug.Print "1. LSet test: " & tOut - tIn & " sec"
Debug.Print
End Sub

result:
Code:
2. Mid$ test: 0.8828125 sec
1. LSet test: 0.75 sec
 

Attachments

  • dbStructure.accdb
    1 MB · Views: 76

Galaxiom

Super Moderator
Staff member
Local time
Today, 18:11
Joined
Jan 20, 2009
Messages
12,854
I don't want to use the mid function because every time you use it the complete string (1500 char) is needed as input and then a string is returned from the 13th character with length of 12 characters.
Then for the second i need the from 25th over a length of 200 characters.

You can imagine that each time you use the mid function it starts at the beginning. Which is logical. But if i could use a pointer it would already be at the 25th character. Which saves time.

Thanks for all the responses. I'll let you know what i come up with.
Relevant to my post #23 in this thread, be aware that the VBA Mid() function and the database engine's Mid() function are not the same animal. They just share the same name. The engine version would be built to process data sets while the VBA version can only deal with one specific string at a time.
 

arnelgp

..forever waiting... waiting for jellybean!
Local time
Today, 16:11
Joined
May 7, 2009
Messages
19,246
be aware that the VBA Mid() function and the database engine's Mid() function are not the same animal.
i am very interested on the Reference material that will point to this one.
what functions are "intrinsic" to SQL language.
 

Pat Hartman

Super Moderator
Staff member
Local time
Today, 04:11
Joined
Feb 19, 2002
Messages
43,445
VBA is inefficient doing this expansion.
The reason is that expansion requires a get memory function. That is extremely costly in the world of programming.

I have to process files produced by a Unix application. They don't bother with silly conventions like records. That means I need to read the entire file as a blob and parse it using VBA. It amounts to millions of characters because the file contains more than a hundred thousand "records". Parts of the records are fixed so I split off the fixed parts into discrete fields and leave the rest as a memo to be split later and I save each record to a table. Then I use a more detailed process depending on which type of data each record contains. The process takes about 2-3 minutes depending on the record count so I'm pretty sure that Mid() isn't sluggish.
 

ebs17

Well-known member
Local time
Today, 10:11
Joined
Feb 7, 2020
Messages
1,975
I have 90K lines to read and the import functionality doesn't do what i want. Skipping a number of lines etc.
Before you start academic discussions about methods from other programming languages, you should first show how to import. It will come as no surprise to those skilled in the art that there are several ways. So if you rely on one method for self-fixation, some problems may not arise in the first place.

Since we are dealing with databases here, the first thing I would try is to create the import file as a table and thus implement an import using the standard method (TransferText) or as an append query.

"Skipping a number of lines" ... can be done with a filter if the structure is given, or you could delete disturbing lines in the text file before importing. One cannot avoid looking at the data structure within the text file before choosing a method for analysis and import.

what functions are "intrinsic" to SQL language
You have to make a big distinction as to which SQL dialect you are using.
Jet-SQL has a very narrow language scope. However, this is greatly expanded by the so-called Expression Service, which allows VBA functions, Access functions (Nz, DLookup) and even user-defined functions to be used in the SQL statement. However, this expression service (probably introduced with Access 2000 / DAO 4.0) only works if the query is executed in Access. So you could do an indirect test and run the query from within Excel and then compare which functions then don't work and aren't directly part of SQL.
 

arnelgp

..forever waiting... waiting for jellybean!
Local time
Today, 16:11
Joined
May 7, 2009
Messages
19,246
You have to make a big distinction as to which SQL dialect you are using.
i am talking about ms access.
don't include VBA functions, just functions "instrinsic" to SQL query, like some aggregate functions (sum(), avg(), count(), etc).
then what are the others and i don't believe SQL has Mid() function. it belongs to VBA.
i need documentation, proof, reference material, not just "i think i know..."
 

Minty

AWF VIP
Local time
Today, 09:11
Joined
Jul 26, 2013
Messages
10,372
T-SQL has a large number of in-built string functions:

I'm not sure how many of those would be considered Native SQL though, and would be native to Access's version of SQL?
 

Pat Hartman

Super Moderator
Staff member
Local time
Today, 04:11
Joined
Feb 19, 2002
Messages
43,445
Have you investigated the native VBA Open statement?
Of course. The entire file is one big record because there are no intervening CrLf's to delimit individual records. Windows, DOS, and IBM operating systems use CrLf to define individual records. Unix does not.
 

The_Doc_Man

Immoderate Moderator
Staff member
Local time
Today, 03:11
Joined
Feb 28, 2001
Messages
27,303
I have three SQL language references but none of them are specific to ACE SQL. Worse, though, is that many "standard" features are noted as "implementation dependent" - which is the "easy way out" for a given flavor of SQL to have or not have a particular function. I have searched various phrases but this is one of those cases where when searching, "you have to hold your mouth right" to get it to bring up something. I've looked through FMS sites and they don't appear to have anything specific to ACE. I've also tried the "backdoor" method that would let me see what functions the ACE OLEDB driver would allow, but that is remarkably resistant to searches. If I ever find something useful, I'll post it.
 

gemma-the-husky

Super Moderator
Staff member
Local time
Today, 09:11
Joined
Sep 12, 2006
Messages
15,692
while this subject is active, can I add a snippet related to downloading data from a Rest API.

The Rest API I was using said I could download a pdf of a delivery document by collecting a string, and saving that string as a pdf file.

Now, I think this failed in VBA because the Rest API string was automatically changed to a unicode string by VBA, and then the pdf failed to be read correctly. I thought the only way to do this might be to get a C (or example) addon to download and manipulate the string in a way that did not change it to unicode. Am I missing something? Is there a way to manipulate a stream of bytes without it becoming unicode?
 

sonic8

AWF VIP
Local time
Today, 10:11
Joined
Oct 27, 2015
Messages
998
The Rest API I was using said I could download a pdf of a delivery document by collecting a string, and saving that string as a pdf file.
"by collecting a string" - What is the word "collecting" supposed to mean in this context?
If the API is sending character data (= a string) for a binary file, that character data is most likely Base64 encoded. You need to convert the Base64 string to an array of bytes and then save those bytes to a file.

Is there a way to manipulate a stream of bytes without it becoming unicode?
A stream of bytes, represented as an array of bytes in VBA, will never become Unicode unless something or someone converts it to a string.
 

cheekybuddha

AWF VIP
Local time
Today, 09:11
Joined
Jul 21, 2014
Messages
2,318
Of course. The entire file is one big record because there are no intervening CrLf's to delimit individual records. Windows, DOS, and IBM operating systems use CrLf to define individual records. Unix does not.
I don't quite follow what you mean here, Pat.

Are you saying it's not applicable because the file has no line delimiters?
 

ebs17

Well-known member
Local time
Today, 10:11
Joined
Feb 7, 2020
Messages
1,975
Pat's file as presented had no line breaks, but @Guus2005 has 90K lines and thus line separations.
Those are different things. Maybe it's enough to deal with the original problem, if @Guus2005 still wants to with some requested work.
 

cheekybuddha

AWF VIP
Local time
Today, 09:11
Joined
Jul 21, 2014
Messages
2,318
Ah, I see. My suggestion was intended for the OP.

But those methods would also be good for Pat's file provided she has a 'map' of the fixed widths, since they both can read byte by byte.

Unix files use LF as their line separator, so you can always split on that too.
 

Pat Hartman

Super Moderator
Staff member
Local time
Today, 04:11
Joined
Feb 19, 2002
Messages
43,445
Are you saying it's not applicable because the file has no line delimiters?
That is what no CrLf's means. NO record delimiters. The whole file was one humongous record as far as Access was concerned.

Let's not get confused about the actual issue, the fact that I was reading a file is irrelevant. I just wanted to tell you where the huge string came from so I wouldn't have to explain how I ever got to process a million character string. The point I was making was that I had a HUGE string to parse and using Mid(), it was not slow. The records were fixed length, so I could just keep an accumulator to keep track of where I was in the string.
 

Users who are viewing this thread

Top Bottom