How can I read newly received invoice files without having to re-read the old ones? (1 Viewer)

amorosik

Active member
Local time
Today, 08:02
Joined
Apr 18, 2020
Messages
664
Rather than access code, it's a matter of how to perform a file copy and read procedure without having to reread the files already read
These are XML invoices; they normally arrive in the directory c:\fatture and remain there forever
When the operator decides to acquire them into the management system, he starts the procedure that does the following:
1- copy all invoices from c:\fatture to c:\gest\fatture
2- from c:\gest\fatture, it reads the first file, calculates its hash, and checks whether this value is already in the archive.
3- if it isn't, it pulls the invoice and stores it in the main database.

Given that the c:\fatture directory is untouchable, it must only be read
Since it's the original destination directory, no one can delete, move, or rename any files

Phase 1 is very quick; I used robocopy with the parameter mt=50, which runs multi-threaded and takes next to nothing. It copies 20,000 files totaling 1 GB in a couple of seconds. I'd say it almost saturates the disk's speed.

Phase 2, on the other hand, first passes through all the files and unsigns them if they're signed. Then it starts again from the first and calculates the hash256 to determine whether the content has already been read or not. It also checks whether it already has it in the archive. If it doesn't, it means it's a new file and needs to be processed.

Phase 3 is also quite fast and, in any case, is performed on 10, 20, and 50 invoices that need to be entered into the management system

Well, the system works, but obviously, as time goes by, phase 2 becomes slower, taking up a few minutes
And this isn't good because if there are 20 invoices to acquire, it can't take three minutes, but if it currently needs to calculate The hash of 20K invoices clearly takes a while

Obviously, all this work, done over and over again on the same files, seriously damages the reputation of a good programmer
It's one of those things where "...since it works, let's leave it like this..."
But enough about that; it's time to fix this mess

How would you figure out the difference between the contents of c:\invoices and c:\gest\invoices and start the necessary processing only on the files that make up this difference?
 
Can your app not just check the file timestamps to see which are new since last import?

Even robocopy ought to be able to transfer only new files since last time to make Step 1 even quicker

For Step 2, if you can just unlock new files and hash them to verify they are the same, then import, it should be much quicker than going over all the ones you have done before.

(You said that c:\fatture directory is untouchable so, previously imported files should not change, correct?)
 
Can your app not just check the file timestamps to see which are new since last import?

Even robocopy ought to be able to transfer only new files since last time to make Step 1 even quicker

For Step 2, if you can just unlock new files and hash them to verify they are the same, then import, it should be much quicker than going over all the ones you have done before.

(You said that c:\fatture directory is untouchable so, previously imported files should not change, correct?)

In addition to cheekybuddha's steps, add a table to record the filename(s) imported and the hash. You can then query by hash or filename. If you want auditing, add a column to the imported data table for the hash.
 
This is a RoboCopy command to copy only files with a last access date on or before the given.

c:\pai\ c:\pai\robo *.* /njh /njs /minlad:20251013 > robolog.txt

The logfile output looks like the following


28 c:\pai\
New File 504 AccessUsesEdge.reg
100%
New File 177256 DWGM415.ZIP
73%
100%
New File 12.3 m GonzoV12.accdb
8.0%
16.1%
24.2%
32.3%
40.4%
48.5%
56.6%
64.7%
72.8%
80.9%
89.0%
97.0%
100%
New File 1059 InstallADLicensing.bat
100%
New File 1.6 m MergeM415.PDF
15%
30%
46%
61%
76%
92%
100%
New File 9728 nloader64.dll
100%
New File 372 OldOneDrive.reg
100%
New File 321 OpenVaultExplore.acr
 
You can eliminate the RoboCopy step by using FSO to search the folder for new files.

Quick Google AI function.

Code:
Sub FindFilesByDate()

    Dim fso As Object ' FileSystemObject
    Dim folder As Object ' Folder object
    Dim file As Object ' File object
    Dim targetFolder As String
    Dim searchDate As Date ' The date to compare against

    ' --- Configuration ---
    targetFolder = "C:\Your\Folder\Path" ' Change this to your desired folder path
    searchDate = #1/1/2024# ' Change this to your desired search date (e.g., January 1, 2024)
    ' ---------------------

    ' Create a FileSystemObject
    Set fso = CreateObject("Scripting.FileSystemObject")

    ' Check if the target folder exists
    If fso.FolderExists(targetFolder) Then
        Set folder = fso.GetFolder(targetFolder)

        ' Loop through each file in the folder
        For Each file In folder.Files
            ' Compare the file's last modified date with the search date
            If file.DateLastModified >= searchDate Then
                ' If the file's date is on or after the search date,
                ' you can perform actions here, e.g., print the file name.
                Debug.Print "Found file: " & file.Name & " (Modified: " & file.DateLastModified & ")"
            End If
        Next file
    Else
        MsgBox "Folder not found: " & targetFolder, vbExclamation
    End If

    ' Clean up objects
    Set file = Nothing
    Set folder = Nothing
    Set fso = Nothing

End Sub
 
This method that you describe makes me think back to strategies I have learned, and there is one that I have to bring out. When you are faced with an insoluble problem, change the problem.

You say you can't move, delete, or rename the files. Since you are not allowed by business rules to rename or alter the file contents (and I understand why), can you create a custom file property? I.e. something that ISN'T content, but attribute instead. Because if you could do that, you can create a property called "Imported" which would be YES or NO. Or, since all of the files are of the same general type, see if that particular file type has a usable attribute you CAN set. This file attribute lookup would be cheaper than computing an entire file hash.

RonPaii offers a good solution. If the files have unique names, store the names you have already imported. Particularly if the name is short enough, you can make it indexed, which means that picking out the file name and testing for its presence in the invoice DB would be very fast, too. Your problem is copying the file so you can work on it. But if you can work directly on a read-only file (and the OPEN (file) command DOES allow you to set for READ only), only copy the files you need to copy. Or just work with them directly.

The problem is, as it stands, you don't have a "metric" - as we used to say when I was still in the petrochemical industry. You don't have a marker, a way to know what is new and what is old. Your algorithm has painted you into a corner. Here's the tradeoff. ROBOCOPY is compiled code whereas VBA is pseudo-compiled and therefore emulated. Which means using VBA to separate the old from the new is slower than that copy that SEEMS like it is faster to operate. But if the next step involves computing something driven through VBA code, you just lost the advantage of that fast copy. AND file operations aren't cheap, so I would presume your current method deletes all the files you just copied once you have finished your run. A sequence of copy, scan copied directory, analyze, delete is slower than a scan/analyze pass - particularly when you reach the point that you say your "Phase 2" starts to take minutes. Probably because you are computing hashes one at a time.

Can you create a file in those folders, such as storing a hash file for every file you have imported? Then instead of a fast copy, do a folder scan to see files that DON'T have a stored hash file. The file system object routines include letting you create a folder object for which one of the properties is the collection of files in that folder. You could easily do "FileExists" tests to see if a given invoice XML file has a corresponding stored hash file (or not). What I'm suggesting is providing a metric so that you can verify whether a file has been imported already and thus can be SKIPPED COMPLETELY, thus bypassing that hash-related "identify candidate files" overhead step.
 
Can your app not just check the file timestamps to see which are new since last import?

Even robocopy ought to be able to transfer only new files since last time to make Step 1 even quicker

For Step 2, if you can just unlock new files and hash them to verify they are the same, then import, it should be much quicker than going over all the ones you have done before.

(You said that c:\fatture directory is untouchable so, previously imported files should not change, correct?)

Two files could have the same timestamp, how do you check based on the timestamp?
Yes, you're right, they SHOULD not change
 
This method that you describe makes me think back to strategies I have learned, and there is one that I have to bring out. When you are faced with an insoluble problem, change the problem.

You say you can't move, delete, or rename the files. Since you are not allowed by business rules to rename or alter the file contents (and I understand why), can you create a custom file property? I.e. something that ISN'T content, but attribute instead. Because if you could do that, you can create a property called "Imported" which would be YES or NO. Or, since all of the files are of the same general type, see if that particular file type has a usable attribute you CAN set. This file attribute lookup would be cheaper than computing an entire file hash.

RonPaii offers a good solution. If the files have unique names, store the names you have already imported. Particularly if the name is short enough, you can make it indexed, which means that picking out the file name and testing for its presence in the invoice DB would be very fast, too. Your problem is copying the file so you can work on it. But if you can work directly on a read-only file (and the OPEN (file) command DOES allow you to set for READ only), only copy the files you need to copy. Or just work with them directly.

The problem is, as it stands, you don't have a "metric" - as we used to say when I was still in the petrochemical industry. You don't have a marker, a way to know what is new and what is old. Your algorithm has painted you into a corner. Here's the tradeoff. ROBOCOPY is compiled code whereas VBA is pseudo-compiled and therefore emulated. Which means using VBA to separate the old from the new is slower than that copy that SEEMS like it is faster to operate. But if the next step involves computing something driven through VBA code, you just lost the advantage of that fast copy. AND file operations aren't cheap, so I would presume your current method deletes all the files you just copied once you have finished your run. A sequence of copy, scan copied directory, analyze, delete is slower than a scan/analyze pass - particularly when you reach the point that you say your "Phase 2" starts to take minutes. Probably because you are computing hashes one at a time.

Can you create a file in those folders, such as storing a hash file for every file you have imported? Then instead of a fast copy, do a folder scan to see files that DON'T have a stored hash file. The file system object routines include letting you create a folder object for which one of the properties is the collection of files in that folder. You could easily do "FileExists" tests to see if a given invoice XML file has a corresponding stored hash file (or not). What I'm suggesting is providing a metric so that you can verify whether a file has been imported already and thus can be SKIPPED COMPLETELY, thus bypassing that hash-related "identify candidate files" overhead step.

No, the c:\fatture directory is read-only for me
It's not my stuff, but the program that receives them, which is another business
But after phase 1, the destination directory, c:\gest\fatture, is completely at my disposal and I can do whatever I want with it.
Yes, I could work with the file names, but I'd prefer to also check the contents to make sure they haven't changed
My problem is working incrementally compared to the last time, and to do this, I can't rely on the file names (because they can't be sorted) or on the date/time of the individual files (because today a file with a timestamp from a week ago could arrive).
 
You can eliminate the RoboCopy step by using FSO to search the folder for new files.

Quick Google AI function.

Code:
Sub FindFilesByDate()

    Dim fso As Object ' FileSystemObject
    Dim folder As Object ' Folder object
    Dim file As Object ' File object
    Dim targetFolder As String
    Dim searchDate As Date ' The date to compare against

    ' --- Configuration ---
    targetFolder = "C:\Your\Folder\Path" ' Change this to your desired folder path
    searchDate = #1/1/2024# ' Change this to your desired search date (e.g., January 1, 2024)
    ' ---------------------

    ' Create a FileSystemObject
    Set fso = CreateObject("Scripting.FileSystemObject")

    ' Check if the target folder exists
    If fso.FolderExists(targetFolder) Then
        Set folder = fso.GetFolder(targetFolder)

        ' Loop through each file in the folder
        For Each file In folder.Files
            ' Compare the file's last modified date with the search date
            If file.DateLastModified >= searchDate Then
                ' If the file's date is on or after the search date,
                ' you can perform actions here, e.g., print the file name.
                Debug.Print "Found file: " & file.Name & " (Modified: " & file.DateLastModified & ")"
            End If
        Next file
    Else
        MsgBox "Folder not found: " & targetFolder, vbExclamation
    End If

    ' Clean up objects
    Set file = Nothing
    Set folder = Nothing
    Set fso = Nothing

End Sub

I can't work with the date of individual files.
Today I might receive a file with a creation/modification date that is older than the files already present
 
Yes, I could work with the file names, but I'd prefer to also check the contents to make sure they haven't changed
If you can store the hashes as "indicator" files, half the job of seeking changes is already done.
it's a matter of how to perform a file copy and read procedure without having to reread the files already read
Yes, I could work with the file names, but I'd prefer to also check the contents to make sure they haven't changed
This is the first I'm hearing that these files could change. The middle quote implies that the old files are immutable. Is EACH FILE incrementally updated? That would have been a good thing to emphasize.
 
Two files could have the same timestamp, how do you check based on the timestamp?
Yes, you're right, they SHOULD not change

Storing the Hash of each file will allow quick check of file changes.

Exporting and parsing the RoboCopy log output may be quicker then reading each file in the folder. Each new file line starts with "New File" and ends with the file name. Experimenting with log settings can reduce the size of the log

remove the log header and footer
/njh /njs

Pipe the output to a text file
> robolog.txt

RoboCopy <YourCurrentParams> /njh /njs > robolog.txt
 
The follow RoboCopy line will copy all changed files with XML extension to your output folder and output the log to your output folder.

Code:
RoboCopy c:\fatture to c:\gest\fatture *.XML /XO /ns /nc /np /njh /njs > c:\gest\fatture\RoboLog.txt

RoboLog.txt will look something like the following, listing all new and changed files.

Code:
c:\fatture
             NewInvoice1.XML
             NewInvoice2.XML
...
             NewInvoiceX.XML

You should be able import this list into a temp table to process the new files.
 
When I do imports and exports, I always included a batch number to separate one from the other.
An export to be imported to another system would have the batch number as their reference, or a part of the reference. So easy to see which import. If an import was to an accounts system and imported twice, then a matching batch could be imported to cancel the second import without too much trouble. I have a bank reconciliation system into which I will import OFX data. If I happen to create the OFX for the wrong date range I can delete the batch by batch number and then re-create the OFX for importing. Batch numbers for import/export make life easy.
 
The follow RoboCopy line will copy all changed files with XML extension to your output folder and output the log to your output folder.

Code:
RoboCopy c:\fatture to c:\gest\fatture *.XML /XO /ns /nc /np /njh /njs > c:\gest\fatture\RoboLog.txt

RoboLog.txt will look something like the following, listing all new and changed files.

Code:
c:\fatture
             NewInvoice1.XML
             NewInvoice2.XML
...
             NewInvoiceX.XML

You should be able import this list into a temp table to process the new files.

Yes, I think using Robocopy's parallel processing capability could be a valid reason to use this system
I think I'll keep robocopy's lightning-fast performance and process the log produced for a copy of just the missing files.
 
Last edited:
I haven't read the whole thread. I would tend to either rename a processed file, or move it to a different folder, or even both. Then you only have the new files to consider. In any event, I would still check for duplicate documents in any imported file, as it's quite possible you will receive the same file more than once, by error of some sort. Maybe also store the file name that any imported invoice was loaded from. I like belt and braces always. :D

The object ought to be to process any file once and once only.
 
Hoping this will be helpful to anyone who needs to solve the same problem, I'm writing my experience here for future reference.In the end
I gave up and managed the process using the file name
So, to determine whether a certain file has been imported or not, simply check whether it's present in the c:\gest\fatture directory
For the 'differential copy' from the source directory to the destination, I used the robocopy log
Reading the log returns the files just copied
And with a loop, I scroll through them one at a time and the expected operations are performed
The 'differential copy' of a hundred or so files, starting from a source directory of 10K files, happens practically instantaneously
Then comes the import phase of the individual invoices, and that time is unavoidable
Most of the time previously spent, which involved scanning the files in the source directory one at a time and verifying the hash with those in the archive, has been avoided
The hash verification, however, is only maintained on the files being imported, which are a minority, and only as a precaution to prevent the same invoice from being imported twice
 
Perhaps you could encourage the creator of the files in the source folder to include the hash of the files in the file name when creating or modifying files.

Then you wouldn't need to calculate the hash of the files in the source folder every time, but could take it from the file name.
 
Were this to be me, I'd copy in the contents, process, then mark each file as "Read Only" and "Hidden". That way you can't copy over them and you don't get them in your list next time you run through.
 
If you are able to do "something" to the drop folder, then you can move each file to the read folder after you import it. Of course, you need to have a way of moving it back and deleting the imported data in case you need to reprocess the file for some reason.

This is how I handle bank statements for example. I log the statement and when I append the data to the permanent file, I include the logID so I can physically connect each record to the specific file it came from which makes it easy to reprocess an old file. Things like bank statements have other rules such as they must be processed in sequence and so your log routine needs to handle that AND your back out routine can only backout the newest file. That means if you have to go back three files, you need to back out 3, 2, 1 and reprocess them all if it comes to that.
 

Users who are viewing this thread

Back
Top Bottom