How can I read newly received invoice files without having to re-read the old ones? (1 Viewer)

amorosik

Active member
Local time
Today, 23:57
Joined
Apr 18, 2020
Messages
674
Rather than access code, it's a matter of how to perform a file copy and read procedure without having to reread the files already read
These are XML invoices; they normally arrive in the directory c:\fatture and remain there forever
When the operator decides to acquire them into the management system, he starts the procedure that does the following:
1- copy all invoices from c:\fatture to c:\gest\fatture
2- from c:\gest\fatture, it reads the first file, calculates its hash, and checks whether this value is already in the archive.
3- if it isn't, it pulls the invoice and stores it in the main database.

Given that the c:\fatture directory is untouchable, it must only be read
Since it's the original destination directory, no one can delete, move, or rename any files

Phase 1 is very quick; I used robocopy with the parameter mt=50, which runs multi-threaded and takes next to nothing. It copies 20,000 files totaling 1 GB in a couple of seconds. I'd say it almost saturates the disk's speed.

Phase 2, on the other hand, first passes through all the files and unsigns them if they're signed. Then it starts again from the first and calculates the hash256 to determine whether the content has already been read or not. It also checks whether it already has it in the archive. If it doesn't, it means it's a new file and needs to be processed.

Phase 3 is also quite fast and, in any case, is performed on 10, 20, and 50 invoices that need to be entered into the management system

Well, the system works, but obviously, as time goes by, phase 2 becomes slower, taking up a few minutes
And this isn't good because if there are 20 invoices to acquire, it can't take three minutes, but if it currently needs to calculate The hash of 20K invoices clearly takes a while

Obviously, all this work, done over and over again on the same files, seriously damages the reputation of a good programmer
It's one of those things where "...since it works, let's leave it like this..."
But enough about that; it's time to fix this mess

How would you figure out the difference between the contents of c:\invoices and c:\gest\invoices and start the necessary processing only on the files that make up this difference?
 
Can your app not just check the file timestamps to see which are new since last import?

Even robocopy ought to be able to transfer only new files since last time to make Step 1 even quicker

For Step 2, if you can just unlock new files and hash them to verify they are the same, then import, it should be much quicker than going over all the ones you have done before.

(You said that c:\fatture directory is untouchable so, previously imported files should not change, correct?)
 
Can your app not just check the file timestamps to see which are new since last import?

Even robocopy ought to be able to transfer only new files since last time to make Step 1 even quicker

For Step 2, if you can just unlock new files and hash them to verify they are the same, then import, it should be much quicker than going over all the ones you have done before.

(You said that c:\fatture directory is untouchable so, previously imported files should not change, correct?)

In addition to cheekybuddha's steps, add a table to record the filename(s) imported and the hash. You can then query by hash or filename. If you want auditing, add a column to the imported data table for the hash.
 
This is a RoboCopy command to copy only files with a last access date on or before the given.

c:\pai\ c:\pai\robo *.* /njh /njs /minlad:20251013 > robolog.txt

The logfile output looks like the following


28 c:\pai\
New File 504 AccessUsesEdge.reg
100%
New File 177256 DWGM415.ZIP
73%
100%
New File 12.3 m GonzoV12.accdb
8.0%
16.1%
24.2%
32.3%
40.4%
48.5%
56.6%
64.7%
72.8%
80.9%
89.0%
97.0%
100%
New File 1059 InstallADLicensing.bat
100%
New File 1.6 m MergeM415.PDF
15%
30%
46%
61%
76%
92%
100%
New File 9728 nloader64.dll
100%
New File 372 OldOneDrive.reg
100%
New File 321 OpenVaultExplore.acr
 
You can eliminate the RoboCopy step by using FSO to search the folder for new files.

Quick Google AI function.

Code:
Sub FindFilesByDate()

    Dim fso As Object ' FileSystemObject
    Dim folder As Object ' Folder object
    Dim file As Object ' File object
    Dim targetFolder As String
    Dim searchDate As Date ' The date to compare against

    ' --- Configuration ---
    targetFolder = "C:\Your\Folder\Path" ' Change this to your desired folder path
    searchDate = #1/1/2024# ' Change this to your desired search date (e.g., January 1, 2024)
    ' ---------------------

    ' Create a FileSystemObject
    Set fso = CreateObject("Scripting.FileSystemObject")

    ' Check if the target folder exists
    If fso.FolderExists(targetFolder) Then
        Set folder = fso.GetFolder(targetFolder)

        ' Loop through each file in the folder
        For Each file In folder.Files
            ' Compare the file's last modified date with the search date
            If file.DateLastModified >= searchDate Then
                ' If the file's date is on or after the search date,
                ' you can perform actions here, e.g., print the file name.
                Debug.Print "Found file: " & file.Name & " (Modified: " & file.DateLastModified & ")"
            End If
        Next file
    Else
        MsgBox "Folder not found: " & targetFolder, vbExclamation
    End If

    ' Clean up objects
    Set file = Nothing
    Set folder = Nothing
    Set fso = Nothing

End Sub
 
This method that you describe makes me think back to strategies I have learned, and there is one that I have to bring out. When you are faced with an insoluble problem, change the problem.

You say you can't move, delete, or rename the files. Since you are not allowed by business rules to rename or alter the file contents (and I understand why), can you create a custom file property? I.e. something that ISN'T content, but attribute instead. Because if you could do that, you can create a property called "Imported" which would be YES or NO. Or, since all of the files are of the same general type, see if that particular file type has a usable attribute you CAN set. This file attribute lookup would be cheaper than computing an entire file hash.

RonPaii offers a good solution. If the files have unique names, store the names you have already imported. Particularly if the name is short enough, you can make it indexed, which means that picking out the file name and testing for its presence in the invoice DB would be very fast, too. Your problem is copying the file so you can work on it. But if you can work directly on a read-only file (and the OPEN (file) command DOES allow you to set for READ only), only copy the files you need to copy. Or just work with them directly.

The problem is, as it stands, you don't have a "metric" - as we used to say when I was still in the petrochemical industry. You don't have a marker, a way to know what is new and what is old. Your algorithm has painted you into a corner. Here's the tradeoff. ROBOCOPY is compiled code whereas VBA is pseudo-compiled and therefore emulated. Which means using VBA to separate the old from the new is slower than that copy that SEEMS like it is faster to operate. But if the next step involves computing something driven through VBA code, you just lost the advantage of that fast copy. AND file operations aren't cheap, so I would presume your current method deletes all the files you just copied once you have finished your run. A sequence of copy, scan copied directory, analyze, delete is slower than a scan/analyze pass - particularly when you reach the point that you say your "Phase 2" starts to take minutes. Probably because you are computing hashes one at a time.

Can you create a file in those folders, such as storing a hash file for every file you have imported? Then instead of a fast copy, do a folder scan to see files that DON'T have a stored hash file. The file system object routines include letting you create a folder object for which one of the properties is the collection of files in that folder. You could easily do "FileExists" tests to see if a given invoice XML file has a corresponding stored hash file (or not). What I'm suggesting is providing a metric so that you can verify whether a file has been imported already and thus can be SKIPPED COMPLETELY, thus bypassing that hash-related "identify candidate files" overhead step.
 
Can your app not just check the file timestamps to see which are new since last import?

Even robocopy ought to be able to transfer only new files since last time to make Step 1 even quicker

For Step 2, if you can just unlock new files and hash them to verify they are the same, then import, it should be much quicker than going over all the ones you have done before.

(You said that c:\fatture directory is untouchable so, previously imported files should not change, correct?)

Two files could have the same timestamp, how do you check based on the timestamp?
Yes, you're right, they SHOULD not change
 
This method that you describe makes me think back to strategies I have learned, and there is one that I have to bring out. When you are faced with an insoluble problem, change the problem.

You say you can't move, delete, or rename the files. Since you are not allowed by business rules to rename or alter the file contents (and I understand why), can you create a custom file property? I.e. something that ISN'T content, but attribute instead. Because if you could do that, you can create a property called "Imported" which would be YES or NO. Or, since all of the files are of the same general type, see if that particular file type has a usable attribute you CAN set. This file attribute lookup would be cheaper than computing an entire file hash.

RonPaii offers a good solution. If the files have unique names, store the names you have already imported. Particularly if the name is short enough, you can make it indexed, which means that picking out the file name and testing for its presence in the invoice DB would be very fast, too. Your problem is copying the file so you can work on it. But if you can work directly on a read-only file (and the OPEN (file) command DOES allow you to set for READ only), only copy the files you need to copy. Or just work with them directly.

The problem is, as it stands, you don't have a "metric" - as we used to say when I was still in the petrochemical industry. You don't have a marker, a way to know what is new and what is old. Your algorithm has painted you into a corner. Here's the tradeoff. ROBOCOPY is compiled code whereas VBA is pseudo-compiled and therefore emulated. Which means using VBA to separate the old from the new is slower than that copy that SEEMS like it is faster to operate. But if the next step involves computing something driven through VBA code, you just lost the advantage of that fast copy. AND file operations aren't cheap, so I would presume your current method deletes all the files you just copied once you have finished your run. A sequence of copy, scan copied directory, analyze, delete is slower than a scan/analyze pass - particularly when you reach the point that you say your "Phase 2" starts to take minutes. Probably because you are computing hashes one at a time.

Can you create a file in those folders, such as storing a hash file for every file you have imported? Then instead of a fast copy, do a folder scan to see files that DON'T have a stored hash file. The file system object routines include letting you create a folder object for which one of the properties is the collection of files in that folder. You could easily do "FileExists" tests to see if a given invoice XML file has a corresponding stored hash file (or not). What I'm suggesting is providing a metric so that you can verify whether a file has been imported already and thus can be SKIPPED COMPLETELY, thus bypassing that hash-related "identify candidate files" overhead step.

No, the c:\fatture directory is read-only for me
It's not my stuff, but the program that receives them, which is another business
But after phase 1, the destination directory, c:\gest\fatture, is completely at my disposal and I can do whatever I want with it.
Yes, I could work with the file names, but I'd prefer to also check the contents to make sure they haven't changed
My problem is working incrementally compared to the last time, and to do this, I can't rely on the file names (because they can't be sorted) or on the date/time of the individual files (because today a file with a timestamp from a week ago could arrive).
 
You can eliminate the RoboCopy step by using FSO to search the folder for new files.

Quick Google AI function.

Code:
Sub FindFilesByDate()

    Dim fso As Object ' FileSystemObject
    Dim folder As Object ' Folder object
    Dim file As Object ' File object
    Dim targetFolder As String
    Dim searchDate As Date ' The date to compare against

    ' --- Configuration ---
    targetFolder = "C:\Your\Folder\Path" ' Change this to your desired folder path
    searchDate = #1/1/2024# ' Change this to your desired search date (e.g., January 1, 2024)
    ' ---------------------

    ' Create a FileSystemObject
    Set fso = CreateObject("Scripting.FileSystemObject")

    ' Check if the target folder exists
    If fso.FolderExists(targetFolder) Then
        Set folder = fso.GetFolder(targetFolder)

        ' Loop through each file in the folder
        For Each file In folder.Files
            ' Compare the file's last modified date with the search date
            If file.DateLastModified >= searchDate Then
                ' If the file's date is on or after the search date,
                ' you can perform actions here, e.g., print the file name.
                Debug.Print "Found file: " & file.Name & " (Modified: " & file.DateLastModified & ")"
            End If
        Next file
    Else
        MsgBox "Folder not found: " & targetFolder, vbExclamation
    End If

    ' Clean up objects
    Set file = Nothing
    Set folder = Nothing
    Set fso = Nothing

End Sub

I can't work with the date of individual files.
Today I might receive a file with a creation/modification date that is older than the files already present
 
Yes, I could work with the file names, but I'd prefer to also check the contents to make sure they haven't changed
If you can store the hashes as "indicator" files, half the job of seeking changes is already done.
it's a matter of how to perform a file copy and read procedure without having to reread the files already read
Yes, I could work with the file names, but I'd prefer to also check the contents to make sure they haven't changed
This is the first I'm hearing that these files could change. The middle quote implies that the old files are immutable. Is EACH FILE incrementally updated? That would have been a good thing to emphasize.
 
Two files could have the same timestamp, how do you check based on the timestamp?
Yes, you're right, they SHOULD not change

Storing the Hash of each file will allow quick check of file changes.

Exporting and parsing the RoboCopy log output may be quicker then reading each file in the folder. Each new file line starts with "New File" and ends with the file name. Experimenting with log settings can reduce the size of the log

remove the log header and footer
/njh /njs

Pipe the output to a text file
> robolog.txt

RoboCopy <YourCurrentParams> /njh /njs > robolog.txt
 
The follow RoboCopy line will copy all changed files with XML extension to your output folder and output the log to your output folder.

Code:
RoboCopy c:\fatture to c:\gest\fatture *.XML /XO /ns /nc /np /njh /njs > c:\gest\fatture\RoboLog.txt

RoboLog.txt will look something like the following, listing all new and changed files.

Code:
c:\fatture
             NewInvoice1.XML
             NewInvoice2.XML
...
             NewInvoiceX.XML

You should be able import this list into a temp table to process the new files.
 

Users who are viewing this thread

Back
Top Bottom