amorosik
Active member
- Local time
- Today, 23:57
- Joined
- Apr 18, 2020
- Messages
- 674
Rather than access code, it's a matter of how to perform a file copy and read procedure without having to reread the files already read
These are XML invoices; they normally arrive in the directory c:\fatture and remain there forever
When the operator decides to acquire them into the management system, he starts the procedure that does the following:
1- copy all invoices from c:\fatture to c:\gest\fatture
2- from c:\gest\fatture, it reads the first file, calculates its hash, and checks whether this value is already in the archive.
3- if it isn't, it pulls the invoice and stores it in the main database.
Given that the c:\fatture directory is untouchable, it must only be read
Since it's the original destination directory, no one can delete, move, or rename any files
Phase 1 is very quick; I used robocopy with the parameter mt=50, which runs multi-threaded and takes next to nothing. It copies 20,000 files totaling 1 GB in a couple of seconds. I'd say it almost saturates the disk's speed.
Phase 2, on the other hand, first passes through all the files and unsigns them if they're signed. Then it starts again from the first and calculates the hash256 to determine whether the content has already been read or not. It also checks whether it already has it in the archive. If it doesn't, it means it's a new file and needs to be processed.
Phase 3 is also quite fast and, in any case, is performed on 10, 20, and 50 invoices that need to be entered into the management system
Well, the system works, but obviously, as time goes by, phase 2 becomes slower, taking up a few minutes
And this isn't good because if there are 20 invoices to acquire, it can't take three minutes, but if it currently needs to calculate The hash of 20K invoices clearly takes a while
Obviously, all this work, done over and over again on the same files, seriously damages the reputation of a good programmer
It's one of those things where "...since it works, let's leave it like this..."
But enough about that; it's time to fix this mess
How would you figure out the difference between the contents of c:\invoices and c:\gest\invoices and start the necessary processing only on the files that make up this difference?
These are XML invoices; they normally arrive in the directory c:\fatture and remain there forever
When the operator decides to acquire them into the management system, he starts the procedure that does the following:
1- copy all invoices from c:\fatture to c:\gest\fatture
2- from c:\gest\fatture, it reads the first file, calculates its hash, and checks whether this value is already in the archive.
3- if it isn't, it pulls the invoice and stores it in the main database.
Given that the c:\fatture directory is untouchable, it must only be read
Since it's the original destination directory, no one can delete, move, or rename any files
Phase 1 is very quick; I used robocopy with the parameter mt=50, which runs multi-threaded and takes next to nothing. It copies 20,000 files totaling 1 GB in a couple of seconds. I'd say it almost saturates the disk's speed.
Phase 2, on the other hand, first passes through all the files and unsigns them if they're signed. Then it starts again from the first and calculates the hash256 to determine whether the content has already been read or not. It also checks whether it already has it in the archive. If it doesn't, it means it's a new file and needs to be processed.
Phase 3 is also quite fast and, in any case, is performed on 10, 20, and 50 invoices that need to be entered into the management system
Well, the system works, but obviously, as time goes by, phase 2 becomes slower, taking up a few minutes
And this isn't good because if there are 20 invoices to acquire, it can't take three minutes, but if it currently needs to calculate The hash of 20K invoices clearly takes a while
Obviously, all this work, done over and over again on the same files, seriously damages the reputation of a good programmer
It's one of those things where "...since it works, let's leave it like this..."
But enough about that; it's time to fix this mess
How would you figure out the difference between the contents of c:\invoices and c:\gest\invoices and start the necessary processing only on the files that make up this difference?