A Database to organise digitised archive information (1 Viewer)

hogarn

New member
Local time
Yesterday, 23:47
Joined
Feb 9, 2012
Messages
3
I am attempting to create a database to use as a tool for searching through and organising a large number of images of documents as part of an archive digitisation project. The actual physical artefacts, the folders stored in the repository, are being scanned. Each physical folder’s contents, having been scanned, are then placed in a folder on the computer which matches their class-mark in the real world. However, there are two problems with the physical organisation of the archive which are at this stage replicated in the digital version;

1) The organisation of the archive is often incoherent. Correspondence is out of order, or spread across a variety of files. This is in part down to the archive having been through a whole series of different filing systems, and in part due to neglect.
2) The inventory which exists is frequently wildly inaccurate.
The archive contains a range of information, which can be broken into the following categories;
1) Correspondence – between individuals and officials/departments.
2) Minutes of various sorts – often relating to correspondence.
3) Memoranda/notebooks – relating to a diverse range of topics.

The database will need to perform several key functions, from the very simple query to much more complicated processes. However, before getting to that, I will explain the intended mode of data collection, and outline which categories of data, relevant to a potential restructuring of the archive, can be culled from the images.
Due to the frequent occurrence of handwritten items, OCRing would not be a suitable method to digitise the material. Therefore, each image will have to be consulted in turn. Whilst time consuming, this is the only way to ensure accuracy.

The categories of data which will form the basis of the columns in the database’s table are as follows;
- Date sent
- Date received
- Reference (in the form of a code frequently appearing on items)
- Refers to
- Date of items referred to
- Author’s Name
- Author’s Position
- Recipient’s Name
- Recipient’s Position
- Title
- Subject
- Notes
- Class-mark
Included in the table will be a hyperlink to each file. Before the database is complete it simply will not be possible to determine whether a physical reorganisation of the archive will be useful/worthwhile. The digital files will therefore also remain in their current file architecture.

It should also be obvious that many of the above fields will not be relevant for each material type, but in the case of a memorandum on goods prices for instance, date sent simply becomes date on memo and so on. The data entry will endeavour to provide as complete a record as possible, but this will not be possible in every case, leaving blanks in the table.

The database is intended to fulfil several functions;
1) By a simply class-mark query provide information on all items with that particular class-mark to facilitate the creation of a new inventory.
2) Permit the retrieval of items through a query in one of the other fields, or a combination of the two, for example, all correspondence send or received by John Smith over the course of a decade, or between John Smith and Peter Brown over the course of a year.
3) Make it possible to use the ‘Reference’/’Refers to’ and if possible ‘Date Sent’/’Date Received’ or ‘Date of items referred to’ fields to reconstruct series of correspondence chronologically. This many not be the same as the process in point two, as two authors may be conversing on a wide range of different topics over any given period, and often simultaneously. It is hoped that the ‘Title’/’Subject’ fields may be put to use here.

It is highly likely that not all items in a correspondence series are extant, but coming as close as possible to the original order is highly important. It may also be the case that a combination of steps 2 and 3 will allow blanks to be filled in a series of correspondence, where the correct references do not appear on the images themselves. There may be cases where minutes and memoranda also have a place in the chain, and through a similar process it is hoped that tables can be reorganised to reflect this, although this may have to be a manual process.

The primary key which can be assigned to each entry would presumably permit the entire table to be returned to its original state through a sort. By assigning a similar number to each item once the reorganisation is complete, it will presumably also be possible to switch between the two.

The desired end result is to have an ordered and searchable database of the material in the archive, which shows chains of correspondence and related materials. Taking a particular item, one should be able to not only link to following and preceding items, but get an overview of the entire series.

At this stage there are only two questions which truly need to be asked. Is all this possible, and further, how complex would it be to achieve?

Secondly, as far as data inputting, which is ready to commence, goes, will it be better to use separate tables for each folder, or simply one master table? The former approach would make creating the inventory my easier, but would it not make constructing the necessary queries more difficult?

Any guidance you can provide will be much appreciated.

J.
 

Lightwave

Ad astra
Local time
Today, 00:47
Joined
Sep 27, 2004
Messages
1,521
Sounds like a useful project J

Yes this is very much possible however it sounds like a big data input job and there is really no getting round this especially if Optical Character recognition is not possible but it sounds like you are resigned to this.

Firstly I have not done anything quite on this scale and I recommend you take more than one persons advice if this is truly as big a job as it sounds. That caveat in place my suggestion is as follows

Start by making the folowing table as the main table - Lets call it T001DocumentList

Fields as follows
- PKID (Primary Number) set to autonumber this will simply increment as you add records
- DocumentID
- Date sent
- Date received
- Reference (in the form of a code frequently appearing on items)
- Refers to
- Date of items referred to
- Author’s Name
- Author’s Position
- Recipient’s Name
- Recipient’s Position
- Title
- Subject
- Notes
- Class-mark

Now I would make as nice a form as possible for this table as someone sounds like they are going to be spending a very long time inputting the information. Getting a nice form that users enjoy working with is quite important in these ultra intensive data input tasks. Big buttons big fields with clear fields etc.. Help to speed process up.

So assume you've done that now
Start scanning
Scanning can be done in any order but several things are important.
Use a stamp or a pen to mark each document with a Number - if the documents are particularly important you might want to think of a non destructive labelling.
That number should increment by one and should never repeat -this is the DocumentID.
It is VERY important that every time you scan a document you should give the scanned file the name equating to the number that is written on it
continue to the end whereupon you should have a pile of documents on the shelves marked
1,2,3,4,5 etc

and you should have a series of documents in a file somewhere named
1.pdf
2.pdf
3.pdf
4.pdf
5.pdf
etc….
(assuming it scans to pdf but whatever format the principle is the same)

I can see some practical complications with this step - for instance it might not be practical to name the documents as they are being scanned but I can see some real advanates to subsequent storing on shelves. All paper documents can be lined up on a shelf in number order and you can place new documents at the end and if you want the 1000th document you simply walk down to roughly the middle of the shelf and there it should be. You will never really need to do any sorting of the physical documents as all sorting can be done on the computer in seconds. Which should mean staff hire is really cheap. Anyone can count right?

There is probably software that you can get that will incrementally name documents 1,2,3,4,5 on scanning.

So what I would do next is

Get your excel spreadsheet out and place 2 columns in it name one PKID and the other DocumentID
Populate PKID with 1 to however many documents you have and similarly populate DocumentID with 1-2000

Now it's important that there is a Document ID for every document. If for instance you lost the document with the number 1000 on it you should have 1999 records but the DocumentID should count from 1 to 2000 OMITTING 1000.

Take this excel sheet and import it into the database and you should have 2000 blank records.

At this point I reckon you are maybe half way through the input the next part may take longer may not depends on how easy it is to get the information from each document and how long it took you to do that very probably mammoth scanning job. The good thing about the next part is that you could set it up so you could have a small army of people doing the next part of the job and it should be directly quicker the more people you throw at the job.

On your form you will have a SHOW document button.

Behind the SHOW document button will be a very simple bit of code.

Just says go to directory X and open file [DocmentID] & ".pdf"

Now directory X will vary for you BUT the central theme is that all records are placed in the same directory that's one single directory. Really creating new directories with new names is both subjective and cumbersome plus its arbitrary if you have all important cateogries in the database any "sorting" of documents can be done digitally. Why create all the hassle which you are trying to get rid of on a computer.

Now comes the classification

Sit your understudy down at a computer get the handcuffs out and tell them to go through each record click on the button and view the document taking the information from scanned onscreen image and filling out the appropriate fields on the form. This is where you could throw multiple classifiers at the problem if this is a really big job you might want to give them extra wide screens and think about the design of the form so it is easy to see both the scanned image and the form on the screen at the same time. This will make a big difference to data inputers.

Continue to said understudies runs out of records.


Hey presto you should now have a complete database of referenced documents.

Now with regard to queries the following should be held in mind. If you want to search on it you need to have a field in the database that can distinguish it. If it ain't there you won't be able to search on it other than that you sound like you've thought about the field names a bit and they are complete. It's not a disaster if you miss something but it may mean that you may have to re hire understudy to go back and add information to a field that has been added later.

Now going forward you simply scan any new documents give them a new DocumentID create a new record in the database with the same DocumentID and place the paper document on the end of what is very possibly an incredibly long shelf and the digital document just goes in the single folder.

Someone wants to see the paper version. No problem sit them down infront of the database tell them to determine the doc id and then get said understudy to walk through the vault until he finds the Arc of the Covenant marked 4,320,456.

Now in the above example I've described the process as consecutive ie scanning THEN classifying. You could organise it so there is a slight overlap. Maybe have a team of individuals on scanning but if they are getting through it quick and you are running out of scanners you could divert some labour to the classification. Doesn't really matter.

Good luck
 
Last edited:

Lightwave

Ad astra
Local time
Today, 00:47
Joined
Sep 27, 2004
Messages
1,521
Oh yes and another good thing about the above method is that it semi protects the intellectual property of your information.

If for instance a hacker gets in and gets all of your documents...

Fine they have stolen that information and technically they could get information from all the documents. However without a database they are going to have to sit down and open up every single document to identify what the hell it is. If for instance you have 2000 documents that could take them years if they are just doing it by themselves.!! As you see from the above explanation getting the scanned documents is only half way to having a useful data source.

I always thought I tunes missed a bit of a trick there on their music If they had given all their tracks incremental numbers and then hidden the numbers in the IPlayer people would never really be able to go through the directory to find individual songs that they wanted copied. And importing them into a new player would just have a number of record headers marked 1,2,3,4,5 so playlists would be shot....
 
Last edited:

hogarn

New member
Local time
Yesterday, 23:47
Joined
Feb 9, 2012
Messages
3
Thanks for the feedback. In particular, the show script will of great use. Presumably it would be simple to use a dual screen set-up to have the file on one, and the form on another?

As far as the DocumentID field goes, I don't physiclly have access to the archive, and the material has already been digitised, which means altering file architecture is unfortunately not an option, nor is marking up the documents. The folders reflect the phycial structure of the acrhive, and that's not something I can alter.

Therefore, I can use a renaming tool to add the classmark information, for arguements sake ABC 1-2-3 to each file, resulting in ABC1-2-3-File1.jpg/.pdf or similar, but thats as far as it goes.

Unfortunately, I also have no underlings, so this will be down to me.

Presumably the PKID is the distinguishing field for each, as other entries will have duplications, such as names and dates. I will need to search across multiple fields, but assuming each entry has a PKID, it should be possible to make some complex queries.

J.
 

Lightwave

Ad astra
Local time
Today, 00:47
Joined
Sep 27, 2004
Messages
1,521
Yes I thought there might be some limiting factors..

Yes dual screens are very useful for that kind of thing.

With regard to the limitations you specify the principle remains the same ..

Take a list of the digitised documents and place those in the DocumentID field

The script remains the same in that the DocumentID field will consist of letters symbols and numbers. Not quite as elegant to the eye but the computer won't care.

I'll get that code to you - busy at the moment but can do it Sunday night. I'll sit down and write a simple database that will show proof of principle.

It's not quite clear from your post but I suspect that all the physical folders have been recreated on the computer as digital folders. This does add complexity particularly if those folders have been inconsistently named (which they invariably are in categorising systems). My desire would be to place all of the digital files in one directory but again your hands maybe tied on that one and admin may come down hard on you for that. If you can't do this you will need to store an element of the folder in the database which is unfortunate but rest assured your organisation won't be the first to continually perform a task that is really not needed!!!!

PKID will be managed by the computer completely you shouldn't need to worry about it. Always good practice to have tables with non repeatable primary keys and they can be used further down the line to link records as and when you are asked to develop the system.
 

Lightwave

Ad astra
Local time
Today, 00:47
Joined
Sep 27, 2004
Messages
1,521
OK I've created the database but it's sunday night and this site won't let me upload it. I'll try again either tomorrow at work or later.

M
 

Lightwave

Ad astra
Local time
Today, 00:47
Joined
Sep 27, 2004
Messages
1,521
OK here's that file

Create a text document called Document1.txt and place it in the c / root directory

Download database (access 2003 format)

Open and then open form 1 take it from there

Mark
 

Attachments

  • DocumentList.mdb
    316 KB · Views: 155

Users who are viewing this thread

Top Bottom