What Database to store RTF texts, images & PDF fragments (Research databse)

pandzio · Mar 27, 2016

I'd like to manage database-like my hard drive full of various types of files connected to my filed of interest. Its ancient sources and modern works that analyse and cross reference them. So it is a lot of images of old manuscripts, retyped old text in Ancient languages, scholarly books and articles (scanned, in PDFs or from Web) and my own notes.
It is impossible for me with my unarguably limited knowledge, to have them all tagged, searchable, filterable and cross-referenced.
I have some folders of images tagged, but tagging doesn't work for GIF-s, PNG-s and PDFs in Windows10. Even with the Shell that allows me tagging PDFs (but not PNGs for some reason) the Windows search-box ignores the tags for these files.
I have some of the data, parts of the books and web articles made into Excel spreadsheet with images, and filtres, but
- converting all of my material (Optical Recognition, corrections, solving the footnotes issues, placing images as JPG instead of uncompressed bitmaps, and so on) seems like unrealistic endless nightmare. Not to mention Excel is not good at handling lot of formatted text and images data. And often times a file crashes or looses all the images "as attempt to cure a one image issue" and is not a safe way to keep my data, I assume.
But first of all it is not covering all of my files, only what I manually put in there.

I was thinking of Ms Access, but I am afraid it will still require me a lot of life taken just for making the data into Access, and still will not be able to dynamically filter and instantly view the relevant parts of my images/texts/etc.

Maybe it should be a File Manager-like database that would be able to tag and work with chunks of files.

You know this paragraph of this PDF file is about that Ancient writer and that other one too, and mentions this three scholars opinions from these three works. And this footnote from this Word file mentions these of them... And this image is an example of this mentioned astronomical technique. And this part that is the second paragraph on the Scanned IMAGE version of this page of that book (that has all the pages scanned in this folder) ...

Any thought and ideas how would one manage to work with such diverse files and "database" them? There are apps that allow me to do few things to few file formats but not all of them. And making all of that into database (especially images) seems stupid and endless job.

Ideally I would like the materials to be filtered and to see several images/pages on the go, without the need to double click each one to see it is not the one I wanted. I will be okay even with full pages from PDF and whole image files, I am willing to resign from having the non-relevant parts of images cropped and the paragraph of interest to be cropped out of the PDF's page. That would be nice, be realistically I assume that is too much to ask, and I will be happy even without it.

At the moment I have no practical knowledge of databases and programming, I just learned some stuff in Excel and some basic theory about Access.

MarkK · Mar 27, 2016

What a database can do is model a system of relationships between things. So a collector has many cars, or a class has many students, or a customer has many orders. In a database, your essential unit of meaningfulness is the one-to-many relationship, implemented as follows, using two tables. . .

tParent
ParentID (Primary Key)
ParentData

tChild
ChildID (Primary Key)
ParentID (Foreign Key, defines the parent of this instance)
ChildData

So using that structure, one "parent thing" can have many related "child things," but notice that there is no "data" at this point, this is just the shape of the box that will hold the data.

I think your challenge if you choose to go with a database will be to design the tables. What are the 'objects' that actually exist that you need to keep track of, and how are they related to each other. Like, lets say you have a jpg, which is a photo of a famous manuscript, and you are concerned with a footnote in the photo. Holy smokes, that's a complex chain of related objects.
1) File on disk - an actual thing
2) Famous Manuscript - a historic reality with authors, pub dates, etc
3) Manuscript Text - the actual, searchable, text of the historical document
4) Footnote Text - a meaningful defined piece of 3) above
5) Tags - Units of meaning or value attached to other objects
And the more detail you want to get OUT of the system, you more detail you have to put IN. This looks like only a narrow piece of what your overall target is.

But to explore this, lets start building the tables. One will be tFile, which will be a file on disk. Whether you store the file IN the database or you just store the path, doesn't matter at this stage.

tFile
FileID (PK)
Path
Name
ActualFileData

What other "objects" are there? And in this process we'll fairly quickly get an idea of whether we can design a "box" to store the shape of the data you have.

pandzio · Mar 30, 2016

Hi, MarkK. You understand my planned database structure quite correctly. But this was only to illustrate my question of more general nature (as I anticipated I will be asked "what do you want to do?")

However - if that was not clearly highlighted - I was asking for advice on CHOICE of database-programme alternative (Access, Oracle, ... what have you) with regards to database(s) which core data Memo fields (of - important - varying length) and images (also often big ones like scans of full pages that need to be readable). Because my main work with them would be working (input, edit, filtering, sorting) with multiple rows at the same time (no clicking "next entry" "previous entry" buttons) but more like in Excel where you scroll up and down through your data.

And:
1) Are there any differences in how different apps handle autofit row hight to the length of the memo (long RTFormatted text) in the modes that allow changing data (not the read & filter only reports)?
2) And similar options for images. With them it is harder for me to say what I want. I can say I am more interested to know what I possibly can do, than choose what approach & options suits me.

MarkK · Mar 30, 2016

pandzio said:
You understand my planned database structure quite correctly.

LOL, only what I'm saying is that I don't understand your planned database structure at all! I think you want to make free-form, ad hoc connections between previously unrelated things, and in pursuit of that I doubt that row height is where you'll get stuck.

Design your ERP or ERD first, and see if you can even make a functional model of the problem. What entities do you have? How are they related to each other? Modeling your data problem looks very, very hard to me, and I think your biggest problem, by far, is not the differences between database products, but whether the problem is suitable for database representation at all.

pandzio · Mar 30, 2016

Yeah, I know I must be one of those beginners that want to make their first thing in Access be a jack of all trades for guys who actually do work with it. Sorry.

My examples are preliminary and indeed terrible. Can we ignore them?

I know for sure my database will rely on RTF memo field and image field. The rest I will think through several times to make it work with correct normalisation and relationships.

I am wondering should I stick with Access, which has these two types of data a bit cumbersome, or maybe in other database environment you can do more fancy stuff with these two.
Perhaps someone could give me a hints what should I Google? I was unsuccessful in wording something that with return results about it...

What Database to store RTF texts, images & PDF fragments (Research databse)

pandzio

Registered User.

MarkK

bit cruncher

pandzio

Registered User.

MarkK

bit cruncher

pandzio

Registered User.

Similar threads

Users who are viewing this thread