Import whole lotta MS Word 2000 docs into Database (1 Viewer)

yahazim

Registered User.
Local time
Today, 00:49
Joined
Apr 2, 2001
Messages
24
OK. Thank you in advance for product suggestions, methods, ideas, etc. Real concrete working info of course would be the best.

----

I'm on a solution development team. We've captured many development documents based on single form templates, so they all look the same.

I am wanting to import all those documents into a database as cleanly as possible. I would really like to extract all the information in one swoop and populate the associated tables, normalizing everything in the process.

One solution I can think of is to create SQL statement to search for specific repeatable strings that point to the unique info I want. EXAMPLE: "First Name: " (extract everything past the string, including the space, and put that string in the database).

Is there a simple way to use XML to do this?

Thank you SO MUCH in advance.

Jim
 

The_Doc_Man

Immoderate Moderator
Staff member
Local time
Yesterday, 18:49
Joined
Feb 28, 2001
Messages
27,193
Never worked with XML but I've done something not too dissimilar with Word.

I'm not saying my method will work, mind you, but the CONCEPT of what I did might help in some way.

First, know that Word, being a Component Object Model application, can be opened as an Application Object. (See help files on this topic.)

Once you create the application, you can use the FindFile (or is it FileFind?) object to scan a directory to identify a file based on some template. (See help files on finding files from VBA). The list of Found objects (all files matching the templates) will merely drive a program loop.

You can use the application object to open each qualifying file. The file becomes a document in Word's Documents collection. (In fact, it becomes ActiveDocument, which is one of those short-cut words like Me or CurrentSheet).

Once the file is open, its contents are exposed. If the information were stored in tables (which was my case), you could direct-access the tables. For instance, the number of tables is Tables.Count. The number of rows in table n is Tables(n).Rows.Count. The text contents of the third column of the second row of the first table in the active document is

ActiveDocument.Tables(1).Rows(2).Cells(3).Range.Text

or something very close to that. Look up Word help on VBA topics to see the collections that you get when you expose COM content of a Word document. You also can see paragraphs, words, and some other items. NOTE: If your document has two hard returns in a row, that is TWO paragraphs, one of which is empty.

Anyway, I built some VBA code to open a recordset, look at the data I saw in the open document, and extract the literal text into the fields of my recordset. Then I just updated the recordset and moved on to the next chunk of data. When I was done, I closed the document (but left the application open - faster that way up to a point.) When I ran out of documents, THEN I closed the application. Note, however, that if your system is memory-tight, you might have to close the application more often because of memory glut issues that are a by-product of using Windows. When you close the document, be sure to use the "Don't Save" option in the Application.Close method.

As to the parsing of what is inside the file, that is going to depend on your individual circumstances. In my case, the solution was originally designed to be tabular because we knew about Word tables and the COM contents. In other words, the Access tables came first, THEN the Word tables. We had control of which was the chicken and which was the egg.

This database I mention is a real-world solution, still in operation at my site. However, I cannot post it because I am at a military site and the database contains some things that are a bit sensitive. So you'll have to live with my description of how I approached the problem.
 

Users who are viewing this thread

Top Bottom