Using VBA to save PDF as Text

Ancalima

Registered User.
Local time
Today, 05:43
Joined
Oct 8, 2008
Messages
11
Ok well what I am trying to do is use VBA code to save an Acrobat Version 7.0 PDF into a text file so it can be pulled into a database using monarch. The actual saving of the file as text works, and here is the code I have used to accomplish that:

Dim AcroXApp As Object
Dim AcroXAVDoc As Object
Dim AcroXPDDoc As Object

Set AcroXApp = CreateObject("AcroExch.App")
AcroXApp.Hide

Set AcroXAVDoc = CreateObject("AcroExch.AVDoc")
AcroXAVDoc.Open PDF_PATH & filename, "Acrobat"

AcroXAVDoc.BringToFront

Set AcroXPDDoc = AcroXAVDoc.GetPDDoc

Dim jsObj As Object
Set jsObj = AcroXPDDoc.GetJSObject

jsObj.SaveAs OUTPUT_PATH & OutputFile, "com.adobe.acrobat.plain-text"

AcroXAVDoc.Close False
AcroXApp.Hide
AcroXApp.Exit

Now as I said the code itself runs perfectly and saves the pdf as a plain text file. The problem I am having is that the original pdf is in neat column format, but the resulting text file loses all the whitespace and creates only a single space between column fields. This is insufficient for my needs as some of the fields have spaces in them, so i cannot use a space-delimited method to import them.
My current work-around is that rather than save the file as text, i save it as a html version 3.20 file by changing "com.adobe.acrobat.plain-text" to "com.adobe.acrobat.html-3-20 "and i use monarch to anaylze that file. This works mostly, but it has had many bugs to work out because of inconsistencies in the way acrobat saves the file as html.
The application i've built does work for the most part, but i am concerned that if it receives some odd data that it will break because as it is built it is somewhat volatile. I need 7 monarch models as it is just to gather all of the data correctly.
I was wondering if there was any way to save the pdf file as a text file but still retain the column format and whitespace from the original pdf. This would alleviate most all of my problems. Also I use Acrobat version 6.0 standard, Monarch Pro 6.00 and MS Access 2000. I appreciate any help anyone could give me.
 
ok this keeps poping up time and again
so here goes one single answer to all :
adobe acrobat is a format that was created mainly to secure texts from being ripped off and thus widly relied on as the primial choice when it comes to ebooks , heck acrobat is ebooks.
the procedure to which acrobat acquires its published material is through creating a virtual printer driver where it captures print file posts (which for those who dont know heavily rely on bitmapped images) yet also do note that a printer can also acquire font details and letters for fast printing (which they try to illustrate as compression)
so what do we understand out of this ?
yes , acrobat store most of the acquired material (tables , art , bullets , headers , footers) in image format , however text is analysed through how guys behind printers conclude as compressed feed and store in text format along with font.
now one would ask , but if page format is in image format how can acrobat align acquired text along? simple , its called layering . acrobat manages to acquire coordinates (dimenssions) where each text layer starts and store it along with text.
now reversing the cycle wasnt realy newly thought off , it was even thought of by adobe people themselves and several trials were made (both adobe and third party) in order to acquire a file properly formated and editable but the results isnt realy satisfying.
just google pdf to doc and pdf to xls convertors and try any . just throw at it a beafy page and watch how it converts almost 60% of its content into images.
so are we saying here that it's impossible to efficiantly convert a pdf into a decently editable format ?
the answer is , not through acrobat.
most of close to success convertors reveal within their marketing documents that they are proudly relying on ocr technology for document format recognition combined with hard arythmatic just to position font layers correctly within captured format by ocr.
now if you can take the lead from here you will probably get there but not through adobe [and that would be till date of post , please research what new they had to offer]
 
Last edited:
Thanks!

This is insufficient for my needs as some of the fields have spaces in them, so i cannot use a space-delimited method to import them.

I actually didn't have a neat table in my pdf's, I had to write a script which actually crawls trough the text looking for : and . characters. Maybe you can figure out a similar method for your format. The function IsNumeric might distinguish your numeric columns from others.

Good luck!

One question: which library do I have to select in VBA? I'm getting an error creating the activeX object.
An answer: probably because I don't have adobe acrobat. Any work around on this?
 
Last edited:
When I use the code to open the PDF I am getting

"Run-time error '429':
ActiveX component can't create object"

when it runs line

Set AcroXApp = CreateObject("AcroExch.App")

Any idea why?

Thanks for the help.
 

Users who are viewing this thread

Back
Top Bottom