Solved Search pdf for specific string (1 Viewer)

drifter96

New member
Local time
Today, 05:15
Joined
Aug 30, 2021
Messages
13
Hello,

I am trying to open a pdf, search the text for a specific string and capture the following characters. The pdfs do not have any tags.

Example: Fax Number: XXX-XXX-XXXX

Open PDF>
Search for: Fax Number:
Capture XXX-XXX-XXXX

Thank in advance
 
I did that first and all I found was to do it using either Excel or Word to do the search. I was hoping to find a solution that can be done through Access.
 
I was hoping to find a solution that can be done through Access.
The only way to do this using Access is to automate Word or Excel. Access cannot directly read a .pdf. I don't even know that Word and Excel can either although the others seem to think they can.

You can create a loop using FSO (File System Object) that reads through a folder and selects the .pdf's. For each one, you can use OLE to open the file using Word or Excel and then run a search macro that you have created to find the data and return a value.
 
Thank you Pat,

I don't like your answer :) but I was thinking that was my only option. I've had to do that in the past with txt files and Word. I was just hoping things had changed and it could be done with pdfs and Access.
 
From the link I posted it seemed you need Adobe Acrobat Pro?
FoxitPDF might be able to do it with an API.
Another link was mentioned in amongst those links, but I will leave you to look for it.

Word will open a pdf, but it takes it's time, at least it does for me in 2019?
Word would appear to be the simplest option?
 
Hello,

I am trying to open a pdf, search the text for a specific string and capture the following characters. The pdfs do not have any tags.

Example: Fax Number: XXX-XXX-XXXX

Open PDF>
Search for: Fax Number:
Capture XXX-XXX-XXXX

Thank in advance
Whenever I've tried to read the contents of PDF files in VBA it seems like it was horrifically complicated. The problem is worse if you don't know the source of the PDF.
I build an Invoice approval system here. A scraper routine pulls files out of Outlook messages and saves them into a folder. I have a computer set to be running a copy of OmniPage (which is a bit dated, but the price is right). OmniPage runs through all the files in the folder and performs OCR on them. It manages to reduce the file size without reducing the file quality. As an added bonus, it can easily be set to do the OCR and produce two files, one is the searchable PDF, the other is just the scraped text in ASCII. My routine scans the ASCII to determine who the Invoice is from, and can usually tell what job it's for, who ordered it, and what the total cost is.
I actually scan for phone numbers as you're trying to do here, to determine what company sent us the file.
 
I was just hoping things had changed and it could be done with pdfs and Access.
pdf is a proprietary image format. Be happy that Word reads it because Adobe is very expensive and if you can't get this to work with OLE automation, then you need to buy Adobe for every user that needs the functionality.

Not sure why you couldn't use access to read a text file. Text files are not encoded and so Access has no trouble reading them. Of course, you need some idea of the format in order to interpret the data correctly.
 
You absolutely can do this - just use VBA to open the PDF as a text file, then search for the text. Some PDF's will have a lot of garbldey gook but also have the actual text as well. Some will not. 10 lines of code max
 
read it in as a text file using scripting file system object or open/input

NOTE - after I posted this, i have to concede there are fewer pdfs than I thought that reveal their actual text this way. I have gotten lucky on some, but it may be rarer than the first impression I gave. Still, worth a try, only takes a few seconds.
 
As Isaac mentions, won't necessarily work for all pdf's but I use an exe called pdftotext

I use it to open pdf invoices and scan for specific data like invoice numbers, dates, quantities, totals etc.

For each supplier you set up a template to identify where the data can be found - typically before some text, or after or between. Once set up, the app scans an 'unprocessed' folder, imports the data and archives the pdf to a different folder.

See post #5 in this link

here is an example of what a template looks like
1722468652407.png


Suppliers do change layouts occasionally so the importer validates each value and if for whatever reason it returns an invalid value, the import of that file is aborted. User then needs to go into the above to recheck the aborted files and modify the template accordingly
 
The post is marked as solved, But I don't think anyone cares if I add my 2 cents.

Several solutions were suggested, Using external pdf to text converters, using word or Excel etc.
Here's mine.

The following code reads the contents of a pdf file and prints it to immediate window.
Works only if you have either Adobe Acrobat pro or personal installed.
Add a reference to Adobe Acrobat and run it.

SQL:
Function getTextFromPDF() As String
 
    Dim AVDoc As New AcroAVDoc
    Dim PDDoc As New AcroPDDoc
    Dim Cnt As Integer
    Dim pg As AcroPDPage
    Dim Sel As AcroPDTextSelect
    Dim Highlight As AcroHiliteList
    Dim pgNum As Long
    Dim strText As String
 
    Const MaxWordsPerPage As Integer = 10000
    Const FileName As String = "D:\AMS.pdf"
 
    strText = ""
    If (AVDoc.Open(FileName, "")) Then
        Set PDDoc = AVDoc.GetPDDoc
        For pgNum = 0 To PDDoc.GetNumPages() - 1
            Set pg = PDDoc.AcquirePage(pgNum)
            Set Highlight = New AcroHiliteList
            Highlight.Add 0, MaxWordsPerPage
            Set Sel = pg.CreatePageHilite(Highlight)

            If Not Sel Is Nothing Then
                For Cnt = 0 To Sel.GetNumText - 1
                    strText = strText & Sel.GetText(Cnt)
                Next Cnt
            End If
        Next pgNum
        AVDoc.Close 1
    End If

    Debug.Print strText

End Function

Note:
  • Be careful if you read a pdf with a lot of pages. It takes 48 seconds to read a 182 page pdf in my database.
  • You may want to add a DoEvents to be able to cancel/pause the function if it takes a long time.
  • There are more simple ways for reading a pdf from VBA. For example embedding JavaScript in a function, or using AcroAVPageView.GetText() method. but these methods need Adobe Acrobat pro. With my Personal license, they fail.
  • Since the above code reads the contents of a pdf word by word, you can add some tests and exit the function after finding what you were looking for. No need to wait for reading the whole file.
 
Last edited:
Using Adobe is being dependent on a third party app and a very expensive one at that.

I would suggest that if you have to scan documents this way that the scanning be automated so that as files get dropped into a folder, the data is scanned and exported to a text file which the Access app can then consume.

The only app I have currently that needs this functionality is one that reads bank statements to identify which refund checks were cashed. The bank statements come via email and are automatically routed to a specific folder. Then on the 15th (or whatever) the scanning tool runs and reads each file in the folder and in my case, converts it to Excel. For this purpose, a text file would also work. Then when the user wants to run the reconciliation, she opens Access and Access imports all the spreadsheets. The code checks the file names and dates to ensure that no files have been missed and no file is imported multiple times. The app includes an option to back out an import in case there was a problem so it could be reimported.
 
Last edited:
Using Adobe is being dependent on a third party app and a very expensive one at that.

I would suggest that if you have to scan documents this way that the scanning be automated so that as files get dropped into a folder, the data is scanned and exported to a text file which the Access app can then consume.

The only app I have currently that needs this functionality is one that reads bank statements to identify which refund checks were cashed. The bank statements come via email and are automatically routed to a specific folder. Then on the 15th (or whatever) the scanning tool runs and reads each file in the folder and in my case, converts it to Excel. For this purpose, a text file would also work. Then when the user wants to run the reconciliation, she opens Access and Access imports all the spreadsheets. The code checks the file names and dates to ensure that no files have been missed and no file is imported multiple times. The app includes an option to back out an import in case there was a problem so it could be reimported.
You're describing a process we use here. The only difference is that we use OmniPage to OCR the PDF files. One time purchase of like $150 with unlimited file use.
 

Users who are viewing this thread

Back
Top Bottom