Extract postal addresses from pdf mailing labels into DB (1 Viewer)

craigachan

Registered User.
Local time
Today, 01:00
Joined
Nov 9, 2007
Messages
285
I'm looking to extract the addresses from a set of mailing labels (pdf) into my database. The labels are scanned copies of Avery labels 3 across. Does anyone have any idea how to do this.

1716014262662.png


I have about 1500 address to extract. Any help would be appreciated, even if I have to purchase an OCR software. But I can't seem to find one.

Thank for your help.
 
I have taken a PDF cut and pasted it into a text file. Then read the text file and string handled it into a database.

For my family tree, I've done that a few times when taking PDFs from different parishes containing centuries of church
records of births, marriages and deaths. In a PDF you cannot do any useful, rapid or group searches but once it is in
a table you can of course do everything and anything to find and filter so easily. One problem of course is that in this
type of document the last-name spelling can change from one Rev to another. They would often guess the spelling
from the name as spoken in whatever the accent of the district. But that's easy to get around once it is in Access of
course but virtually impossible with a PDF. In fact the spelling of our names is often down the guy in some church
hundreds of years ago deciding on how he thought it should be. Not many could read, so they couldn't check it!
That little exercise helped to take my male and female lines back into the 1500s.
 
Last edited:
Here is one you can buy, that will do it on your computer, should you need to do it multiple times?
 
With Power Query (Excel) you can read a lot of data sources and then revise them.
So you could create a usable table and then add it to your database.
 
It's interesting when you see how post offices actually structure an address.

A line of address is probably ok for most of us, but they go deeper than that.
 
First step get the PDF into Google drive.

Open the PDF as a Google document

This automatically extracts the text in the PDF in to text.

Next copy and paste the text into ChatGPT.

I'm pretty sure it would be able to sort it out for you. I don't know if chatty has any limits on large amounts of text, but I'm sure it would be worth a try!
 
You can probably copy and paste the pdf text into Excel directly and use some simple string manipulation to extract what you want in a format suitable for the DB.

I do this with web scrapes for pricing data for a client. Takes about a 10 seconds to extract about 2000 prices and other related data.
 
First step get the PDF into Google drive.

Open the PDF as a Google document
This is very handy to know. Once extracted, I would paste into Excel and build 4 records at a time. But then I'm not good friends with chatty;)
 
Thanks you for all of your suggestions. I have no problem converting the PDF to text. My problem is with the next step. I have 1500+ postal addresses that would make it very cumbersome to cut and paste each one. I can get the list into a .txt file that looks like this:

1716162907229.png


As you can see the names can be separated out by every 2 or 3 lines, but how do I get it into the following format in 3 lines:

Address Name
AddressStreet
AddressCityStateZip

I think I can get the state and zips separated into fields. But there are 2 other issues. Addresses with PO Box adds a line, and some cities have 2 names (eg. Sun City West) and some cities have 1 name (eg. Scottsdale). I can isolate the AddressStreet/PO Box and also AddressCity, I can then import and create 3 fields for my database.

Any suggestions on how to do this in mass?

Thanks for all of your suggestions.
 
You can split the address by , or look for that with Instr()
You can test if PO are the first two characters.

TBH with only 1500, I would paste into excel and adjust, if this is just a one off issue?
 
You might be able to identify the start of a record - address name- based on only one space in the line although would be better if your unspecified extraction process retained a blank line between records

but seems an awful lot of effort if this is a one off
 
If all the addresses are in Arizona, I have an easy way to split the addresses. Add all the cities to a table.

Code:
You have an outer loop and an inner loop.  In this case,

Do Until rs.EOF
    assume the first line is a name.
    rs.MoveNext
    use LIKE to see if any of the city names is in the string for the second line.  If no city name is found, asume this is a PO or some other data
        rs.MoveNext
        use LIKE to match the city.  If found, split the string at the last coma.  The city, state, zip is to the right of the coma, the rest of the address is to the left.
        If Not found - EXIT and fix the file manually
    rs.MoveNext
Loop

You will need to review the saved records carefully manually since the code to do this the way Google does is very complex.
 
Those addresses all are close to me. One is only 1 zipcode over. Maybe we should have coffee sometime.

The Locale Names as used by USPS can be downloaded here: ZIP Codes by Area and District codes | PostalPro (usps.com)
That would be a good way to identify for example "sun city west" as a locale, and therefore the trimmed string before that is the Address.
With a bit of InStr, Left, Mid, Right, Trim you'd get quite a ways. Maybe Split.
 
Thank you all for your suggestions. It sound like if I get a list of all the city names in AZ, and then run the AddressCity against that, then I can separate everything else. I'll give it a try and see what happens. I'll post back my solution if it works. Thanks everyone.
 

Users who are viewing this thread

Back
Top Bottom