Solved Scraping a table from Website (1 Viewer)

jdraw

Super Moderator
Staff member
Local time
Today, 07:31
Joined
Jan 23, 2006
Messages
15,379
John,
Just for clarity. The status at the moment is you and HE have a strategy that is progressing and will get what you need.
??Is your client who needs this data a "customer" of whoever has the website??
I have been in situations where all sorts of manipulation/transformation/"hand-raulic mechanisms" were employed to get at some data. But in reality, asking the powers to be for data in a specific(useable) format got immediate action and streamlined "extraction" and communications. Just sayin'
 

NauticalGent

Ignore List Poster Boy
Local time
Today, 07:31
Joined
Apr 27, 2015
Messages
6,341
Morning Jack,

*deep breath*

Full disclosure - The "customer" is my wife's company, Performance Food Groups (PFG). If you buy food in North America, or pretty much the developed world for that matter, PFG was most likely involved with the distribution process.

She works in the Human Relations department, and more specifically, the information systems branch of the HR department, or HRIS. If an employee or regional HR representative has an issue with one of PFG's websites, she and her team are the ones you call.

PFG hired a consultant company, UKG, to develop a site that handles their complaints/trouble calls, or what they refer to as Case Management. They built it, PFG bought it and now it is theirs. Any modifications, such as an API and better export process, would require a separate contract and more $$$.

Skip ahead a few months and the boss of my wife's boss wanted metrics: How many cases and have/have not been resolved, where are the majority of the cases originating from, what ARE the issues, how much time is it taking to resolve, etc.

"No problem!" they said, "We have this nifty export feature - we'll just do an export, get it into Excel and Bob's your uncle!" Well, Uncle Bob's export sucks. I made a post about it few months ago and I wont even get into how bad it is right now. Her boss set about employing her Excel mastery to produce something of value.

Several weeks later, she threw in the towel and asked my wife to engage. Missy (wife) had already been hacking away at it and had asked me to take a look. I did and after about 2 weeks of cussing and discussing, I was able to give her about a 90% solution that required human interaction for the other 10%. I suppose I could have eventually worked it out but I was tired of dealing with it and my db is so bad I wont share it here because of the amount of chastising I would deservedly receive.

Missy was able to get her people the reports needed within a week of doing an export and that was good enough...until it wasn't. Her upper management was not in love with the reports (no charts, line graphs etc.) and insisted they bring in the Power-Bi team to generate the reports worthy of their status. She told them that it would be ok, once they figured out how to get clean data. When asked to explain, she had me join a Teams meeting with the elite so I could explain the issues and the processes I use to clean the data. I again tried to impress upon them the need for an API and/or access to the server - after their eyes glossed over, I was excused and they adjourned.

They reconvened, told Missy that an API/Direct access was a no-go and that they wanted me to sit down with their Power-Bi team and explain the process so that they could emulate it. She explained to them that even if I did that, the remaining clean-up would still take a dedicated team hours to accomplish depending on the amount of dirty records. They asked her if I could do something about the remaining 10% and that they would be willing to compensate me for my time.

*whew*

Which brings us back to the beginning of this thread...
 

jdraw

Super Moderator
Staff member
Local time
Today, 07:31
Joined
Jan 23, 2006
Messages
15,379
Quite a saga. I've also seen many managers/bosses who have no/zip knowledge of the value of information they are responsible for - nor any understanding of data management. I once worked in an area where one of the intermediate bosses came to me with --" I hear you know something about Access. Can you give me a little help?"
Sure I said -what's the project/subject matter? He--"I'm going to build a database to monitor quality assurance on several of our websites." Why, I asked. There is a whole team dedicated to QA of those sites that reports to you."
"Really, he said, I didn't know exactly what they did." "I also told him that QA reports were available and who the responsible manager was". His career in Access never really got started, at least not at the office. I think he did some personal stuff at home. I retired 16 yrs ago and he retired shortly after.

As an aside, I got a joke from a friend the other day that may fit into this post--when it comes to remembering and getting older.

"When I say, "The other day," I could be referring to any time between yesterday and 15 years ago."
 

Steve R.

Retired
Local time
Today, 07:31
Joined
Jul 5, 2006
Messages
4,687
"No problem!" they said, "We have this nifty export feature - we'll just do an export, get it into Excel and Bob's your uncle!" Well, Uncle Bob's export sucks. I made a post about it few months ago and I wont even get into how bad it is right now. Her boss set about employing her Excel mastery to produce something of value.

They reconvened, told Missy that an API/Direct access was a no-go and that they wanted me to sit down with their Power-Bi team and explain the process so that they could emulate it. She explained to them that even if I did that, the remaining clean-up would still take a dedicated team hours to accomplish depending on the amount of dirty records.

First, I won't be able to provide any programing assistance. Sorry about that. :(
But, based on the limited information of two paragraphs above, it would seem that a two step approach may be be a "solution" if you aren't already doing it. This also assume that I understand what you are attempting to accomplish.

First step: scrape the data as displayed. Then import it into an excel file (or simply a text file) on your local PC without any modification.
Second step: Review the data in the excel spreadsheet (or text file) to examine how best to identify and filter out the "dirty records" through automation. This may require the development of a program to accomplish that task. A potential problem, can "dirty records" be identified (automatically) without human inspection?
 

NauticalGent

Ignore List Poster Boy
Local time
Today, 07:31
Joined
Apr 27, 2015
Messages
6,341
Good morning folks. I am enthusiastically marking this thread "Solved".

First and foremost, an HUGE shout-out to @Edgar_ . I remember when he joined AWF back in July, it seemed he came out of nowhere but immediately made an impact and I knew that AWF had gained another "Heavy Weight".

When I started this thread, he responded via PM with a single sentence: "Hey there, John, I saw you need a hand, I could take a look..." He wasn't the only one, but since he was first, and he had helped me in the past with this same issue, I responded and we set to work the next morning.

What he did was simply amazing. I will hit the high-spots and even share the db with an example of the JSON returned from the site if anyone is interested.

As I said earlier, the site actually uses a REST API to interface with the one that holds the raw data. He was able to see this in the HTML and once he gained access, made the following observations:

1. The site allows you to "pull" 100 records in JSON format at one time.
2. You only have an hour once your credentials (token) has been activated to get your business done.

Once the JSON is retrieved, he used a JSON Converter (VBA-JSON v2.3.1 (c) Tim Hall - https://github.com/VBA-tools/VBA-JSON) to parse the data into a table. As of this morning, with 22,183 records, it took 1 hour and 18 mins to fetch and process this data. Fortunately, we anticipated this early on and to get around the 1 hour limit, we decided to break it up into 2 section; Fetch and Process.

The fetch routine goes and gets the JSON and stores it into a table with only two fields, an AutoNumber(PK) and LongText, for the record the average length of each JSON field is about 350k characters. The whole routine takes about 30min . The 2nd process then uses the JSON Converter and populates the appropriate table.

We know that eventually the fetch process will exceed the 1hour limit and he has offered to build in the ability to assign a new token (how do folks know how to do this?!?) and I may take him up on that, but there are other and more efficient ways to deal with the data that come over i.e. there is no need to fetch archived records (17k at present) each and every time.

Missy (wife) had a Teams meeting with her bosses who had brought in a couple of IT folks to help us get clean data. It didn't last long since Edgar had already cracked that nut. After the meeting, her bosses' boss responded with a word: "FANTASTIC!"

Fantastic is an understatement. While we were working on this, I did my best to assist. Edgar was patient and painstakingly explained to me the JSON process while I muddled through it. Quite frankly, he would have it done a lot sooner had I not "helped". Even now he hits me up on WhatsApp with excited texts about process improvement.

@Edgar_ you are a Rock Star and as promised, what they "pay" me will be passed along and you have my "mark" in perpetuity.

Thank you...
 

Isaac

Lifelong Learner
Local time
Today, 04:31
Joined
Mar 14, 2017
Messages
8,777
I have been able to see table information from a site that requires a password. I can inspect the table and see the columns and the data...but no matter what I try, I cannot extract the data.

Several methods I have found using Dr. Google have failed and about 4 hours of ChatGPT have gotten me a bunch of code that does not work. It even gave me code to loop through the elements to find ANY table and it returns a "No Tables Found" message - by design of course.

I KNOW there is a table but something is keeping me from getting there...

The good news is that the company has agreed to compensate me for my time - which I will pass along to whoever wants to get me over this hump. PM me if you are interested.

One thing to try when all else fails. Try the simple "get data" from a webpage in Excel.
If you are lucky and this is one of the very rare times that it actually works, you can then vba automate the refreshing of it.

I did this once when I was wanting to be alerted for craigslist ads. I 'get data' from 'website' in Excel manually, then set vba to automate it.
if the refreshed data ever exceeded x-# of rows, I'd know there was a new craigslist ad for the search terms
 

NauticalGent

Ignore List Poster Boy
Local time
Today, 07:31
Joined
Apr 27, 2015
Messages
6,341
One thing to try when all else fails. Try the simple "get data" from a webpage in Excel.
If you are lucky and this is one of the very rare times that it actually works, you can then vba automate the refreshing of it.
Tried that as well - this issue was that it only pulled the first 25 records for the same reason(s) mentioned in thread #18. This project was a lot of fun and I have learned a lot. Edgar has even offered to show me how to find these things I mentioned once you get "under the hood"
 

jdraw

Super Moderator
Staff member
Local time
Today, 07:31
Joined
Jan 23, 2006
Messages
15,379
Good stuff John. You should be in line for a raise or promotion!
Glad Edgar was able to solve the issue.
I'm sure it would be a great addition to Sample Databases or similar.
I'd like to see it and I'm just a dabbler.
Congrats Edgar(y) and John(y)
 

Users who are viewing this thread

Top Bottom