Solved Scraping a table from Website (1 Viewer)

NauticalGent

Ignore List Poster Boy
Local time
Today, 04:23
Joined
Apr 27, 2015
Messages
6,341
I have been able to see table information from a site that requires a password. I can inspect the table and see the columns and the data...but no matter what I try, I cannot extract the data.

Several methods I have found using Dr. Google have failed and about 4 hours of ChatGPT have gotten me a bunch of code that does not work. It even gave me code to loop through the elements to find ANY table and it returns a "No Tables Found" message - by design of course.

I KNOW there is a table but something is keeping me from getting there...

The good news is that the company has agreed to compensate me for my time - which I will pass along to whoever wants to get me over this hump. PM me if you are interested.
 

jdraw

Super Moderator
Staff member
Local time
Today, 04:23
Joined
Jan 23, 2006
Messages
15,379
Hi John,

Your current experience with ChatGPT sounds ominously the same as many of my experiences involving miniZinc and Access VBE. Lots and lots of hallucinations and code samples that don't work, but promote methods/properties that don't exist. Nice to see I have company.;)
 

Josef P.

Well-known member
Local time
Today, 10:23
Joined
Feb 2, 2023
Messages
826
Just as an idea: would a workaround via Excel be possible?
Example: https://www.w3schools.com/html/html_tables.asp
Code:
let
    Source = Web.BrowserContents("https://www.w3schools.com/html/html_tables.asp"),
    #"Extracted Table From Html" = Html.Table(Source, {{"Column1", "TABLE[id='customers'] > * > TR > :nth-child(1)"}, {"Column2", "TABLE[id='customers'] > * > TR > :nth-child(2)"}, {"Column3", "TABLE[id='customers'] > * > TR > :nth-child(3)"}}, [RowSelector="TABLE[id='customers'] > * > TR"]),
    #"Promoted Headers" = Table.PromoteHeaders(#"Extracted Table From Html", [PromoteAllScalars=true])
in
    #"Promoted Headers"
 

Gasman

Enthusiastic Amateur
Local time
Today, 09:23
Joined
Sep 21, 2011
Messages
14,301
I think @Edgar_ was helping another user getting data from a web page.
I cannot find the thread ATM though. :(
 

NauticalGent

Ignore List Poster Boy
Local time
Today, 04:23
Joined
Apr 27, 2015
Messages
6,341
I think @Edgar_ was helping another user getting data from a web page.
I cannot find the thread ATM though. :(
As fate would have it, Edgar responded rather quickly so he and I will be getting after it sometime this morning. Once we (he) has cracked this nut, I will post the details here so others may benefit.
 

NauticalGent

Ignore List Poster Boy
Local time
Today, 04:23
Joined
Apr 27, 2015
Messages
6,341
i think scraping is very easy as in my currency conversion?
Exchange Rate/ currency conversion | Access World Forums (access-programmers.co.uk).

scraping codes are "everywhere" on the net.
Thanks Arnel, your code is what I tried first. But the issue is, all the code I have tried, including yours, fails to find a table on the site. Here is a HTML snippet from the page:
HTML:
<div class="ignite-data-table-wrapper has-content">
            <table>
              <thead>
                
  <!---->

  <!---->

  <tr>
 
        <ukg-th aria-sort="none" sortable="true" data-test-requests-list-column="id" data-test-requests-list-sorting="request_number" class="sc-ukg-th-h sc-ukg-th-s data-cell ukg-sort hydrated ukg-data-table-col-0" tabindex="0" scope="col" role="columnheader"><!----><div class="column-container align-left sc-ukg-th sc-ukg-th-s">
    
        ID
      
  </div></ukg-th>

        <ukg-th aria-sort="none" sortable="true" data-test-requests-list-column="name" data-test-requests-list-sorting="name" class="sc-ukg-th-h sc-ukg-th-s data-cell ukg-sort hydrated ukg-data-table-col-1" tabindex="0" scope="col" role="columnheader"><!----><div class="column-container align-left sc-ukg-th sc-ukg-th-s">
    
        Subject
      
  </div></ukg-th>

        <ukg-th data-test-requests-list-column="category" class="sc-ukg-th-h sc-ukg-th-s data-cell hydrated ukg-data-table-col-2" tabindex="-1" scope="col" role="columnheader"><!----><div class="column-container align-left sc-ukg-th sc-ukg-th-s">
    
        Category
      
  </div></ukg-th>

        <ukg-th aria-sort="none" sortable="true" data-test-requests-list-column="employee" data-test-requests-list-sorting="requested_for" class="sc-ukg-th-h sc-ukg-th-s data-cell ukg-sort hydrated ukg-data-table-col-3" tabindex="0" scope="col" role="columnheader"><!----><div class="column-container align-left sc-ukg-th sc-ukg-th-s">
    
        Requested for
      
  </div></ukg-th>

        <ukg-th data-test-requests-list-column="assignees" class="sc-ukg-th-h sc-ukg-th-s data-cell hydrated ukg-data-table-col-4" tabindex="-1" scope="col" role="columnheader"><!----><div class="column-container align-left sc-ukg-th sc-ukg-th-s">
    
        Assignees
      
  </div></ukg-th>

        <ukg-th aria-sort="none" sortable="true" data-test-requests-list-column="creationDate" data-test-requests-list-sorting="created_at" class="sc-ukg-th-h sc-ukg-th-s data-cell ukg-sort hydrated ukg-data-table-col-5" tabindex="0" scope="col" role="columnheader"><!----><div class="column-container align-left sc-ukg-th sc-ukg-th-s">
    
        Created on
      
  </div></ukg-th>

        <ukg-th aria-sort="none" sortable="true" data-test-requests-list-column="dueDate" data-test-requests-list-sorting="due_date" class="sc-ukg-th-h sc-ukg-th-s data-cell ukg-sort hydrated ukg-data-table-col-6" tabindex="0" scope="col" role="columnheader"><!----><div class="column-container align-left sc-ukg-th sc-ukg-th-s">
    
        Due date
      
  </div></ukg-th>

        <ukg-th aria-sort="none" sortable="true" data-test-requests-list-column="priority" data-test-requests-list-sorting="priority" class="sc-ukg-th-h sc-ukg-th-s data-cell ukg-sort hydrated ukg-data-table-col-7" tabindex="0" scope="col" role="columnheader"><!----><div class="column-container align-left sc-ukg-th sc-ukg-th-s">
    
        Priority
      
  </div></ukg-th>

        <ukg-th aria-sort="none" sortable="true" data-test-requests-list-column="status" data-test-requests-list-sorting="status" class="sc-ukg-th-h sc-ukg-th-s data-cell ukg-sort ukg-is-sticky--right hydrated ukg-data-table-col-8 ukg-is-last-sticky" tabindex="0" scope="col" role="columnheader" style="right: 115.422px;"><!----><div class="column-container align-left sc-ukg-th sc-ukg-th-s">
    
        Status
      
  </div></ukg-th>


      <ukg-th sticky="right" class="sc-ukg-th-h sc-ukg-th-s data-cell ukg-is-sticky--right hydrated ukg-data-table-col-9" tabindex="-1" scope="col" role="columnheader" style="right: 0px;"><!----><div class="column-container align-left sc-ukg-th sc-ukg-th-s">
    
  </div></ukg-th>

 
</tr>

  <!---->

  <!---->

  <!---->

              </thead>
              <tbody>
                  
  <!---->

  <!---->

  <!---->

  <!---->

  <!---->

 
    <tr id="ember44-row-0" data-test-ignite-paginated-table-row-index="0" data-test-ignite-paginated-table-row="21688">
 
        <td data-test-requests-list-number="" class="ukg-data-table-col-0">
  <span>
    21688
  </span>
</td>

        <td class="ukg-data-table-col-1">
  <a id="ember45" class="ember-view" data-dd-name="Request list > Request name link (efaccc88-cf7c-4b7b-9c3b-b6f7058bb869)" data-dd-privacy="mask" data-test-requests-list-name="" href="/v2/requests/efaccc88-cf7c-4b7b-9c3b-b6f7058bb869">
    Payroll - Direct Deposit
  </a>

I KNOW there is a table there, but so far everything I have tried as failed to find it...
 

arnelgp

..forever waiting... waiting for jellybean!
Local time
Today, 16:23
Joined
May 7, 2009
Messages
19,243
as you can "obviously" see.
there are:
ID, SUBJECT, CATEGORY, REQUESTED FOR, ASSIGNEE , ETC..
on that table.
you just parse the paired < /> bracket
and you can extract the info you need.
anyway, goodluck.
 

NauticalGent

Ignore List Poster Boy
Local time
Today, 04:23
Joined
Apr 27, 2015
Messages
6,341
Yes, I can see they are there, it is getting Access (or any app that uses VBA) to see it too. I know there is a way, but I sure as hell do not know it.
 

NauticalGent

Ignore List Poster Boy
Local time
Today, 04:23
Joined
Apr 27, 2015
Messages
6,341
Just as an idea: would a workaround via Excel be possible?
Absolutely! The company just want to get clean data so they cans use Power BI to make reports. The folks for sold them the website did not provide an API and the .csv export feature is horrible.

That is the most comprehensive tutorial on HTML I have ever seen. I have used W3Schools before and they never fail to impress. Just a quick glance through the pages explained a lot of the tags and what they mean. Appreciate you sharing it.
 

AccessBlaster

Registered User.
Local time
Today, 01:23
Joined
May 22, 2010
Messages
5,953
Some websites actively try to defeat scraping, hopefully this isn't the case.
 

cheekybuddha

AWF VIP
Local time
Today, 09:23
Joined
Jul 21, 2014
Messages
2,280
Place the html file you sent me in the same folder as this db.

Then open Immediate Window (Ctrl+G) and type ?DoIt and hit Enter.

Should give you a start.
 

Attachments

  • ng.accdb
    308 KB · Views: 44

arnelgp

..forever waiting... waiting for jellybean!
Local time
Today, 16:23
Joined
May 7, 2009
Messages
19,243
here is your extractor (in Excel).
told you, the code is just on the net.
 

Attachments

  • NauticalAgent.zip
    20.9 KB · Views: 41

NauticalGent

Ignore List Poster Boy
Local time
Today, 04:23
Joined
Apr 27, 2015
Messages
6,341
Good morning folks,

From some of the responses here, PM's and even emails, it seems I have ruffled a few feathers. I really didn't understand where the ire was coming from until I had some time to think on it.

From what I have given you, all the code I have been provided DOES work. But, like is used to tell my customers while I was on active duty (Logistics/Supply): "I gave you what you asked for, not what you need."

When I tried to plug in my site, the code provided would either fail silently or error on various lines. What was even more curious to me was that EVERY bit of code, when processing the extracted HTML would only return 25 records when there are over 19k.

Since the site is password protected and uses two-tier security - in this case a one-time pin - giving everyone access was not possible. I was able to give one person the "keys to the kingdom" and he, with is very impressive knowledge of website structure, was able to figure out what the issue(s) are.

I will try to explain it here, but when I get it wrong, understand that I am doing trouble shooting by braille here and I am only making a vain attempt to explain what has been explained to me. I know I am going to muck this up, but what the hell...

I assumed that web-scrapping covered ALL methods of getting info from the web, which is why I used that term in the title. The site I am using is more a less a front-end to another site that accesses/stores the raw-data. While I was talking to Edgar, he kept using the term API which confused me because I was assured by the company that there were in fact no API's and that the company who developed the front-end site would not make one. No idea why, but they were quite clear that it was a no-go. Simply getting a connection to their server was also a non-starter since the ability to export the data was provided. In other words, IT made the data accessible; job done. However, the export is hideous and to "clean" the data takes days due to the amount of human interaction.

He corrected me and said these were REST API's - which was clear as dog-shit mud until I did a google search after he explained it to me. This explained why I was only getting the 25 records and why web-scraping would not work. To programmatically "click" the "Next" button to load the next 25 records, process them and repeat would have been painful.

He has found a way to leverage this REST API and get up to 100 records in JSON format at a time, but is still working on how to get a better product.

I hope this clears things up, if not, ask and I will try to explain it, or even better, Edgar will. I like learning new things, but this has me feeling 'purt near stupid...
 

cheekybuddha

AWF VIP
Local time
Today, 09:23
Joined
Jul 21, 2014
Messages
2,280
I suspected you might need some pagination, hence my question to you via PM!

If you can hit the REST API's directly and not have to faff about with the HTML at all life is a lot simpler (and probably much easier w.r.t. the login/2fa process too!)
 

NauticalGent

Ignore List Poster Boy
Local time
Today, 04:23
Joined
Apr 27, 2015
Messages
6,341
I suspected you might need some pagination, hence my question to you via PM!

If you can hit the REST API's directly and not have to faff about with the HTML at all life is a lot simpler (and probably much easier w.r.t. the login/2fa process too!)
He came up with a bookmarklet process to bypass the login that is just simply brilliant!
 

Users who are viewing this thread

Top Bottom