Scraping some text off an Amazon webpage... (1 Viewer)

peskywinnets

Registered User.
Local time
Today, 03:33
Joined
Feb 4, 2014
Messages
567
So I seek to have VBA go fetch an amazon product listing & extract the bit of text that tells me if Amazon themselves are the seller of the product.

I found this ...

https://www.wiseowl.co.uk/blog/s393/scrape-website-html.htm

...which is a great start/primer, but I'm still struggling!

I wish to pull in a bit of data from a webpage that would look like this...

https://www.amazon.co.uk/dp/B001KOTNG2

specifically the bit of text that says "Dispatched from and sold by Amazon"

So here's what I've got...
Code:
[B]Sub ImportAmazonProductData()
Dim ie As InternetExplorer
Dim html As HTMLDocument
Set ie = New InternetExplorer
ie.Visible = False
ie.navigate "https://www.amazon.co.uk/dp/B001KOTNG2"
Do While ie.ReadyState <> READYSTATE_COMPLETE
DoEvents
Loop
Set html = ie.Document
Set ie = Nothing
[/B]

' this is the parsing bit (fail!)....
Set SoldBySection = html.getElementById("merchant-info")
Set SoldBy = SoldBySection.Children

For Each Seller In SoldBy
Debug.Print Seller
Next

End Sub

The bit that's bolded seems to work, but it's the parsing of the returned data that I'm struggling with.

hen I view the Amazon's webpage source code, the data I seek seems to be underneath this section/element...

<div id="merchant-info" class="a-section a-spacing-mini">

...I realise this is probably a big ask, but my brain a bit fogged - can anyone lend a helping hand (it's likely to be a quick fix for those that are bit tidy with parsing data in VBA! - I'm not!)
 

arnelgp

..forever waiting... waiting for jellybean!
Local time
Today, 10:33
Joined
May 7, 2009
Messages
14,111
Can you post the captured html
 

peskywinnets

Registered User.
Local time
Today, 03:33
Joined
Feb 4, 2014
Messages
567
Can you post the captured html

Actually, I think that might be the problem, when I tried to debug.print the returned html, there wasn't any ...since posting, I've now had a good night's sleep - so I'm off to delve further & perhaps take another approach - thanks!
 

isladogs

CID VIP
Local time
Today, 03:33
Joined
Jan 14, 2017
Messages
15,411
I also got no output using that approach
However, I already had similar code used in a Web Version Checker database:
https://www.access-programmers.co.uk/forums/showthread.php?t=293970

So I've adapted that to create a more generalised web scraper
It definitely works for your example giving the output 'Gift wrap available.'



See attached database

NOTE:
1. Whilst this works for your specific example, I need to debug this properly by testing on other web pages and with other search strings
I am aware that Write conflicts occur when the dataset is changed.
Will post an update once I've fixed issues


2. Also, a few months ago, you started a thread about parsing JSON files into Access.
I've since done a lot of work on this idea.

If you are still interested, see this thread: JSON parser database
 

Attachments

  • Capture.PNG
    Capture.PNG
    20.5 KB · Views: 358
  • WebScraper.zip
    560.1 KB · Views: 78
Last edited:

Lightwave

Ad astra
Local time
Today, 03:33
Joined
Sep 27, 2004
Messages
1,452
Colin that looks quite impressive I hope to try it at some point
 

isladogs

CID VIP
Local time
Today, 03:33
Joined
Jan 14, 2017
Messages
15,411
Colin that looks quite impressive I hope to try it at some point

Thanks but its just version 1 and definitely not finished.
I need to make lots of changes before it can be used for any webpage without errors.
I hope to do more on this in the next few days
 

peskywinnets

Registered User.
Local time
Today, 03:33
Joined
Feb 4, 2014
Messages
567
I'm not at my home PC at the moment but this looks very promising - I really appreciate the time you've put into this & look forward to attempting to integrate it into my larger overall body of code when I get home!

(I actually need to Access traverse down a list of about 150 products - I have the ASINs (Amazon name for a product idetifier) in a database, so I just need traverse down a table containing the ASINs to change the main URL each time & that should be it!)

2. Also, a few months ago, you started a thread about parsing JSON files into Access.I've since done a lot of work on this idea.

You have a good memory.... since then, other (more pressing) tasks took precedent , but I'll definitely becoming back to JSON again - great work!
 

jdraw

Super Moderator
Staff member
Local time
Yesterday, 22:33
Joined
Jan 23, 2006
Messages
13,862
Pesky,

You haven't been around for a while. When you told us you were leaving Access because you couldn't
get Access to work with JSO
N, I asked for general assistance from the forum. Colin took the task to heart and created and tested a variety of algorithms and code to develop his parser. A few colleagues have offered some comments and testing, and have found the parser to be quite good.
I realize you have been very busy (as most 1 man shops hope to be), but I'd like to know your status re JSON.
 

isladogs

CID VIP
Local time
Today, 03:33
Joined
Jan 14, 2017
Messages
15,411
I'm not at my home PC at the moment but this looks very promising - I really appreciate the time you've put into this & look forward to attempting to integrate it into my larger overall body of code when I get home!

(I actually need to Access traverse down a list of about 150 products - I have the ASINs (Amazon name for a product idetifier) in a database, so I just need traverse down a table containing the ASINs to change the main URL each time & that should be it!)

I'm not quite clear what you are trying to get from each page.
The final part of the page URL is the ASIN
e.g. URL = https://www.amazon.co.uk/dp/B001KOTNG2; ASIN = B001KOTNG2

The ASIN is mentioned numerous times within the page source e.g.



NOTE: As the page is so large, I'm getting a lot of system resource exceeded errors.
Hopefully I can get around this but it could be a major issue

If you can upload the relevant parts of your db containing the ASINs & URLs together with examples of what you need from each page, I'll use that as part of my testing when I do some more on this web scraper.

The easiest way might be to email it to me using the link in my signature line

You have a good memory.... since then, other (more pressing) tasks took precedent , but I'll definitely becoming back to JSON again - great work!

I remember as I have spent countless hours on this & hope eventually to make it a commercial product - with a niche audience?
 

Attachments

  • Capture1.PNG
    Capture1.PNG
    75.6 KB · Views: 326
Last edited:

peskywinnets

Registered User.
Local time
Today, 03:33
Joined
Feb 4, 2014
Messages
567
To clarify - I just want to scrape a bit of text on an Amazon product webpage to see if Amazon are the seller of a product (if Amazon aren't selling the product themselves, then another seller's name will appear instead - so the text I am hunting for is "Dispatched from & Sold by Amazon"), hopefully this will clarify....



So I will simply have a table of ASIN, traverse down the list, construct the URL with the ASIN on the end, then scrape the text, update a table with the result (true or false), then move down the table to the next ASIN

So for that hammer the result would look like this

ASIN..................Result
B001KOTNG2........True
B001KOTNH1

(the dots above I inserted are just to help with the formatting on here)
 

jdraw

Super Moderator
Staff member
Local time
Yesterday, 22:33
Joined
Jan 23, 2006
Messages
13,862
I found this link https://www.amazon.co.uk/dp/B0073H1CIC where the item is not dispatched and sold by Amazon.
I ran Colin's database and it found the item and provided the SupplierName as per the attached jpg.

Pesky,

Can you provide a sample --say 5-10 of your ASINs -- for testing?
Also, please see my earlier post re status of your JSON endeavours.

Also, also I think if you format some text with tabs and/or spaces, your format will be sustained if you use code tags around that text.

Code:
ASIN            ABCD
ASIN            xyzw   some other text

Good luck.
 

Attachments

  • RiddersWebScrape01.jpg
    RiddersWebScrape01.jpg
    59.1 KB · Views: 116

peskywinnets

Registered User.
Local time
Today, 03:33
Joined
Feb 4, 2014
Messages
567
I found this link https://www.amazon.co.uk/dp/B0073H1CIC where the item is not dispatched and sold by Amazon.
I ran Colin's database and it found the item and provided the SupplierName as per the attached jpg.

Pesky,

Can you provide a sample --say 5-10 of your ASINs -- for testing?
Also, please see my earlier post re status of your JSON endeavours.

Also, also I think if you format some text with tabs and/or spaces, your format will be sustained if you use code tags around that text.

Code:
ASIN            ABCD
ASIN            xyzw   some other text

Good luck.

Not my particular ASINs, but 5 randomly selected ASINs ...

https://www.amazon.co.uk/dp/B0073H1CIC (not sold by Amazon)
https://www.amazon.co.uk/dp/B00UQ8D5OE (not sold by Amazon)
https://www.amazon.co.uk/dp/B01LWQBKMO (sold by Amazon)
https://www.amazon.co.uk/dp/B0001IW8TW (sold by Amazon)
https://www.amazon.co.uk/dp/B0009VX1ZG (sold by Amazon)


(though please note, Amazon dynamically allocates who the seller is, so it's sort of like a snapshot for that moment - it can change quite often)

...therefore I seek this type of result
Code:
[B]ASIN                  Result  [/B]
B0073H1CIC            False
B00UQ8D5OE            False
B01LWQBKMO            True
B0001IW8TW            True
B0009VX1ZG            True

I've just had a chance to run Colin's database - very impressive, but being candid, I couldn't wrap my head around the code! (to recap - I'm not really a programmer...just a kludger....who normally gets some form of outcome through brute force....not coding elegance!)
 
Last edited:

peskywinnets

Registered User.
Local time
Today, 03:33
Joined
Feb 4, 2014
Messages
567
Pesky,

You haven't been around for a while. When you told us you were leaving Access because you couldn't
get Access to work with JSO
N, I asked for general assistance from the forum. Colin took the task to heart and created and tested a variety of algorithms and code to develop his parser. A few colleagues have offered some comments and testing, and have found the parser to be quite good.
I realize you have been very busy (as most 1 man shops hope to be), but I'd like to know your status re JSON.

Yes I remember it well...at that time there was no meaningful JSON VBA parsing option available ...it looked like I was going to move website cart software & all their related APIs were JSON orientated (& therefore I was considering going with a different programming language) ....well in the end I didn't migrate cart software (not because I don't want to , but as you've pointed out as a 1 man operation, the tasks that have the highest urgency get my attention first(!)...so moving cart software (to a system with JSON APIs) went on the back burner. If I'm truthful, I'm still way too busy to contemplate seeing what has been achieved with JSON, but it's great that Colin has picked up the gauntlet so to speak :) & I'll certainly be wanting to take a close look at what's been achieved in a couple of month''s time
 

isladogs

CID VIP
Local time
Today, 03:33
Joined
Jan 14, 2017
Messages
15,411
Not at my computer at the moment but...

1. Have you tested those 5 URLs to see what you get?
Obviously it won't be true or false as I wasn't testing for that originally.

2. I can explain the code if you wish. However, if all you want is the true/ false output, does it matter if you can't fully understand the code?

3. Do all Amazon uk URLs have the same structure? /dp/ followed by ASIN?

4. There are many uses for web scraping but in this particular case, I'm not sure I see the point. If you have been to the web page (to get its URL) can't you just note who the seller is when you save the URL?
 

peskywinnets

Registered User.
Local time
Today, 03:33
Joined
Feb 4, 2014
Messages
567
Not at my computer at the moment but...

1. Have you tested those 5 URLs to see what you get?
Obviously it won't be true or false as I wasn't testing for that originally.

2. I can explain the code if you wish. However, if all you want is the true/ false output, does it matter if you can't fully understand the code?

3. Do all Amazon uk URLs have the same structure? /dp/ followed by ASIN?

4. There are many uses for web scraping but in this particular case, I'm not sure I see the point. If you have been to the web page (to get its URL) can't you just note who the seller is when you save the URL?

Hello Colin,

I went back & took a look at your code...it looks (to me) like you are saving the webpage html to a text file, then parsing the text file using filesystem object to pull out the data of interest? Clearly that approach works, but I'm wondering whether there's a quicker way, lightwave pasted a link which might be more efficient https://www.pulseinfomatics.com/vba-...-htmldocument/ - just got to wrap my head around the terms used wrt html structure (I've sort of partially grasped XML, but never had to parse html prior)

In answer to your questions...

1. I tried, but wasn't successful (though that was probably down to me & not understanding what I needed to do!)

2. I'll need a loose understanding of the approach (& the code), if I'm to adapt/integrate into my own code.

3. Yes, just use the root URL https://www.amazon.co.uk/dp/ & append the ASIN onto the end of it, therefore these product ID (ASINs) would be tagged onto the end to then interrogate each product page B0073H1CIC B00UQ8D5OE B01LWQBKMO B0001IW8TW B0009VX1ZG

4. I've just started supplying Amazon with products to sell ...things aren't going as smooth as I'd hoped, so I need to 'Police' their movements a little! (to ensure that Amazon are selling the items across all their 5 EU sites) - I have 150 products, there are 5 Amazon marketplaces (UK, Germany, France, Spain & Italy), therefore that's 750 product pages to check - assuming I can check 4 pages per minute manually - that would be 3 hours work (& draining/exhausting!!). Like I say the pages aren't static but continually changing ...which compounds the amount of time needed to check. Far better to run a bit of VBA last thing before heading off to bed - even if it take Access a couple of hours - I'd rather it carried the heavy data crunching burden vs. me!

Normally I'd use an API & pull the data out of their returned XML, but surprisingly, for all Amazon offer numerous APIs, they don't appear have an API that shows who's selling a product (ASIN)....hence needing to scrape the data from webpages.
 
Last edited:

isladogs

CID VIP
Local time
Today, 03:33
Joined
Jan 14, 2017
Messages
15,411
Lightwave
Many thanks for the link
I may contact the author to discuss further but one line caught my attention

it took about 35 minutes
to retrieve the profiles for 60 companies.
Now I don't know how complex the source data was but to my mind that's ridiculously slow.

For the past few months I've been retrieving various data from websites, parsing it and importing into normalised Access tables. This can usually be done in about a second once the data format is understood.

For example see these links:
https://www.access-programmers.co.uk/forums/showthread.php?t=295378
https://www.access-programmers.co.u...grammers.co.uk/forums/showthread.php?t=295789
 

isladogs

CID VIP
Local time
Today, 03:33
Joined
Jan 14, 2017
Messages
15,411
Hello Colin,

I went back & took a look at your code...it looks (to me) like you are saving the webpage html to a text file, then parsing the text file using filesystem object to pull out the data of interest? Clearly that approach works, but I'm wondering whether there's a quicker way, lightwave pasted a link which might be more efficient https://www.pulseinfomatics.com/vba-...-htmldocument/ - just got to wrap my head around the terms used wrt html structure (I've sort of partially grasped XML, but never had to parse html prior)

Yes you understood the method I'm using correctly.
Having read the link provided by Lightwave, my initial impression is that the other approach is much less efficient. See my previous reply to Lightwave

1. I tried, but wasn't successful (though that was probably down to me & not understanding what I needed to do!)

Just enter the URL into my db and click Run

Normally I'd use an API & pull the data out of their returned XML, but surprisingly, for all Amazon offer numerous APIs, they don't appear have an API that shows who's selling a product (ASIN)....hence needing to scrape the data from webpages.

I would be amazed if Amazon don't provide the API you describe.
If not, ask them & I expect it will be added within a reasonable time frame
Out of interest can you give me a link to their APIs so I can investigate what they do provide
 

peskywinnets

Registered User.
Local time
Today, 03:33
Joined
Feb 4, 2014
Messages
567
So I found this useful tutorial...

http://analystcave.com/excel-scrape-html-element-id/

which looked to be a good contender, so I modified it a little...

Code:
Public Function GetElementById()
    Dim url As String
    Dim id As String
    Dim SoldByAmazon As Boolean
    Dim XMLHTTP As Object, html As Object, objResult As Object
    Set XMLHTTP = CreateObject("MSXML2.serverXMLHTTP")
    XMLHTTP.Open "GET", url, False
    XMLHTTP.setRequestHeader "Content-Type", "text/xml"
    XMLHTTP.setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 6.1; rv:25.0) Gecko/20100101 Firefox/25.0"
    url = "https://www.amazon.co.uk/dp/B001KOTNG2"
    id = "merchant-info"
    XMLHTTP.send
    Set html = CreateObject("htmlfile")
    html.body.innerHTML = XMLHTTP.responseText
    Set objResult = html.GetElementById(id)
    GetElementById = objResult.innerHTML
    GetElementById = Mid(GetElementById, 29, 6)
    If GetElementById = "Amazon" Then
    SoldByAmazon = True
    Else
    SoldByAmazon = False
    End If
   
End Function

& it works - yay! (i.e. if Amazon are the seller, the variable "SoldByAmazon" is true.... so now have something (simple) that I can modify to use as a function.

Thanks for all your contributions ...and by all means keep this discussion going ...as web scraping is something that many seem to be interested in :)
 

peskywinnets

Registered User.
Local time
Today, 03:33
Joined
Feb 4, 2014
Messages
567
I would be amazed if Amazon don't provide the API you describe.

Prepare to be amazed(!) - this is their closest API - close indeed, but no cigar...

https://docs.developer.amazonservices.com/en_UK/products/Products_GetCompetitivePricingForASIN.html

...it doesn't yield the explicit information I need.

If not, ask them & I expect it will be added within a reasonable time frame

Believe me...I've dealt with Amazon for years..they are a beast that move slowly unless there are thousands of people screaming for a result/action. Their API team are hard to get at *unless* you are a big player & prepared to cross their hands with silver.

Out of interest can you give me a link to their APIs so I can investigate what they do provide

There are two main 'libraries'...

https://developer.amazonservices.co.uk/gp/mws/docs.html (this is where all the bread & butter work is done)

http://docs.aws.amazon.com/AWSECommerceService/latest/DG/CHAP_ApiReference.html
 

Users who are viewing this thread

Top Bottom