Jsoup scraping image url results in data:image/gif;base64,

I’m starting to learn Jsoup and want to scrape Tesco webstore. Here is a link:

https://www.tesco.com/groceries/en-GB/shop/fresh-food/all

I want to get an image of a product. When I’m browsing the code of the page from Google Chrome I get something like this:

<img src="https://img.tesco.com/Groceries/pi/321/5054775188321/IDShot_225x225.jpg" alt="Tesco British
 Unsalted Butter 250G" class="product-image" 
srcset="https://img.tesco.com/Groceries/pi/321/5054775188321/IDShot_90x90.jpg 
768w,https://img.tesco.com/Groceries/pi/321/5054775188321/IDShot_225x225.jpg 4000w">

But my code:

Document doc = null;
        try {
            doc = Jsoup.connect("https://www.tesco.com/groceries/en-GB/shop/home-and-ents/all?page=20").get();
        } catch (IOException e) {
            e.printStackTrace();
        }
        System.out.println(doc.getElementsByClass("product-image-wrapper").get(0));

results in:

<a href="/groceries/en-GB/products/295626079" aria-hidden="true" class="product-image-wrapper" tabindex="-1">
 <div class="product-image__container">
  <img src="" alt="Sterling Blue Superkings 100 Pack" class="product-image">
 </div></a>

I think the problem is that the URLs are loaded by JS and Jsoup is not supporting it. Is there any way to get the URL as I see it in chrome, or should I use more powerful tool such as HtmlUnit or Selenium.

Answer

So basically I’ve just switched to selenium. It may be slower, but at least the progress is going. I’ve also tried the HtmlUnit, but it seems to work badly with JS.

Leave a Reply

Your email address will not be published. Required fields are marked *