same selector for different types data in web scraping using Puppeteer

I’m a novice web developer and have started coding recently.

I’m only familiar with HTML/CSS/JS & NODE.

I’m currently working on a page scraper project and using puppeteer

PROBLEM – In code scenarios like this ↓ where there is the same selector for different types data.
(which in this case is – a[rel=”tag”] ).

<span class="clip-link">

  <h4>Stars:</h4>
  <a href="https://www.media.com/ACTORS/darshan-raval/" rel="tag">Darshan Raval</a>,
  <a href="https://www.media.com/ACTORS/priyanka-chopra/" rel="tag">Priyanka Chopra</a>
  <h4>Singers:</h4>
  <a href="https://www.media.com/SINGERS/hardy-sandhu/" rel="tag">Hardy Sandhu</a>,
  <a href="https://www.media.com/SINGERS/amit-trivedi/" rel="tag">Amit Trivedi</a>,
</span>

OR

<span class="clip-link">

  <h4>Stars:</h4>
  <a href="https://www.media.com/ACTORS/darshan-raval/" rel="tag">Darshan Raval</a>,
  <a href="https://www.media.com/ACTORS/priyanka-chopra/" rel="tag">Priyanka Chopra</a>,
  <a href="https://www.media.com/ACTORS/amir-khan/" rel="tag">Amir Khan</a>
  <h4>Singers:</h4>
  <a href="https://www.media.com/SINGERS/hardy-sandhu/" rel="tag">Hardy Sandhu</a>,
  <a href="https://www.media.com/SINGERS/amit-trivedi/" rel="tag">Amit Trivedi</a>,
</span>

The only common difference we can see in these tags is in their URL, just after the domain name.


QUESTION

How do I select, and categorize these tags based on URL-difference (“.com/ACTORS/.” or “.com/SINGERS.”) and then get the innerText of the element to store them like.

actors = ["Darshan Raval","Priyanka Chopra"]
singers = ["Hardy Sandhu","Amit Trivedi"]

OR

actors = ["Darshan Raval","Priyanka Chopra","Amir Khan"]
singers = ["Hardy Sandhu","Amit Trivedi"]

The number of “Stars” and “Singers” are different all the time, so I can’t define I fixed array count method.

Answer

You can try something like this:

import puppeteer from 'puppeteer';

const browser = await puppeteer.launch();

const html = `
  <!doctype html>
  <html>
    <head><meta charset='UTF-8'><title>Test</title></head>
    <body>
      <span class="clip-link">
        <h4>Stars:</h4>
        <a href="https://www.media.com/ACTORS/darshan-raval/" rel="tag">Darshan Raval</a>,
        <a href="https://www.media.com/ACTORS/priyanka-chopra/" rel="tag">Priyanka Chopra</a>,
        <a href="https://www.media.com/ACTORS/amir-khan/" rel="tag">Amir Khan</a>
        <h4>Singers:</h4>
        <a href="https://www.media.com/SINGERS/hardy-sandhu/" rel="tag">Hardy Sandhu</a>,
        <a href="https://www.media.com/SINGERS/amit-trivedi/" rel="tag">Amit Trivedi</a>,
      </span>
    </body>
  </html>`;

try {
  const [page] = await browser.pages();

  await page.goto(`data:text/html,${html}`);

  const data = await page.evaluate(() => {
    const tags = [...document.querySelectorAll('a[rel="tag"]')];
    return tags.reduce((persons, tag) => {
      const type = tag.pathname.split('/')[1];
      persons[type] ??= [];
      persons[type].push(tag.innerText);
      return persons;
    }, {});
  });
  console.log(data);
} catch (err) { console.error(err); } finally { await browser.close(); }

Output:

{
  ACTORS: [ 'Darshan Raval', 'Priyanka Chopra', 'Amir Khan' ],
  SINGERS: [ 'Hardy Sandhu', 'Amit Trivedi' ]
}

Leave a Reply

Your email address will not be published. Required fields are marked *