Grab all hrefs with Puppeteer

const href = await page.evaluate(() => {
  let array = Array.from(document.querySelectorAll("table tr td a").href);
  return array.map((array) => array.innerText);
});

I have been trying to use this block of JS to do this but it is not working, It just keeps returning undefined.

But when I do document.querySelector("table tr td a").href; it works but it only gives the first one (there are multiple)!

How do I do it properly?

Answer

When working with debugging Puppeteer’s serialized code that’s injected into the driven browser, keep in mind:

  • the code is run in a browser context
  • state isn’t shared with the Node environment except through serialized parameters and return values (you can’t access Node functions or ElementHandles in the browser)
  • page console.logs aren’t visible in the Node environment by default

With these things in mind, you can attach console.log handlers to help debug your browser code from within Node. In many cases you can simply execute it outside of Puppeteer by hand in a console on the scraped page to validate that it works, as shown below.

The issue here is that there is a difference between querySelector (return the first element matching a CSS selector) and querySelectorAll (return a NodeList of all elements matching a selector). .href is not a property on the NodeList object; this property access needs to be applied to each element in the NodeList using map, which is available after converting the NodeList to an array.

Once you’ve given yourself the ability to console.log, it’s easy to debug this by simply printing the return values of every function call to see which properties exist.

The code below illustrates the difference between these two functions and works just fine in Puppeteer as it does in a stack snippet:

// print first element's href
console.log(document.querySelector("table tr td a").href);

// print all elements' hrefs
console.log([...document.querySelectorAll("table tr td a")].map(e => e.href));
<table>
  <tr>
    <td>
      <a href="foo.html">foo</a>
      <a href="bar.html">bar</a>
      <a href="baz.html">baz</a>
    </td>
  </tr>
</table>

Leave a Reply

Your email address will not be published. Required fields are marked *