Getting the right tag for web scraping using rvest

I’m having some trouble finding the right tag to scrape the text I want from a web page. A sample of the HTML is below. I want to scrape the text “Melbourne Storm has achieved 4 tries Brisbane Broncos has achieved 2 tries”

The R code I have been using is below, and I just can’t seem to get the text I want.

url <- 'https://www.nrl.com/draw/nrl-premiership/2019/round-1/storm-v-broncos/'
RawTable <- read_html(url)
RawTable <- html_nodes(RawTable,'.u-visually-hidden')
RawTable <- html_text(RawTable)
RawTable <- data.frame(RawTable)

HTML Code:

`<div class="Match-centre-summary o-shadowed-box u-spacing-mb-small">
      <span class="u-visually-hidden">Melbourne Storm has achieved 4 Tries Brisbane Broncos has achieved 2 
       Tries </span>`

Answer

Typical some special tricks such as Rselenium is required for webpages such as this one. Looking at this web page it appears the data you are requesting is stored as JSON data in an attribute which is then rendered by the browser.

In this case one can retrieve the attribute’s data using rvest and then convert the JSON data into a list and/or a dataframe.

library(rvest)
library(dplyr)
library(jsonlite)

url <- 'https://www.nrl.com/draw/nrl-premiership/2019/round-1/storm-v-broncos/'
page <- read_html(url)

contentnodes <-page %>% html_nodes ("div.l-content.pre-quench") %>% 
   html_attr("q-data") %>% jsonlite::fromJSON()

What is happening is we are looking for the div node that has “class= l-content pre-quench”. In that node there is an attribute named “q-data”. It is this attribute’s data we want to retrieve. fromJSON() is converting the attribute’s JSON data into a list with many nested lists and dataframes with all of the information associated with the match.
You’ll need to work out the structure to the information desired.

Leave a Reply

Your email address will not be published. Required fields are marked *