SO I am not sure how to proceed here. I have an example of the page that I am trying to scrape:
Now I have the xpath selecting the ‘article’ div class and then subsequent
<p>‘s I can then always eliminate the first one because it is the same stock news text (city, yonhapnews, reporter, etc) I am evaluating word densities so this could be a problem for me 🙁
The issue comes in towards the end of the article. If you look towards the end there is a reporter email address and a date and time of publishing…
The problem is that on different pages of this site, there are different numbers of
<p> tags towards the end so I cannot just eliminate the last two because it still messes with my results sometimes.
How would you go about eliminating those certain
<p> elements towards the end? do I just have to try and scrub my data afterwards?
Here is the code snippet that selects the path and eliminates the first
<p> and the last two. How should I change it?
# gets all the text from the listed div and then applies the regex to find all word objects in hanul range hangul_syllables = response.xpath('//*[@class="article"]/p//text()').re(ur'[uac00-ud7af]+') # For yonhapnews the first and the last two <p>'s are useless, everything else should be good hangul_syllables = hangul_syllables[1:-2]
You can tweak your XPath expression not to include the
p tag having
class="adrs" (the date of publishing):