SO I am not sure how to proceed here. I have an example of the page that I am trying to scrape:
http://www.yonhapnews.co.kr/sports/2015/06/05/1001000000AKR20150605128600007.HTML?template=7722
Now I have the xpath selecting the ‘article’ div class and then subsequent <p>
‘s I can then always eliminate the first one because it is the same stock news text (city, yonhapnews, reporter, etc) I am evaluating word densities so this could be a problem for me 🙁
The issue comes in towards the end of the article. If you look towards the end there is a reporter email address and a date and time of publishing…
The problem is that on different pages of this site, there are different numbers of <p>
tags towards the end so I cannot just eliminate the last two because it still messes with my results sometimes.
How would you go about eliminating those certain <p>
elements towards the end? do I just have to try and scrub my data afterwards?
Here is the code snippet that selects the path and eliminates the first <p>
and the last two. How should I change it?
# gets all the text from the listed div and then applies the regex to find all word objects in hanul range hangul_syllables = response.xpath('//*[@class="article"]/p//text()').re(ur'[uac00-ud7af]+') # For yonhapnews the first and the last two <p>'s are useless, everything else should be good hangul_syllables = hangul_syllables[1:-2]
Answer
You can tweak your XPath expression not to include the p
tag having class="adrs"
(the date of publishing):
//*[@class="article"]/p[not(contains(@class, "adrs"))]//text()