How to eliminate certain elements when scraping?

SO I am not sure how to proceed here. I have an example of the page that I am trying to scrape:

http://www.yonhapnews.co.kr/sports/2015/06/05/1001000000AKR20150605128600007.HTML?template=7722

Now I have the xpath selecting the ‘article’ div class and then subsequent <p>‘s I can then always eliminate the first one because it is the same stock news text (city, yonhapnews, reporter, etc) I am evaluating word densities so this could be a problem for me 🙁

The issue comes in towards the end of the article. If you look towards the end there is a reporter email address and a date and time of publishing…

The problem is that on different pages of this site, there are different numbers of <p> tags towards the end so I cannot just eliminate the last two because it still messes with my results sometimes.

How would you go about eliminating those certain <p> elements towards the end? do I just have to try and scrub my data afterwards?

Here is the code snippet that selects the path and eliminates the first <p> and the last two. How should I change it?

# gets all the text from the listed div and then applies the regex to find all word objects in hanul range
hangul_syllables = response.xpath('//*[@class="article"]/p//text()').re(ur'[uac00-ud7af]+')

# For yonhapnews the first and the last two <p>'s are useless, everything else should be good
hangul_syllables = hangul_syllables[1:-2]

Answer

You can tweak your XPath expression not to include the p tag having class="adrs" (the date of publishing):

//*[@class="article"]/p[not(contains(@class, "adrs"))]//text()

Leave a Reply

Your email address will not be published. Required fields are marked *