I would like to get a HTML string without certain elements. However, upfront I just know which elements to keep but don’t know which ones to drop.
Let’s say I just want to keep all
a tags inside the
<div class="A"> <p>Text1</p> <img src="A.jpg"> <div class="sub1"> <p>Subtext1</p> </div> <p>Text2</p> <a href="url">link text</a> </div> <div class="B"> ContentDiv2 </div>
<div class="A"> <p>Text1</p> <p>Text2</p> <a href="url">link text</a> </div>
If I’d know all the selectors of all other elements I could just use
drop_tree(). But the problem is that I don’t know
['img', 'div.sub1', 'div.B'] upfront.
import lxml.cssselect import lxml.html tree = lxml.html.fromstring(html_str) elements_drop = ['img', 'div.sub1', 'div.B'] for j in elements_drop: selector = lxml.cssselect.CSSSelector(j) for e in selector(tree): e.drop_tree() output = lxml.html.tostring(tree)
I’m still not entirely sure I understand correctly, but it seems like you may be looking for something resembling this:
target = tree.xpath('//div[@class="A"]') to_keep = target.xpath('//p | //a') for t in target.xpath('.//*'): if t not in to_keep: target.remove(t) #I believe this method is better here than drop_tree() print(lxml.html.tostring(target).decode())
The output I get is your expected output.