BeautifulSoup4 exclude div that is in wrapper

I tried to get the all text out of the following HTML structure:

<div class="header"><h1>Header</h1></div>
<div class="container">
    <div class="header"><h1>Sub Header</h1></div>
    <p>Target_2</p>
    <p>Target_3</p>
    <p>Target_4</p>
</div>

My approach was something like this:

targets = soup.find_all("div", class_=["header", "container"])

for html_row in targets:
   for row in html_row.strings:
         print(row)

Output:

Header
Sub Header
Target_2
Target_3
Target_4
Sub Header

My problem is that “Sub Header” is found twice because of the header class. How can I exclude the header class inside of the container class? I have to grab everything with the classes.

Answer

You can set the recursive argument to False, which will only find direct children:

from bs4 import BeautifulSoup


html = """
<div class="header"><h1>Header</h1></div>
<div class="container">
    <div class="header"><h1>Sub Header</h1></div>
    <p>Target_2</p>
    <p>Target_3</p>
    <p>Target_4</p>
</div>"""

soup = BeautifulSoup(html, "html.parser")
targets = soup.find_all("div", class_=["header", "container"], recursive=False)

for tag in targets:
    print(tag.text.strip())

Output:

Header
Sub Header
Target_2
Target_3
Target_4

Leave a Reply

Your email address will not be published. Required fields are marked *