Extract html with no span class attribute and same div class attributes

I have found similar questions but none that directly address my issue. I have worked on this for about a week now with no luck.

I am trying to scrape data from this link: https://www.truecar.com/prices-new/chevrolet/malibu-pricing/?zipcode=44070

The issue is, the value I am looking for has no span-class attribute but when using the div class attributes, it shares the same name as other values on the page. I want my code to return $22,807 but anything I try either returns $25,195 or []. See the following HTML:

<div class="text-right col-3 col-sm-4 col-md-6">
    <div class="label-block label-block-1 label-block-sm-2 text-muted" data-qa="vehicle-header-msrp" 
    data-test="vehicleHeaderMsrp">
        <div class="label-block-title" data-qa="LabelBlock-title" data-test="labelBlockTitle"></div>
        <div class="label-block-subtitle" data-qa="LabelBlock-subTitle" data-test="labelBlockSubTitle"></div>
        <div data-qa="LabelBlock-text" class="label-block-text" data-test="labelBlockText">
            <span class="pricing-block-amount-strikethrough">$25,195</span>
          </div>
     </div>
</div>


<div class="text-right col-3 col-sm-4 col-md-6">
    <div class="label-block label-block-1 label-block-sm-2" data-qa="vehicle-header-average-market-price" 
    data-test="vehicleHeaderAverageMarketPrice">
        <div class="label-block-title" data-qa="LabelBlock-title" data-test="labelBlockTitle"></div>
        <div class="label-block-subtitle" data-qa="LabelBlock-subTitle" data-test="labelBlockSubTitle"></div>
        <div data-qa="LabelBlock-text" class="label-block-text" data-test="labelBlockText">
            <span class="">$22,807</span>
          </div>
     </div>
</div>

I can easily get the $25,195 returned with the following code:

import requests
from bs4 import BeautifulSoup
import pandas as pd
import re

headers = {
   "User-Agent":
   "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:70.0) Gecko/20190101 Firefox/70.0"
}

url = "https://www.truecar.com/prices-new/chevrolet/malibu-pricing/?zipcode=44070"
print(url)
page = requests.get(
url,
headers=headers)
    
soup = BeautifulSoup(page.content, 'html.parser')

test = soup.find('span', {'class': 'pricing-block-amount-strikethrough'})
print(test.get_text())

But no combination of calls that I try will return the $22,807 that I need.

What’s interesting is that I can get the $25 value if I use

test = soup.find('div', {'class': 'label-block label-block-1 label-block-sm-2 text-muted'})

So I assumed that I could simply delete the “text-muted” part like:

test = soup.find('div', {'class': 'label-block label-block-1 label-block-sm-2'})

to get the $22 number but it just returns [ ].

Disclaimer: the dollar amount that I need changes frequently so if you help with this and end up getting a number slightly different than $22,807 it might still be correct. If you click on the link, the number I am looking for is the “Market Average” not the “MSRP.”

Thank you!

Answer

If you browse the page it takes time for it to get the second value that you are looking for. In requests module it quickly gets the content doesn’t wait for it to load completely. This is where you add selenium with bs4. To add the wait for the site to load then get the page content.

you can download the geckodriver from link

import time
from bs4 import BeautifulSoup
from selenium import webdriver

url = "https://www.truecar.com/prices-new/chevrolet/malibu-pricing/?zipcode=44070"

driver = webdriver.Firefox(executable_path=r'geckodriver.exe')
driver.get(url)
time.sleep(7)
soup = BeautifulSoup(driver.page_source, 'html')
div = soup.find_all('div', {'class': 'label-block-text'})
for x in div:
    span = x.find('span')
    print(span.get_text())

Leave a Reply

Your email address will not be published. Required fields are marked *