So, I was starting a project where I would scrape ‘https://www.gumtree.com/cars/uk’ and extract all used cars prices and experiment with machine learning algorithms on those data. However, when I use requests api alongside beautiful soup to extract the html files, I realised that it won’t display the description text of the website.
Here’s the beautiful soup result: As you can see instead of getting the description of the car, I got something like ‘amp;lhblk;▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄&’.
Am I doing anything wrong?
Here’s my code till now:
from bs4 import BeautifulSoup as bs import requests import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns import json cars = requests.get('https://www.gumtree.com/cars/uk','lxml') soup = bs(cars.content) match = soup.find('div',class_ = 'srp-results') #div with an id of class srp_container-main. We are getting the div #with an information of that class print(match)
A few things, the site you mention is dynamic, there there is part of the content that will be modified by script I guess (I didn’t open it). Other times you may be blocked. Here is an example code. By the content I saw, I would select ‘p.listing-description’ instead, also I excluded from the text strings that have number of distinct characters <=10.
descrs =  for p in soup.find_all('p', class_='listing-description'): if len(set(p.text)) > 10: descrs.append(p.text) print('-' * 80) print(p.text)
This shows you what it finds and gives a list with the texts of each node.