I am trying to extract the href from the lists contained within a list. This is the code I have used to create the list:
import requests as re from bs4 import BeautifulSoup as bs import lxml category_page = re.get("https://sv.epaenlinea.com/automotriz.html") category_soup = bs(category_page.text, 'lxml') url = [i.find_all('a') for i in category_soup.find_all('ol',{'class':'items'})[-1].find_all('li',{'class':'item'})] print(url)
This is the output generated:
[[<a href="https://sv.epaenlinea.com/automotriz/accesorios-exterior.html">Accesorios exterior</a>], [<a href="https://sv.epaenlinea.com/automotriz/seguridad-automotriz.html">Seguridad automotriz</a>], [<a href="https://sv.epaenlinea.com/automotriz/accesorios-interior.html">Accesorios interior</a>], [<a href="https://sv.epaenlinea.com/automotriz/limpieza-y-cuidado.html">Limpieza y cuidado</a>], [<a href="https://sv.epaenlinea.com/automotriz/lubricantes-y-aditivos.html">Lubricantes y aditivos</a>], [<a href="https://sv.epaenlinea.com/automotriz/llantas.html">Llantas</a>], [<a href="https://sv.epaenlinea.com/automotriz/baterias-y-accesorios.html">Baterías y accesorios</a>]]
This is the output I would like:
["https://sv.epaenlinea.com/automotriz/accesorios-exterior.html", "https://sv.epaenlinea.com/automotriz/seguridad-automotriz.html", . . . "https://sv.epaenlinea.com/automotriz/baterias-y-accesorios.html"]
Any ideas on how to do it?
Answer
You can use bs4 property attrs
that extracts attributes as a dict.
This should work
import requests as re from bs4 import BeautifulSoup as bs import lxml category_page = re.get("https://sv.epaenlinea.com/automotriz.html") category_soup = bs(category_page.text, 'lxml') url = [i.find_all('a')[0].attrs['href'] for i in category_soup.find_all('ol',{'class':'items'})[-1].find_all('li',{'class':'item'})] print(url)