How can I scrape a list from wikipedia and transfer to a dataframe

I want to get the “List of Helsinki neighbourhoods” from wikipedia page (https://en.wikipedia.org/wiki/Subdivisions_of_Helsinki) and convert it into a dataframe (ideally I would want the main neighborhood (with two digit code) in one column and the subdivisions of the neighborhood (with three digit code) in another column)

I used the following code:

url = 'https://en.wikipedia.org/wiki/Subdivisions_of_Helsinki'
data = requests.get(url)
soup= BeautifulSoup(data.content, "html.parser")
helsinki_neiborhood_raw = soup.find_all('li')[7:171]
helsinki_neiborhood_raw

and then I got this ( just a part of the helsinki_neiborhood_raw since it is too long)

[<li>01 <a href="/wiki/Kruununhaka" title="Kruununhaka">Kruununhaka</a> <i>(Kronohagen)</i></li>,
 <li>02 <a href="/wiki/Kluuvi" title="Kluuvi">Kluuvi</a> <i>(Gloet)</i></li>,
 <li>03 <a href="/wiki/Kaartinkaupunki" title="Kaartinkaupunki">Kaartinkaupunki</a> <i>(Gardestaden)</i></li>,
 <li>04 <a href="/wiki/Kamppi" title="Kamppi">Kamppi</a> <i>(Kampen)</i></li>,
 <li>05 <a href="/wiki/Punavuori" title="Punavuori">Punavuori</a> <i>(Rödbergen)</i></li>,
 <li>06 <a href="/wiki/Eira" title="Eira">Eira</a></li>,
 <li>07 <a href="/wiki/Ullanlinna" title="Ullanlinna">Ullanlinna</a> <i>(Ulrikasborg)</i></li>,
 <li>08 <a href="/wiki/Katajanokka" title="Katajanokka">Katajanokka</a> <i>(Skatudden)</i></li>,
 <li>09 <a href="/wiki/Kaivopuisto" title="Kaivopuisto">Kaivopuisto</a> <i>(Brunnsparken)</i></li>,
 <li>10 <a href="/wiki/S%C3%B6rn%C3%A4inen" title="Sörnäinen">Sörnäinen</a> <i>(Sörnäs)</i>
 <ul><li>102 <a href="/wiki/Kalasatama" title="Kalasatama">Kalasatama</a> <i>(Fiskehamnen)</i></li></ul></li>,
 <li>102 <a href="/wiki/Kalasatama" title="Kalasatama">Kalasatama</a> <i>(Fiskehamnen)</i></li>,

**How can I extract only the code and the name of the neighborhood from the above response and turn into a dataframe (columns=(“Code”,”Main_neighborhood”, “Sub_neighborhood”)? **

Answer

You can do something like this

import requests
from bs4 import BeautifulSoup
import pandas as pd

url = 'https://en.wikipedia.org/wiki/Subdivisions_of_Helsinki'
data = requests.get(url)
soup= BeautifulSoup(data.content, "html.parser")
helsinki_neiborhood_raw = soup.find_all("div", {"class": "div-col"})[0].find_all("li")

df = pd.DataFrame([[item.get_text().split(" ")[0], 
                    item.find_next("a").get("title"), 
                    item.find_next("i").get_text()[1:-1]]
                   for item in helsinki_neiborhood_raw if item.find_next("i")], 
                  columns=("Code","Main_neighborhood", "Sub_neighborhood"))
print(df.head())

Leave a Reply

Your email address will not be published. Required fields are marked *