Error trying to get weird characters with Request and Beautiful Soup

I have the following code, but it yields me rows with weird characters like Luka DonÄić instead of Luka Dončić.

import pandas as pd
from requests import get
from bs4 import BeautifulSoup

scrapTable = get('https://widgets.sports-reference.com/wg.fcgi?css=1&site=bbr&url=%2Fleagues%2FNBA_2021_per_game.html&div=div_per_game_stats')
scrapTable.encoding = 'utf-8-sig'
soup_a = BeautifulSoup(scrapTable.content, 'html.parser')
table = soup_a.find('table')
df_nba_PerGame = pd.read_html(str(table), encoding='utf8')[0]

Any idea of what’s wrong?

Answer

The document contains utf-8 characters encoded as HTML special characters (?). To decode the document, you can use:

import re
import html
import pandas as pd
from requests import get
from bs4 import BeautifulSoup


scrapTable = get(
    "https://widgets.sports-reference.com/wg.fcgi?css=1&site=bbr&url=%2Fleagues%2FNBA_2021_per_game.html&div=div_per_game_stats"
)


s = re.sub(
    rb"&#(d+);",
    lambda g: b"%c" % int(g.group(1)),
    scrapTable.content,
)

s = (
    html.unescape(s.decode("latin1"))
    .encode("latin1", "ignore")
    .decode("utf-8", "ignore")
)

soup = BeautifulSoup(s, "html.parser")
table = soup.find("table")
df_nba_PerGame = pd.read_html(str(table), encoding="utf8")[0]
print(df_nba_PerGame)

Prints:

...

176  129          Donte DiVincenzo     SG   24  MIL  66  66  27.5   3.8   9.1   .420  2.0   5.2   .379   1.8   3.9   .475   .528  0.8   1.1   .718  1.2   4.5   5.8   3.1  1.1  0.2  1.4  1.7  10.4
177  130               Luka Dončić     PG   21  DAL  66  66  34.3   9.8  20.5   .479  2.9   8.3   .350   6.9  12.2   .567   .550  5.2   7.1   .730  0.8   7.2   8.0   8.6  1.0  0.5  4.3  2.3  27.7
178  131             Luguentz Dort     SG   21  OKC  52  52  29.7   4.8  12.3   .387  2.2   6.3   .343   2.6   6.0   .432   .475  2.3   3.2   .744  0.7   2.9   3.6   1.7  0.9  0.4  1.5  2.6  14.0

...