How to extract multiple table from HTML in Python

I want to extract all data of security bulletin table from html https://helpx.adobe.com/security/products/dreamweaver/apsb21-13.html. Based on my code, I only can extract the data in the table one by one. The code cannot extract the overall data from the table.

This is my code

soup = BeautifulSoup(html_content, "lxml")
print(soup.prettify())
gdp = soup.find_all("table")

table = gdp[0]
body = table.find_all("tr")
head = body[0]
body_rows = body[1:] 

headings = []
for item in head.find_all("td"): 
    item = (item.text).rstrip("n")
    headings.append(item)

all_rows = [] # will be a list for list for all rows
for row_num in range(len(body_rows)): # A row at a time
    row = [] # this will old entries for one row
    for row_item in body_rows[row_num].find_all("td"): 
        aa = re.sub("(xa0)|(n)|,","",row_item.text)
        row.append(aa)
    all_rows.append(row)
df = pd.DataFrame(data=all_rows,columns=headings)
df.head()

df = pd.DataFrame(data=all_rows,columns=headings)
df.to_csv('C:/Users//AdobeAir-APSB16-23 Security Update Available for Adobe AIR.csv')
df.head()

The output of the code is

Bulletin ID Date Published  Priority
0   APSB21-13   February 09 2021    3

For this code, I imported library such as Beautifulsoup, requests, pandas and re. Hope anyone can help me on how to extract the data in the table all at once and can be converted into csv format. Thank you.

Answer

You can make pandas do the heavy-lifting for you with read_html:

url = 'https://helpx.adobe.com/security/products/dreamweaver/apsb21-13.html'
dfs = pd.read_html(url, header=0)
dfs[1]

Output:

             Product  Affected Versions           Platform
0  Adobe Dreamweaver               20.2  Windows and macOS
1  Adobe Dreamweaver               21.0  Windows and macOS

P.S. It outputs a list of all tables found in the HTML. For example, dfs[0] is the first table:

  Bulletin ID     Date Published  Priority
0   APSB21-13  February 09, 2021         3

Leave a Reply

Your email address will not be published. Required fields are marked *