Parsing JavaScript Output Using Selenium and Beautifulsoup

Have extracted the JavaScript data using Selenium and can see the data I require (“meeting_summary_reference”) is contained in a dictionary type ‘structure’. Python json does not parse this type of data and test2_text and test33 are both blank. So tag does not convert to text. Beautiful soup strings do not work for me either. Not proficient at complex Regex. At a loss of what to try next.

from urllib.parse import urlparse
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Chrome(executable_path='C:/A38/chromedriver_win32/chromedriver.exe')

# Navigate to the application home page

innerHTML = driver.execute_script("return document.body.innerHTML")
print("nJS PAGE SOURCE:", "n", driver.page_source)

j_str = driver.page_source
html = j_str
bsObj = BeautifulSoup(html, "html.parser")
print("nBSOBJ:", "n", bsObj.prettify())

test2 = bsObj.find('script', attrs={'id': '__NEXT_DATA__'})
print("nTEST2: n", test2)
print("nTYPE TEST2: n", type(test2))
print("nLENGTH TEST2: n", len(test2))
test2_text = bsObj.find('script', attrs={'id': '__NEXT_DATA__'}).getText()
print("nTEST2_TEXT: n",test2_text)

test33 = test2.find(text = "meeting_summary_reference")
print("nTEST33: n", test33)


I am not sure you can extract that information with beautifulsoup, but you can use this regex:

import re
test2_text = bsObj.find('script', attrs={'id': '__NEXT_DATA__'}) # I edited this line also
pattern = r'"meeting_summary_reference":{(.*?)({.*})?},'
test33 = re.findall(pattern, str(test2_text), re.M)

Which will find all nested objects named meeting_summary_reference and then you can convert it to python dictionary and extract desired information.