Update dataframe with new data

I’m scraping data and I’d need to save it every time, in order to avoid losing what I have already done. My code is similar to this:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import pandas as pd
from random import randrange

def crawl(df):
    chrome_options = webdriver.ChromeOptions()

    my_list1=[]
    my_list2=[]
    
    # Server info
    
    query=df['Source'].unique().tolist() 
    driver=webdriver.Chrome('path',chrome_options=chrome_options) 
    driver.maximize_window()

    for x in query:
            
        response=driver.get('link_to_scrape/'+x)
        try:
        
            wait = WebDriverWait(driver, 30)
            time.sleep(randrange(5))
            driver.execute_script("window.scrollTo(0, 1000)")
            
            # Get data to append in my_list1
            my1 = wait.until(EC.visibility_of_element_located((By.XPATH, "//div[text()='Trustscore']/../following-sibling::div/descendant::div[@class='icon']"))).text
            my_list1.append(my1)


            # Get data to append in my_list2
            try:
                my2 = wait.until(EC.visibility_of_element_located((By.XPATH, "//div[text()='Company data']/../following-sibling::div/descendant::b[text()='Alexa rank']/../following-sibling::div"))).text
                my_list2.append(my2)
            except: 
                my_list2.append("Data not available")
                 
        except: 
          print("n!!! ERROR !!!")
          break
                            
    # Create dataframe
    dict = {'Source': query, 'List 1': my_list1, 'List 2': my_list2} 
    df=pd.DataFrame.from_dict(dict)

    driver.quit()


    return df

Currently, the code has some weaknesses that I’d need to fix just by saving data before closing the session for each element in the query. Let’s say that I have 5 elements in df['Source']: x1,x2,x3,x4,x5.

When I run my code, x1 is saved, but when the code runs using x2, I get the error: ValueError: arrays must all be the same length, and the process stops. I’d like to fix this issue as follows:

  • for each unique element in df['Source'], open chrome, extract data, save data into a df, then close the chrome window;
  • wait for 15 seconds before submitting a new request;
  • submit a new request: open chrome for the second element in df['Source'], extract data, save data in the same df used previously (for element x1), close chrome.
  • and so on, until all the elements are in the new df.

In order to keep the extract data, I would need that the df is updated at each step, not at the end, i.e., when crawl has extracted data for all of each item in the list. My code is not doing that: it creates the df at the end, so every time I get an error, I lose my work. In the end, I should have a data frame with 5 rows (excluding the headers), with data extracted (or error message, if it runs the exception). Can you provide me with some help to understand the right way to open/close chrome and save/update the data frame with new data at each iteration? If you need more info, let me know.

Answer

Create a dictionary before for loop and then update it with list items and create data frame out of it

frame_dict = {}
for x in query:
    response=driver.get('link_to_scrape/'+x)
    ... 
    Some codes here
    ...
    except: 
         my_list2.append("Data not available")
    frame_dict.update({'Source': x, 'List 1': my_list1, 'List 2': my_list2})

convert frame_dict to dataframe

df=pd.DataFrame.from_dict(frame_dict)