I am using
chrome-driver to scrap data from some pages and then run some additional tasks with that information (for example, type some comments on some pages)
My program has a button. Every time it’s pressed it calls the
thread_(self) (bellow), starting a new thread. The target function
self.main has the code to run all the selenium work on a
def thread_(self): th = threading.Thread(target=self.main) th.start()
My problem is that after the user press the first time. This
th thread will open browser A and do some stuff. While browser A is doing some stuff, the user will press the button again and open browser B that runs the same
self.main. I want each browser opened to run simultaneously. The problem I faced is that when I run that thread function, the first browser stops and the second browser is opened.
I know my code can create threads infinitely. And I know that this will affect the pc performance but I am ok with that. I want to speed up the work done by
selenium speed up
Consider the following functions to exemplify how threads with selenium give some speed-up compared to a single driver approach. The code bellow scraps the html title from a page opened by selenium using
BeautifulSoup. The list of pages is
import time from bs4 import BeautifulSoup from selenium import webdriver import threading def create_driver(): """returns a new chrome webdriver""" chromeOptions = webdriver.ChromeOptions() chromeOptions.add_argument("--headless") # make it not visible, just comment if you like seeing opened browsers return webdriver.Chrome(options=chromeOptions) def get_title(url, webdriver=None): """get the url html title using BeautifulSoup if driver is None uses a new chrome-driver and quit() after otherwise uses the driver provided and don't quit() after""" def print_title(driver): driver.get(url) soup = BeautifulSoup(driver.page_source,"lxml") item = soup.find('title') print(item.string.strip()) if webdriver: print_title(webdriver) else: webdriver = create_driver() print_title(webdriver) webdriver.quit() links = ["https://www.amazon.com", "https://www.google.com", "https://www.youtube.com/", "https://www.facebook.com/", "https://www.wikipedia.org/", "https://us.yahoo.com/?p=us", "https://www.instagram.com/", "https://www.globo.com/", "https://outlook.live.com/owa/"]
get_tile on the
A single chrome driver and passing all links sequentially. Takes 22.3 s my machine (note:windows).
start_time = time.time() driver = create_driver() for link in links: # could be 'like' clicks get_title(link, driver) driver.quit() print("sequential took ", (time.time() - start_time), " seconds")
Multiple threads approach
Using a thread for each link. Results in 10.5 s > 2x faster.
start_time = time.time() threads =  for link in links: # each thread could be like a new 'click' th = threading.Thread(target=get_title, args=(link,)) th.start() # could `time.sleep` between 'clicks' to see whats'up without headless option threads.append(th) for th in threads: th.join() # Main thread wait for threads finish print("multiple threads took ", (time.time() - start_time), " seconds")
This here and this better are some other working examples. The second uses a fixed number of threads on a
ThreadPool. And suggests that storing the
chrome-driver instance initialized on each thread is faster than creating-starting it every time.
Still I’m not sure this is the optimal approach for selenium to have considerable speed-ups. Since
threading on no IO bound code will end-up executed sequentially (one thread after another). Due the Python GIL (Global Interpreter Lock) a Python process cannot run threads in parallel (utilize multiple cpu-cores).
selenium speed up
To try to overcome the Python GIL limitation using the package
Processes class I wrote the following code and I ran multiple tests. I even added random page hyperlink clicks on the
get_title function above. Additional code is here.
start_time = time.time() processes =  for link in links: # each thread a new 'click' ps = multiprocessing.Process(target=get_title, args=(link,)) ps.start() # could sleep 1 between 'clicks' with `time.sleep(1)`` processes.append(ps) for ps in processes: ps.join() # Main wait for processes finish return (time.time() - start_time)
Contrary of what I would expect Python
multiprocessing.Process based parallelism for
selenium in average was around 8% slower than
threading.Thread. But obviously booth were in average more than twice faster than the sequential approach. Apparently
selenium chrome-driver calls somewhat releases the Python GIL indeed making it parallel in threads.
Threading a good start for
selenium speed up **
This is not a definitive answer as my tests were only a tiny example. Also I’m using Windows and
multiprocessing have many limitations in this case. Each new
Process is not a fork like in Linux meaning, among other downsides, a lot of memory is wasted.
Taking all that in account: It seams that depending on the use case threads maybe as good or better than trying the heavier approach of process (specially for Windows users).