Why does my multiprocessing job in Python take longer than a single process?

I have to run my code that takes about 3 hours to complete 100 times, which amounts to a total computation time of 300 hours. This code is expensive and includes taking Laplacians, contour points, and outputting plots. I also have access to a computing cluster, which grants me access to 100 cores at once. So I wondered if I could run my 100 jobs on 100 individual cores at once — no communication between cores — and reduce the total computation time to somewhere around 3 hours still.

My online reading took me to multiprocessing.Pool, where I ended up using Pool.apply_async(). Quickly, I realized that the average completion time per task was way over 3 hours, the completion time of one task prior to parallelizing. Further research revealed it might be an overhead issue one has to deal with when parallelizing, but I don’t think that can account for an 8-fold increase in computation time.

I was able to reproduce the issue on a simple, and shareable piece of code:

import numpy as np
import matplotlib.pyplot as plt
from multiprocessing import Pool
import time

def f(x):
'''Junk to just waste computing power!'''
   arr = np.ones(shape=(10000,10000))*x
   val1 = arr*arr
   val2 = np.sqrt(val1)
   val3 = np.roll(arr,-1) - np.roll(arr,1)
   plt.subplot(121)
   plt.imshow(arr)
   plt.subplot(122)
   plt.imshow(val3)
   plt.savefig('f.png')
   plt.close()
   print('Done')

N = 1
pool = Pool(processes=N)
t0 = time.time()
for i in range(N):
   pool.apply_async(f,[i])
pool.close()
pool.join()

print('Time taken = ', time.time()-t0)

The machine I’m on has 8 cores, and if I’ve understood my online reading correctly, setting N=1 should force Python to run the job exclusively on one core. Looking at the completion times, I get:

N=1 || Time taken = 7.964005947113037

N=7 || Time taken = 40.3266499042511.

Put simply, I don’t understand why the times aren’t the same. As for my original code, the time difference between single-task and parallelized-task is about 8-fold.

[Question 1] Is this all a case of overhead (whatever it entails!)?

[Question 2] Is it at all possible to achieve a case where I can get the same computation time as that of a single-task for an N-task job running independently on N cores?

For instance, in SLURM you could do sbatch --array=0-N myCode.slurm, where myCode.slurm would be a single-task Python code, and this would do exactly what I want. But, I need to achieve this same outcome from within Python itself.

Thank you in advance for your help and input!

Answer

I suspect that (perhaps due to memory overhead) your program is simply not CPU bound when running multiple copies, therefore process parallelism is not speeding it up as much as you might think.

An easy way to test this is just to have something else do the parallelism and see if anything is improved.

With N = 8 it took 31.2 seconds on my machine. With N = 1 my machine took 7.2 seconds. I then just fired off 8 copies of the N = 1 version.

$ for i in $(seq 8) ; do python paralleltest &  done
...
Time taken =  32.07925891876221
Done
Time taken =  33.45247411727905
Done
Done
Done
Done
Done
Done
Time taken =  34.14084982872009
Time taken =  34.21410894393921
Time taken =  34.44455814361572
Time taken =  34.49029612541199
Time taken =  34.502259969711304
Time taken =  34.516881227493286

I also tried this with the entire multiprocessing stuff removed and just a single call to f(0) and the results were quite similar. So the python multiprocessing is doing what it can, but this job has constraints beyond just CPU.

When I replaced the code with something less memory intensive (math.factorial(1400000)), then the parallelism looked a lot better. 14.0 seconds for 1 copy, 16.44 seconds for 8 copies.

SLURM has the ability to make use of multiple hosts as well (depending on your configuration). The multiprocessing pool does not. This method will be limited to the resources on a single host.