Multiprocessing vs Threading in Python

I am learning Multiprocessing and Threading in python to process and create large amount of files, the diagram is shown here diagram

Each of output file depends on the analysis of all input files.

Single processing of the program takes quite a long time, so I tried the following codes:

(a) multiprocessing

start = time.time()
process_count = cpu_count()
p = Pool(process_count)
for i in range(process_count):
    p.apply_async(my_read_process_and_write_func, args=(i,w))

p.close()
p.join()
end = time.time()

(b) threading

start = time.time()
thread_count = cpu_count()
thread_list = [] 

for i in range(0, thread_count):
    t = threading.Thread(target=my_read_process_and_write_func, args=(i,))
    thread_list.append(t)

for t in thread_list:
    t.start()

for t in thread_list:
    t.join()

end = time.time()

I am runing these codes using Python 3.6 on a Windows PC with 8 cores. However Multiprocessing method takes about the same time as the single-processing method, and Threading method takes about 75% of the single-processing method.

My questions are:

Are my codes correct?

Is there any better way/codes to improve the efficiency? Thanks!

Answer

Your processing is I/O bound, not CPU bound. As a result, the fact that you have multiple processes helps little. Each Python process in multiprocessing is stuck waiting for input or output while the CPU does nothing. Increasing the Pool size in multiprocessing should improve performance.