I’m trying to optimize the performance of a Python script which searches if a string (a class name) is contained in a file. The file contains about 100.000 lines, each line with a class name.
def readFile(fileName): fileObj = open(fileName, "r") # opens the file in read mode words =  lines = fileObj.read().splitlines() # Using a for loop to check if word should be qualified as a class # (Not included here for brevity) for line in lines: line = line.strip() words.extend(line.split()) fileObj.close() return words #Actual array contains about 1000 classes to check words = ["class1","class2","class3"] # Contains about 100.000 lines classes = readFile('classes.txt') for word in words: if (word in classes): # do something
When running the above code, it takes about 60 seconds to complete (including some string concatenation which should not be the bottleneck, I think). Is there a faster way to search for a string in a file or simply it’s a no-go and only a DB can do better?
Simply do this:
classes = set(readFile('classes.txt'))
(Or better yet, just return a
setfrom your function).
It should then no longer take 60 seconds, it should be practically instantaneous for 1000 words.
The problem is that you are checking:
word in classes
classes is a list, that is a linear time operation, so you ultimately have polynomial time. When
classes is a set, that will be constant time.
Here is an example, using random strings:
In : import string In : import random In : words = [''.join(random.sample(string.ascii_lowercase, 10)) for _ in range(1000)] In : classes = [''.join(random.sample(string.ascii_lowercase, 10)) for _ in range(100_000)] In : %%timeit ...: for word in words: ...: word in classes ...: 1.22 s ± 6.48 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You also vaguely mentioned something about concatenating strings. You may also be introducing polynomial time behavior if you do that wrong. But you haven’t shown that.