Python improve performance when searching for a string in a file

I’m trying to optimize the performance of a Python script which searches if a string (a class name) is contained in a file. The file contains about 100.000 lines, each line with a class name.

def readFile(fileName):
    fileObj = open(fileName, "r")  # opens the file in read mode
    words = []
    lines = fileObj.read().splitlines()
    # Using a for loop to check if word should be qualified as a class
    # (Not included here for brevity)
    for line in lines:        
        line = line.strip()
        words.extend(line.split())

    fileObj.close()
    return words

#Actual array contains about 1000 classes to check
words[] = ["class1","class2","class3"]

# Contains about 100.000 lines
classes = readFile('classes.txt')
for word in words:
   if (word in classes):
   # do something

When running the above code, it takes about 60 seconds to complete (including some string concatenation which should not be the bottleneck, I think). Is there a faster way to search for a string in a file or simply it’s a no-go and only a DB can do better?

Answer

Simply do this:

classes = set(readFile('classes.txt'))

(Or better yet, just return a setfrom your function).

It should then no longer take 60 seconds, it should be practically instantaneous for 1000 words.

The problem is that you are checking:

word in classes

But since classes is a list, that is a linear time operation, so you ultimately have polynomial time. When classes is a set, that will be constant time.

Here is an example, using random strings:

In [1]: import string

In [2]: import random

In [3]: words = [''.join(random.sample(string.ascii_lowercase, 10)) for _ in range(1000)]

In [4]: classes = [''.join(random.sample(string.ascii_lowercase, 10)) for _ in range(100_000)]

In [5]: %%timeit
   ...: for word in words:
   ...:     word in classes
   ...:
1.22 s ± 6.48 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

You also vaguely mentioned something about concatenating strings. You may also be introducing polynomial time behavior if you do that wrong. But you haven’t shown that.