RAM overload – itertools

Here’s a quick example of I’m trying to do and the error I’m getting:

    for symbol in itertools.product(list_a, repeat=8):
        list_b.append(symbol)

I’m also afterwards excluding combinations from that list like so:

    for combination in list_b:
        valid_b = True
        for symbols in range(len(list_exclude)):
            if list_exclude[symbols] in combination:
                valid_b = False
            else:
                pass
        if valid_b:
            new_list.append(combination)

I’ve heard somehow chunking the process might help, not sure how that could be done here though.

I’m using multiprocessing for this as well.

When I run it I get “MemoryError”

How would you go about it?

Answer

Don’t pre-compute anything, especially not the first full list:

def symbols(lst, exclude):
    for symbol in map(''.join, itertools.product(lst, repeat=8)):
        if any(map(symbol.__contains__, exclude)):
            continue
        yield symbol

Now use the generator as you need to lazily evaluate the elements. Keep in mind that since it’s pre-filtering the data, even list(symbols(list_a, list_exclude)) will he much cheaper than what you originally wrote.

Here is a breakdown of what happens:

  • itertools.product is a generator. That means that it produces an output without retaining a reference to any previous items. Each element it returns is a tuple containing some combination of the input elements.

  • Since you want to compare strings, you need to convert the tuples. Hence, ''.join. Mapping it onto each of the tuples that itertools.product produces converts those elements into strings. For example:

    >>> ''.join(('$', '$', '&', '&', '♀', '@', '%', '$'))
    '$$&&♀@%$'
    
  • Filtering each symbol thus created can be done by checking if any of the items in excludes are contained in it. You can do this with something like

    [ex in symbol for ex in exclude]
    

    The operation ... in symbol is implemented via the magic method symbol.__contains__. You can therefore map that method to every element of exclude.

  • Since the first element of exclude that is contained in symbol invalidates it, you don’t need to check the remainder. This is called short-circuiting, and is implemented in the any function. Notice that because map is a generator, the remaining elements will actually not be computed once a match is found. This is different from using a list comprehension, which pre-computed all the elements.

  • Putting yield into your function turns it into a generator function. That means that when you call symbols(...), it returns a generator object that you can iterate over. This object does not pre-compute anything until you call next on it. So if you write the data to a file (for example), only the current result will be in memory at once. It may take a long time to write out a large number of results but your memory usage should not spike at all from it.