I am writing a function that takes a large list of strings and classifies them.
Each string classified is deleted from the list one after another.
Sometimes the function needs to delete a string in an arbitrary index and sometimes it needs to delete a string from the leftmost position.
The ratio between these two cases is about 2/3 of the times it deletes from the leftmost position and 1/3 of the time from an arbitrary place in the middle.
I know using list as the data structure to hold the strings is O(n) deleting the leftmost cell and O(n-i) deleting the i’th cell, since it moves all the cells to the right one step left for each deletion.
I then tried using
collections.deque instead of list, but it actually took more time (about 4/3 more time) than the original
list data structure, possibly because of the arbitrary place deletions.
What data structure will perform best under these assumptions? Is there any better than just using a
The deletions with the
list data structure from leftmost cell with is using
list.pop(0) and removing from arbitrary index is by value using
OrderedDict with your strings as keys.
Benchmark with just 30,000 strings, 20,000 deletions from left and 10,000 “arbitrary” removals:
1476 ms 1490 ms 1491 ms using_list 10 ms 11 ms 11 ms using_OrderedDict
This assumes you have no duplicate strings, which seems likely. If you do, then use the dictionary values as the frequencies.
Benchmark code (Try it online!):
from timeit import timeit from collections import OrderedDict from random import randrange def using_list(strings, removes): for string in removes: strings.pop(0) strings.pop(0) strings.remove(string) def using_OrderedDict(strings, removes): strings = OrderedDict.fromkeys(strings) for string in removes: strings.popitem(False) strings.popitem(False) strings.pop(string) # Build testcase with 30,000 elements strings = list(map(str, range(10_000, 40_000))) copy = strings.copy() removes =  for _ in range(10_000): copy.pop(0) copy.pop(0) i = randrange(len(copy)) removes.append(copy.pop(i)) # Benchmark for _ in range(3): for func in using_list, using_OrderedDict: times =  for _ in range(3): copy = strings.copy() t = timeit(lambda: func(copy, removes), number=1) times.append(t) times.sort() print(*('%4d ms ' % (t * 1e3) for t in times), func.__name__) print()