Most suitable data structure to support delete from left and delete from arbitrary index where delete from left is twice as much frequent

I am writing a function that takes a large list of strings and classifies them.

Each string classified is deleted from the list one after another.

Sometimes the function needs to delete a string in an arbitrary index and sometimes it needs to delete a string from the leftmost position.

The ratio between these two cases is about 2/3 of the times it deletes from the leftmost position and 1/3 of the time from an arbitrary place in the middle.

I know using list as the data structure to hold the strings is O(n) deleting the leftmost cell and O(n-i) deleting the i’th cell, since it moves all the cells to the right one step left for each deletion.

I then tried using collections.deque instead of list, but it actually took more time (about 4/3 more time) than the original list data structure, possibly because of the arbitrary place deletions.

What data structure will perform best under these assumptions? Is there any better than just using a list?

The deletions with the list data structure from leftmost cell with is using list.pop(0) and removing from arbitrary index is by value using list.remove(value).

Answer

Try an OrderedDict with your strings as keys.

Benchmark with just 30,000 strings, 20,000 deletions from left and 10,000 “arbitrary” removals:

1476 ms  1490 ms  1491 ms  using_list
  10 ms    11 ms    11 ms  using_OrderedDict

Instead of

strings_list.pop(0)
strings_list.remove(value)

use

strings_dict.popitem(False)
strings_dict.pop(value)

This assumes you have no duplicate strings, which seems likely. If you do, then use the dictionary values as the frequencies.

Benchmark code (Try it online!):

from timeit import timeit
from collections import OrderedDict
from random import randrange

def using_list(strings, removes):
    for string in removes:
        strings.pop(0)
        strings.pop(0)
        strings.remove(string)

def using_OrderedDict(strings, removes):
    strings = OrderedDict.fromkeys(strings)
    for string in removes:
        strings.popitem(False)
        strings.popitem(False)
        strings.pop(string)

# Build testcase with 30,000 elements
strings = list(map(str, range(10_000, 40_000)))
copy = strings.copy()
removes = []
for _ in range(10_000):
    copy.pop(0)
    copy.pop(0)
    i = randrange(len(copy))
    removes.append(copy.pop(i))

# Benchmark
for _ in range(3):
    for func in using_list, using_OrderedDict:
        times = []
        for _ in range(3):
            copy = strings.copy()
            t = timeit(lambda: func(copy, removes), number=1)
            times.append(t)
        times.sort()
        print(*('%4d ms ' % (t * 1e3) for t in times), func.__name__)
    print()