For loop comparing items in a list with all items in previous rows in pandas

I have a pandas dataframe, sorted by date, where in each row i have a list of strings. For each list of strings I want to compare each of its strings with all strings in the previous lists from previous rows. If a string is found in a previous row and the condition dataframe[‘label’]=1 is met then I can add +1 and move on to the next string.

Right now this involves too many ugly for loops for a dataframe with 18k rows. I was wondering whether someone can help me speed up this function.

# count how many ngrams in a row where present in previous rows where condition is met
def count_previous(df, ngram_col):
    out = np.empty(len(df[ngram_col]))
    # loop through every row
    for i in range(len(df[ngram_col])):
        count = 0
        # loop through every ngram in the list of strings in the current row
        current_ng_list = df[ngram_col][i]
        for ng in current_ng_list:
            # loop through all previous rows
            for j in range(i):
                # check if condition is met, if it is break and move on to next ngram
                if ng in df[ngram_col][j] and df['label'][j] == 1:
                    count += 1
                    break
                else:
                    pass
        out[i] = count
    return out


data1 = {'Date': ['2019-07-01', '2019-07-01', '2019-07-03', '2019-09-03', '2019-08-02', '2019-08-02', '2019-09-17',
                 '2019-08-02', '2019-10-01'],
        'ngram_list': [['ena dio', 'this is a test'], ['this is test'], ['dog cat'],
            ['birds are awesome'], ['birds are awesome'], ['birds are awesome'], ['dog cat', 'birds are awesome', 'this is a test'], ['ena dio'],
                       ['ena dio', 'this is a test']],
         'label': [1, 1, 0, 1,1, 0, 1, 1, 0]}
df1 = pd.DataFrame(data1)
df1 = df1.sort_values('Date', ascending=True).reset_index(drop=True)
df1['counts'] = count_previous(df1, 'ngram_list')


Expected output: 

         Date                                    ngram_list  label  counts
0  2019-07-01                     ['ena dio', 'this is a test']    1     0.0
1  2019-07-01                                ['this is test']      1     0.0
2  2019-07-03                                     ['dog cat']      0     0.0
3  2019-08-02                           ['birds are awesome']      1     0.0
4  2019-08-02                           ['birds are awesome']      0     1.0
5  2019-08-02                                     ['ena dio']      1     1.0
6  2019-09-03                           ['birds are awesome']      1     1.0
7  2019-09-17  ['dog cat', 'birds are awesome', 'this is a test']  1     2.0
8  2019-10-01                     ['ena dio', 'this is a test']    0     2.0

Answer

I managed to write it (almost) without any for loops. It will take some more memory, though, since you’d need to create additional columns.

The idea is to create a column that will keep all the ngrams that we’ve already seen and that had the label of 1. We will keep those in a set, so that we are sure that we do not waste any memory/time on duplicates.

def func(x):
    ngrams = x['ngram_list']
    already_seen = x['already_seen']
    seen_sum = sum([ngram in already_seen for ngram in ngrams])
    return seen_sum


df1 = pd.DataFrame(data1)
df1 = df1.sort_values('Date', ascending=True).reset_index(drop=True)
# if the label is 0, we don't really care about these ngrams, so we can drop them and fill with previously-seen ones,
# so that we have the continuity of lists in the column. It will come in handy later.
df1['addable'] = (
    df1[['ngram_list']]
        .where(df1['label'] == 1)
        .ffill()
)

# next, we want to get the info about all the previously-seen ngrams. To do so, we can just use `cumsum`
# (since adding list concatenates them) and turn them into a set.
df1['already_seen'] = (
    df1['addable']
        .shift()
        .dropna()
        .cumsum()
        .apply(lambda v: set(v))
)
df1 = df1.dropna()

# only thing left to do is to sum all the previously-seen ngrams for every row.
df1['counts'] = df1.apply(func, axis=1)

Edit: if this is still too slow, you probably can get it down to only one .apply, which should help