Pandas transform two columns of lists into a columns dictionary with repeated keys

I have a pandas dataframe called: self.data They have two columns: name and value and I want a new one to be generated with a dictionary. For example:

Name Value New Dict Column
[a, b, c, a] [1, 2, 3, 4] {a: [1, 4], b: [2], c: [3]}
[b, b, a] [1, 2, 3] {b: [1, 2], a: [3] }

At this moment I have the following code:

data['dict'] = self.data[['name', 'value']].apply(lambda x: dict(zip(*x)), axis=1)

The problem with this attempt is that the pair name, value is being always replaced. Using the example, I can’t save both a1 and a2. The final dictionary only stores the last one.

Thank you in advance!

Answer

Use custom function with defaultdict if performance is important:

from collections import defaultdict

def f(x):
    d = defaultdict(list)
    for y, z in zip(*x):
        d[y].append(z)
    return d

df['New Dict Column'] = [ f(x) for x in df[['column1','column2']].to_numpy()]
print(df)
        column1       column2                    New Dict Column
0  [a, b, c, a]  [1, 2, 3, 4]  {'a': [1, 4], 'b': [2], 'c': [3]}
1     [b, b, a]     [1, 2, 3]            {'b': [1, 2], 'a': [3]}

Performance is really good, 10 times faster:

#20k rows for test
df = pd.concat([df] * 10000, ignore_index=True)


In [211]: %timeit df.apply(lambda data: {k: [y for x, y in zip(data[0], data[1]) if x == k] for k in data[0]}, axis=1)
532 ms ± 2.54 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [212]: %timeit  [ f(x) for x in df[['column1','column2']].to_numpy()]
53.8 ms ± 596 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)