Let’s assume we have the following dataframe that includes customer orders (order_id) and the products that the individual order contained (product_id):
import pandas as pd df = pd.DataFrame({'order_id' : [1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3], 'product_id' : [365, 48750, 3333, 9877, 48750, 32001, 3333, 3333, 365, 11202, 365]}) print(df) order_id product_id 0 1 365 1 1 48750 2 1 3333 3 1 9877 4 2 48750 5 2 32001 6 2 3333 7 3 3333 8 3 365 9 3 11202 10 3 365
It would be interesting to know how often product pairs appeared together in the same basket.
How does one create a co-occurence matrix in python that looks like this:
365 48750 3333 9877 32001 11202 365 1 1 2 1 0 1 48750 1 0 2 1 1 0 3333 2 2 0 1 1 1 9877 1 1 1 0 0 0 32001 0 1 1 0 0 0 11202 1 0 1 0 0 0
I would be very thankful for your help!
Answer
We start by grouping the df by order_id, and within each group calculate all possible pairs. Note we sort first by product_id so the same pairs in different groups are always in the same order
import itertools all_pairs = [] for _, group in df.sort_values('product_id').groupby('order_id'): all_pairs += list(itertools.combinations(group['product_id'],2)) all_pairs
we get a list of all pairs from all orders
[('3333', '365'), ('3333', '48750'), ('3333', '9877'), ('365', '48750'), ('365', '9877'), ('48750', '9877'), ('32001', '3333'), ('32001', '48750'), ('3333', '48750'), ('11202', '3333'), ('11202', '365'), ('11202', '365'), ('3333', '365'), ('3333', '365'), ('365', '365')]
Now we count duplicates
from collections import Counter count_dict = dict(Counter(all_pairs)) count_dict
so we get the count of each pair, basically what you are after
{('3333', '365'): 3, ('3333', '48750'): 2, ('3333', '9877'): 1, ('365', '48750'): 1, ('365', '9877'): 1, ('48750', '9877'): 1, ('32001', '3333'): 1, ('32001', '48750'): 1, ('11202', '3333'): 1, ('11202', '365'): 2, ('365', '365'): 1}
Putting this back into a cross-product table is a bit of work, the key bit is spliitng the tuples into columns by calling .apply(pd.Series)
and eventually moving one of the columns to the column names via unstack
:
(pd.DataFrame.from_dict(count_dict, orient='index') .reset_index(0) .set_index(0)['index'] .apply(pd.Series) .rename(columns = {0:'pid1',1:'pid2'}) .reset_index() .rename(columns = {0:'count'}) .set_index(['pid1', 'pid2'] ) .unstack() .fillna(0))
this produces a ‘compact’ form of the table you are after that only includes products that appeared in at least one pair
count pid2 3333 365 48750 9877 pid1 11202 1.0 2.0 0.0 0.0 32001 1.0 0.0 1.0 0.0 3333 0.0 3.0 2.0 1.0 365 0.0 1.0 1.0 1.0 48750 0.0 0.0 0.0 1.0
UPDATE Here is a rather simplified version of the above, following various discussions in the comments
import numpy as np import pandas as pd from collections import Counter # we start as in the original solution but use permutations not combinations all_pairs = [] for _, group in df.sort_values('product_id').groupby('order_id'): all_pairs += list(itertools.permutations(group['product_id'],2)) count_dict = dict(Counter(all_pairs)) # We create permutations for _all_ product_ids ... note we use unique() but also product(..) to allow for (365,265) combinations total_pairs = list(itertools.product(df['product_id'].unique(),repeat = 2)) # pull out first and second elements separately pid1 = [p[0] for p in total_pairs] pid2 = [p[1] for p in total_pairs] # and get the count for those permutations that exist from count_dict. Use 0 # for those that do not count = [count_dict.get(p,0) for p in total_pairs] # Now a bit of dataFrame magic df_cross = pd.DataFrame({'pid1':pid1, 'pid2':pid2, 'count':count}) df_cross.set_index(['pid1','pid2']).unstack()
and we are done. df_cross
below
count pid2 11202 32001 3333 365 48750 9877 pid1 11202 0 0 1 2 0 0 32001 0 0 1 0 1 0 3333 1 1 0 3 2 1 365 2 0 3 2 1 1 48750 0 1 2 1 0 1 9877 0 0 1 1 1 0