Suppose I have a Pandas DataFrame like this:
item event date A 1 2020-03-09 B 1 2020-03-09 A 2 2020-05-01 B 2 2020-05-01 C 2 2020-05-01 A 3 2020-06-25 C 3 2020-06-25 B 4 2020-07-18 C 4 2020-07-18
This dataframe contains a unique date per ‘event’ per ‘item’. So this means that an item has several events with distinct dates.
Now I would like to calculate per item the average amount of days between the dates. So this will be different values for each item and it thus requires me to calculate the average of the time between the dates per event per item.
So the expected output would look like:
item average_interval_in_days A 54 B 65.5 C 39.5
Anyone an idea how to do this?
Very similar to @BradSolomon’s answer, with two small differences:
df.sort_values(['item', 'date']).groupby('item')['date'].agg( lambda g: g.diff().mean() / pd.Timedelta(days=1)) # gives: item A 54.0 B 65.5 C 39.0
- ensure that dates are sorted within each group, otherwise the mean will depend on the order; in your example, the dates happen to be sorted, so if you can guarantee it, you may skip
... / pd.Timedelta(days=1)to produce directly the mean difference in units of days.
Alternative for speed (no sort, no lambda, but a bit more opaque)
gb = df.groupby('item')['date'] (gb.max() - gb.min()) / (gb.count() - 1) / pd.Timedelta(days=1) # gives: item A 54.0 B 65.5 C 39.0