Get dates where category change happened for each user in dataframe

I am using pandas==1.2.1

MRE:

x = pd.DataFrame({"date":["20201211", "20201211", "20201212", "20201222", "20201222","20201223",
                          "20201211", "20201211", "20201212", "20201222", "20201222"],
                  "userid":["A", "A", "A", "A", "A", "A","B", "B", "B", "B", "B"],
                  "category":[1,1,2,2,2,1, 33, 33, 33, 34, 34]})

which looks like this:

    date    userid  category
0   20201211    A   1
1   20201211    A   1
2   20201212    A   2
3   20201222    A   2
4   20201222    A   2
5   20201223    A   1
6   20201211    B   33
7   20201211    B   33
8   20201212    B   33
9   20201222    B   34
10  20201222    B   34

What I want to do is get dates for each user when their category changed

so desired dataframe should look like this:

user         cat_changed             changed_cat
  A      [20201212, 20201223]         [2, 1]
  B          [20201222]                [34]

I’ve tried grouping by userid, cate, date however stuck from there…

Answer

You could first get the transition points using diff() on the category, then index those transitions and aggregate as list:

>>> transitions = x.groupby('userid').category.diff().fillna(0).ne(0)
>>> x[transitions].groupby('userid').agg(list)

                          date  category
userid
     A    [20201212, 20201223]    [2, 1]
     B              [20201222]      [34]

Leave a Reply

Your email address will not be published. Required fields are marked *