pandas dataframe groupby conditional count on multi-level column

Let’s say we have dataframe like this

np.random.seed(123)
df = pd.DataFrame(np.random.randint(100,size=(4, 4)),columns = pd.MultiIndex.from_product([['exp0','exp1'],['rnd0','rnd1']],names=['experiments','rnd_runs']))

df['grp1','cat'] = ['A','A','B','B']
df['grp2','cat2'] = ['C','C','C','B']

experiments exp0            exp1            grp1 grp2
rnd_runs    rnd0    rnd1    rnd0    rnd1    cat cat2
0             66    92      98      17      A   C
1             83    57      86      97      A   C
2             96    47      73      32      B   C
3             46    96      25      83      B   B

I want to count values in ('exp0', 'rdn0') column with groupby ('grp1','cat')

So I tried;

df['exp0_cnt','rdn0'] = df.groupby([('grp1','cat')])[('exp0', 'rdn')].apply(sum(x > 50 for x in df[(('exp0', 'rdn'))]))

but getting the error

TypeError: other must be a MultiIndex or a list of tuples

Here are the similar posts and I think I’m doing multi-level column callings with tuples.

conditional on multi header pandas dataframe

pandas dataframe groupby on multiindex

Better way for creating columns in a multi level columns pandas dataframe

Answer

The only ways to select MultiIndex columns from a groupby is with a list of tuples or a MultiIndex (as indicated by the Error Message):

So, instead of [('exp0', 'rdn')] it needs to be [[('exp0', 'rdn')]], then it just needs to be a valid column name like ('exp0', 'rnd0'), for example.

df['exp0_cnt', 'rdn0'] = (
    df.groupby([('grp1', 'cat')])[[('exp0', 'rnd0')]]
        #                                   ^ need to use valid column name
        #                         ^ needs to be a list of tuples
        .transform(lambda x: x.gt(50).sum())  # Some function that works
)

*I’ve also changed the apply function, because it seems to be missing the lambda so I made a guess as to an equivalent:

.apply(sum(x > 50 for x in df[(('exp0', 'rdn'))])

To transform since it’s being assigned back to the DataFrame:

.transform(lambda x: x.gt(50).sum())

df:

experiments exp0      exp1      grp1 grp2 exp0_cnt
rnd_runs    rnd0 rnd1 rnd0 rnd1  cat cat2     rdn0
0             66   92   98   17    A    C        2
1             83   57   86   97    A    C        2  # 2 values over 50 (in group)
2             96   47   73   32    B    C        1
3             46   96   25   83    B    B        1  # 1 values over 50 (in group)

Please Note: This means that a SeriesGroupBy cannot be created by selecting MultiIndex columns, only DataFrameGroupBy operations.

type(df.groupby([('grp1', 'cat')])[[('exp0', 'rnd0')]])
# <class 'pandas.core.groupby.generic.DataFrameGroupBy'>

This will exclude a few operations like SeriesGroupBy.unique

df.groupby([('grp1', 'cat')])[[('exp0', 'rnd0')]].unique()
AttributeError: 'DataFrameGroupBy' object has no attribute 'unique'

However, we can force a SeriesGroupBy by Selecting the Series from the DataFrame and grouping by the Series values directly:

df[('exp0', 'rnd0')].groupby(df[('grp1', 'cat')]).unique()
#  ^ select specific column  ^ pass the Series to groupby directly
(grp1, cat)
A    [66, 83]
B    [96, 46]
Name: (exp0, rnd0), dtype: object