The apply function is executed twice for the first 2 groups in a grouped pandas DataFrame

I want to get the size of the every kind of animal that having max weight. The test code is here:

import numpy as np
import pandas as pd

print('numpy version =', np.__version__)
print('pandas version =', pd.__version__)
print()


def get_size_with_max_weight(subf):
    print(subf)
    return subf['size'][subf['weight'].idxmax()]

df = pd.DataFrame({'animal': 'cat dog cat fish dog cat cat'.split(),
                   'size': list('SSMMMLL'),
                   'weight': [8, 10, 11, 1, 20, 12, 12],
                   'adult': [False] * 5 + [True] * 2})

print(df)
print()


gf = df.groupby('animal').apply(get_size_with_max_weight)
print()
print(gf)

But When I tried to run apply function in DataFrame group, it was supposed each group should be just executed once. But when the function idxmax() is called as index with another column, I found that the apply function was executed twice for the first 2 groups. The below is the output:

numpy version = 1.18.5
pandas version = 1.0.5

  animal size  weight  adult
0    cat    S       8  False
1    dog    S      10  False
2    cat    M      11  False
3   fish    M       1  False
4    dog    M      20  False
5    cat    L      12   True
6    cat    L      12   True

  animal size  weight  adult
0    cat    S       8  False
2    cat    M      11  False
5    cat    L      12   True
6    cat    L      12   True
  animal size  weight  adult
1    dog    S      10  False
4    dog    M      20  False
  animal size  weight  adult
0    cat    S       8  False
2    cat    M      11  False
5    cat    L      12   True
6    cat    L      12   True
  animal size  weight  adult
1    dog    S      10  False
4    dog    M      20  False
  animal size  weight  adult
3   fish    M       1  False

animal
cat     L
dog     M
fish    M
dtype: object

As you can see, the group cat/dog was printed twice. If I don’t use the idxmax() function, it won’t happen. What’s the problems?

Answer

This is not a bug. This is by design.

The apply function needs to know the shape of the groups. Since the first two groups have different shapes. It will print group them twice, first time for getting the shape and second time for running the code on it.

In pandas version 1.1.0 this has been fixed, as mentioned in the “What’s New” page of the [documentation]

apply and applymap on DataFrame evaluates first row/column only once¶

Previous behavior:

df.apply(func, axis=1)
a    1
b    3
Name: 0, dtype: int64
a    1
b    3
Name: 0, dtype: int64
a    2
b    6
Name: 1, dtype: int64
Out[4]:
   a  b
0  1  3
1  2  6

New behavior:

df.apply(func, axis=1)
a    1
b    3
Name: 0, Length: 2, dtype: int64
a    2
b    6
Name: 1, Length: 2, dtype: int64
Out[79]: 
   a  b
0  1  3
1  2  6

[2 rows x 2 columns]

Also mentioned here on GitHub.