How to make undersampling to have 25% of imput fo category 0 and does not changes in category 1 in Python?

I have imbalanced dataset in Pytohn like: 95% of 0 and 5% of 1.

How can I make undersampling to reduce number of zeros to have only 25% of imput dataset ?

I ask you because on the internet source I see only undesampling codes which cause that my dataset is balanced 50% of 0 and 50% of 1 and I do not want to have that, I only want to reduce my number of zeroes to level of 25% in dataset

How can I do taht in Python ? Have you some example codes?

Answer

To apply different rules to different values, you can use groupby. As you didn’t give an example dataset I’m just using a dataframe with a column col, which has 19 zeros and 1 one:

>>> df.shape
(20, 2)
>>> df['col'].value_counts() / len(df)
0      0.95
1      0.05
Name: col, dtype: float64

Now groupby.sample() doesn’t allow setting different numbers or fractions per group, so we can simply use groupby.apply() which itself can call sample() on the dataframes:

>>> df.groupby('col').apply(lambda g: g.sample(frac=.25 if g.name == 0 else 1))
        col foo
col            
0   6     0   g
    16    0   q
    3     0   d
    14    0   o
    15    0   p
1   19    1   t
>>> df.groupby('col').apply(lambda g: g.sample(frac=.25 if g.name == 0 else 1))
        col foo
col            
0   16    0   q
    5     0   f
    13    0   n
    2     0   c
    9     0   j
1   19    1   t

Note that I’m using the fact that the value used to decide the group is passed inside apply by setting a .name property on the dataframe.

You can add .droplevel('col') at the end to remove the first index level.