I have imbalanced dataset in Pytohn like: 95% of 0 and 5% of 1.
How can I make undersampling to reduce number of zeros to have only 25% of imput dataset ?
I ask you because on the internet source I see only undesampling codes which cause that my dataset is balanced 50% of 0 and 50% of 1 and I do not want to have that, I only want to reduce my number of zeroes to level of 25% in dataset
How can I do taht in Python ? Have you some example codes?
To apply different rules to different values, you can use
groupby. As you didn’t give an example dataset I’m just using a dataframe with a column
col, which has 19 zeros and 1 one:
>>> df.shape (20, 2) >>> df['col'].value_counts() / len(df) 0 0.95 1 0.05 Name: col, dtype: float64
groupby.sample() doesn’t allow setting different numbers or fractions per group, so we can simply use
groupby.apply() which itself can call
sample() on the dataframes:
>>> df.groupby('col').apply(lambda g: g.sample(frac=.25 if g.name == 0 else 1)) col foo col 0 6 0 g 16 0 q 3 0 d 14 0 o 15 0 p 1 19 1 t >>> df.groupby('col').apply(lambda g: g.sample(frac=.25 if g.name == 0 else 1)) col foo col 0 16 0 q 5 0 f 13 0 n 2 0 c 9 0 j 1 19 1 t
Note that I’m using the fact that the value used to decide the group is passed inside
apply by setting a
.name property on the dataframe.
You can add
.droplevel('col') at the end to remove the first index level.