Imbalanced multiclass classification dataset: undersample or oversample?

Dataset has around 150k records with four labels: [‘A’,’B’,’C’,’D’] and the distribution is as follows:
A: 60000
B: 50000
C: 36000
D: 4000

I notice using the package classification report to get the precision, recall, and f1-score, the f1-score is causing an UndefinedMetricWarning because class D is not being predicted due to the low number of records.

I know that I need to perform oversample/undersample to fix the imbalanced data.

Question: Would it be a good idea to fix the imbalanced data but randomly sample 4000 records from each class so that it is balanced?


I think you want to oversample from your class D. The technique is called Synthetic Minority Oversampling Technique, or SMOTE.

One way to solve this problem is to oversample the examples in the minority class. This can be achieved by simply duplicating examples from the minority class in the training dataset prior to fitting a model. This can balance the class distribution but does not provide any additional information to the model.

An improvement on duplicating examples from the minority class is to synthesize new examples from the minority class. This is a type of data augmentation for tabular data and can be very effective.