How to sample different number of rows from each group in DataFrame

I have a dataframe with a category column. Df has different number of rows for each category.

category number_of_rows
cat1     19189
cat2     13193
cat3     4500
cat4     1914
cat5     568
cat6     473
cat7     216
cat8     206
cat9     197
cat10    147
cat11    130
cat12    49
cat13    38
cat14    35
cat15    35
cat16    30
cat17    29
cat18    9
cat19    4
cat20    4
cat21    1
cat22    1
cat23    1

I want to select different number of rows from each category. (Instead of n fixed number of rows from each category)

Example input:
size_1 : {"cat1": 40, "cat2": 20, "cat3": 15, "cat4": 11, ...}
Example input: 
size_2 : {"cat1": 51, "cat2": 42, "cat3": 18, "cat4": 21, ...}

What I want to do is actually a stratified sampling with given number of instances corresponding to each category.

Also, it should be randomly selected. For example, I don’t need the top 40 values for size_1.[“cat1”], I need random 40 values.

Thanks for the help.

Answer

Artificial data generation


Dataframe

I’ve stolen DataFrame generation from the previous answer to show it similarly.

# Define a DataFrame containing employee data 
df = pd.DataFrame({'Category':['Jai', 'Jai', 'Jai', 'Princi', 'Princi'], 
        'Age':[27, 24, 22, 32, 15], 
        'Address':['Delhi', 'Kanpur', 'Allahabad', 'Kannauj', 'Noida'], 
        'Qualification':['Msc', 'MA', 'MCA', 'Phd', '10th']} )

Sampling rule

# Number of rows, that we want to be sampled from each category 
samples_per_group_dict = {'Jai': 1, 
                          'Princi':2}


Problem solving


I can propose two solutions:

  1. Apply on groupby (one-liner)

    output = df.groupby('Category').apply(lambda group: group.sample(samples_per_group_dict[group.name])).reset_index(drop = True)
    
  2. Looping groups (more verbose)

    list_of_sampled_groups = []
    
    for name, group in df.groupby('Category'):    
        n_rows_to_sample = samples_per_group_dict[name]
        sampled_group = group.sample(n_rows_to_sample)
        list_of_sampled_groups.append(sampled_group)
    
    output = pd.concat(list_of_sampled_groups).reset_index(drop=True)
    

Performance should be the same for both approaches. If performance matters you can vectorize your calculation. But exact optimization depends on n_groups and n_samples in each group.