How to categorize values in different rows in pandas dataframe as low, medium and high based on different conditions?

I have a pandas dataframe df which looks as follows:

 country    India
year    2015    2014    2013    2012    2011    2010
GDP per capita  1605.605445 1573.885642 1449.610451 1443.882435 1458.104066 1357.563727
CO2 per capita  1.641198    1.649328    1.535560    1.507821    1.408316    1.349214
Electricity Access  88.000000   83.585213   80.738045   79.900000   67.600000   76.300000

It has three rows each for GDP per capita, CO2 per capita and Electricity Access for India from 2010 to 2015. I want to categorize values in each row of the dataframe based on individual conditions.

I am familiar how to categorize all the rows based on uniform condition. e.g.

df.apply(lambda x: pd.cut(x, bins = 3, labels = ["low", "medium", "high"]), axis = 1) returns something as below:

 country    India
year    2015    2014    2013    2012    2011    2010
GDP per capita  high    high    medium  medium  medium  low
CO2 per capita  high    high    medium  medium  low low
Electricity Access  high    high    medium  medium  low medium

Above, the values in higher-end in each row are categorized as high, in the lower-end as low and the values in the middle range are categorized as medium. This is correct for first row which is GDP per capita.

However, for the second row, I’d like to categorize the values with lower CO2 per capita as high and vice versa. For the third row, I’d like to categorize the Electricity Access below 80% as low, between 80 and 85% as medium, and above 85% as high.

I’d like to have a function since I have many indicators as these ones. What would the function look like for the categorization of given pandas data frame df with the individual conditions for each row as explained above? Or is it also possible to categorize values in each row one by one? What would be a good approach for it?

Note: df.to_dict() looks as follows:

{('India', '2015'): {'GDP per capita': 1605.60544457045,
  'CO2 per capita': 1.64119839274392,
  'Electricity Access': 88.0},
 ('India', '2014'): {'GDP per capita': 1573.88564183014,
  'CO2 per capita': 1.6493275187685,
  'Electricity Access': 83.5852127075195},
 ('India', '2013'): {'GDP per capita': 1449.61045069632,
  'CO2 per capita': 1.5355600591395,
  'Electricity Access': 80.7380447387695},
 ('India', '2012'): {'GDP per capita': 1443.88243476181,
  'CO2 per capita': 1.50782097489256,
  'Electricity Access': 79.9},
 ('India', '2011'): {'GDP per capita': 1458.10406619626,
  'CO2 per capita': 1.40831559281322,
  'Electricity Access': 67.6},
 ('India', '2010'): {'GDP per capita': 1357.5637268318,
  'CO2 per capita': 1.34921446581292,
  'Electricity Access': 76.3}}

Answer

We can do something like:

def cut_dataframe(df_, rules):
    """
    Select rows by index and create a new DataFrame based on cut rules

    :param df_: DataFrame to process
    :param rules: Dictionary of rules. Keys represent index location
    values contain a dictionary representing the kwargs for pd.cut
    :return: New DataFrame with the updated values
    """
    new_df = pd.DataFrame(columns=df_.columns)
    for idx, kwargs in rules.items():
        new_df.loc[idx] = pd.cut(df_.loc[idx], **kwargs)
    return new_df

Which will process each index location based on separate rules.

A sample usage for this example is:

df = pd.DataFrame({
    1: [1605.605445, 1.641198, 88.0],
    2: [1573.885642, 1.649328, 83.585213],
    3: [1449.610451, 1.53556, 80.738045],
    4: [1443.882435, 1.507821, 79.9],
    5: [1458.104066, 1.408316, 67.6],
    6: [1357.563727, 1.349214, 76.3]
}, index=['GDP per capita', 'CO2 per capita', 'Electricity Access'])
df.columns = pd.MultiIndex.from_product(
    [['India'], range(2015, 2009, -1)], names=['Country', 'year']
)


out_df = cut_dataframe(df, {
    'GDP per capita': dict(bins=3, labels=['low', 'medium', 'high']),
    'CO2 per capita': dict(bins=3, labels=['high', 'medium', 'low']),
    'Electricity Access': dict(bins=[0, 80, 85, 100],
                               labels=['low', 'medium', 'high'])
})

out_df:

Country            India                                      
year                2015    2014    2013    2012    2011  2010
GDP per capita      high    high  medium  medium  medium   low
CO2 per capita       low     low  medium  medium    high  high
Electricity Access  high  medium  medium     low     low   low

This could be further generalized to allow for a mix of cut and qcut (and other pd functions like)

def process_dataframe(df_, rules):
    """
    Select rows by index and create a new DataFrame based on rules

    :param df_: DataFrame to process
    :param rules: Dictionary of rules. Keys represent index location
    values contain a dictionary containing the following keys `fn`,
    `args` and `kwargs` and can be used to apply any pd function to each
    index location
    :return: New DataFrame with the updated values
    """
    new_df = pd.DataFrame(columns=df_.columns)
    for idx, rule in rules.items():
        try:
            new_df.loc[idx] = (
                getattr(pd, rule['fn'])(
                    df_.loc[idx],
                    *(rule['args'] if 'args' in rule else []),
                    **(rule['kwargs'] if 'kwargs' in rule else {})
                )
            )
        except AttributeError:
            # Invalid Function
            pass
    return new_df

Sample usage (equivalent to the above cut only function)

out_df = process_dataframe(df, {
    'GDP per capita': {
        'fn': 'cut',
        'kwargs': dict(bins=3, labels=['low', 'medium', 'high'])
    },
    'CO2 per capita': {
        'fn': 'cut',
        'kwargs': dict(bins=3, labels=['high', 'medium', 'low'])
    },
    'Electricity Access': {
        'fn': 'cut',
        'kwargs': dict(bins=[0, 80, 85, 100], labels=['low', 'medium', 'high'])
    }
})