Reorienting the pandas dataframe with pivot table and groupby function to make multiple row (string) values appear in a single column

Any suggestion what is happening here? Here is my dataframe with application number, applicant, gender, title of the application and multiple keywords per application.

import pandas as pd

df = pd.DataFrame({'Application number': [1,1,1,1,1,1,2,2,2,2,2,3,3,3],
        'Applicant': ['John Johnson','John Johnson','John Johnson','John Johnson','John Johnson','John Johnson','Ryan Rosling','Ryan Rosling','Ryan Rosling','Ryan Rosling','Ryan Rosling','Gee Gettysbeurg','Gee Gettysbeurg','Gee Gettysbeurg'],
        'Gender': ['m','m','m','m','m','m','m','m','m','m','m','v','v','v'],
        'Title':['Symplectic symmetry','Symplectic symmetry','Symplectic symmetry','Symplectic symmetry','Symplectic symmetry','Symplectic symmetry', 'Exploring protein synthesis', 'Exploring protein synthesis', 'Exploring protein synthesis', 'Exploring protein synthesis', 'Exploring protein synthesis', 'auxin-response pathway','auxin-response pathway','auxin-response pathway' ],
        'Keyword':['Maths', 'trignometry', 'Analysis', 'Algebra', 'Hypothesis', 'DNA', 'mitochondria', 'RNA', 'Brain', 'Cellstructure', 'Life', 'Blood', 'analysis', 'blood circulation']  })
df.head()

I used to reorient the table to make keywords appear in a single cell per applicant by following code:

df_pv2=pd.pivot_table(df,index=['Application number', 'Applicant', 'Gender', 'Title'],columns=df.groupby(['Application number', 'Applicant', 'Gender', 'Title']).cumcount().add(1),values=['Keyword'], aggfunc='sum')
df_pv2.columns=df_pv2.columns.map('{0[0]}{0[1]}'.format) 

But now it does not reorient the keyword columns. For the sake of completeness, hereunder the code to clean the table: dropping the not required columns

#joining all keywords together, seperated by comma
df_pv2['Keywords'] = df_pv2.loc[:, 'Keyword1':'Keyword6'].apply(lambda x: ', '.join(x[x.notnull()]), axis = 1)
df_pv2=df_pv2.drop(df_pv2.loc[:, 'Keyword1':'Keyword6'].columns, axis = 1) 
#resetting index
df_pv2=df_pv2.reset_index()
df_pv2.head() 

Answer

aggfunc='sum' will only operate on numeric values, for this reason, this code:

df_pv2 = pd.pivot_table(df,
                        index=['Application number', 'Applicant', 'Gender',
                               'Title'],
                        columns=df.groupby(
                            ['Application number', 'Applicant', 'Gender',
                             'Title']
                        ).cumcount().add(1),
                        values=['Keyword'],
                        aggfunc='sum')

Produces:

Empty DataFrame
Columns: []
Index: [(1, John Johnson, m, Symplectic symmetry), (2, Ryan Rosling, m, Exploring protein synthesis), (3, Gee Gettysbeurg, v, auxin-response pathway)]

Which is why the resulting DataFrame has no keywords.


One fix would be to use aggfunc='first':

df_pv2 = pd.pivot_table(df,
                        index=['Application number', 'Applicant', 'Gender',
                               'Title'],
                        columns=df.groupby(
                            ['Application number', 'Applicant', 'Gender',
                             'Title']
                        ).cumcount().add(1),
                        values=['Keyword'],
                        aggfunc='first')

df_pv2.columns = df_pv2.columns.map('{0[0]}{0[1]}'.format)

# joining all keywords together, seperated by comma
df_pv2['Keywords'] = df_pv2.loc[:, 'Keyword1':'Keyword6'].apply(
    lambda x: ', '.join(x[x.notnull()]), axis=1)
df_pv2 = df_pv2.drop(df_pv2.loc[:, 'Keyword1':'Keyword6'].columns, axis=1)
# resetting index
df_pv2 = df_pv2.reset_index()

Which results in:

   Application number        Applicant Gender                        Title                                                Keywords
0                   1     John Johnson      m          Symplectic symmetry  Maths, trignometry, Analysis, Algebra, Hypothesis, DNA
1                   2     Ryan Rosling      m  Exploring protein synthesis           mitochondria, RNA, Brain, Cellstructure, Life
2                   3  Gee Gettysbeurg      v       auxin-response pathway                      Blood, analysis, blood circulation

However, a simple groupby agg can be used to get directly to this point from df without a pivot_table:

df_pv2 = df.groupby(
    ['Application number', 'Applicant', 'Gender', 'Title'],
)['Keyword'].agg(', '.join).reset_index(name='Keywords')

df_pv2:

   Application number        Applicant Gender                        Title                                                Keywords
0                   1     John Johnson      m          Symplectic symmetry  Maths, trignometry, Analysis, Algebra, Hypothesis, DNA
1                   2     Ryan Rosling      m  Exploring protein synthesis           mitochondria, RNA, Brain, Cellstructure, Life
2                   3  Gee Gettysbeurg      v       auxin-response pathway                      Blood, analysis, blood circulation