Python Pandas Dataframe: add new column based on existing column, which contains lists of lists

I am trying to add a column to the dataframe below, that would tell me if a person belongs to the category Green or not. It would just show Y or N, depending on whether the column category contains it, for that person. The problem is that the column category contains in some lines just a string, and in other a list of strings and even on others a list of lists.

import pandas as pd

df = pd.DataFrame({'user': ['Bob', 'Jane','Theresa', 'Alice'], 
                   'category': [[['green'],['red']],'blue',['green'],[['yellow','purple'],'green','brown']]})

How can I make it so that I get to see if the column, for each row, contains the specific ‘Green’ string?

Thank you.

Answer

I would not bother flattening the list, just use basic string matching:

df['category'].astype(str).str.contains(r'bgreenb')

0     True
1    False
2     True
3     True
Name: category, dtype: bool

Add the word boundary check b so we don’t accidentally match words like “greenery” or “greenwich” which have “green” as part of a larger word.


df.assign(has_green=df['category'].astype(str)
                                  .str.contains(r'bgreenb')
                                  .map({True: 'Y', False: 'N'}))

      user                          category has_green
0      Bob                  [[green], [red]]         Y
1     Jane                              blue         N
2  Theresa                           [green]         Y
3    Alice  [[yellow, purple], green, brown]         Y

Leave a Reply

Your email address will not be published. Required fields are marked *