I am trying to add a column to the dataframe below, that would tell me if a person belongs to the category Green or not. It would just show Y or N, depending on whether the column category contains it, for that person. The problem is that the column category contains in some lines just a string, and in other a list of strings and even on others a list of lists.
import pandas as pd df = pd.DataFrame({'user': ['Bob', 'Jane','Theresa', 'Alice'], 'category': [[['green'],['red']],'blue',['green'],[['yellow','purple'],'green','brown']]})
How can I make it so that I get to see if the column, for each row, contains the specific ‘Green’ string?
Thank you.
Answer
I would not bother flattening the list, just use basic string matching:
df['category'].astype(str).str.contains(r'bgreenb') 0 True 1 False 2 True 3 True Name: category, dtype: bool
Add the word boundary check b
so we don’t accidentally match words like “greenery” or “greenwich” which have “green” as part of a larger word.
df.assign(has_green=df['category'].astype(str) .str.contains(r'bgreenb') .map({True: 'Y', False: 'N'})) user category has_green 0 Bob [[green], [red]] Y 1 Jane blue N 2 Theresa [green] Y 3 Alice [[yellow, purple], green, brown] Y