How to only extract the full words of a string in Python?

I want to extract only the full words of a string.

I have this df:

                     Students  Age
0           Boston Terry Emma   23
1      Tommy Julien Cambridge   20
2                      London   21
3                New York Liu   30
4  Anna-Madrid+       Pauline   26
5         Mozart    Cambridge   27
6             Gigi Tokyo Lily   18
7      Paris Diane Marie Dive   22

And I want to extract the FULL words from the string, NOT parts of it (ex: I want Liu if Liu is written in names, not iu if just iu if written, because Liu is not iu.)

cities = ['Boston', 'Cambridge', 'Bruxelles', 'New York', 'London', 'Amsterdam', 'Madrid', 'Tokyo', 'Paris']
liked_names = ['Emma', 'Pauline', 'Tommy Julien', 'iu']

Desired df:

                     Students  Age     Cities   Liked Names
0           Boston Terry Emma   23     Boston          Emma
1      Tommy Julien Cambridge   20  Cambridge  Tommy Julien
2                      London   21     London           NaN
3                New York Liu   30   New York           NaN
4  Anna-Madrid+       Pauline   26     Madrid       Pauline
5         Mozart    Cambridge   27  Cambridge           NaN
6             Gigi Tokyo Lily   18      Tokyo           NaN
7      Paris Diane Marie Dive   22      Paris           NaN

I tried this code:

pat = f'({"|".join(cities)})'
df['Cities'] = df['Students'].str.extract(pat, expand=False)
pat = f'({"|".join(liked_names)})'
df['Liked Names'] = df['Students'].str.extract(pat, expand=False)

My code for cities works, I just need to repair the issue for the ‘Liked Names’.

How to make this work? Thanks a lot!!!


I think what you are looking for are word boundaries. In a regular expression they can be expressed with a b. An ugly (albeit working) solution is to modify the liked_names list to include word boundaries and then run the code:

l = [
    ["Boston Terry Emma", 23],
    ["Tommy Julien Cambridge", 20],
    ["London", 21],
    ["New York Liu", 30],
    ["Anna-Madrid+       Pauline", 26],
    ["Mozart    Cambridge", 27],
    ["Gigi Tokyo Lily", 18],
    ["Paris Diane Marie Dive", 22],

cities = [
    "New York",
liked_names = ["Emma", "Pauline", "Tommy Julien", "iu"]
# here we modify the liked_names to include word boundaries.
liked_names = [r"b" + n + r"b" for n in liked_names]
df = pd.DataFrame(l, columns=["Students", "Age"])

pat = f'({"|".join(cities)})'
df["Cities"] = df["Students"].str.extract(pat, expand=False)
pat = f'({"|".join(liked_names)})'
df["Liked Names"] = df["Students"].str.extract(pat, expand=False)


A nicer solution would be to include the word boundaries in the creation of the regular expression.

I first tried using s, i.e. whitespace, but that did not work at the end of the list, so b was the solution. You can check for some details.