Split Column on regex

I really struggle with regex, and I’m hoping for some help.

I have columns that look like this

import pandas as pd

data = {'Location': ['Building A, 100 First St City, State', 'Fire Station # 100, 2 Apple Row, City, State Zip', 'Church , 134 Baker Rd City, State']}

df = pd.DataFrame(data)

                                          Location
0              Building A, 100 First St City, State
1  Fire Station # 100, 2 Apple Row, City, State Zip
2                 Church , 134 Baker Rd City, State

I would like to get it to the code chunk below by splitting anytime there is a comma followed by space and then a number. However, I’m running into an issue where I’m removing the number.

        Location Name                        Address
0          Building A       100 First St City, State
1  Fire Station # 100  2 Apple Row, City, State, Zip
2              Church       134 Baker Rd City, State

This is the code I’ve been using

df['Location Name']= df['Location'].str.split('.,sd', expand=True)[0]
df['Address']= df['Location'].str.split('.,sd', expand=True)[1]

Answer

You can use Series.str.extract:

df[['Location Name','Address']] = df['Location'].str.extract(r'^(.*?),s(d.*)', expand=True)

The ^(.*?),s(d.*) regex matches

  • ^ – start of string
  • (.*?) – Group 1 (‘Location Name’): any zero or more chars other than line break chars as few as possible
  • ,s – comma and whitespace
  • (d.*) – Group 1 (‘Address’): digit and the rest of the line.

See the regex demo.