Pandas Fuzzy Matching

I want to check the accuracy of a column of addresses in my dataframe against a column of addresses in another dataframe, to see if they match and how well they match. However, it seems that it takes a long time to go through the addresses and perform the calculations. There are 15000+ addresses in my main dataframe and around 50 addresses in my reference dataframe. It ran for 5 minutes and still hadn’t finished.

My code is:

import pandas as pd
from fuzzywuzzy import fuzz, process

### Main dataframe
df = pd.read_csv("adressess.csv", encoding="cp1252")

#### Reference dataframe
ref_df = pd.read_csv("ref_addresses.csv", encoding="cp1252")

### Variable for accuracy scoring
accuracy = 0

for index, value in df["address"].iteritems():

    ### This gathers the index from the correct address column in the reference df
    ref_index = ref_df["correct_address"][
        ref_df["correct_address"]
        == process.extractOne(value, ref_df["correct_address"])[0]
    ].index.toList()[0]

    ### if each row can score a max total of 1, the ratio must be divided by 100
    accuracy += (
        fuzz.ratio(df["address"][index], ref_df["correct_address"][ref_index]) / 100
    )

Is this the best way to loop through a column in a dataframe and fuzzy match it against another? I want the score to be a ratio because later I will then output an excel file with the correct values and a background colour to indicate what values were wrong and changed.

I don’t believe fuzzywuzzy has a method that allows you to pull the index, value and ration into one tuple – just value and ratio of match.

Answer

Hopefully the below code (with links to dummy data) helps show what is possible. I tried to use street addresses to mock up a similar situation so it is easier to compare with your dataset; obviously it is no where near as big.

You can pull the csv text from the links in the comments and run it and see what could work on your larger sample.

For five addresses in the reference frame and 100 contacts in the other its execution timings are:

CPU times: user 107 ms, sys: 21 ms, total: 128 ms
Wall time: 137 ms

The below code should be quicker than .iteritems() etc.

Code:

# %%time

import pandas as pd
from fuzzywuzzy import fuzz, process
import difflib

# create 100-contacts.csv from data at: https://pastebin.pl/view/3a216455
df = pd.read_csv('100-contacts.csv')

#  create ref_addresses.csv from data at: https://pastebin.pl/view/6e992fe8
ref_df = pd.read_csv('ref_addresses.csv')

# function used for fuzzywuzzy matching
def match_addresses(add, list_add, min_score=0):
    max_score = -1
    max_add = ''
    for x in list_add:
        score = fuzz.ratio(add, x)
        if (score > min_score) & (score > max_score):
            max_add = x
            max_score = score
    return (max_add, max_score)

# given current row of ref_df (via Apply) and series (df['address'])
# return the fuzzywuzzy score
def scoringMatches(x, s):
    o = process.extractOne(x, s, score_cutoff = 60)
    if o != None:
        return o[1]
    
# creating two lists from address column of both dataframes
contacts_addresses = list(df.address.unique())
ref_addresses = list(ref_df.correct_address.unique())

# via fuzzywuzzy matching and using scoringMatches() above
# return a dictionary of addresses where there is a match
# the keys are the address from ref_df and the associated value is from df (i.e., 'huge' frame)
# example:
# {'86 Nw 66th Street #8673': '86 Nw 66th St #8673', '1 Central Avenue': '1 Central Ave'}

names = []
for x in ref_addresses:
    match = match_addresses(x, contacts_addresses, 75)
    if match[1] >= 75:
        name = (str(x), str(match[0]))
        names.append(name)
name_dict = dict(names)

# create new frame from fuzzywuzzy address matches dictionary
match_df = pd.DataFrame(name_dict.items(), columns=['ref_address', 'matched_address'])

# add fuzzywuzzy scoring to original ref_df
ref_df['fuzzywuzzy_score'] = ref_df.apply(lambda x: scoringMatches(x['correct_address'], df['address']), axis=1)

# merge the fuzzywuzzy address matches frame with the reference frame
compare_df = pd.concat([match_df, ref_df], axis=1)
compare_df = compare_df[['ref_address', 'matched_address', 'correct_address', 'fuzzywuzzy_score']].copy()

# add difflib scoring for a bit of interest.  
# a random thought passed through my head maybe this is interesting?
compare_df['difflib_score'] = compare_df.apply(lambda x : difflib.SequenceMatcher
                                               (None, x['ref_address'], x['matched_address']).ratio(),axis=1)

# clean up column ordering ('correct_address' and 'ref_address' are basically 
# copies of each other, but shown for completeness)
compare_df = compare_df[['correct_address', 'ref_address', 'matched_address',
                         'fuzzywuzzy_score', 'difflib_score']]

# see what we've got
print(compare_df)

# remember: correct_address and ref_address are copies 
# so just pick one to compare to matched_address

           correct_address              ref_address         matched_address  
0  86 Nw 66th Street #8673  86 Nw 66th Street #8673     86 Nw 66th St #8673   
1   2737 Pistorio Rd #9230   2737 Pistorio Rd #9230  2737 Pistorio Rd #9230   
2       6649 N Blue Gum St       6649 N Blue Gum St      6649 N Blue Gum St   
3       59 n Groesbeck Hwy       59 n Groesbeck Hwy      59 N Groesbeck Hwy   
4         1 Central Avenue         1 Central Avenue           1 Central Ave   

   fuzzywuzzy_score  difflib_score  
0                90       0.904762  
1               100       1.000000  
2               100       1.000000  
3               100       0.944444  
4                90       0.896552