set difference for pandas

A simple pandas question:

Is there a drop_duplicates() functionality to drop every row involved in the duplication?

An equivalent question is the following: Does pandas have a set difference for dataframes?

For example:

In [5]: df1 = pd.DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})

In [6]: df2 = pd.DataFrame({'col1':[4,2,5], 'col2':[6,3,5]})

In [7]: df1
Out[7]: 
   col1  col2
0     1     2
1     2     3
2     3     4

In [8]: df2
Out[8]: 
   col1  col2
0     4     6
1     2     3
2     5     5

so maybe something like df2.set_diff(df1) will produce this:

   col1  col2
0     4     6
2     5     5

However, I don’t want to rely on indexes because in my case, I have to deal with dataframes that have distinct indexes.

By the way, I initially thought about an extension of the current drop_duplicates() method, but now I realize that the second approach using properties of set theory would be far more useful in general. Both approaches solve my current problem, though.

Thanks!

Answer

from pandas import  DataFrame

df1 = DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
df2 = DataFrame({'col1':[4,2,5], 'col2':[6,3,5]})


print(df2[~df2.isin(df1).all(1)])
print(df2[(df2!=df1)].dropna(how='all'))
print(df2[~(df2==df1)].dropna(how='all'))

Leave a Reply

Your email address will not be published. Required fields are marked *