finding duplicates with tolerance and assign to a set in pandas

Input

  Name         A    B     C
0   aa  0.002667  2.5  13.5
1   bb  0.003400  2.5  13.7
2   cc  0.003600  1.0  13.6
3   dd  0.003667  1.0  13.6
4   aa  0.003667  1.0  13.6
5   bb  0.007600  1.0  13.6
6   cc  0.007000  1.0  13.6
7   dd  0.007000  1.0  13.6

Allowed Tolerance:

        A    B   C
0   0.003   0.2 0.2

I have to find the duplicates with above tolerance table and need to map the duplicates as sets below

Expected Output:

   Name     A        B   C     Set
0   aa  0.002667    2.5 13.5    1
1   bb  0.003400    2.5 13.7    1
2   cc  0.003600    1.0 13.6    2
3   dd  0.003667    1.0 13.6    2
4   aa  0.003667    1.0 13.6    2
5   bb  0.007600    1.0 13.6    3
6   cc  0.007000    1.0 13.6    3
7   dd  0.007000    1.0 13.6    3

Answer

Here is a way that is relatively fast, and can be adapted for other proximity query types (for example, finding sets of points that are within Euclidean distance of each other). It treats proximity in a transitive way: if a is within tolerance of b, and b is within tolerance of c, then all of a, b, c are assigned to the same set_id, regardless of whether a is within tolerance of c. This is equivalent to single-linkage clustering, but done without having to compute an O[n^2] distance matrix.

It uses two important tools:

  1. scipy.spatial.KDTree to speed up finding neighbors within a given distance.

  2. networkx to find isolated subgraphs among the neighbors.

Note about the p-norm: our understanding of the question is to flag pairs of points that are close to each other in all of the dimensions. If instead, the goal is to find neighbors that are within the tolerance in any of the dimensions, then use p=1 instead. For points that are within the ellipsoid with axes tol of each other (i.e. spheres in the scaled problem), then use p=2.

Note on speed: this is efficient if the total number of neighbors (number of pairs of point within tolerance of each other) is small. In the extreme case where all points are close to each other, then the method presented here is O[n^2], and other methods (e.g. boxing) will be faster.

Solution

import networkx as nx
from scipy.spatial import KDTree


def group_neighbors(df, tol, p=np.inf, show=False):
    r = np.linalg.norm(np.ones(len(tol)), p)
    kd = KDTree(df[tol.index] / tol)
    g = nx.Graph([
        (i, j)
        for i, neighbors in enumerate(kd.query_ball_tree(kd, r=r, p=p))
        for j in neighbors
    ])
    if show:
        nx.draw_networkx(g)
    ix, id_ = np.array([
        (j, i)
        for i, s in enumerate(nx.connected_components(g))
        for j in s
    ]).T
    id_[ix] = id_.copy()
    return df.assign(set_id=id_)

Example 1 (OP’s description)

df = pd.DataFrame({
    'Name': ['aa', 'bb', 'cc', 'dd', 'aa', 'bb', 'cc', 'dd'],
    'A': [0.002667, 0.0034, 0.0036, 0.003667, 0.003667, 0.0076, 0.007, 0.007],
    'B': [2.5, 2.5, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0],
    'C': [13.5, 13.7, 13.6, 13.6, 13.6, 13.6, 13.6, 13.6]},
)
tol = pd.Series([0.003, 0.2, 0.2], index=list('ABC'))

>>> group_neighbors(df, tol)
  Name         A    B     C  set_id
0   aa  0.002667  2.5  13.5       0
1   bb  0.003400  2.5  13.7       0
2   cc  0.003600  1.0  13.6       1
3   dd  0.003667  1.0  13.6       1
4   aa  0.003667  1.0  13.6       1
5   bb  0.007600  1.0  13.6       2
6   cc  0.007000  1.0  13.6       2
7   dd  0.007000  1.0  13.6       2

Bonus: show the neighbors graph:

_ = group_neighbors(df, tol, show=True)

Example 2: long list of neighbors

In this example, we replace A by a monotonic sequence [0, 0.1, 0.2, ...] such that each pair of consecutive points has distance 0.1. We also give a tolerance of A=0.12:

>>> group_neighbors(
...     df.assign(A=np.arange(0, df.shape[0]) * 0.1),
...     tol=pd.Series([0.12], index=['A']), show=True)

  Name    A    B     C  set_id
0   aa  0.0  2.5  13.5       0
1   bb  0.1  2.5  13.7       0
2   cc  0.2  1.0  13.6       0
3   dd  0.3  1.0  13.6       0
4   aa  0.4  1.0  13.6       0
5   bb  0.5  1.0  13.6       0
6   cc  0.6  1.0  13.6       0
7   dd  0.7  1.0  13.6       0

Example 3: more neighbors, no isolated subgraph

>>> group_neighbors(
...     df.assign(A=np.arange(0, df.shape[0]) * 0.1),
...     tol=pd.Series([0.21], index=['A']), show=True)

  Name    A    B     C  set_id
0   aa  0.0  2.5  13.5       0
1   bb  0.1  2.5  13.7       0
2   cc  0.2  1.0  13.6       0
3   dd  0.3  1.0  13.6       0
4   aa  0.4  1.0  13.6       0
5   bb  0.5  1.0  13.6       0
6   cc  0.6  1.0  13.6       0
7   dd  0.7  1.0  13.6       0

Explanation

Here are the various steps that the algorithm takes:

  1. scale all coordinates so that the tolerance becomes 1 in each dimension;
  2. make a KDTree of these transformed points;
  3. in one shot, query all pairs of points within distance r=1 of each other; note: we use p-norm Infinite so the regions are hypercubes; this corresponds to finding all points within the tol bounding box of each other;
  4. make a graph where all edges denote points that are neighbors;
  5. find all isolated subgraphs: those are the sets we want to assign each member to;
  6. label the sets by a unique int (from enumerate()).

Worked example

Let’s examine step by step what happens on the OP’s example.

First, select the relevant dimensions and scale for unit tolerance:

>>> df[tol.index] / tol

          A     B     C
0  0.889000  12.5  67.5
1  1.133333  12.5  68.5
2  1.200000   5.0  68.0
3  1.222333   5.0  68.0
4  1.222333   5.0  68.0
5  2.533333   5.0  68.0
6  2.333333   5.0  68.0
7  2.333333   5.0  68.0

After this scaling, the task becomes finding any pair of points whose difference in any dimension is at most 1.

Use KDTree to have a fast way of finding neighbors. Note: we use kd.query_ball_tree instead of kd.query_pairs because we want to keep singletons as well (e.g.: points that are only neighbors with themselves), so that they can get their own set_id in the final output:

kd = KDTree(df[tol.index] / tol)
>>> kd.query_ball_tree(kd, r=1, p=np.inf)

[[0, 1],
 [0, 1],
 [2, 3, 4],
 [2, 3, 4],
 [2, 3, 4],
 [5, 6, 7],
 [5, 6, 7],
 [5, 6, 7]]

The graph is then constructed from all these pairs.

We use connected_components to obtain all subgraphs of g that are isolated from each other:

>>> list(nx.connected_components(g))

[{0, 1}, {2, 3, 4}, {5, 6, 7}]

So, we have three sets (isolated subgraphs). We can then assign an ID to each, and return the result.