# finding duplicates with tolerance and assign to a set in pandas

Input

```  Name         A    B     C
0   aa  0.002667  2.5  13.5
1   bb  0.003400  2.5  13.7
2   cc  0.003600  1.0  13.6
3   dd  0.003667  1.0  13.6
4   aa  0.003667  1.0  13.6
5   bb  0.007600  1.0  13.6
6   cc  0.007000  1.0  13.6
7   dd  0.007000  1.0  13.6
```

Allowed Tolerance:

```        A    B   C
0   0.003   0.2 0.2
```

I have to find the duplicates with above tolerance table and need to map the duplicates as sets below

Expected Output:

```   Name     A        B   C     Set
0   aa  0.002667    2.5 13.5    1
1   bb  0.003400    2.5 13.7    1
2   cc  0.003600    1.0 13.6    2
3   dd  0.003667    1.0 13.6    2
4   aa  0.003667    1.0 13.6    2
5   bb  0.007600    1.0 13.6    3
6   cc  0.007000    1.0 13.6    3
7   dd  0.007000    1.0 13.6    3
```

Here is a way that is relatively fast, and can be adapted for other proximity query types (for example, finding sets of points that are within Euclidean distance of each other). It treats proximity in a transitive way: if `a` is within tolerance of `b`, and `b` is within tolerance of `c`, then all of `a`, `b`, `c` are assigned to the same `set_id`, regardless of whether `a` is within tolerance of `c`. This is equivalent to single-linkage clustering, but done without having to compute an `O[n^2]` distance matrix.

It uses two important tools:

1. `scipy.spatial.KDTree` to speed up finding neighbors within a given distance.

2. `networkx` to find isolated subgraphs among the neighbors.

Note about the p-norm: our understanding of the question is to flag pairs of points that are close to each other in all of the dimensions. If instead, the goal is to find neighbors that are within the tolerance in any of the dimensions, then use `p=1` instead. For points that are within the ellipsoid with axes `tol` of each other (i.e. spheres in the scaled problem), then use `p=2`.

Note on speed: this is efficient if the total number of neighbors (number of pairs of point within tolerance of each other) is small. In the extreme case where all points are close to each other, then the method presented here is `O[n^2]`, and other methods (e.g. boxing) will be faster.

## Solution

```import networkx as nx
from scipy.spatial import KDTree

def group_neighbors(df, tol, p=np.inf, show=False):
r = np.linalg.norm(np.ones(len(tol)), p)
kd = KDTree(df[tol.index] / tol)
g = nx.Graph([
(i, j)
for i, neighbors in enumerate(kd.query_ball_tree(kd, r=r, p=p))
for j in neighbors
])
if show:
nx.draw_networkx(g)
ix, id_ = np.array([
(j, i)
for i, s in enumerate(nx.connected_components(g))
for j in s
]).T
id_[ix] = id_.copy()
return df.assign(set_id=id_)
```

## Example 1 (OP’s description)

```df = pd.DataFrame({
'Name': ['aa', 'bb', 'cc', 'dd', 'aa', 'bb', 'cc', 'dd'],
'A': [0.002667, 0.0034, 0.0036, 0.003667, 0.003667, 0.0076, 0.007, 0.007],
'B': [2.5, 2.5, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0],
'C': [13.5, 13.7, 13.6, 13.6, 13.6, 13.6, 13.6, 13.6]},
)
tol = pd.Series([0.003, 0.2, 0.2], index=list('ABC'))

>>> group_neighbors(df, tol)
Name         A    B     C  set_id
0   aa  0.002667  2.5  13.5       0
1   bb  0.003400  2.5  13.7       0
2   cc  0.003600  1.0  13.6       1
3   dd  0.003667  1.0  13.6       1
4   aa  0.003667  1.0  13.6       1
5   bb  0.007600  1.0  13.6       2
6   cc  0.007000  1.0  13.6       2
7   dd  0.007000  1.0  13.6       2
```

Bonus: show the neighbors graph:

```_ = group_neighbors(df, tol, show=True)
``` ## Example 2: long list of neighbors

In this example, we replace `A` by a monotonic sequence `[0, 0.1, 0.2, ...]` such that each pair of consecutive points has distance `0.1`. We also give a tolerance of `A=0.12`:

```>>> group_neighbors(
...     df.assign(A=np.arange(0, df.shape) * 0.1),
...     tol=pd.Series([0.12], index=['A']), show=True)

Name    A    B     C  set_id
0   aa  0.0  2.5  13.5       0
1   bb  0.1  2.5  13.7       0
2   cc  0.2  1.0  13.6       0
3   dd  0.3  1.0  13.6       0
4   aa  0.4  1.0  13.6       0
5   bb  0.5  1.0  13.6       0
6   cc  0.6  1.0  13.6       0
7   dd  0.7  1.0  13.6       0
``` ## Example 3: more neighbors, no isolated subgraph

```>>> group_neighbors(
...     df.assign(A=np.arange(0, df.shape) * 0.1),
...     tol=pd.Series([0.21], index=['A']), show=True)

Name    A    B     C  set_id
0   aa  0.0  2.5  13.5       0
1   bb  0.1  2.5  13.7       0
2   cc  0.2  1.0  13.6       0
3   dd  0.3  1.0  13.6       0
4   aa  0.4  1.0  13.6       0
5   bb  0.5  1.0  13.6       0
6   cc  0.6  1.0  13.6       0
7   dd  0.7  1.0  13.6       0
``` ## Explanation

Here are the various steps that the algorithm takes:

1. scale all coordinates so that the tolerance becomes 1 in each dimension;
2. make a KDTree of these transformed points;
3. in one shot, query all pairs of points within distance `r=1` of each other; note: we use p-norm Infinite so the regions are hypercubes; this corresponds to finding all points within the `tol` bounding box of each other;
4. make a graph where all edges denote points that are neighbors;
5. find all isolated subgraphs: those are the sets we want to assign each member to;
6. label the sets by a unique `int` (from `enumerate()`).

## Worked example

Let’s examine step by step what happens on the OP’s example.

First, select the relevant dimensions and scale for unit tolerance:

```>>> df[tol.index] / tol

A     B     C
0  0.889000  12.5  67.5
1  1.133333  12.5  68.5
2  1.200000   5.0  68.0
3  1.222333   5.0  68.0
4  1.222333   5.0  68.0
5  2.533333   5.0  68.0
6  2.333333   5.0  68.0
7  2.333333   5.0  68.0
```

After this scaling, the task becomes finding any pair of points whose difference in any dimension is at most 1.

Use `KDTree` to have a fast way of finding neighbors. Note: we use `kd.query_ball_tree` instead of `kd.query_pairs` because we want to keep singletons as well (e.g.: points that are only neighbors with themselves), so that they can get their own `set_id` in the final output:

```kd = KDTree(df[tol.index] / tol)
>>> kd.query_ball_tree(kd, r=1, p=np.inf)

[[0, 1],
[0, 1],
[2, 3, 4],
[2, 3, 4],
[2, 3, 4],
[5, 6, 7],
[5, 6, 7],
[5, 6, 7]]
```

The graph is then constructed from all these pairs.

We use `connected_components` to obtain all subgraphs of `g` that are isolated from each other:

```>>> list(nx.connected_components(g))

[{0, 1}, {2, 3, 4}, {5, 6, 7}]
```

So, we have three sets (isolated subgraphs). We can then assign an ID to each, and return the result.