# Count element-wise matches per row between two 2d numpy arrays of different lengths

I’ve got the following piece of code which is working fine:

```import numpy as np

arr1 = np.array([(1, 2, 3, 4), (2, 3, 4, 6), (4, 6, 7, 9), (4, 7, 8, 9), (5, 7, 8, 9)])
arr2 = np.array([(1, 3, 4, 5), (2, 3, 4, 5), (3, 4, 6, 7)])

m = []
for c in range(len(arr1)):
s = np.sum(np.any(arr1[c] == arr2[..., None], axis=1), axis=1)
m = np.append(m, np.amax(s, axis=0))
```

The above yields the desired result as shown below (that is, for each row in arr1, find the max number of matches to any of the rows in arr2, and add that number to the output array):

```print(m)
```

[3. 3. 3. 2. 1.]

The problem I’m facing is that when arr1 grows to become millions of rows, even though arr2 will only be a few thousands rows, the computation still becomes very slow.

I’ve tried to do the comparison using the following piece of code (and many other tries that yielded an even worse result), but this does not work:

```c = (arr1[:, None, ...] == arr2[None, ...])
```

The above code yields an outcome that is very close to what I’m expecting to see as the stage 1 output before doing the counting (as shown below), but the element-wise comparisons are not done correctly:

```print(c)
```

[[[ True False False False] [False False False False] [False False False False]]

[[False True True False] [ True True True False] [False False False False]]

[[False False False False] [False False False False] [False True True False]]

[[False False False False] [False False False False] [False False False False]]

[[False False False False] [False False False False] [False False False False]]]

Any suggestions on how to optimize the first snippet of code (such that it will not use a for loop) to work with arr1 having +/- 2 000 000 rows, and arr2 having +/- 7 000 rows?

I’m going to jump straight into timing here:

```np.random.seed(0xFFFF)
arr1 = np.random.randint(10, size=(20000, 5))
arr2 = np.random.randint(10, size=(70, 5))

def original(a1, a2):
m = []
for c in range(len(a1)):
s = np.sum(np.any(a1[c] == a2[..., None], axis=1), axis=1)
m = np.append(m, np.amax(s, axis=0))
return m
```

I truncated the input sizes by a factor of 100 from your target to be able to get reasonable numbers on my weak laptop. You can extrapolate as necessary.

```%timeit original(arr1, arr2)
1.11 s ± 19.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```

Compare this to @mozway’s solution using sets:

```def using_sets(a1, a2):
l1 = [frozenset(i) for i in a1]
l2 = [frozenset(i) for i in a2]
return np.array([max(len(s1.intersection(s2)) for s2 in l2) for s1 in l1])
```
```534 ms ± 6.23 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```

First thing’s first: numpy array allocations are not amortized, unlike say `list`. That means that you shouldn’t use `np.append` in a loop (and starting with a list doesn’t help):

```def list_append(a1, a2):
m = []
for c in range(len(a1)):
s = np.sum(np.any(a1[c] == a2[..., None], axis=1), axis=1)
m.append(np.amax(s, axis=0))
return np.array(m)
```

This is a huge change for even a small number of elements. In practice, the benefit is reducing O(n^2) to O(n) as the output size increases:

```%timeit list_append(arr1, arr2)
856 ms ± 28.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```

But wait! You know the size of the output to begin with. It’s much better to pre-allocate the whole thing, and just store the result to the correct index:

```def preallocate(a1, a2):
m = np.empty(a1.shape, int)
for c in range(len(a1)):
s = np.sum(np.any(a1[c] == a2[..., None], axis=1), axis=1)
m[c] = np.amax(s, axis=0)
return m
```

The gains are much less significant this time, because the arrays in the example are not huge, and `list` already amortized append a lot to begin with:

```%timeit preallocate(arr1, arr2)
842 ms ± 4.28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```

Now to address the meat of the problem. For one thing, you can reduce the memory footprint and speed by using `np.isin` instead of computing the huge boolean mask:

```def using_isin(a1, a2):
m = np.empty(a1.shape, int)
for c in range(len(a1)):
s = np.sum(np.isin(a2, a1[c]), axis=1)
m[c] = np.amax(s, axis=0)
return m
```

This is starting to approach the set-based approach:

```%timeit using_isin(arr1, arr2)
582 ms ± 1.55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```

If the arrays have sorted rows, you can avoid the linear search entirely:

```def using_sorted(a1, a2):
a1 = np.sort(a1, axis=1)
m = np.empty(a1.shape, int)
for c in range(len(a1)):
m[c] = (a1[c, np.clip(np.searchsorted(a1[c], a2), 0, a2.shape - 1)] == a2).sum(1).max()
return m
```

The overhead of a binary search actually makes this worse compared to the linear search implemented in `np.isin`:

```%timeit using_sorted(arr1, arr2)
695 ms ± 2.59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```

Of course the fastest solution is going to be to use numba:

```@njit
def using_numba(a1, a2):
m = np.empty(a1.shape, np.int32)
s = np.empty(a2.shape, a1.dtype)
for i in range(a1.shape):
for j in range(a2.shape):
s[j] = 0
for k in range(a1.shape):
s[j] += a1[i, k] in a2[j]
m[i] = s.max()
return m
```

Not surprisingly, this is ~7x faster than the set-based approach:

```%timeit using_numba(arr1, arr2)
73 ms ± 1.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

As an aside, let’s take a look at what you’re trying to do with a boolean mask. For each element of arr1, you want to check if it equals any element of arr2 (shape: `(2M, 5, 7k, 5)`). If the element from `arr1` appears somewhere in a row of `arr2`, you count that, so you apply `any` to the dimension corresponding to the rows of `arr2` (shape: `(2M, 5, 7k)`). Then you want to know the total number of elements matching in a given row of `arr1`, so you sum across that dimension (shape: `(2M, 7k)`). Finally, you want the maximum of that sum across the rows of `arr2` (shape: `(2M)`).

```(arr1[..., None, None] == arr2).any(axis=3).sum(axis=1).max(axis=1)
```

Notice the sizes of the temporary arrays you need to make along the way. At least two of them must exist at a time, so this method is unlikely to work for the desired inputs.