# Difference of items from an array with the same array in numpy

I have an array `a` of numbers.

```rnd = np.random.default_rng(12345)
a = rnd.uniform(0, -50, 5)
# array([-11.36680112, -15.83791699, -39.86827287, -33.81273354,
#       -19.55547753])
```

I want to find difference of the array with each element in the same array. The example output would be:

```[array([ 0.        ,  4.47111586, 28.50147174, 22.44593241,  8.18867641]),
array([-4.47111586,  0.        , 24.03035588, 17.97481655,  3.71756054]),
array([-28.50147174, -24.03035588,   0.        ,  -6.05553933,
-20.31279534]),
array([-22.44593241, -17.97481655,   6.05553933,   0.        ,
-14.25725601]),
array([-8.18867641, -3.71756054, 20.31279534, 14.25725601,  0.        ])]
```

My first approach was to use a list comprehension `[i - a for i in a]`. However, since my original array `a` is quite huge, and I have thousands of such `a`s where I need to do the same operation, the overall process becomes quite slow and memory hungry to the point where jupyter kernel dies.

Is there any possible way where I could speed up this?

There are two ways you can do, one is using only numpy vectos

1. Memory inefficient, (in this case numpy is faster). But still should be first method you try if smaller array size
```a[:, None] - a
```
1. Using numba + numpy, it has llvm optimization so it can do magic on speed and also you can square your speed with parallel = True option. For super huge arrays, this should be go to. Or c++

For 40000 size, this finish in 3s without parallelism and 0.6s on my 12 core machine with parallelism

```import numpy as np
import numba as nb

rnd = np.random.default_rng(12345)
a = rnd.uniform(0, -50, 5)

# return type nb.float64[:, :]
# input argument type nb.float64[:, :]
# By specifying these you can do eager compilation instead of lazy
# also you can add parallel = True, cache=True
# if you are using python threading then nogil=True
# you can do lots of stuff
# numba has SIMD vectorization, which just means it shall not loose to numpy on performance grounds if coded properly
@nb.njit(nb.float64[:, :](nb.float64[:]))
def speed(a):
# empty to prevent unnecessary initializations
b = np.empty((a.shape, a.shape), dtype=a.dtype)

# nb.prange needed to tell numba this for loop can be parallelized
for i in nb.prange(a.shape):
for j in range(a.shape):
b[i][j] = a[i] - a[j]
return b

speed(a)
```

Performance

```import numpy as np
import numba as nb
import sys
import time

@nb.njit(nb.float64[:, :](nb.float64[:]))
def f1(a):
b = np.empty((a.shape, a.shape), dtype=a.dtype)
for i in nb.prange(a.shape):
for j in range(a.shape):
b[i][j] = a[i] - a[j]
return b

@nb.njit(nb.float64[:, :](nb.float64[:]), parallel=True, cache=True)
def f2(a):
b = np.empty((a.shape, a.shape), dtype=a.dtype)
for i in nb.prange(a.shape):
for j in range(a.shape):
b[i][j] = a[i] - a[j]
return b

def f3(a):
return a[:, None] - a

if __name__ == '__main__':
s0 = time.time()
rnd = np.random.default_rng(12345)
a = rnd.uniform(0, -50, int(sys.argv))
b = eval(sys.argv + '(a)')
print(time.time() - s0)
```
```(base) xxx:~\$ python test.py f1 40000
3.0324509143829346
(base) xxx:~\$ python test.py f2 40000
0.6196465492248535
(base) xxx:~\$ python test.py f3 40000
2.4126882553100586
```

I faced similar restrictions, I needed something fast. I get around 50x speed up without parallelism just by resolving memory usage and numba Why are np.hypot and np.subtract.outer very fast?