I have an array `a`

of numbers.

rnd = np.random.default_rng(12345) a = rnd.uniform(0, -50, 5) # array([-11.36680112, -15.83791699, -39.86827287, -33.81273354, # -19.55547753])

I want to find difference of the array with each element in the same array. The example output would be:

[array([ 0. , 4.47111586, 28.50147174, 22.44593241, 8.18867641]), array([-4.47111586, 0. , 24.03035588, 17.97481655, 3.71756054]), array([-28.50147174, -24.03035588, 0. , -6.05553933, -20.31279534]), array([-22.44593241, -17.97481655, 6.05553933, 0. , -14.25725601]), array([-8.18867641, -3.71756054, 20.31279534, 14.25725601, 0. ])]

My first approach was to use a list comprehension `[i - a for i in a]`

. However, since my original array `a`

is quite huge, and I have thousands of such `a`

s where I need to do the same operation, the overall process becomes quite slow and memory hungry to the point where jupyter kernel dies.

Is there any possible way where I could speed up this?

## Answer

There are two ways you can do, one is using only numpy vectos

- Memory inefficient, (in this case numpy is faster). But still should be first method you try if smaller array size

a[:, None] - a

- Using numba + numpy, it has llvm optimization so it can do magic on speed and also you can square your speed with parallel = True option. For super huge arrays, this should be go to. Or c++

For 40000 size, this finish in 3s without parallelism and 0.6s on my 12 core machine with parallelism

import numpy as np import numba as nb rnd = np.random.default_rng(12345) a = rnd.uniform(0, -50, 5) # return type nb.float64[:, :] # input argument type nb.float64[:, :] # By specifying these you can do eager compilation instead of lazy # also you can add parallel = True, cache=True # if you are using python threading then nogil=True # you can do lots of stuff # numba has SIMD vectorization, which just means it shall not loose to numpy on performance grounds if coded properly @nb.njit(nb.float64[:, :](nb.float64[:])) def speed(a): # empty to prevent unnecessary initializations b = np.empty((a.shape[0], a.shape[0]), dtype=a.dtype) # nb.prange needed to tell numba this for loop can be parallelized for i in nb.prange(a.shape[0]): for j in range(a.shape[0]): b[i][j] = a[i] - a[j] return b speed(a)

**Performance**

import numpy as np import numba as nb import sys import time @nb.njit(nb.float64[:, :](nb.float64[:])) def f1(a): b = np.empty((a.shape[0], a.shape[0]), dtype=a.dtype) for i in nb.prange(a.shape[0]): for j in range(a.shape[0]): b[i][j] = a[i] - a[j] return b @nb.njit(nb.float64[:, :](nb.float64[:]), parallel=True, cache=True) def f2(a): b = np.empty((a.shape[0], a.shape[0]), dtype=a.dtype) for i in nb.prange(a.shape[0]): for j in range(a.shape[0]): b[i][j] = a[i] - a[j] return b def f3(a): return a[:, None] - a if __name__ == '__main__': s0 = time.time() rnd = np.random.default_rng(12345) a = rnd.uniform(0, -50, int(sys.argv[2])) b = eval(sys.argv[1] + '(a)') print(time.time() - s0)

(base) xxx:~$ python test.py f1 40000 3.0324509143829346 (base) xxx:~$ python test.py f2 40000 0.6196465492248535 (base) xxx:~$ python test.py f3 40000 2.4126882553100586

I faced similar restrictions, I needed something fast. I get around 50x speed up without parallelism just by resolving memory usage and numba Why are np.hypot and np.subtract.outer very fast?