I’m in a situation where I have some 2d array
A that a method one of the classes in my code uses, and then later I need to check if an array passed to a different method has the same values.
The obvious fix is to save
A as a class attribute, but because
A can get potentially quite large, I want to avoid adding it as an attribute to avoid memory issues.
What I’d like to do is save some kind of unique identifier for this array and check that. My first thought was to use
id(A), but that’s a unique identifier to the object, not the array, and so if I have some
B = A.copy(), it would have a different
Another thought was to save some sparse version of A, e.g., sample some number of random indices and check equivalence, but that seems a lot more messy and in-depth than I need for something like this.
Anyone have any suggestions?
Use hash function e.g. SHA-256 through
hashlib module. Example of producing hash-based ID is down below.
array_id() function returns unique for this array string of fixed 64 symbols length. Array with same contents will produce same id, while if even small portion is changed then it will be totally different id.
Note that different types of array MAY produce different results, e.g. if you have two integer arrays with same integers values but one has type
np.int32 another one is
np.int64 then you’ll get different IDs, in this case you just need to change array to one common type e.g. do
res_id = array_id(a.astype(np.int64)). But different types doesn’t always mean that hash IDs will be different, e.g. if all integers are non-negative and less than 2^31 then
np.uint32 types both will give same hash.
So if you want hash-IDs to be same for numbers of same value then always change array type to some common type like
common_type may be e.g.
np.int64 for all integer types and
np.float64 for all floating point types. On the contrary if you want that different types produce always different results then include type name into hash like
hashlib.sha256(str(a.dtype).encode('ascii') + a.tobytes()).hexdigest().upper().
In next code if you pass flag
include_dtype = True then data type will be included in ID computation. If
include_shape = True also shape will be included.
algo argument (either
xxhash) chooses which hash algorithm to use.
Code needs installing some modules one time through command
python -m pip install numpy xxhash.
# Needs: python -m pip install numpy xxhash def array_id(a, *, include_dtype = False, include_shape = False, algo = 'xxhash'): data = bytes() if include_dtype: data += str(a.dtype).encode('ascii') data += b',' if include_shape: data += str(a.shape).encode('ascii') data += b',' data += a.tobytes() if algo == 'sha256': import hashlib return hashlib.sha256(data).hexdigest().upper() elif algo == 'xxhash': import xxhash return xxhash.xxh3_64(data).hexdigest().upper() else: assert False, algo # Test import numpy as np, timeit a = np.array([[1,2,3],[4,5,6]]) print(array_id(a)) print(array_id(a, include_shape = True)) print(array_id(a, include_shape = True, include_dtype = True)) # Speed Measure a = np.ones((10000, 10000,), dtype = np.uint32) for algo in ['sha256', 'xxhash']: print(algo, round(timeit.timeit(lambda: array_id(a, algo = algo), number = 1), 3), 'sec')
17A96F5E5826D66A E201378DF28CB280 0FDFAE47334C986A sha256 3.774 sec xxhash 1.356 sec