Unique identifier for numpy array?

I’m in a situation where I have some 2d array A that a method one of the classes in my code uses, and then later I need to check if an array passed to a different method has the same values.

The obvious fix is to save A as a class attribute, but because A can get potentially quite large, I want to avoid adding it as an attribute to avoid memory issues.

What I’d like to do is save some kind of unique identifier for this array and check that. My first thought was to use id(A), but that’s a unique identifier to the object, not the array, and so if I have some B = A.copy(), it would have a different id.

Another thought was to save some sparse version of A, e.g., sample some number of random indices and check equivalence, but that seems a lot more messy and in-depth than I need for something like this.

Anyone have any suggestions?

Answer

Use hash function e.g. SHA-256 through hashlib module. Example of producing hash-based ID is down below. array_id() function returns unique for this array string of fixed 64 symbols length. Array with same contents will produce same id, while if even small portion is changed then it will be totally different id.

Note that different types of array MAY produce different results, e.g. if you have two integer arrays with same integers values but one has type np.int32 another one is np.int64 then you’ll get different IDs, in this case you just need to change array to one common type e.g. do res_id = array_id(a.astype(np.int64)). But different types doesn’t always mean that hash IDs will be different, e.g. if all integers are non-negative and less than 2^31 then np.int32 and np.uint32 types both will give same hash.

So if you want hash-IDs to be same for numbers of same value then always change array type to some common type like array_id(a.astype(common_type)) where common_type may be e.g. np.int64 for all integer types and np.float64 for all floating point types. On the contrary if you want that different types produce always different results then include type name into hash like hashlib.sha256(str(a.dtype).encode('ascii') + a.tobytes()).hexdigest().upper().

In next code if you pass flag include_dtype = True then data type will be included in ID computation. If include_shape = True also shape will be included. algo argument (either sha256 or xxhash) chooses which hash algorithm to use.

Code needs installing some modules one time through command python -m pip install numpy xxhash.

Try it online!

# Needs: python -m pip install numpy xxhash
def array_id(a, *, include_dtype = False, include_shape = False, algo = 'xxhash'):
    data = bytes()
    if include_dtype:
        data += str(a.dtype).encode('ascii')
    data += b','
    if include_shape:
        data += str(a.shape).encode('ascii')
    data += b','
    data += a.tobytes()
    if algo == 'sha256':
        import hashlib
        return hashlib.sha256(data).hexdigest().upper()
    elif algo == 'xxhash':
        import xxhash
        return xxhash.xxh3_64(data).hexdigest().upper()
    else:
        assert False, algo

# Test
import numpy as np, timeit
a = np.array([[1,2,3],[4,5,6]])
print(array_id(a))
print(array_id(a, include_shape = True))
print(array_id(a, include_shape = True, include_dtype = True))

# Speed Measure
a = np.ones((10000, 10000,), dtype = np.uint32)
for algo in ['sha256', 'xxhash']:
    print(algo, round(timeit.timeit(lambda: array_id(a, algo = algo), number = 1), 3), 'sec')

Output:

17A96F5E5826D66A
E201378DF28CB280
0FDFAE47334C986A
sha256 3.774 sec
xxhash 1.356 sec

Leave a Reply

Your email address will not be published. Required fields are marked *