基于numpy数组中的行生成唯一值

I have a 3D numpy array, arr, with shape m*n*k.

我有一个3D numpy数组,arr,形状为m * n * k。

for every set of values along the m axis (e.g. arr[:, 0, 0]) I want to generate a single value to represent this set, so that I may end up with a 2D matrix, n*k. If a set of values along the m axis is repeated, then we should generate the same value each time.

对于沿m轴的每组值(例如arr [:,0,0]),我想生成一个单独的值来表示这个集合,这样我最终可能得到一个2D矩阵,n * k。如果重复沿m轴的一组值,那么我们应该每次都生成相同的值。

I.e it is a hashing problem.

这是一个哈希问题。

I created a solution to the problem using a dictionary, but it drastically reduces performance. For each set of values, I call this function:

我使用字典创建了问题的解决方案,但它大大降低了性能。对于每组值,我调用此函数:

 def getCellId(self, valueSet):

     # Turn the set of values (a numpy vector) to a tuple so it can be hashed
     key = tuple(valueSet)

     # Try and simply return an existing ID for this key
     try:
       return self.attributeDict[key]
     except KeyError:

       # If the key was new (and didnt exist), try and generate a new Id by adding one to the max of all current Id's. This will fail the very first time we do this (as there will be no Id's yet), so in that case, just assign the value '1' to the newId
       try:
         newId = max(self.attributeDict.values()) +1
       except ValueError:
         newId = 1
       self.attributeDict[key] = newId
       return newId

The array itself is typically of the size 30*256*256, so a single set of values will have 30 values. I have hundreds of these arrays to process at any one time. Currently, doing all processing that needs to be done up to calculating the hash takes 1.3s for a block of 100 arrays. Including the hashing bumps that up to 75s.

数组本身的大小通常为30 * 256 * 256,因此一组值将具有30个值。我有数百个这样的阵列可以在任何时候处理。目前,对于100个阵列的块,执行所有需要完成的计算哈希的处理需要1.3s。包括高达75s的散列凸起。

Is there a faster way to generate the single representative value?

有没有更快的方法来生成单个代表值?

3 个解决方案

#1

This could be one approach using basic numpy functions -

这可能是使用基本numpy函数的一种方法 -

import numpy as np

# Random input for demo
arr = np.random.randint(0,3,[2,5,4])

# Get dimensions for later usage
m,n,k = arr.shape

# Reshape arr to a 2D array that has each slice arr[:, n, k] in each row
arr2d = np.transpose(arr,(1,2,0)).reshape([-1,m])

# Perform lexsort & get corresponding indices and sorted array 
sorted_idx = np.lexsort(arr2d.T)
sorted_arr2d =  arr2d[sorted_idx,:]

# Differentiation along rows for sorted array
df1 = np.diff(sorted_arr2d,axis=0)

# Look for changes along df1 that represent new labels to be put there
df2 = np.append([False],np.any(df1!=0,1),0)

# Get unique labels
labels = df2.cumsum(0)

# Store those unique labels in a n x k shaped 2D array
pos_labels = np.zeros_like(labels)
pos_labels[sorted_idx] = labels
out = pos_labels.reshape([n,k])

Sample run -

样品运行 -

In [216]: arr
Out[216]: 
array([[[2, 1, 2, 1],
        [1, 0, 2, 1],
        [2, 0, 1, 1],
        [0, 0, 1, 1],
        [1, 0, 0, 2]],

       [[2, 1, 2, 2],
        [0, 0, 2, 1],
        [2, 1, 0, 0],
        [1, 0, 1, 0],
        [0, 1, 1, 0]]])

In [217]: out
Out[217]: 
array([[6, 4, 6, 5],
       [1, 0, 6, 4],
       [6, 3, 1, 1],
       [3, 0, 4, 1],
       [1, 3, 3, 2]], dtype=int32)

#2

Depending on how many new keys vs old keys need to be generated it's hard to say what will be optimal. But using your logic, the following should be fairly fast:

根据需要生成的新密钥与旧密钥的数量,很难说出最佳的密钥。但是使用你的逻辑,以下应该相当快:

import collections
import hashlib

_key = 0

def _get_new_key():
    global _key
    _key += 1
    return _key

attributes = collections.defaultdict(_get_new_key)

def get_cell_id(series):                             
    global attributes
    return attributes[hashlib.md5(series.tostring()).digest()]

Edit:

I now updated for looping all data series according to your question by using strides:

我现在更新了使用步幅根据您的问题循环所有数据系列:

In [99]: import numpy as np

In [100]: A = np.random.random((30, 256, 256))

In [101]: A_strided = np.lib.stride_tricks.as_strided(A, (A.shape[1] * A.shape[2], A.shape[0]), (A.itemsize, A.itemsize * A.shape[1] * A.shape[2]))

In [102]: %timeit tuple(get_cell_id(S) for S in A_strided)
10 loops, best of 3: 169 ms per loop

The above does 256x256 lookups/assignments of 30 element arrays each. There is of course no guarantee that the md5 hash wont collide. If that should be an issue, you could of course change to other hashes in the same lib.

以上是每个30个元素阵列的256x256查找/分配。当然不能保证md5哈希不会发生冲突。如果这应该是一个问题,你当然可以改为同一个lib中的其他哈希。

Edit 2:

Given that you seem to do the majority of costly operations on the first axis of your 3D array, I would suggest you reorganize your array:

鉴于您似乎在3D阵列的第一个轴上进行了大部分昂贵的操作,我建议您重新组织您的阵列:

In [254]: A2 = np.random.random((256, 256, 30))

In [255]: A2_strided = np.lib.stride_tricks.as_strided(A2, (A2.shape[0] * A2.shape[1], A2.shape[2]), (A2.itemsize * A2.shape[2], A2.itemsize))

In [256]: %timeit tuple(get_cell_id(S) for S in A2_strided)
10 loops, best of 3: 126 ms per loop

Not having to jump around long distances in memory does for about a 25% speed-up

不必在内存中长距离跳转就可以实现大约25%的加速

Edit 3:

If there is no actual need for caching a hash to int look-up, but that you just need actual hashes and if the 3D array is of int8-type, then given the A2 and A2_strided organization, time can be reduced some more. Of this 15ms is the tuple-looping.

如果没有实际需要将哈希缓存到int查找,但是你只需要实际哈希,如果3D数组是int8类型,那么给定A2和A2_strided组织,时间可以减少一些。在这15ms中是元组循环。

In [9]: from hashlib import md5

In [10]: %timeit tuple(md5(series.tostring()).digest() for series in A2_strided) 
10 loops, best of 3: 72.2 ms per loop

#3

If it is just about hashing try this

如果只是哈希试试这个

import numpy as np
import numpy.random

# create random data
a = numpy.random.randint(10,size=(5,3,3))

# create some identical 0-axis data
a[:,0,0] = np.arange(5)
a[:,0,1] = np.arange(5)

# create matrix with the hash values
h = np.apply_along_axis(lambda x: hash(tuple(x)),0,a)

h[0,0]==h[0,1]
# Output: True

However, use it with caution and test first this code with your code. ... all I can say is that it works for this simple example.

但是,请谨慎使用它,并首先使用您的代码测试此代码。 ......我只能说它适用于这个简单的例子。

In addition, it may be possible that two values might have the same hash value although they are different. This is an issue which can always happen using the hash function but they are very unlikely

此外,尽管两个值可能具有相同的散列值,但它们可能有所不同。这是一个总是可以使用哈希函数发生的问题,但它们不太可能发生

Edit: In order to compare with the other solutions

编辑:为了与其他解决方案进行比较

timeit(np.apply_along_axis(lambda x: hash(tuple(x)),0,a))
# output: 1 loops, best of 3: 677 ms per loop

#1