计算numpy.array中每行的出现次数

时间:2021-02-08 23:55:42

I am trying to count a number each row shows in a np.array, for example:

我试图计算np.array中每行显示的数字,例如:

import numpy as np
my_array = np.array([[1, 2, 0, 1, 1, 1],
                     [1, 2, 0, 1, 1, 1], # duplicate of row 0
                     [9, 7, 5, 3, 2, 1],
                     [1, 1, 1, 0, 0, 0], 
                     [1, 2, 0, 1, 1, 1], # duplicate of row 0
                     [1, 1, 1, 1, 1, 0]])

Row [1, 2, 0, 1, 1, 1] shows up 3 times.

行[1,2,0,1,1,1]显示3次。

A simple naive solution would involve converting all my rows to tuples, and applying collections.Counter, like this:

一个简单的天真解决方案将涉及将我的所有行转换为元组,并应用collections.Counter,如下所示:

from collections import Counter
def row_counter(my_array):
    list_of_tups = [tuple(ele) for ele in my_array]
    return Counter(list_of_tups)

Which yields:

产量:

In [2]: row_counter(my_array)
Out[2]: Counter({(1, 2, 0, 1, 1, 1): 3, (1, 1, 1, 1, 1, 0): 1, (9, 7, 5, 3, 2, 1): 1, (1, 1, 1, 0, 0, 0): 1})

However, I am concerned about the efficiency of my approach. And maybe there is a library that provides a built-in way of doing this. I tagged the question as pandas because I think that pandas might have the tool I am looking for.

但是,我担心我的方法的效率。也许有一个库提供了这样做的内置方式。我将这个问题标记为熊猫,因为我认为大熊猫可能拥有我正在寻找的工具。

5 个解决方案

#1


9  

You can use the answer to this other question of yours to get the counts of the unique items.

您可以使用您的其他问题的答案来获取唯一项目的计数。

In numpy 1.9 there is a return_counts optional keyword argument, so you can simply do:

在numpy 1.9中有一个return_counts可选的关键字参数,所以你可以简单地做:

>>> my_array
array([[1, 2, 0, 1, 1, 1],
       [1, 2, 0, 1, 1, 1],
       [9, 7, 5, 3, 2, 1],
       [1, 1, 1, 0, 0, 0],
       [1, 2, 0, 1, 1, 1],
       [1, 1, 1, 1, 1, 0]])
>>> dt = np.dtype((np.void, my_array.dtype.itemsize * my_array.shape[1]))
>>> b = np.ascontiguousarray(my_array).view(dt)
>>> unq, cnt = np.unique(b, return_counts=True)
>>> unq = unq.view(my_array.dtype).reshape(-1, my_array.shape[1])
>>> unq
array([[1, 1, 1, 0, 0, 0],
       [1, 1, 1, 1, 1, 0],
       [1, 2, 0, 1, 1, 1],
       [9, 7, 5, 3, 2, 1]])
>>> cnt
array([1, 1, 3, 1])

In earlier versions, you can do it as:

在早期版本中,您可以这样做:

>>> unq, _ = np.unique(b, return_inverse=True)
>>> cnt = np.bincount(_)
>>> unq = unq.view(my_array.dtype).reshape(-1, my_array.shape[1])
>>> unq
array([[1, 1, 1, 0, 0, 0],
       [1, 1, 1, 1, 1, 0],
       [1, 2, 0, 1, 1, 1],
       [9, 7, 5, 3, 2, 1]])
>>> cnt
array([1, 1, 3, 1])

#2


4  

(This assumes that the array is fairly small, e.g. fewer than 1000 rows.)

(这假设数组相当小,例如少于1000行。)

Here's a short NumPy way to count how many times each row appears in an array:

这是一个简短的NumPy方法来计算每行在数组中出现的次数:

>>> (my_array[:, np.newaxis] == my_array).all(axis=2).sum(axis=1)
array([3, 3, 1, 1, 3, 1])

This counts how many times each row appears in my_array, returning an array where the first value shows how many times the first row appears, the second value shows how many times the second row appears, and so on.

这计算每行在my_array中出现的次数,返回一个数组,其中第一个值显示第一行出现的次数,第二个值显示第二行出现的次数,依此类推。

#3


3  

You solution is not bad, but if your matrix is large you will probably want to use a more efficient hash (compared to the default one Counter uses) for the rows before counting. You can do that with joblib:

你的解决方案也不错,但如果你的矩阵很大,你可能希望在计算之前使用更有效的哈希(与计数器使用的默认哈希相比)。你可以用joblib做到这一点:

A = np.random.rand(5, 10000)

%timeit (A[:,np.newaxis,:] == A).all(axis=2).sum(axis=1)
10000 loops, best of 3: 132 µs per loop

%timeit Counter(joblib.hash(row) for row in A).values()
1000 loops, best of 3: 1.37 ms per loop

%timeit Counter(tuple(ele) for ele in A).values()
100 loops, best of 3: 3.75 ms per loop

%timeit pd.DataFrame(A).groupby(range(A.shape[1])).size()
1 loops, best of 3: 2.24 s per loop

The pandas solution is extremely slow (about 2s per loop) with this many columns. For a small matrix like the one you showed your method is faster than joblib hashing but slower than numpy:

这么多列的熊猫解决方案非常慢(每个循环约2秒)。对于像你所展示的那样的小矩阵,你的方法比joblib散列更快但比numpy慢:

numpy: 100000 loops, best of 3: 15.1 µs per loop
joblib:1000 loops, best of 3: 885 µs per loop
tuple: 10000 loops, best of 3: 27 µs per loop
pandas: 100 loops, best of 3: 2.2 ms per loop

If you have a large number of rows then you can probably find a better substitute for Counter to find hash frequencies.

如果你有大量的行,那么你可以找到一个更好的替代Counter来查找散列频率。

Edit: Added numpy benchmarks from @acjr's solution in my system so that it is easier to compare. The numpy solution is the fastest one in both cases.

编辑:在我的系统中添加来自@ acjr解决方案的numpy基准测试,以便更容易比较。在这两种情况下,numpy解决方案是最快的解决方案。

#4


2  

A pandas approach might look like this

大熊猫的方法可能看起来像这样

import pandas as pd

df = pd.DataFrame(my_array,columns=['c1','c2','c3','c4','c5','c6'])
df.groupby(['c1','c2','c3','c4','c5','c6']).size()

Note: supplying column names is not necessary

注意:不需要提供列名

#5


0  

A solution identical to Jaime's can be found in the numpy_indexed package (disclaimer: I am its author)

可以在numpy_indexed包中找到与Jaime相同的解决方案(免责声明:我是其作者)

import numpy_indexed as npi
npi.count(my_array)

#1


9  

You can use the answer to this other question of yours to get the counts of the unique items.

您可以使用您的其他问题的答案来获取唯一项目的计数。

In numpy 1.9 there is a return_counts optional keyword argument, so you can simply do:

在numpy 1.9中有一个return_counts可选的关键字参数,所以你可以简单地做:

>>> my_array
array([[1, 2, 0, 1, 1, 1],
       [1, 2, 0, 1, 1, 1],
       [9, 7, 5, 3, 2, 1],
       [1, 1, 1, 0, 0, 0],
       [1, 2, 0, 1, 1, 1],
       [1, 1, 1, 1, 1, 0]])
>>> dt = np.dtype((np.void, my_array.dtype.itemsize * my_array.shape[1]))
>>> b = np.ascontiguousarray(my_array).view(dt)
>>> unq, cnt = np.unique(b, return_counts=True)
>>> unq = unq.view(my_array.dtype).reshape(-1, my_array.shape[1])
>>> unq
array([[1, 1, 1, 0, 0, 0],
       [1, 1, 1, 1, 1, 0],
       [1, 2, 0, 1, 1, 1],
       [9, 7, 5, 3, 2, 1]])
>>> cnt
array([1, 1, 3, 1])

In earlier versions, you can do it as:

在早期版本中,您可以这样做:

>>> unq, _ = np.unique(b, return_inverse=True)
>>> cnt = np.bincount(_)
>>> unq = unq.view(my_array.dtype).reshape(-1, my_array.shape[1])
>>> unq
array([[1, 1, 1, 0, 0, 0],
       [1, 1, 1, 1, 1, 0],
       [1, 2, 0, 1, 1, 1],
       [9, 7, 5, 3, 2, 1]])
>>> cnt
array([1, 1, 3, 1])

#2


4  

(This assumes that the array is fairly small, e.g. fewer than 1000 rows.)

(这假设数组相当小,例如少于1000行。)

Here's a short NumPy way to count how many times each row appears in an array:

这是一个简短的NumPy方法来计算每行在数组中出现的次数:

>>> (my_array[:, np.newaxis] == my_array).all(axis=2).sum(axis=1)
array([3, 3, 1, 1, 3, 1])

This counts how many times each row appears in my_array, returning an array where the first value shows how many times the first row appears, the second value shows how many times the second row appears, and so on.

这计算每行在my_array中出现的次数,返回一个数组,其中第一个值显示第一行出现的次数,第二个值显示第二行出现的次数,依此类推。

#3


3  

You solution is not bad, but if your matrix is large you will probably want to use a more efficient hash (compared to the default one Counter uses) for the rows before counting. You can do that with joblib:

你的解决方案也不错,但如果你的矩阵很大,你可能希望在计算之前使用更有效的哈希(与计数器使用的默认哈希相比)。你可以用joblib做到这一点:

A = np.random.rand(5, 10000)

%timeit (A[:,np.newaxis,:] == A).all(axis=2).sum(axis=1)
10000 loops, best of 3: 132 µs per loop

%timeit Counter(joblib.hash(row) for row in A).values()
1000 loops, best of 3: 1.37 ms per loop

%timeit Counter(tuple(ele) for ele in A).values()
100 loops, best of 3: 3.75 ms per loop

%timeit pd.DataFrame(A).groupby(range(A.shape[1])).size()
1 loops, best of 3: 2.24 s per loop

The pandas solution is extremely slow (about 2s per loop) with this many columns. For a small matrix like the one you showed your method is faster than joblib hashing but slower than numpy:

这么多列的熊猫解决方案非常慢(每个循环约2秒)。对于像你所展示的那样的小矩阵,你的方法比joblib散列更快但比numpy慢:

numpy: 100000 loops, best of 3: 15.1 µs per loop
joblib:1000 loops, best of 3: 885 µs per loop
tuple: 10000 loops, best of 3: 27 µs per loop
pandas: 100 loops, best of 3: 2.2 ms per loop

If you have a large number of rows then you can probably find a better substitute for Counter to find hash frequencies.

如果你有大量的行,那么你可以找到一个更好的替代Counter来查找散列频率。

Edit: Added numpy benchmarks from @acjr's solution in my system so that it is easier to compare. The numpy solution is the fastest one in both cases.

编辑:在我的系统中添加来自@ acjr解决方案的numpy基准测试,以便更容易比较。在这两种情况下,numpy解决方案是最快的解决方案。

#4


2  

A pandas approach might look like this

大熊猫的方法可能看起来像这样

import pandas as pd

df = pd.DataFrame(my_array,columns=['c1','c2','c3','c4','c5','c6'])
df.groupby(['c1','c2','c3','c4','c5','c6']).size()

Note: supplying column names is not necessary

注意:不需要提供列名

#5


0  

A solution identical to Jaime's can be found in the numpy_indexed package (disclaimer: I am its author)

可以在numpy_indexed包中找到与Jaime相同的解决方案(免责声明:我是其作者)

import numpy_indexed as npi
npi.count(my_array)