I have two 2D numpy arrays (simplified in this example with respect to size and content) with identical sizes.
我有两个2D numpy数组(在这个例子中简化了大小和内容),大小相同。
An ID matrix:
ID矩阵:
1 1 1 2 2
1 1 2 2 5
1 1 2 5 5
1 2 2 5 5
2 2 5 5 5
and a value matrix:
和一个价值矩阵:
14.8 17.0 74.3 40.3 90.2
25.2 75.9 5.6 40.0 33.7
78.9 39.3 11.3 63.6 56.7
11.4 75.7 78.4 88.7 58.6
79.6 32.3 35.3 52.5 13.3
My goal is to count and sum the values from the second matrix grouped by the IDs from the first matrix:
我的目标是对来自第一个矩阵的ID分组的第二个矩阵的值进行计数和求和:
1: (8, 336.8)
2: (9, 453.4)
5: (8, 402.4)
I can do this in a for
loop but when the matrices have sizes in thousands instead of just 5x5 and thousands of unique ID's, it takes a lot of time to process.
我可以在for循环中执行此操作,但是当矩阵具有数千个而不是仅5x5和数千个唯一ID时,需要花费大量时间来处理。
Does numpy
have a clever method or a combination of methods for doing this?
numpy有一个聪明的方法或方法的组合来做到这一点?
3 个解决方案
#1
5
Here's a vectorized approach to get the counts for ID
and ID-based
summed values for value
with a combination of np.unique
and np.bincount
-
这是一个矢量化的方法,通过np.unique和np.bincount的组合来获取值的ID和基于ID的总和值的计数 -
unqID,idx,IDsums = np.unique(ID,return_counts=True,return_inverse=True)
value_sums = np.bincount(idx,value.ravel())
To get the final output as a dictionary, you can use loop-comprehension to gather the summed values, like so -
要将最终输出作为字典,您可以使用循环理解来收集求和值,如下所示 -
{i:(IDsums[itr],value_sums[itr]) for itr,i in enumerate(unqID)}
Sample run -
样品运行 -
In [86]: ID
Out[86]:
array([[1, 1, 1, 2, 2],
[1, 1, 2, 2, 5],
[1, 1, 2, 5, 5],
[1, 2, 2, 5, 5],
[2, 2, 5, 5, 5]])
In [87]: value
Out[87]:
array([[ 14.8, 17. , 74.3, 40.3, 90.2],
[ 25.2, 75.9, 5.6, 40. , 33.7],
[ 78.9, 39.3, 11.3, 63.6, 56.7],
[ 11.4, 75.7, 78.4, 88.7, 58.6],
[ 79.6, 32.3, 35.3, 52.5, 13.3]])
In [88]: unqID,idx,IDsums = np.unique(ID,return_counts=True,return_inverse=True)
...: value_sums = np.bincount(idx,value.ravel())
...:
In [89]: {i:(IDsums[itr],value_sums[itr]) for itr,i in enumerate(unqID)}
Out[89]:
{1: (8, 336.80000000000001),
2: (9, 453.40000000000003),
5: (8, 402.40000000000003)}
#2
1
This is possible with a combination of a few simple methods:
通过几种简单方法的组合可以实现这一点:
- use
numpy.unique
to find each ID - 使用numpy.unique查找每个ID
- create a boolean mask for each ID
- 为每个ID创建一个布尔掩码
- sum the 1s in the mask (count) and the values where the mask is 1
- 将掩码(计数)中的1和掩码为1的值相加
This can look like this:
这可能如下所示:
import numpy as np
ids = np.array([[1, 1, 1, 2, 2],
[1, 1, 2, 2, 5],
[1, 1, 2, 5, 5],
[1, 2, 2, 5, 5],
[2, 2, 5, 5, 5]])
values = np.array([[14.8, 17.0, 74.3, 40.3, 90.2],
[25.2, 75.9, 5.6, 40.0, 33.7],
[78.9, 39.3, 11.3, 63.6, 56.7],
[11.4, 75.7, 78.4, 88.7, 58.6],
[79.6, 32.3, 35.3, 52.5, 13.3]])
for i in np.unique(ids): # loop through all IDs
mask = ids == i # find entries that match current ID
count = np.sum(mask) # number of matches
total = np.sum(values[mask]) # values of matches
print('{}: ({}, {:.1f})'.format(i, count, total)) #print result
# Output:
# 1: (8, 336.8)
# 2: (9, 453.4)
# 5: (8, 402.4)
#3
0
The numpy_indexed package (disclaimer: I am its author) has functionality to solve these kind of problems in an elegant and vectorized manner:
numpy_indexed包(免责声明:我是它的作者)具有以优雅和矢量化的方式解决这些问题的功能:
import numpy_indexed as npi
group_by = npi.group_by(ID.flatten())
ID_unique, value_sums = group_by.sum(value.flatten())
ID_count = groupy_by.count
Note: if you want to compute the sum and count in order to compute a mean, there is also group_by.mean; plus a lot of other useful functionality.
注意:如果要计算总和和计数以计算均值,还有group_by.mean;加上许多其他有用的功能。
#1
5
Here's a vectorized approach to get the counts for ID
and ID-based
summed values for value
with a combination of np.unique
and np.bincount
-
这是一个矢量化的方法,通过np.unique和np.bincount的组合来获取值的ID和基于ID的总和值的计数 -
unqID,idx,IDsums = np.unique(ID,return_counts=True,return_inverse=True)
value_sums = np.bincount(idx,value.ravel())
To get the final output as a dictionary, you can use loop-comprehension to gather the summed values, like so -
要将最终输出作为字典,您可以使用循环理解来收集求和值,如下所示 -
{i:(IDsums[itr],value_sums[itr]) for itr,i in enumerate(unqID)}
Sample run -
样品运行 -
In [86]: ID
Out[86]:
array([[1, 1, 1, 2, 2],
[1, 1, 2, 2, 5],
[1, 1, 2, 5, 5],
[1, 2, 2, 5, 5],
[2, 2, 5, 5, 5]])
In [87]: value
Out[87]:
array([[ 14.8, 17. , 74.3, 40.3, 90.2],
[ 25.2, 75.9, 5.6, 40. , 33.7],
[ 78.9, 39.3, 11.3, 63.6, 56.7],
[ 11.4, 75.7, 78.4, 88.7, 58.6],
[ 79.6, 32.3, 35.3, 52.5, 13.3]])
In [88]: unqID,idx,IDsums = np.unique(ID,return_counts=True,return_inverse=True)
...: value_sums = np.bincount(idx,value.ravel())
...:
In [89]: {i:(IDsums[itr],value_sums[itr]) for itr,i in enumerate(unqID)}
Out[89]:
{1: (8, 336.80000000000001),
2: (9, 453.40000000000003),
5: (8, 402.40000000000003)}
#2
1
This is possible with a combination of a few simple methods:
通过几种简单方法的组合可以实现这一点:
- use
numpy.unique
to find each ID - 使用numpy.unique查找每个ID
- create a boolean mask for each ID
- 为每个ID创建一个布尔掩码
- sum the 1s in the mask (count) and the values where the mask is 1
- 将掩码(计数)中的1和掩码为1的值相加
This can look like this:
这可能如下所示:
import numpy as np
ids = np.array([[1, 1, 1, 2, 2],
[1, 1, 2, 2, 5],
[1, 1, 2, 5, 5],
[1, 2, 2, 5, 5],
[2, 2, 5, 5, 5]])
values = np.array([[14.8, 17.0, 74.3, 40.3, 90.2],
[25.2, 75.9, 5.6, 40.0, 33.7],
[78.9, 39.3, 11.3, 63.6, 56.7],
[11.4, 75.7, 78.4, 88.7, 58.6],
[79.6, 32.3, 35.3, 52.5, 13.3]])
for i in np.unique(ids): # loop through all IDs
mask = ids == i # find entries that match current ID
count = np.sum(mask) # number of matches
total = np.sum(values[mask]) # values of matches
print('{}: ({}, {:.1f})'.format(i, count, total)) #print result
# Output:
# 1: (8, 336.8)
# 2: (9, 453.4)
# 5: (8, 402.4)
#3
0
The numpy_indexed package (disclaimer: I am its author) has functionality to solve these kind of problems in an elegant and vectorized manner:
numpy_indexed包(免责声明:我是它的作者)具有以优雅和矢量化的方式解决这些问题的功能:
import numpy_indexed as npi
group_by = npi.group_by(ID.flatten())
ID_unique, value_sums = group_by.sum(value.flatten())
ID_count = groupy_by.count
Note: if you want to compute the sum and count in order to compute a mean, there is also group_by.mean; plus a lot of other useful functionality.
注意:如果要计算总和和计数以计算均值,还有group_by.mean;加上许多其他有用的功能。