Numpy数组:按一列分组，另一列求和

I have an array that looks like this:

我有一个这样的数组:

 array([[ 0,  1,  2],
        [ 1,  1,  6],
        [ 2,  2, 10],
        [ 3,  2, 14]])

I want to sum the values of the third column that have the same value in the second column, so the result is something is:

我想把第二列中值相同的第三列的值相加，结果是:

 array([[ 0,  1,  8],
        [ 1,  2, 24]])

I started coding this but I'm stuck with this sum:

我开始编码这个，但是我被这个和困住了:

import numpy as np
import sys

inFile = sys.argv[1]

with open(inFile, 'r') as t:
    f = np.genfromtxt(t, delimiter=None, names =["1","2","3"])

f.sort(order=["1","2"])
if value == previous.value:
   sum(f["3"])

6 个解决方案

#1

You can use pandas to vectorize your algorithm:

你可以用熊猫来量化你的算法:

import pandas as pd, numpy as np

A = np.array([[ 0,  1,  2],
              [ 1,  1,  6],
              [ 2,  2, 10],
              [ 3,  2, 14]])

df = pd.DataFrame(A)\
       .groupby(1, as_index=False)\
       .sum()\
       .reset_index()

res = df[['index', 1, 2]].values

Result

结果

array([[ 0,  1,  8],
       [ 2,  2, 24]], dtype=int64)

#2

If your data is sorted by the second column, you can use something centered around np.add.reduceat for a pure numpy solution. A combination of np.nonzero (or np.where) applied to np.diff will give you the locations where the second column switches values. You can use those indices to do the sum-reduction. The other columns are pretty formulaic, so you can concatenate them back in fairly easily:

如果数据是按第二列排序的，那么可以使用以np.add为中心的内容。还原为纯numpy溶液。np的组合。非零(或np)diff将给出第二列交换值的位置。你可以用这些指标来做求和运算。其他的列是非常公式化的，所以你可以很容易地把它们串联起来:

A = np.array([[ 0,  1,  2],
              [ 1,  1,  6],
              [ 2,  2, 10],
              [ 3,  2, 14]])
# Find the split indices
i = np.nonzero(np.diff(A[:, 1]))[0] + 1
i = np.insert(i, 0, 0)
# Compute the result columns
c0 = np.arange(i.size)
c1 = A[i, 1]
c2 = np.add.reduceat(A[:, 2], i)
# Concatenate the columns
result = np.c_[c0, c1, c2]

IDEOne Link

IDEOne链接

Notice the +1 in the indices. That is because you always want the location after the switch, not before, given how reduceat works. The insertion of zero as the first index could also be accomplished with np.r_, np.concatenate, etc.

注意指数中的+1。这是因为你总是想要在开关后的位置，而不是以前，考虑到还原的工作原理。插入0作为第一个指标也可以用np来实现。r_,np。连接等。

That being said, I still think you are looking for the pandas version in @jpp's answer.

话虽如此，我仍然认为你在寻找@jpp的熊猫版本。

#3

Here is my solution using only numpy arrays...

这是我的解决方案，只使用numpy数组…

import numpy as np
arr = np.array([[ 0,  1,  2], [ 1,  1,  6], [ 2,  2, 10], [ 3,  2, 14]])

lst = []
compt = 0
for index in range(1, max(arr[:, 1]) + 1):
    lst.append([compt, index, np.sum(arr[arr[:, 1] == index][:, 2])])
lst = np.array(lst)
print lst
# lst, outputs...
# [[ 0  1  8]
# [ 0  2 24]]

The tricky part is the np.sum(arr[arr[:, 1] == index][:, 2]), so let's break it down to multiple parts.

棘手的部分是np。sum(arr[arr[:， 1] == index])[:， 2]，所以让我们把它分解成多个部分。

arr[arr[:, 1] == index] means...
arr[arr] [:， 1] = index]表示……

You have an array arr, on which we ask numpy the rows that matches the value of the for loop. Here, it is set from 1, to the maximum value of element of the 2nd column (meaning, column with index 1). Printing only this expression in the for loop results in...

你有一个数组arr，我们在它上面问numpy哪些行匹配for循环的值。这里，它从1设置为第二列元素的最大值(即索引1的列)。

# First iteration
[[0 1 2]
 [1 1 6]]
# Second iteration
[[ 2  2 10]
 [ 3  2 14]]

Adding [:, 2] to our expression, it means that we want the value of the 3rd column (meaning index 2), of our above lists. If I print arr[arr[:, 1] == index][:, 2], it would give me... [2, 6] at first iteration, and [10, 14] at the second.

添加[:，2]到我们的表达式，这意味着我们想要我们上面的列表的第三列(即索引2)的值。如果我打印arr[arr][:， 1] = index][:， 2]，它会给我…[2,6]在第一次迭代中，[10,14]在第二次迭代中。
I just need to sum these values using np.sum(), and to format my output list accordingly. :)

我只需要使用np.sum()来求和这些值，并相应地格式化输出列表。:)

#4

Using a dictionary to store the values and then converting back to a list

使用字典存储值，然后转换回列表

x = [[ 0,  1,  2],
     [ 1,  1,  6],
     [ 2,  2, 10],
     [ 3,  2, 14]]

y = {}
for val in x:
    if val[1] in y:
        y[val[1]][2] += val[2]
    else:
        y.update({val[1]: val})
print([y[val] for val in y])

#5

You can also use a defaultdict and sum the values:

您还可以使用defaultdict和总和的值:

from collections import defaultdict

x = [[ 0,  1,  2],
    [ 1,  1,  6],
    [ 2,  2, 10]]

res = defaultdict(int)
for val in x:
    res[val[1]]+= val[2]
print ([[i, val,res[val]] for i, val in enumerate(res)])

#6

To get exact output use pandas:

为了得到准确的输出，请使用熊猫:

import pandas as pd
import numpy as np

a = np.array([[ 0,  1,  2],
              [ 1,  1,  6],
              [ 2,  2, 10],
              [ 3,  2, 14]])

df = pd.DataFrame(a)
df.groupby(1).sum().reset_index().reset_index().as_matrix()
#[[ 0 1  8]
# [ 1 2 24]]

#1