用于聚合数组的pythonic方法(numpy与否)

时间:2022-05-26 21:27:24

I would like to make a nice function to aggregate data among an array (it's a numpy record array, but it does not change anything)

我想做一个很好的函数来聚合数组之间的数据(它是一个numpy记录数组,但它不会改变任何东西)

you have an array of data that you want to aggregate among one axis: for example an array of dtype=[(name, (np.str_,8), (job, (np.str_,8), (income, np.uint32)] and you want to have the mean income per job

你有一个想要在一个轴之间聚合的数据数组:例如dtype = [(name,(np.str_,8),(job,(np.str_,8),(income,np。 uint32)]并且你希望获得每份工作的平均收入

I did this function, and in the example it should be called as aggregate(data,'job','income',mean)

我做了这个功能,在示例中它应该被称为聚合(数据,'工作','收入',意思)


def aggregate(data, key, value, func):

    data_per_key = {}

    for k,v in zip(data[key], data[value]):

        if k not in data_per_key.keys():

            data_per_key[k]=[]

        data_per_key[k].append(v)

    return [(k,func(data_per_key[k])) for k in data_per_key.keys()]

the problem is that I find it not very nice I would like to have it in one line: do you have any ideas?

问题是我发现它不是很好我想把它放在一行:你有什么想法吗?

Thanks for your answer Louis

谢谢你的回答路易斯

PS: I would like to keep the func in the call so that you can also ask for median, minimum...

PS:我想在通话中保留功能,这样你也可以要求中位数,最小...

6 个解决方案

#1


5  

Perhaps the function you are seeking is matplotlib.mlab.rec_groupby:

也许您正在寻找的功能是matplotlib.mlab.rec_groupby:

import matplotlib.mlab

data=np.array(
    [('Aaron','Digger',1),
     ('Bill','Planter',2),
     ('Carl','Waterer',3),
     ('Darlene','Planter',3),
     ('Earl','Digger',7)],
    dtype=[('name', np.str_,8), ('job', np.str_,8), ('income', np.uint32)])

result=matplotlib.mlab.rec_groupby(data, ('job',), (('income',np.mean,'avg_income'),))

yields

('Digger', 4.0)
('Planter', 2.5)
('Waterer', 3.0)

matplotlib.mlab.rec_groupby returns a recarray:

matplotlib.mlab.rec_groupby返回一个重新排列:

print(result.dtype)
# [('job', '|S7'), ('avg_income', '<f8')]

You may also be interested in checking out pandas, which has even more versatile facilities for handling group-by operations.

您可能也有兴趣查看大熊猫,它拥有更多用于处理分组操作的多功能设施。

#2


5  

Your if k not in data_per_key.keys() could be rewritten as if k not in data_per_key, but you can do even better with defaultdict. Here's a version that uses defaultdict to get rid of the existence check:

如果k不在data_per_key中,你的if不在data_per_key.keys()中就可以被重写,但你可以用defaultdict做得更好。这是一个使用defaultdict来摆脱存在检查的版本:

import collections

def aggregate(data, key, value, func):
    data_per_key = collections.defaultdict(list)
    for k,v in zip(data[key], data[value]):
        data_per_key[k].append(v)

    return [(k,func(data_per_key[k])) for k in data_per_key.keys()]

#3


2  

Here is a recipe which emulates the functionality of matlabs accumarray quite well. It uses pythons iterators quite nicely, nevertheless, performancewise it sucks compared to the matlab implementation. As I had the same problem, I had written an implementation using scipy.weave. You can find it here: https://github.com/ml31415/accumarray

这是一个很好地模仿matlabs accumarray功能的配方。它使用pythons迭代器非常好,但是,与matlab实现相比,它在性能方面很糟糕。由于我遇到同样的问题,我使用scipy.weave编写了一个实现。你可以在这里找到它:https://github.com/ml31415/accumarray

#4


2  

Best flexibility and readability is get using pandas:

使用熊猫可获得最佳灵活性和可读性:

import pandas

data=np.array(
    [('Aaron','Digger',1),
     ('Bill','Planter',2),
     ('Carl','Waterer',3),
     ('Darlene','Planter',3),
     ('Earl','Digger',7)],
    dtype=[('name', np.str_,8), ('job', np.str_,8), ('income', np.uint32)])

df = pandas.DataFrame(data)
result = df.groupby('job').mean()

Yields to :

收益率:

         income
job
Digger      4.0
Planter     2.5
Waterer     3.0

Pandas DataFrame is a great class to work with, but you can get back your results as you need:

Pandas DataFrame是一个很棒的课程,但您可以根据需要获得结果:

result.to_records()
result.to_dict()
result.to_csv()

And so on...

等等...

#5


2  

Best performance is achieved using ndimage.mean from scipy. This will be twice faster than accepted answer for this small dataset, and about 3.5 times faster for larger inputs:

使用scipy的ndimage.mean实现最佳性能。对于这个小数据集,这将比接受的答案快两倍,对于较大的输入,速度要快3.5倍:

from scipy import ndimage

data=np.array(
    [('Aaron','Digger',1),
     ('Bill','Planter',2),
     ('Carl','Waterer',3),
     ('Darlene','Planter',3),
     ('Earl','Digger',7)],
    dtype=[('name', np.str_,8), ('job', np.str_,8), ('income', np.uint32)])

unique = np.unique(data['job'])
result=np.dstack([unique, ndimage.mean(data['income'], data['job'], unique)])

Will yield to:

将屈服于:

array([[['Digger', '4.0'],
        ['Planter', '2.5'],
        ['Waterer', '3.0']]],
      dtype='|S32')

EDIT: with bincount (faster!)

This is about 5x faster than accepted answer for the small example input, if you repeat the data 100000 times it will be around 8.5x faster:

对于小示例输入,这比接受的答案快约5倍,如果重复数据100000次,它将快约8.5倍:

unique, uniqueInd, uniqueCount = np.unique(data['job'], return_inverse=True, return_counts=True)
means = np.bincount(uniqueInd, data['income'])/uniqueCount
return np.dstack([unique, means])

#6


0  

http://python.net/~goodger/projects/pycon/2007/idiomatic/handout.html#dictionary-get-method

should help to make it a little prettier, more pythonic, more efficient possibly. I'll come back later to check on your progress. Maybe you can edit the function with this in mind? Also see the next couple of sections.

应该有助于使它更漂亮,更pythonic,更有效。我稍后会回来查看你的进展情况。也许你可以编辑这个功能吗?另见下几节。

#1


5  

Perhaps the function you are seeking is matplotlib.mlab.rec_groupby:

也许您正在寻找的功能是matplotlib.mlab.rec_groupby:

import matplotlib.mlab

data=np.array(
    [('Aaron','Digger',1),
     ('Bill','Planter',2),
     ('Carl','Waterer',3),
     ('Darlene','Planter',3),
     ('Earl','Digger',7)],
    dtype=[('name', np.str_,8), ('job', np.str_,8), ('income', np.uint32)])

result=matplotlib.mlab.rec_groupby(data, ('job',), (('income',np.mean,'avg_income'),))

yields

('Digger', 4.0)
('Planter', 2.5)
('Waterer', 3.0)

matplotlib.mlab.rec_groupby returns a recarray:

matplotlib.mlab.rec_groupby返回一个重新排列:

print(result.dtype)
# [('job', '|S7'), ('avg_income', '<f8')]

You may also be interested in checking out pandas, which has even more versatile facilities for handling group-by operations.

您可能也有兴趣查看大熊猫,它拥有更多用于处理分组操作的多功能设施。

#2


5  

Your if k not in data_per_key.keys() could be rewritten as if k not in data_per_key, but you can do even better with defaultdict. Here's a version that uses defaultdict to get rid of the existence check:

如果k不在data_per_key中,你的if不在data_per_key.keys()中就可以被重写,但你可以用defaultdict做得更好。这是一个使用defaultdict来摆脱存在检查的版本:

import collections

def aggregate(data, key, value, func):
    data_per_key = collections.defaultdict(list)
    for k,v in zip(data[key], data[value]):
        data_per_key[k].append(v)

    return [(k,func(data_per_key[k])) for k in data_per_key.keys()]

#3


2  

Here is a recipe which emulates the functionality of matlabs accumarray quite well. It uses pythons iterators quite nicely, nevertheless, performancewise it sucks compared to the matlab implementation. As I had the same problem, I had written an implementation using scipy.weave. You can find it here: https://github.com/ml31415/accumarray

这是一个很好地模仿matlabs accumarray功能的配方。它使用pythons迭代器非常好,但是,与matlab实现相比,它在性能方面很糟糕。由于我遇到同样的问题,我使用scipy.weave编写了一个实现。你可以在这里找到它:https://github.com/ml31415/accumarray

#4


2  

Best flexibility and readability is get using pandas:

使用熊猫可获得最佳灵活性和可读性:

import pandas

data=np.array(
    [('Aaron','Digger',1),
     ('Bill','Planter',2),
     ('Carl','Waterer',3),
     ('Darlene','Planter',3),
     ('Earl','Digger',7)],
    dtype=[('name', np.str_,8), ('job', np.str_,8), ('income', np.uint32)])

df = pandas.DataFrame(data)
result = df.groupby('job').mean()

Yields to :

收益率:

         income
job
Digger      4.0
Planter     2.5
Waterer     3.0

Pandas DataFrame is a great class to work with, but you can get back your results as you need:

Pandas DataFrame是一个很棒的课程,但您可以根据需要获得结果:

result.to_records()
result.to_dict()
result.to_csv()

And so on...

等等...

#5


2  

Best performance is achieved using ndimage.mean from scipy. This will be twice faster than accepted answer for this small dataset, and about 3.5 times faster for larger inputs:

使用scipy的ndimage.mean实现最佳性能。对于这个小数据集,这将比接受的答案快两倍,对于较大的输入,速度要快3.5倍:

from scipy import ndimage

data=np.array(
    [('Aaron','Digger',1),
     ('Bill','Planter',2),
     ('Carl','Waterer',3),
     ('Darlene','Planter',3),
     ('Earl','Digger',7)],
    dtype=[('name', np.str_,8), ('job', np.str_,8), ('income', np.uint32)])

unique = np.unique(data['job'])
result=np.dstack([unique, ndimage.mean(data['income'], data['job'], unique)])

Will yield to:

将屈服于:

array([[['Digger', '4.0'],
        ['Planter', '2.5'],
        ['Waterer', '3.0']]],
      dtype='|S32')

EDIT: with bincount (faster!)

This is about 5x faster than accepted answer for the small example input, if you repeat the data 100000 times it will be around 8.5x faster:

对于小示例输入,这比接受的答案快约5倍,如果重复数据100000次,它将快约8.5倍:

unique, uniqueInd, uniqueCount = np.unique(data['job'], return_inverse=True, return_counts=True)
means = np.bincount(uniqueInd, data['income'])/uniqueCount
return np.dstack([unique, means])

#6


0  

http://python.net/~goodger/projects/pycon/2007/idiomatic/handout.html#dictionary-get-method

should help to make it a little prettier, more pythonic, more efficient possibly. I'll come back later to check on your progress. Maybe you can edit the function with this in mind? Also see the next couple of sections.

应该有助于使它更漂亮,更pythonic,更有效。我稍后会回来查看你的进展情况。也许你可以编辑这个功能吗?另见下几节。