使用numpy。重采样熊猫阵列的平均重量。

I need to resample some data with numpys weighted-average-function - and it just doesn't work... .

我需要用numpys加权平均函数重新采样一些数据，但它就是不管用……

This is my test-case:

这是我的测试用例:

import numpy as np
import pandas as pd
time_vec = [datetime.datetime(2007,1,1,0,0)
             ,datetime.datetime(2007,1,1,0,1)
             ,datetime.datetime(2007,1,1,0,5)
             ,datetime.datetime(2007,1,1,0,8)
             ,datetime.datetime(2007,1,1,0,10)
             ]
df = pd.DataFrame([2,3,1,7,4],index = time_vec)

A normal resampling without weights works fine (using the lambda function as a parameter to how is suggested here: Pandas resampling using numpy percentile? Thanks!):

没有权值的正常重采样工作得很好(使用lambda函数作为参数，说明这里如何建议:大熊猫使用numpy百分位数重采样?)谢谢!):

df.resample('5min',how = lambda x: np.average(x[0]))

But if i try to use some weights, it always returns a TypeError: Axis must be specified when shapes of a and weights differ:

但是如果我尝试使用一些权重，它总是返回一个类型错误:当a的形状和权重不同时，必须指定轴:

df.resample('5min',how = lambda x: np.average(x[0],weights = [1,2,3,4,5]))

I tried this with many different numbers of weights, but it did not get better:

我试了很多不同的重量，但是没有变好:

for i in xrange(20):
    try:
        print range(i)
        print df.resample('5min',how = lambda x:np.average(x[0],weights = range(i)))
        print i
        break
    except TypeError:
        print i,'typeError'

I'd be glad about any suggestions.

我很高兴有什么建议。

1 个解决方案

#1

The short answer here is that the weights in your lambda need to be created dynamically based on the length of the series that is being averaged. In addition, you need to be careful about the types of objects that you're manipulating.

简而言之，您的lambda中的权重需要根据被平均的级数的长度动态创建。此外，您还需要小心处理的对象的类型。

The code that I got to compute what I think you're trying to do is as follows:

我要计算的代码我认为你正在尝试做的是:

df.resample('5min', how=lambda x: np.average(x, weights=1+np.arange(len(x))))

There are two differences compared with the line that was giving you problems:

与给你带来问题的那条线相比，有两个不同之处:

x[0] is now just x. The x object in the lambda is a pd.Series, and so x[0] gives just the first value in the series. This was working without raising an exception in the first example (without the weights) because np.average(c) just returns c when c is a scalar. But I think it was actually computing incorrect averages even in that case, because each of the sampled subsets was just returning its first value as the "average".

x[0]现在就是x，里面的x对象是pd。级数，所以x[0]只给出级数的第一个值。这在第一个示例中没有引发异常(没有权重)，因为当c是标量时，np.average(c)只返回c。但我认为即使在那种情况下，它也在计算不正确的平均值，因为每个抽样子集都只是返回它的第一个值作为“平均值”。
The weights are created dynamically based on the length of data in the Series being resampled. You need to do this because the x in your lambda might be a Series of different length for each time interval being computed.

权重是根据被重新采样的系列中的数据长度动态创建的。您需要这样做，因为在您的lambda中x可能是一个不同长度的序列，每次计算时间间隔。

The way I figured this out was through some simple type debugging, by replacing the lambda with a proper function definition:

我解决这个问题的方法是通过一些简单的类型调试，用合适的函数定义替换lambda:

def avg(x):
    print(type(x), x.shape, type(x[0]))
    return np.average(x, weights=np.arange(1, 1+len(x)))

df.resample('5Min', how=avg)

This let me have a look at what was happening with the x variable. Hope that helps!

这让我看看x变量发生了什么。希望会有帮助!

#1