Numpy:在每个时间步长平均许多数据点

时间:2021-01-05 23:39:45

This question is probably answered somewhere, but I cannot find where, so I will ask here:

这问题可能在某处得到解答,但我找不到哪里,所以我会在这里问:

I have a set of data consisting of several samples per timestep. So, I basically have two arrays, "times", which looks something like: (0,0,0,1,1,1,1,1,2,2,3,4,4,4,4,...) and my data which is the value for each time. Each timestep has a random number of samples. I would like to get the average value of the data at each timestep in an efficient manner.

我有一组数据,每个时间步长包含几个样本。所以,我基本上有两个数组,“时间”,看起来像:(0,0,0,1,1,1,1,1,2,2,3,4,4,4,4,.. 。)和我的数据,这是每次的价值。每个时间步长具有随机数量的样本。我希望以有效的方式获得每个时间步长的数据的平均值。

I have prepared the following sample code to show what my data looks like. Basically, I am wondering if there is a more efficient way to write the "average_values" function.

我准备了以下示例代码来显示我的数据。基本上,我想知道是否有更有效的方法来编写“average_values”函数。

import numpy as np
import matplotlib.pyplot as plt

def average_values(x,y):
    unique_x = np.unique(x)
    averaged_y = [np.mean(y[x==ux]) for ux in unique_x]
    return unique_x, averaged_y

#generate our data
times   = []
samples = []

#we have some timesteps:
for time in np.linspace(0,10,101):

    #and a random number of samples at each timestep:
    num_samples = np.random.random_integers(1,10)

    for i in range(0,num_samples):
        times.append(time)
        samples.append(np.sin(time)+np.random.random()*0.5)

times   = np.array(times)
samples = np.array(samples)

plt.plot(times,samples,'bo',ms=3,mec=None,alpha=0.5)
plt.plot(*average_values(times,samples),color='r')
plt.show()

Here is what it looks like: Numpy:在每个时间步长平均许多数据点

这是它的样子:

2 个解决方案

#1


5  

May I propose a pandas solution. It is highly recommended if you are going to be working with time series.

我可以提出一个熊猫解决方案。如果您打算使用时间序列,强烈建议您使用。

Create test data

import pandas as pd
import numpy as np

times = np.random.randint(0,10,size=50)
values = np.sin(times) + np.random.random_sample((len(times),))
s = pd.Series(values, index=times)
s.plot(linestyle='.', marker='o')

Numpy:在每个时间步长平均许多数据点

Calculate averages

avs = s.groupby(level=0).mean()
avs.plot()

Numpy:在每个时间步长平均许多数据点

#2


9  

A generic code to do this would do something as follows:

执行此操作的通用代码将执行以下操作:

def average_values_bis(x, y):
    unq_x, idx = np.unique(x, return_inverse=True)
    count_x = np.bincount(idx)
    sum_y = np.bincount(idx, weights=y)

    return unq_x, sum_y / count_x

Adding the function above and following line for the plotting to your script

添加上面和后面的函数来绘制脚本

plt.plot(*average_values_bis(times, samples),color='g')

produces this output, with the red line hidden behind the green one:

产生此输出,红线隐藏在绿色背后:

Numpy:在每个时间步长平均许多数据点

But timing both approaches reveals the benefits of using bincount, a 30x speed-up:

但两种方法的时间安排都显示了使用bincount的好处,加速了30倍:

%timeit average_values(times, samples)
100 loops, best of 3: 2.83 ms per loop

%timeit average_values_bis(times, samples)
10000 loops, best of 3: 85.9 us per loop

#1


5  

May I propose a pandas solution. It is highly recommended if you are going to be working with time series.

我可以提出一个熊猫解决方案。如果您打算使用时间序列,强烈建议您使用。

Create test data

import pandas as pd
import numpy as np

times = np.random.randint(0,10,size=50)
values = np.sin(times) + np.random.random_sample((len(times),))
s = pd.Series(values, index=times)
s.plot(linestyle='.', marker='o')

Numpy:在每个时间步长平均许多数据点

Calculate averages

avs = s.groupby(level=0).mean()
avs.plot()

Numpy:在每个时间步长平均许多数据点

#2


9  

A generic code to do this would do something as follows:

执行此操作的通用代码将执行以下操作:

def average_values_bis(x, y):
    unq_x, idx = np.unique(x, return_inverse=True)
    count_x = np.bincount(idx)
    sum_y = np.bincount(idx, weights=y)

    return unq_x, sum_y / count_x

Adding the function above and following line for the plotting to your script

添加上面和后面的函数来绘制脚本

plt.plot(*average_values_bis(times, samples),color='g')

produces this output, with the red line hidden behind the green one:

产生此输出,红线隐藏在绿色背后:

Numpy:在每个时间步长平均许多数据点

But timing both approaches reveals the benefits of using bincount, a 30x speed-up:

但两种方法的时间安排都显示了使用bincount的好处,加速了30倍:

%timeit average_values(times, samples)
100 loops, best of 3: 2.83 ms per loop

%timeit average_values_bis(times, samples)
10000 loops, best of 3: 85.9 us per loop