I have a set of data (X,Y). My independent variable values X are not unique, so there are multiple repeated values, I want to output a new array containing : X_unique, which is a list of unique values of X. Y_mean, the mean of all of the Y values corresponding to X_unique. Y_std, the standard deviation of all the Y values corresponding to X_unique.
我有一组数据(X,Y)。我的自变量值X不是唯一的,所以有多个重复值,我想输出一个新的数组,其中包含:X_unique,它是X的唯一值列表.Y_mean,对应于X_unique的所有Y值的平均值。 Y_std,与X_unique对应的所有Y值的标准偏差。
x = data[:,0]
y = data[:,1]
3 个解决方案
#1
2
x_unique = np.unique(x)
y_means = np.array([np.mean(y[x==u]) for u in x_unique])
y_stds = np.array([np.std(y[x==u]) for u in x_unique])
#2
4
You can use binned_statistic
from scipy.stats that supports various statistic functions to be applied in chunks across a 1D array. To get the chunks, we need to sort and get positions of the shifts (where chunks change), for which np.unique
would be useful. Putting all those, here's an implementation -
您可以使用scipy.stats中的binned_statistic,它支持各种统计函数,以便在一维数组中应用于块。为了获得块,我们需要对移位的位置进行排序(获取块的位置),np.unique对此有用。把所有这些,这是一个实现 -
from scipy.stats import binned_statistic as bstat
# Sort data corresponding to argsort of first column
sdata = data[data[:,0].argsort()]
# Unique col-1 elements and positions of breaks (elements are not identical)
unq_x,breaks = np.unique(sdata[:,0],return_index=True)
breaks = np.append(breaks,data.shape[0])
# Use binned statistic to get grouped average and std deviation values
idx_range = np.arange(data.shape[0])
avg_y,_,_ = bstat(x=idx_range, values=sdata[:,1], statistic='mean', bins=breaks)
std_y,_,_ = bstat(x=idx_range, values=sdata[:,1], statistic='std', bins=breaks)
From the docs of binned_statistic
, one can also use a custom statistic function :
从binned_statistic的文档中,还可以使用自定义统计函数:
function : a user-defined function which takes a 1D array of values, and outputs a single numerical statistic. This function will be called on the values in each bin. Empty bins will be represented by function([]), or NaN if this returns an error.
function:用户定义的函数,它接受1D数组值,并输出单个数字统计量。将在每个bin中的值上调用此函数。空箱将由函数([])表示,如果返回错误则为NaN。
Sample input, output -
样本输入,输出 -
In [121]: data
Out[121]:
array([[2, 5],
[2, 2],
[1, 5],
[3, 8],
[0, 8],
[6, 7],
[8, 1],
[2, 5],
[6, 8],
[1, 8]])
In [122]: np.column_stack((unq_x,avg_y,std_y))
Out[122]:
array([[ 0. , 8. , 0. ],
[ 1. , 6.5 , 1.5 ],
[ 2. , 4. , 1.41421356],
[ 3. , 8. , 0. ],
[ 6. , 7.5 , 0.5 ],
[ 8. , 1. , 0. ]])
#3
1
Pandas is done for such task :
熊猫完成了这样的任务:
data=np.random.randint(1,5,20).reshape(10,2)
import pandas
pandas.DataFrame(data).groupby(0).mean()
gives
1
0
1 2.666667
2 3.000000
3 2.000000
4 1.500000
#1
2
x_unique = np.unique(x)
y_means = np.array([np.mean(y[x==u]) for u in x_unique])
y_stds = np.array([np.std(y[x==u]) for u in x_unique])
#2
4
You can use binned_statistic
from scipy.stats that supports various statistic functions to be applied in chunks across a 1D array. To get the chunks, we need to sort and get positions of the shifts (where chunks change), for which np.unique
would be useful. Putting all those, here's an implementation -
您可以使用scipy.stats中的binned_statistic,它支持各种统计函数,以便在一维数组中应用于块。为了获得块,我们需要对移位的位置进行排序(获取块的位置),np.unique对此有用。把所有这些,这是一个实现 -
from scipy.stats import binned_statistic as bstat
# Sort data corresponding to argsort of first column
sdata = data[data[:,0].argsort()]
# Unique col-1 elements and positions of breaks (elements are not identical)
unq_x,breaks = np.unique(sdata[:,0],return_index=True)
breaks = np.append(breaks,data.shape[0])
# Use binned statistic to get grouped average and std deviation values
idx_range = np.arange(data.shape[0])
avg_y,_,_ = bstat(x=idx_range, values=sdata[:,1], statistic='mean', bins=breaks)
std_y,_,_ = bstat(x=idx_range, values=sdata[:,1], statistic='std', bins=breaks)
From the docs of binned_statistic
, one can also use a custom statistic function :
从binned_statistic的文档中,还可以使用自定义统计函数:
function : a user-defined function which takes a 1D array of values, and outputs a single numerical statistic. This function will be called on the values in each bin. Empty bins will be represented by function([]), or NaN if this returns an error.
function:用户定义的函数,它接受1D数组值,并输出单个数字统计量。将在每个bin中的值上调用此函数。空箱将由函数([])表示,如果返回错误则为NaN。
Sample input, output -
样本输入,输出 -
In [121]: data
Out[121]:
array([[2, 5],
[2, 2],
[1, 5],
[3, 8],
[0, 8],
[6, 7],
[8, 1],
[2, 5],
[6, 8],
[1, 8]])
In [122]: np.column_stack((unq_x,avg_y,std_y))
Out[122]:
array([[ 0. , 8. , 0. ],
[ 1. , 6.5 , 1.5 ],
[ 2. , 4. , 1.41421356],
[ 3. , 8. , 0. ],
[ 6. , 7.5 , 0.5 ],
[ 8. , 1. , 0. ]])
#3
1
Pandas is done for such task :
熊猫完成了这样的任务:
data=np.random.randint(1,5,20).reshape(10,2)
import pandas
pandas.DataFrame(data).groupby(0).mean()
gives
1
0
1 2.666667
2 3.000000
3 2.000000
4 1.500000