I have a pandas dataframe with a column of real values that I want to zscore normalize:
我有一只熊猫dataframe,它有一列我想让它标准化的真实值:
>> a
array([ nan, 0.0767, 0.4383, 0.7866, 0.8091, 0.1954, 0.6307,
0.6599, 0.1065, 0.0508])
>> df = pandas.DataFrame({"a": a})
The problem is that a single nan
value makes all the array nan
:
问题是单个nan值使得所有数组nan:
>> from scipy.stats import zscore
>> zscore(df["a"])
array([ nan, nan, nan, nan, nan, nan, nan, nan, nan, nan])
What's the correct way to apply zscore
(or an equivalent function not from scipy) to a column of a pandas dataframe and have it ignore the nan
values? I'd like it to be same dimension as original column with np.nan
for values that can't be normalized
将zscore(或等效函数不从scipy)应用到熊猫dataframe的一列,并忽略nan值的正确方法是什么?我希望它和原始列和np相同。nan的值不能被标准化。
edit: maybe the best solution is to use scipy.stats.nanmean
and scipy.stats.nanstd
? I don't see why the degrees of freedom need to be changed for std
for this purpose:
编辑:也许最好的解决方案是使用scipy.stats。nanmean scipy.stats.nanstd ?我不明白为什么为了这个目的,std的*度需要改变:
zscore = lambda x: (x - scipy.stats.nanmean(x)) / scipy.stats.nanstd(x)
2 个解决方案
#1
18
Well the pandas'
versions of mean
and std
will hand the Nan
so you could just compute that way (to get the same as scipy zscore I think you need to use ddof=0 on std
):
嗯,熊猫的意思和std的版本将会传递给Nan,所以你可以这样计算(得到和scipy zscore一样的,我认为你需要在std上使用ddof=0):
df['zscore'] = (df.a - df.a.mean())/df.a.std(ddof=0)
print df
a zscore
0 NaN NaN
1 0.0767 -1.148329
2 0.4383 0.071478
3 0.7866 1.246419
4 0.8091 1.322320
5 0.1954 -0.747912
6 0.6307 0.720512
7 0.6599 0.819014
8 0.1065 -1.047803
9 0.0508 -1.235699
#2
4
You could ignore nans using isnan
.
你可以用isnan忽略nans。
z = a # initialise array for zscores
z[~np.isnan(a)] = zscore(a[~np.isnan(a)])
pandas.DataFrame({'a':a,'Zscore':z})
Zscore a
0 NaN NaN
1 -1.148329 0.0767
2 0.071478 0.4383
3 1.246419 0.7866
4 1.322320 0.8091
5 -0.747912 0.1954
6 0.720512 0.6307
7 0.819014 0.6599
8 -1.047803 0.1065
9 -1.235699 0.0508
#1
18
Well the pandas'
versions of mean
and std
will hand the Nan
so you could just compute that way (to get the same as scipy zscore I think you need to use ddof=0 on std
):
嗯,熊猫的意思和std的版本将会传递给Nan,所以你可以这样计算(得到和scipy zscore一样的,我认为你需要在std上使用ddof=0):
df['zscore'] = (df.a - df.a.mean())/df.a.std(ddof=0)
print df
a zscore
0 NaN NaN
1 0.0767 -1.148329
2 0.4383 0.071478
3 0.7866 1.246419
4 0.8091 1.322320
5 0.1954 -0.747912
6 0.6307 0.720512
7 0.6599 0.819014
8 0.1065 -1.047803
9 0.0508 -1.235699
#2
4
You could ignore nans using isnan
.
你可以用isnan忽略nans。
z = a # initialise array for zscores
z[~np.isnan(a)] = zscore(a[~np.isnan(a)])
pandas.DataFrame({'a':a,'Zscore':z})
Zscore a
0 NaN NaN
1 -1.148329 0.0767
2 0.071478 0.4383
3 1.246419 0.7866
4 1.322320 0.8091
5 -0.747912 0.1954
6 0.720512 0.6307
7 0.819014 0.6599
8 -1.047803 0.1065
9 -1.235699 0.0508