如何将熊猫栏与nans规范化?

时间:2022-06-13 22:59:15

I have a pandas dataframe with a column of real values that I want to zscore normalize:

我有一只熊猫dataframe,它有一列我想让它标准化的真实值:

>> a
array([    nan,  0.0767,  0.4383,  0.7866,  0.8091,  0.1954,  0.6307,
        0.6599,  0.1065,  0.0508])
>> df = pandas.DataFrame({"a": a})

The problem is that a single nan value makes all the array nan:

问题是单个nan值使得所有数组nan:

>> from scipy.stats import zscore
>> zscore(df["a"])
array([ nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan,  nan])

What's the correct way to apply zscore (or an equivalent function not from scipy) to a column of a pandas dataframe and have it ignore the nan values? I'd like it to be same dimension as original column with np.nan for values that can't be normalized

将zscore(或等效函数不从scipy)应用到熊猫dataframe的一列,并忽略nan值的正确方法是什么?我希望它和原始列和np相同。nan的值不能被标准化。

edit: maybe the best solution is to use scipy.stats.nanmean and scipy.stats.nanstd? I don't see why the degrees of freedom need to be changed for std for this purpose:

编辑:也许最好的解决方案是使用scipy.stats。nanmean scipy.stats.nanstd ?我不明白为什么为了这个目的,std的*度需要改变:

zscore = lambda x: (x - scipy.stats.nanmean(x)) / scipy.stats.nanstd(x)

2 个解决方案

#1


18  

Well the pandas' versions of mean and std will hand the Nan so you could just compute that way (to get the same as scipy zscore I think you need to use ddof=0 on std):

嗯,熊猫的意思和std的版本将会传递给Nan,所以你可以这样计算(得到和scipy zscore一样的,我认为你需要在std上使用ddof=0):

df['zscore'] = (df.a - df.a.mean())/df.a.std(ddof=0)
print df

        a    zscore
0     NaN       NaN
1  0.0767 -1.148329
2  0.4383  0.071478
3  0.7866  1.246419
4  0.8091  1.322320
5  0.1954 -0.747912
6  0.6307  0.720512
7  0.6599  0.819014
8  0.1065 -1.047803
9  0.0508 -1.235699

#2


4  

You could ignore nans using isnan.

你可以用isnan忽略nans。

z = a                    # initialise array for zscores
z[~np.isnan(a)] = zscore(a[~np.isnan(a)])
pandas.DataFrame({'a':a,'Zscore':z})

     Zscore       a
0       NaN     NaN
1 -1.148329  0.0767
2  0.071478  0.4383
3  1.246419  0.7866
4  1.322320  0.8091
5 -0.747912  0.1954
6  0.720512  0.6307
7  0.819014  0.6599
8 -1.047803  0.1065
9 -1.235699  0.0508

#1


18  

Well the pandas' versions of mean and std will hand the Nan so you could just compute that way (to get the same as scipy zscore I think you need to use ddof=0 on std):

嗯,熊猫的意思和std的版本将会传递给Nan,所以你可以这样计算(得到和scipy zscore一样的,我认为你需要在std上使用ddof=0):

df['zscore'] = (df.a - df.a.mean())/df.a.std(ddof=0)
print df

        a    zscore
0     NaN       NaN
1  0.0767 -1.148329
2  0.4383  0.071478
3  0.7866  1.246419
4  0.8091  1.322320
5  0.1954 -0.747912
6  0.6307  0.720512
7  0.6599  0.819014
8  0.1065 -1.047803
9  0.0508 -1.235699

#2


4  

You could ignore nans using isnan.

你可以用isnan忽略nans。

z = a                    # initialise array for zscores
z[~np.isnan(a)] = zscore(a[~np.isnan(a)])
pandas.DataFrame({'a':a,'Zscore':z})

     Zscore       a
0       NaN     NaN
1 -1.148329  0.0767
2  0.071478  0.4383
3  1.246419  0.7866
4  1.322320  0.8091
5 -0.747912  0.1954
6  0.720512  0.6307
7  0.819014  0.6599
8 -1.047803  0.1065
9 -1.235699  0.0508