I've got a numpy array filled mostly with real numbers, but there is a few nan
values in it as well.
我有一个numpy数组,大部分都是实数,但也有一些nan值。
How can I replace the nan
s with averages of columns where they are?
如何用列的平均值来代替它们呢?
8 个解决方案
#1
42
No loops required:
不需要循环:
print(a)
[[ 0.93230948 nan 0.47773439 0.76998063]
[ 0.94460779 0.87882456 0.79615838 0.56282885]
[ 0.94272934 0.48615268 0.06196785 nan]
[ 0.64940216 0.74414127 nan nan]]
#Obtain mean of columns as you need, nanmean is just convenient.
col_mean = np.nanmean(a, axis=0)
print(col_mean)
[ 0.86726219 0.7030395 0.44528687 0.66640474]
#Find indicies that you need to replace
inds = np.where(np.isnan(a))
#Place column means in the indices. Align the arrays using take
a[inds] = np.take(col_mean, inds[1])
print(a)
[[ 0.93230948 0.7030395 0.47773439 0.76998063]
[ 0.94460779 0.87882456 0.79615838 0.56282885]
[ 0.94272934 0.48615268 0.06196785 0.66640474]
[ 0.64940216 0.74414127 0.44528687 0.66640474]]
#2
6
Using masked arrays
The standard way to do this using only numpy would be to use the masked array module.
使用numpy实现此目的的标准方法是使用掩蔽数组模块。
Scipy is a pretty heavy package which relies on external libraries, so it's worth having a numpy-only method. This borrows from @DonaldHobson's answer.
Scipy是一个非常重的包,它依赖于外部库,因此值得使用一个只有numpy的方法。这借用了@DonaldHobson的回答。
Edit: np.nanmean
is now a numpy function. However, it doesn't handle all-nan columns...
编辑:np。nanmean现在是一个numpy函数。但是,它不能处理所有的nan列…
Suppose you have an array a
:
假设你有一个数组a:
>>> a
array([[ 0., nan, 10., nan],
[ 1., 6., nan, nan],
[ 2., 7., 12., nan],
[ 3., 8., nan, nan],
[ nan, 9., 14., nan]])
>>> import numpy.ma as ma
>>> np.where(np.isnan(a), ma.array(a, mask=np.isnan(a)).mean(axis=0), a)
array([[ 0. , 7.5, 10. , 0. ],
[ 1. , 6. , 12. , 0. ],
[ 2. , 7. , 12. , 0. ],
[ 3. , 8. , 12. , 0. ],
[ 1.5, 9. , 14. , 0. ]])
Note that the masked array's mean does not need to be the same shape as a
, because we're taking advantage of the implicit broadcasting over rows.
注意,掩蔽数组的均值不需要与a的形状相同,因为我们正在利用对行的隐式广播。
Also note how the all-nan column is nicely handled. The mean is zero since you're taking the mean of zero elements. The method using nanmean
doesn't handle all-nan columns:
还需要注意的是,all-nan列是如何很好地处理的。均值为0因为取0个元素的均值。使用nanmean的方法不能处理所有的nan列:
>>> col_mean = np.nanmean(a, axis=0)
/home/praveen/.virtualenvs/numpy3-mkl/lib/python3.4/site-packages/numpy/lib/nanfunctions.py:675: RuntimeWarning: Mean of empty slice
warnings.warn("Mean of empty slice", RuntimeWarning)
>>> inds = np.where(np.isnan(a))
>>> a[inds] = np.take(col_mean, inds[1])
>>> a
array([[ 0. , 7.5, 10. , nan],
[ 1. , 6. , 12. , nan],
[ 2. , 7. , 12. , nan],
[ 3. , 8. , 12. , nan],
[ 1.5, 9. , 14. , nan]])
Explanation
解释
Converting a
into a masked array gives you
将a转换为掩蔽数组会得到
>>> ma.array(a, mask=np.isnan(a))
masked_array(data =
[[0.0 -- 10.0 --]
[1.0 6.0 -- --]
[2.0 7.0 12.0 --]
[3.0 8.0 -- --]
[-- 9.0 14.0 --]],
mask =
[[False True False True]
[False False True True]
[False False False True]
[False False True True]
[ True False False True]],
fill_value = 1e+20)
And taking the mean over columns gives you the correct answer, normalizing only over the non-masked values:
通过对列的均值给出正确的答案,只对非掩蔽值进行标准化:
>>> ma.array(a, mask=np.isnan(a)).mean(axis=0)
masked_array(data = [1.5 7.5 12.0 --],
mask = [False False False True],
fill_value = 1e+20)
Further, note how the mask nicely handles the column which is all-nan!
此外,请注意蒙版如何很好地处理全南的列!
Finally, np.where
does the job of replacement.
最后,np。替换的工作在哪里?
Row-wise mean
Row-wise意味着
To replace nan
values with row-wise mean instead of column-wise mean requires a tiny change for broadcasting to take effect nicely:
要将nan值替换为行均值而不是列均值,广播需要做一个微小的改变才能很好地发挥作用:
>>> a
array([[ 0., 1., 2., 3., nan],
[ nan, 6., 7., 8., 9.],
[ 10., nan, 12., nan, 14.],
[ nan, nan, nan, nan, nan]])
>>> np.where(np.isnan(a), ma.array(a, mask=np.isnan(a)).mean(axis=1), a)
ValueError: operands could not be broadcast together with shapes (4,5) (4,) (4,5)
>>> np.where(np.isnan(a), ma.array(a, mask=np.isnan(a)).mean(axis=1)[:, np.newaxis], a)
array([[ 0. , 1. , 2. , 3. , 1.5],
[ 7.5, 6. , 7. , 8. , 9. ],
[ 10. , 12. , 12. , 12. , 14. ],
[ 0. , 0. , 0. , 0. , 0. ]])
#3
3
If partial is your original data, and replace is an array of the same shape containing averaged values then this code will use the value from partial if one exists.
如果“部分”是原始数据,而“替换”是一个相同形状的数组,其中包含平均值,那么如果存在“部分”,该代码将使用“部分”的值。
Complete= np.where(np.isnan(partial),replace,partial)
#4
2
This isn't very clean but I can't think of a way to do it other than iterating
这不是很清楚,但我想不出除了迭代之外的方法
#example
a = np.arange(16, dtype = float).reshape(4,4)
a[2,2] = np.nan
a[3,3] = np.nan
indices = np.where(np.isnan(a)) #returns an array of rows and column indices
for row, col in zip(*indices):
a[row,col] = np.mean(a[~np.isnan(a[:,col]), col])
#5
2
Alternative: Replacing NaNs with interpolation of columns.
另一种选择:用列的插值代替非整数。
def interpolate_nans(X):
"""Overwrite NaNs with column value interpolations."""
for j in range(X.shape[1]):
mask_j = np.isnan(X[:,j])
X[mask_j,j] = np.interp(np.flatnonzero(mask_j), np.flatnonzero(~mask_j), X[~mask_j,j])
return X
Example use:
使用示例:
X_incomplete = np.array([[10, 20, 30 ],
[np.nan, 30, np.nan],
[np.nan, np.nan, 50 ],
[40, 50, np.nan ]])
X_complete = interpolate_nans(X_incomplete)
print X_complete
[[10, 20, 30 ],
[20, 30, 40 ],
[30, 40, 50 ],
[40, 50, 50 ]]
I use this bit of code for time series data in particular, where columns are attributes and rows are time-ordered samples.
我特别使用了这段时间序列数据的代码,其中列是属性,行是时间顺序的示例。
#6
1
To extend Donald's Answer I provide a minimal example. Let's say a
is an ndarray and we want to replace its zero values with the mean of the column.
为了扩展Donald的答案,我提供了一个最小的示例。假设a是ndarray我们想用列的均值来替换它的零值。
In [231]: a
Out[231]:
array([[0, 3, 6],
[2, 0, 0]])
In [232]: col_mean = np.nanmean(a, axis=0)
Out[232]: array([ 1. , 1.5, 3. ])
In [228]: np.where(np.equal(a, 0), col_mean, a)
Out[228]:
array([[ 1. , 3. , 6. ],
[ 2. , 1.5, 3. ]])
#7
0
Using simple functions with loops:
使用简单的函数和循环:
a=[[0.93230948, np.nan, 0.47773439, 0.76998063],
[0.94460779, 0.87882456, 0.79615838, 0.56282885],
[0.94272934, 0.48615268, 0.06196785, np.nan],
[0.64940216, 0.74414127, np.nan, np.nan],
[0.64940216, 0.74414127, np.nan, np.nan]]
print("------- original array -----")
for aa in a:
print(aa)
# GET COLUMN MEANS:
ta = np.array(a).T.tolist() # transpose the array;
col_means = list(map(lambda x: np.nanmean(x), ta)) # get means;
print("column means:", col_means)
# REPLACE NAN ENTRIES WITH COLUMN MEANS:
nrows = len(a); ncols = len(a[0]) # get number of rows & columns;
for r in range(nrows):
for c in range(ncols):
if np.isnan(a[r][c]):
a[r][c] = col_means[c]
print("------- means added -----")
for aa in a:
print(aa)
Output:
输出:
------- original array -----
[0.93230948, nan, 0.47773439, 0.76998063]
[0.94460779, 0.87882456, 0.79615838, 0.56282885]
[0.94272934, 0.48615268, 0.06196785, nan]
[0.64940216, 0.74414127, nan, nan]
[0.64940216, 0.74414127, nan, nan]
column means: [0.82369018599999999, 0.71331494500000003, 0.44528687333333333, 0.66640474000000005]
------- means added -----
[0.93230948, 0.71331494500000003, 0.47773439, 0.76998063]
[0.94460779, 0.87882456, 0.79615838, 0.56282885]
[0.94272934, 0.48615268, 0.06196785, 0.66640474000000005]
[0.64940216, 0.74414127, 0.44528687333333333, 0.66640474000000005]
[0.64940216, 0.74414127, 0.44528687333333333, 0.66640474000000005]
The for loops can also be written with list comprehension:
for循环也可以用列表理解来写:
new_a = [[col_means[c] if np.isnan(a[r][c]) else a[r][c]
for c in range(ncols) ]
for r in range(nrows) ]
#8
-2
you might want to try this built-in function:
您可能想尝试这个内置功能:
x = np.array([np.inf, -np.inf, np.nan, -128, 128])
np.nan_to_num(x)
array([ 1.79769313e+308, -1.79769313e+308, 0.00000000e+000,
-1.28000000e+002, 1.28000000e+002])
#1
42
No loops required:
不需要循环:
print(a)
[[ 0.93230948 nan 0.47773439 0.76998063]
[ 0.94460779 0.87882456 0.79615838 0.56282885]
[ 0.94272934 0.48615268 0.06196785 nan]
[ 0.64940216 0.74414127 nan nan]]
#Obtain mean of columns as you need, nanmean is just convenient.
col_mean = np.nanmean(a, axis=0)
print(col_mean)
[ 0.86726219 0.7030395 0.44528687 0.66640474]
#Find indicies that you need to replace
inds = np.where(np.isnan(a))
#Place column means in the indices. Align the arrays using take
a[inds] = np.take(col_mean, inds[1])
print(a)
[[ 0.93230948 0.7030395 0.47773439 0.76998063]
[ 0.94460779 0.87882456 0.79615838 0.56282885]
[ 0.94272934 0.48615268 0.06196785 0.66640474]
[ 0.64940216 0.74414127 0.44528687 0.66640474]]
#2
6
Using masked arrays
The standard way to do this using only numpy would be to use the masked array module.
使用numpy实现此目的的标准方法是使用掩蔽数组模块。
Scipy is a pretty heavy package which relies on external libraries, so it's worth having a numpy-only method. This borrows from @DonaldHobson's answer.
Scipy是一个非常重的包,它依赖于外部库,因此值得使用一个只有numpy的方法。这借用了@DonaldHobson的回答。
Edit: np.nanmean
is now a numpy function. However, it doesn't handle all-nan columns...
编辑:np。nanmean现在是一个numpy函数。但是,它不能处理所有的nan列…
Suppose you have an array a
:
假设你有一个数组a:
>>> a
array([[ 0., nan, 10., nan],
[ 1., 6., nan, nan],
[ 2., 7., 12., nan],
[ 3., 8., nan, nan],
[ nan, 9., 14., nan]])
>>> import numpy.ma as ma
>>> np.where(np.isnan(a), ma.array(a, mask=np.isnan(a)).mean(axis=0), a)
array([[ 0. , 7.5, 10. , 0. ],
[ 1. , 6. , 12. , 0. ],
[ 2. , 7. , 12. , 0. ],
[ 3. , 8. , 12. , 0. ],
[ 1.5, 9. , 14. , 0. ]])
Note that the masked array's mean does not need to be the same shape as a
, because we're taking advantage of the implicit broadcasting over rows.
注意,掩蔽数组的均值不需要与a的形状相同,因为我们正在利用对行的隐式广播。
Also note how the all-nan column is nicely handled. The mean is zero since you're taking the mean of zero elements. The method using nanmean
doesn't handle all-nan columns:
还需要注意的是,all-nan列是如何很好地处理的。均值为0因为取0个元素的均值。使用nanmean的方法不能处理所有的nan列:
>>> col_mean = np.nanmean(a, axis=0)
/home/praveen/.virtualenvs/numpy3-mkl/lib/python3.4/site-packages/numpy/lib/nanfunctions.py:675: RuntimeWarning: Mean of empty slice
warnings.warn("Mean of empty slice", RuntimeWarning)
>>> inds = np.where(np.isnan(a))
>>> a[inds] = np.take(col_mean, inds[1])
>>> a
array([[ 0. , 7.5, 10. , nan],
[ 1. , 6. , 12. , nan],
[ 2. , 7. , 12. , nan],
[ 3. , 8. , 12. , nan],
[ 1.5, 9. , 14. , nan]])
Explanation
解释
Converting a
into a masked array gives you
将a转换为掩蔽数组会得到
>>> ma.array(a, mask=np.isnan(a))
masked_array(data =
[[0.0 -- 10.0 --]
[1.0 6.0 -- --]
[2.0 7.0 12.0 --]
[3.0 8.0 -- --]
[-- 9.0 14.0 --]],
mask =
[[False True False True]
[False False True True]
[False False False True]
[False False True True]
[ True False False True]],
fill_value = 1e+20)
And taking the mean over columns gives you the correct answer, normalizing only over the non-masked values:
通过对列的均值给出正确的答案,只对非掩蔽值进行标准化:
>>> ma.array(a, mask=np.isnan(a)).mean(axis=0)
masked_array(data = [1.5 7.5 12.0 --],
mask = [False False False True],
fill_value = 1e+20)
Further, note how the mask nicely handles the column which is all-nan!
此外,请注意蒙版如何很好地处理全南的列!
Finally, np.where
does the job of replacement.
最后,np。替换的工作在哪里?
Row-wise mean
Row-wise意味着
To replace nan
values with row-wise mean instead of column-wise mean requires a tiny change for broadcasting to take effect nicely:
要将nan值替换为行均值而不是列均值,广播需要做一个微小的改变才能很好地发挥作用:
>>> a
array([[ 0., 1., 2., 3., nan],
[ nan, 6., 7., 8., 9.],
[ 10., nan, 12., nan, 14.],
[ nan, nan, nan, nan, nan]])
>>> np.where(np.isnan(a), ma.array(a, mask=np.isnan(a)).mean(axis=1), a)
ValueError: operands could not be broadcast together with shapes (4,5) (4,) (4,5)
>>> np.where(np.isnan(a), ma.array(a, mask=np.isnan(a)).mean(axis=1)[:, np.newaxis], a)
array([[ 0. , 1. , 2. , 3. , 1.5],
[ 7.5, 6. , 7. , 8. , 9. ],
[ 10. , 12. , 12. , 12. , 14. ],
[ 0. , 0. , 0. , 0. , 0. ]])
#3
3
If partial is your original data, and replace is an array of the same shape containing averaged values then this code will use the value from partial if one exists.
如果“部分”是原始数据,而“替换”是一个相同形状的数组,其中包含平均值,那么如果存在“部分”,该代码将使用“部分”的值。
Complete= np.where(np.isnan(partial),replace,partial)
#4
2
This isn't very clean but I can't think of a way to do it other than iterating
这不是很清楚,但我想不出除了迭代之外的方法
#example
a = np.arange(16, dtype = float).reshape(4,4)
a[2,2] = np.nan
a[3,3] = np.nan
indices = np.where(np.isnan(a)) #returns an array of rows and column indices
for row, col in zip(*indices):
a[row,col] = np.mean(a[~np.isnan(a[:,col]), col])
#5
2
Alternative: Replacing NaNs with interpolation of columns.
另一种选择:用列的插值代替非整数。
def interpolate_nans(X):
"""Overwrite NaNs with column value interpolations."""
for j in range(X.shape[1]):
mask_j = np.isnan(X[:,j])
X[mask_j,j] = np.interp(np.flatnonzero(mask_j), np.flatnonzero(~mask_j), X[~mask_j,j])
return X
Example use:
使用示例:
X_incomplete = np.array([[10, 20, 30 ],
[np.nan, 30, np.nan],
[np.nan, np.nan, 50 ],
[40, 50, np.nan ]])
X_complete = interpolate_nans(X_incomplete)
print X_complete
[[10, 20, 30 ],
[20, 30, 40 ],
[30, 40, 50 ],
[40, 50, 50 ]]
I use this bit of code for time series data in particular, where columns are attributes and rows are time-ordered samples.
我特别使用了这段时间序列数据的代码,其中列是属性,行是时间顺序的示例。
#6
1
To extend Donald's Answer I provide a minimal example. Let's say a
is an ndarray and we want to replace its zero values with the mean of the column.
为了扩展Donald的答案,我提供了一个最小的示例。假设a是ndarray我们想用列的均值来替换它的零值。
In [231]: a
Out[231]:
array([[0, 3, 6],
[2, 0, 0]])
In [232]: col_mean = np.nanmean(a, axis=0)
Out[232]: array([ 1. , 1.5, 3. ])
In [228]: np.where(np.equal(a, 0), col_mean, a)
Out[228]:
array([[ 1. , 3. , 6. ],
[ 2. , 1.5, 3. ]])
#7
0
Using simple functions with loops:
使用简单的函数和循环:
a=[[0.93230948, np.nan, 0.47773439, 0.76998063],
[0.94460779, 0.87882456, 0.79615838, 0.56282885],
[0.94272934, 0.48615268, 0.06196785, np.nan],
[0.64940216, 0.74414127, np.nan, np.nan],
[0.64940216, 0.74414127, np.nan, np.nan]]
print("------- original array -----")
for aa in a:
print(aa)
# GET COLUMN MEANS:
ta = np.array(a).T.tolist() # transpose the array;
col_means = list(map(lambda x: np.nanmean(x), ta)) # get means;
print("column means:", col_means)
# REPLACE NAN ENTRIES WITH COLUMN MEANS:
nrows = len(a); ncols = len(a[0]) # get number of rows & columns;
for r in range(nrows):
for c in range(ncols):
if np.isnan(a[r][c]):
a[r][c] = col_means[c]
print("------- means added -----")
for aa in a:
print(aa)
Output:
输出:
------- original array -----
[0.93230948, nan, 0.47773439, 0.76998063]
[0.94460779, 0.87882456, 0.79615838, 0.56282885]
[0.94272934, 0.48615268, 0.06196785, nan]
[0.64940216, 0.74414127, nan, nan]
[0.64940216, 0.74414127, nan, nan]
column means: [0.82369018599999999, 0.71331494500000003, 0.44528687333333333, 0.66640474000000005]
------- means added -----
[0.93230948, 0.71331494500000003, 0.47773439, 0.76998063]
[0.94460779, 0.87882456, 0.79615838, 0.56282885]
[0.94272934, 0.48615268, 0.06196785, 0.66640474000000005]
[0.64940216, 0.74414127, 0.44528687333333333, 0.66640474000000005]
[0.64940216, 0.74414127, 0.44528687333333333, 0.66640474000000005]
The for loops can also be written with list comprehension:
for循环也可以用列表理解来写:
new_a = [[col_means[c] if np.isnan(a[r][c]) else a[r][c]
for c in range(ncols) ]
for r in range(nrows) ]
#8
-2
you might want to try this built-in function:
您可能想尝试这个内置功能:
x = np.array([np.inf, -np.inf, np.nan, -128, 128])
np.nan_to_num(x)
array([ 1.79769313e+308, -1.79769313e+308, 0.00000000e+000,
-1.28000000e+002, 1.28000000e+002])