在numpy数组中正向填充NaN值的最有效方法

时间:2022-03-04 21:26:05

Example Problem

As a simple example, consider the numpy array arr as defined below:

作为一个简单的例子,考虑下面定义的numpy数组arr:

import numpy as np
arr = np.array([[5, np.nan, np.nan, 7, 2],
                [3, np.nan, 1, 8, np.nan],
                [4, 9, 6, np.nan, np.nan]])

where arr looks like this in console output:

其中arr在控制台输出中是这样的:

array([[  5.,  nan,  nan,   7.,   2.],
       [  3.,  nan,   1.,   8.,  nan],
       [  4.,   9.,   6.,  nan,  nan]])

I would now like to row-wise 'forward-fill' the nan values in array arr. By that I mean replacing each nan value with the nearest valid value from the left. The desired result would look like this:

现在,我想以行方式“向前填充”数组arr中的nan值。我的意思是用左边最近的有效值替换每个nan值。期望的结果是这样的:

array([[  5.,   5.,   5.,  7.,  2.],
       [  3.,   3.,   1.,  8.,  8.],
       [  4.,   9.,   6.,  6.,  6.]])

Tried thus far

I've tried using for-loops:

我试着使用for循环:

for row_idx in range(arr.shape[0]):
    for col_idx in range(arr.shape[1]):
        if np.isnan(arr[row_idx][col_idx]):
            arr[row_idx][col_idx] = arr[row_idx][col_idx - 1]

I've also tried using a pandas dataframe as an intermediate step (since pandas dataframes have a very neat built-in method for forward-filling):

我还尝试过使用熊猫数据爆炸作为中间步骤(因为熊猫数据爆炸有一个非常整洁的内置方法来向前填充):

import pandas as pd
df = pd.DataFrame(arr)
df.fillna(method='ffill', axis=1, inplace=True)
arr = df.as_matrix()

Both of the above strategies produce the desired result, but I keep on wondering: wouldn't a strategy that uses only numpy vectorized operations be the most efficient one?

上述两种策略都产生了预期的结果,但我一直在想:一个只使用numpy矢量化操作的策略难道不是最有效的吗?


Summary

Is there another more efficient way to 'forward-fill' nan values in numpy arrays? (e.g. by using numpy vectorized operations)

还有其他更有效的方法可以在numpy数组中“向前填充”nan值吗?(例如使用numpy矢量化操作)


Update: Solutions Comparison

I've tried to time all solutions thus far. This was my setup script:

到目前为止,我已经尝试了所有的解决方案。这是我的设置脚本:

import numba as nb
import numpy as np
import pandas as pd

def random_array():
    choices = [1, 2, 3, 4, 5, 6, 7, 8, 9, np.nan]
    out = np.random.choice(choices, size=(1000, 10))
    return out

def loops_fill(arr):
    out = arr.copy()
    for row_idx in range(out.shape[0]):
        for col_idx in range(1, out.shape[1]):
            if np.isnan(out[row_idx, col_idx]):
                out[row_idx, col_idx] = out[row_idx, col_idx - 1]
    return out

@nb.jit
def numba_loops_fill(arr):
    '''Numba decorator solution provided by shx2.'''
    out = arr.copy()
    for row_idx in range(out.shape[0]):
        for col_idx in range(1, out.shape[1]):
            if np.isnan(out[row_idx, col_idx]):
                out[row_idx, col_idx] = out[row_idx, col_idx - 1]
    return out

def pandas_fill(arr):
    df = pd.DataFrame(arr)
    df.fillna(method='ffill', axis=1, inplace=True)
    out = df.as_matrix()
    return out

def numpy_fill(arr):
    '''Solution provided by Divakar.'''
    mask = np.isnan(arr)
    idx = np.where(~mask,np.arange(mask.shape[1]),0)
    np.maximum.accumulate(idx,axis=1, out=idx)
    out = arr[np.arange(idx.shape[0])[:,None], idx]
    return out

followed by this console input:

下面是控制台输入:

%timeit -n 1000 loops_fill(random_array())
%timeit -n 1000 numba_loops_fill(random_array())
%timeit -n 1000 pandas_fill(random_array())
%timeit -n 1000 numpy_fill(random_array())

resulting in this console output:

产生此控制台输出:

1000 loops, best of 3: 9.64 ms per loop
1000 loops, best of 3: 377 µs per loop
1000 loops, best of 3: 455 µs per loop
1000 loops, best of 3: 351 µs per loop

2 个解决方案

#1


19  

Here's one approach -

这是一种方法

mask = np.isnan(arr)
idx = np.where(~mask,np.arange(mask.shape[1]),0)
np.maximum.accumulate(idx,axis=1, out=idx)
out = arr[np.arange(idx.shape[0])[:,None], idx]

If you don't want to create another array and just fill the NaNs in arr itself, replace the last step with this -

如果您不想创建另一个数组并只在arr中填充NaNs,那么用这个-替换最后一步

arr[mask] = arr[np.nonzero(mask)[0], idx[mask]]

Sample input, output -

样本的输入、输出

In [179]: arr
Out[179]: 
array([[  5.,  nan,  nan,   7.,   2.,   6.,   5.],
       [  3.,  nan,   1.,   8.,  nan,   5.,  nan],
       [  4.,   9.,   6.,  nan,  nan,  nan,   7.]])

In [180]: out
Out[180]: 
array([[ 5.,  5.,  5.,  7.,  2.,  6.,  5.],
       [ 3.,  3.,  1.,  8.,  8.,  5.,  5.],
       [ 4.,  9.,  6.,  6.,  6.,  6.,  7.]])

#2


3  

Use Numba. This should give a significant speedup:

使用Numba。这将带来显著的加速:

import numba
@numba.jit
def loops_fill(arr):
    ...

#1


19  

Here's one approach -

这是一种方法

mask = np.isnan(arr)
idx = np.where(~mask,np.arange(mask.shape[1]),0)
np.maximum.accumulate(idx,axis=1, out=idx)
out = arr[np.arange(idx.shape[0])[:,None], idx]

If you don't want to create another array and just fill the NaNs in arr itself, replace the last step with this -

如果您不想创建另一个数组并只在arr中填充NaNs,那么用这个-替换最后一步

arr[mask] = arr[np.nonzero(mask)[0], idx[mask]]

Sample input, output -

样本的输入、输出

In [179]: arr
Out[179]: 
array([[  5.,  nan,  nan,   7.,   2.,   6.,   5.],
       [  3.,  nan,   1.,   8.,  nan,   5.,  nan],
       [  4.,   9.,   6.,  nan,  nan,  nan,   7.]])

In [180]: out
Out[180]: 
array([[ 5.,  5.,  5.,  7.,  2.,  6.,  5.],
       [ 3.,  3.,  1.,  8.,  8.,  5.,  5.],
       [ 4.,  9.,  6.,  6.,  6.,  6.,  7.]])

#2


3  

Use Numba. This should give a significant speedup:

使用Numba。这将带来显著的加速:

import numba
@numba.jit
def loops_fill(arr):
    ...