熊猫 - 找到没有Nan值的最长伸展

I have a pandas dataframe "df", a sample of which is below:

我有一个pandas数据帧“df”,其示例如下:

   time  x
0  1     1
1  2     Nan 
2  3     3
3  4     Nan
4  5     8
5  6     7
6  7     5
7  8     Nan

The real frame is much bigger. I am trying to find the longest stretch of non NaN values in the "x" series, and print out the starting and ending index for this frame. Is this possible?

真实的框架要大得多。我试图在“x”系列中找到最长的非NaN值,并打印出该帧的起始和结束索引。这可能吗?

Thank You

5 个解决方案

#1

Here's a vectorized approach with NumPy tools -

这是使用NumPy工具的矢量化方法 -

a = df.x.values  # Extract out relevant column from dataframe as array
m = np.concatenate(( [True], np.isnan(a), [True] ))  # Mask
ss = np.flatnonzero(m[1:] != m[:-1]).reshape(-1,2)   # Start-stop limits
start,stop = ss[(ss[:,1] - ss[:,0]).argmax()]  # Get max interval, interval limits

Sample run -

样品运行 -

In [474]: a
Out[474]: 
array([  1.,  nan,   3.,  nan,  nan,  nan,  nan,   8.,   7.,   5.,   2.,
         5.,  nan,  nan])

In [475]: start, stop
Out[475]: (7, 12)

The intervals are set such that the difference between each start and stop would give us the length of each interval. So, by ending index if you meant to get the last index of non-zero element, we need to subtract one from stop.

设置间隔使得每个开始和停止之间的差异将给出每个间隔的长度。因此,如果您想获得非零元素的最后一个索引,则通过结束索引,我们需要从stop中减去一个。

#2

So you can get the index values of the NaN's in the following way:

因此,您可以通过以下方式获取NaN的索引值:

import numpy as np

index = df['x'].index[df['x'].apply(np.isnan)]
df_index = df.index.values.tolist()
[df_index.index(indexValue) for indexValue in index]

>>> [0, 1, 3, 7]

Then one solution would be to see the largest difference between subsequent index values and that would give you the longest stretch of non NaN values.

然后,一种解决方案是查看后续索引值之间的最大差异,这将为您提供最长的非NaN值。

#3

pandas

f = dict(
    Start=pd.Series.first_valid_index,
    Stop=pd.Series.last_valid_index,
    Stretch='count'
)

agged = df.x.groupby(df.x.isnull().cumsum()).agg(f)
agged.loc[agged.Stretch.idxmax(), ['Start', 'Stop']].values

array([ 4.,  6.])

numpy

def pir(x):
    # pad with np.nan
    x = np.append(np.nan, np.append(x, np.nan))
    # find where null
    w = np.where(np.isnan(x))[0]
    # diff to find length of stretch
    # argmax to find where largest stretch
    a = np.diff(w).argmax()
    # return original positions of boundary nulls
    return w[[a, a + 1]] + np.array([0, -2])

demo

pir(df.x.values)

array([4, 6])

a = np.array([1, np.nan, 3, np.nan, np.nan, np.nan, np.nan, 8, 7, 5, 2, 5, np.nan, np.nan])
pir(a)

array([ 7, 11])

#4

Maybe a faster way would be the following (given that you say you have a long dataframe, speed matters):

也许更快的方式是以下(假设你说你有一个很长的数据帧,速度很重要):

In [19]: df = pd.DataFrame({'time':[1,2,3,4,5,6,7,8],'x':[1,np.NAN,3,np.NAN,8,7,5,np.NAN]})

In [20]: index = df['x'].isnull()

In [21]: df[index].index.values
Out[21]: array([1, 3, 7])

#5

Another method is to use scipy.ndimage.measurements.label. It will perform a segmentation of your non null index into valid group and label them differently. You can then group your dataframe using the labels and take the biggest group.

另一种方法是使用scipy.ndimage.measurements.label。它将执行非空索引到有效组的分段,并以不同方式标记它们。然后,您可以使用标签对数据框进行分组,并选择最大的组。

Set-up

import pandas as pd
import numpy as np
from scipy.ndimage.measurements import label
df = pd.DataFrame({'time':[1,2,3,4,5,6,7,8],'x':[1,np.NAN,3,np.NAN,8,7,5,np.NAN]})

Retrieving longest stretch without nan

恢复最长的拉伸没有南

valid_rows = ~df.isnull().any(axis=1)
label, num_feature = label(valid_rows)
label_of_biggest_group =  valid_rows.groupby(label).count().drop(0).argmax()
print df.loc[label == label_of_biggest_group]

Result

   time    x
4     5  8.0
5     6  7.0
6     7  5.0

Note

The label 0 contains background data in our case the nan values and it has to be dropped in case of your number of nan is greater or equal to the size of yout biggest group. num_feature is your number of homogeneous stretches without nan.

标签0包含背景数据,在我们的例子中是nan值,如果你的nan数量大于或等于你最大的组的大小,它必须被删除。 num_feature是没有nan的均匀伸展的数量。

#1