如何删除某些列中值为NaN的熊猫数据结构

时间:2022-07-18 07:23:55

I have a DataFrame:

我有一个DataFrame:

>>> df
                 STK_ID  EPS  cash
STK_ID RPT_Date                   
601166 20111231  601166  NaN   NaN
600036 20111231  600036  NaN    12
600016 20111231  600016  4.3   NaN
601009 20111231  601009  NaN   NaN
601939 20111231  601939  2.5   NaN
000001 20111231  000001  NaN   NaN

Then I just want the records whose EPS is not NaN, that is, df.drop(....) will return the dataframe as below:

然后我想记录的EPS不是南,也就是说,df.drop(....)将返回dataframe如下:

                  STK_ID  EPS  cash
STK_ID RPT_Date                   
600016 20111231  600016  4.3   NaN
601939 20111231  601939  2.5   NaN

How do I do that?

我该怎么做呢?

9 个解决方案

#1


340  

Don't drop. Just take rows where EPS is finite:

不要放弃。只考虑EPS为有限的行数:

df = df[np.isfinite(df['EPS'])]

#2


582  

This question is already resolved, but...

这个问题已经解决了,但是……

...also consider the solution suggested by Wouter in his original comment. The ability to handle missing data, including dropna(), is built into pandas explicitly. Aside from potentially improved performance over doing it manually, these functions also come with a variety of options which may be useful.

…还可以考虑一下Wouter在他最初的评论中提出的解决方案。处理丢失数据(包括dropna())的能力显式地构建在熊猫中。除了手工操作可能提高性能之外,这些函数还提供了各种可能有用的选项。

In [24]: df = pd.DataFrame(np.random.randn(10,3))

In [25]: df.iloc[::2,0] = np.nan; df.iloc[::4,1] = np.nan; df.iloc[::3,2] = np.nan;

In [26]: df
Out[26]:
          0         1         2
0       NaN       NaN       NaN
1  2.677677 -1.466923 -0.750366
2       NaN  0.798002 -0.906038
3  0.672201  0.964789       NaN
4       NaN       NaN  0.050742
5 -1.250970  0.030561 -2.678622
6       NaN  1.036043       NaN
7  0.049896 -0.308003  0.823295
8       NaN       NaN  0.637482
9 -0.310130  0.078891       NaN

In [27]: df.dropna()     #drop all rows that have any NaN values
Out[27]:
          0         1         2
1  2.677677 -1.466923 -0.750366
5 -1.250970  0.030561 -2.678622
7  0.049896 -0.308003  0.823295

In [28]: df.dropna(how='all')     #drop only if ALL columns are NaN
Out[28]:
          0         1         2
1  2.677677 -1.466923 -0.750366
2       NaN  0.798002 -0.906038
3  0.672201  0.964789       NaN
4       NaN       NaN  0.050742
5 -1.250970  0.030561 -2.678622
6       NaN  1.036043       NaN
7  0.049896 -0.308003  0.823295
8       NaN       NaN  0.637482
9 -0.310130  0.078891       NaN

In [29]: df.dropna(thresh=2)   #Drop row if it does not have at least two values that are **not** NaN
Out[29]:
          0         1         2
1  2.677677 -1.466923 -0.750366
2       NaN  0.798002 -0.906038
3  0.672201  0.964789       NaN
5 -1.250970  0.030561 -2.678622
7  0.049896 -0.308003  0.823295
9 -0.310130  0.078891       NaN

In [30]: df.dropna(subset=[1])   #Drop only if NaN in specific column (as asked in the question)
Out[30]:
          0         1         2
1  2.677677 -1.466923 -0.750366
2       NaN  0.798002 -0.906038
3  0.672201  0.964789       NaN
5 -1.250970  0.030561 -2.678622
6       NaN  1.036043       NaN
7  0.049896 -0.308003  0.823295
9 -0.310130  0.078891       NaN

There are also other options (See docs at http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html), including dropping columns instead of rows.

还有其他选项(参见http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html中的文档),包括删除列而不是行。

Pretty handy!

非常方便!

#3


82  

I know this has already been answered, but just for the sake of a purely pandas solution to this specific question as opposed to the general description from Aman (which was wonderful) and in case anyone else happens upon this:

我知道这个问题已经得到了回答,但仅仅是出于对这个具体问题的纯粹的熊猫解决方案,而不是阿曼的一般性描述(这真是太棒了)。

import pandas as pd
df = df[pd.notnull(df['EPS'])]

#4


24  

You can use this:

您可以使用:

df.dropna(subset=['EPS'], how='all', inplace = True)

#5


19  

You could use dataframe method notnull or inverse of isnull, or numpy.isnan:

您可以使用dataframe方法notnull或isnull的逆,或numpy.isnan:

In [332]: df[df.EPS.notnull()]
Out[332]:
   STK_ID  RPT_Date  STK_ID.1  EPS  cash
2  600016  20111231    600016  4.3   NaN
4  601939  20111231    601939  2.5   NaN


In [334]: df[~df.EPS.isnull()]
Out[334]:
   STK_ID  RPT_Date  STK_ID.1  EPS  cash
2  600016  20111231    600016  4.3   NaN
4  601939  20111231    601939  2.5   NaN


In [347]: df[~np.isnan(df.EPS)]
Out[347]:
   STK_ID  RPT_Date  STK_ID.1  EPS  cash
2  600016  20111231    600016  4.3   NaN
4  601939  20111231    601939  2.5   NaN

#6


14  

Simplest of all solutions:

简单的解决方案:

filtered_df = df[df['EPS'].notnull()]

The above solution is way better than using np.isfinite()

上面的解决方案比使用np. islimited()要好得多

#7


8  

yet another solution which uses the fact that np.nan != np.nan:

另一个利用np这个事实的解。南! = np.nan:

In [149]: df.query("EPS == EPS")
Out[149]:
                 STK_ID  EPS  cash
STK_ID RPT_Date
600016 20111231  600016  4.3   NaN
601939 20111231  601939  2.5   NaN

#8


1  

It may be added at that '&' can be used to add additional conditions e.g.

可以在“&”处加上“&”,以增加附加条件,如:

df = df[(df.EPS > 2.0) & (df.EPS <4.0)]

Notice that when evaluating the statements, pandas needs parenthesis.

注意,在评估报表时,熊猫需要圆括号。

#9


-1  

For some reason none of the previously submitted answers worked for me. This basic solution did:

出于某种原因,之前提交的答案对我都不起作用。这个基本的解决方案是:

df = df[df.EPS >= 0]

Though of course that will drop rows with negative numbers, too. So if you want those it's probably smart to add this after, too.

当然,它也会删除带有负数的行。如果你想要这些,在后面加上这个可能也很聪明。

df = df[df.EPS <= 0]

#1


340  

Don't drop. Just take rows where EPS is finite:

不要放弃。只考虑EPS为有限的行数:

df = df[np.isfinite(df['EPS'])]

#2


582  

This question is already resolved, but...

这个问题已经解决了,但是……

...also consider the solution suggested by Wouter in his original comment. The ability to handle missing data, including dropna(), is built into pandas explicitly. Aside from potentially improved performance over doing it manually, these functions also come with a variety of options which may be useful.

…还可以考虑一下Wouter在他最初的评论中提出的解决方案。处理丢失数据(包括dropna())的能力显式地构建在熊猫中。除了手工操作可能提高性能之外,这些函数还提供了各种可能有用的选项。

In [24]: df = pd.DataFrame(np.random.randn(10,3))

In [25]: df.iloc[::2,0] = np.nan; df.iloc[::4,1] = np.nan; df.iloc[::3,2] = np.nan;

In [26]: df
Out[26]:
          0         1         2
0       NaN       NaN       NaN
1  2.677677 -1.466923 -0.750366
2       NaN  0.798002 -0.906038
3  0.672201  0.964789       NaN
4       NaN       NaN  0.050742
5 -1.250970  0.030561 -2.678622
6       NaN  1.036043       NaN
7  0.049896 -0.308003  0.823295
8       NaN       NaN  0.637482
9 -0.310130  0.078891       NaN

In [27]: df.dropna()     #drop all rows that have any NaN values
Out[27]:
          0         1         2
1  2.677677 -1.466923 -0.750366
5 -1.250970  0.030561 -2.678622
7  0.049896 -0.308003  0.823295

In [28]: df.dropna(how='all')     #drop only if ALL columns are NaN
Out[28]:
          0         1         2
1  2.677677 -1.466923 -0.750366
2       NaN  0.798002 -0.906038
3  0.672201  0.964789       NaN
4       NaN       NaN  0.050742
5 -1.250970  0.030561 -2.678622
6       NaN  1.036043       NaN
7  0.049896 -0.308003  0.823295
8       NaN       NaN  0.637482
9 -0.310130  0.078891       NaN

In [29]: df.dropna(thresh=2)   #Drop row if it does not have at least two values that are **not** NaN
Out[29]:
          0         1         2
1  2.677677 -1.466923 -0.750366
2       NaN  0.798002 -0.906038
3  0.672201  0.964789       NaN
5 -1.250970  0.030561 -2.678622
7  0.049896 -0.308003  0.823295
9 -0.310130  0.078891       NaN

In [30]: df.dropna(subset=[1])   #Drop only if NaN in specific column (as asked in the question)
Out[30]:
          0         1         2
1  2.677677 -1.466923 -0.750366
2       NaN  0.798002 -0.906038
3  0.672201  0.964789       NaN
5 -1.250970  0.030561 -2.678622
6       NaN  1.036043       NaN
7  0.049896 -0.308003  0.823295
9 -0.310130  0.078891       NaN

There are also other options (See docs at http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html), including dropping columns instead of rows.

还有其他选项(参见http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html中的文档),包括删除列而不是行。

Pretty handy!

非常方便!

#3


82  

I know this has already been answered, but just for the sake of a purely pandas solution to this specific question as opposed to the general description from Aman (which was wonderful) and in case anyone else happens upon this:

我知道这个问题已经得到了回答,但仅仅是出于对这个具体问题的纯粹的熊猫解决方案,而不是阿曼的一般性描述(这真是太棒了)。

import pandas as pd
df = df[pd.notnull(df['EPS'])]

#4


24  

You can use this:

您可以使用:

df.dropna(subset=['EPS'], how='all', inplace = True)

#5


19  

You could use dataframe method notnull or inverse of isnull, or numpy.isnan:

您可以使用dataframe方法notnull或isnull的逆,或numpy.isnan:

In [332]: df[df.EPS.notnull()]
Out[332]:
   STK_ID  RPT_Date  STK_ID.1  EPS  cash
2  600016  20111231    600016  4.3   NaN
4  601939  20111231    601939  2.5   NaN


In [334]: df[~df.EPS.isnull()]
Out[334]:
   STK_ID  RPT_Date  STK_ID.1  EPS  cash
2  600016  20111231    600016  4.3   NaN
4  601939  20111231    601939  2.5   NaN


In [347]: df[~np.isnan(df.EPS)]
Out[347]:
   STK_ID  RPT_Date  STK_ID.1  EPS  cash
2  600016  20111231    600016  4.3   NaN
4  601939  20111231    601939  2.5   NaN

#6


14  

Simplest of all solutions:

简单的解决方案:

filtered_df = df[df['EPS'].notnull()]

The above solution is way better than using np.isfinite()

上面的解决方案比使用np. islimited()要好得多

#7


8  

yet another solution which uses the fact that np.nan != np.nan:

另一个利用np这个事实的解。南! = np.nan:

In [149]: df.query("EPS == EPS")
Out[149]:
                 STK_ID  EPS  cash
STK_ID RPT_Date
600016 20111231  600016  4.3   NaN
601939 20111231  601939  2.5   NaN

#8


1  

It may be added at that '&' can be used to add additional conditions e.g.

可以在“&”处加上“&”,以增加附加条件,如:

df = df[(df.EPS > 2.0) & (df.EPS <4.0)]

Notice that when evaluating the statements, pandas needs parenthesis.

注意,在评估报表时,熊猫需要圆括号。

#9


-1  

For some reason none of the previously submitted answers worked for me. This basic solution did:

出于某种原因,之前提交的答案对我都不起作用。这个基本的解决方案是:

df = df[df.EPS >= 0]

Though of course that will drop rows with negative numbers, too. So if you want those it's probably smart to add this after, too.

当然,它也会删除带有负数的行。如果你想要这些,在后面加上这个可能也很聪明。

df = df[df.EPS <= 0]