pandas处理缺失数据

《Python for Data Analysis》

NA处理方法

方法	说明
dropna	根据各标签的值中是否存在缺失数据对轴标签进行过滤，可通过阈值调节对缺失值得容忍度
fillna	用指定值或插值方法（如ffill和bfill）填充缺失数据
isnull	返回一个含有布尔值的对象，这些布尔值表示哪些值是缺失值NA,该对象的类型与源类型一样
notnull	isnull的否定式

滤除缺失数据（dropna）

Series

In [1]: import pandas as pd

In [2]: from pandas import DataFrame, Series

In [3]: import numpy as np

In [4]: from numpy import nan as NA

In [5]: data = Series([1, NA, 3.5, NA, 7])

In [6]: data.dropna()
Out[6]:
0    1.0
2    3.5
4    7.0
dtype: float64

In [7]: data[data.notnull()]
Out[7]:
0    1.0
2    3.5
4    7.0
dtype: float64

DataFrame

DataFrame中dropna默认丢弃任何含有缺失值的行。
传入how=’all’将只丢弃全为NA的行
如果想丢弃列，只需传入axis=1

填充缺失数据（fillna）!!

常数调用df.fillna(0)
字典调用，对不同的列填充不同的值df.fillna({1:0.5, 3:-1})
fillna默认会返回新对象！！，就地修改： _ = df.fillna(0, inplace=True)
对reindex有效的插值方法也可用于fillna

替换值

利用fillna方法填充缺失数据可以看做值替换的一种特殊情况。而replace则提供了一种实现该功能的更简单、更灵活的方式。

In [11]: data = Series([1.,-999.,2.,-999.,-1000.,3.])

In [12]: data
Out[12]:
0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64

In [13]: data.replace(-999, np.nan)
Out[13]:
0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64

In [14]: data.replace([-999,-1000], np.nan)
Out[14]:
0    1.0
1    NaN
2    2.0
3    NaN
4    NaN
5    3.0
dtype: float64

In [15]: data.replace([-999,-1000], [np.nan,0])
Out[15]:
0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

In [16]: data.replace({-999 : np.nan, -1000 : 0})
Out[16]:
0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

秒客网

pandas处理缺失数据

NA处理方法

滤除缺失数据（dropna）

Series

DataFrame

填充缺失数据（fillna）!!

替换值

相关文章