pandas(五)处理缺失数据和层次化索引

时间:2021-11-03 21:13:00

pandas用浮点值Nan表示浮点和非浮点数组中的缺失数据。它只是一个便于被检测的标记而已。

>>> string_data = Series(['aardvark','artichoke',np.nan,'avocado'])
>>> string_data
0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object
>>> string_data.isnull()
0    False
1    False
2     True
3    False
dtype: bool
>>> string_data.notnull()
0     True
1     True
2    False
3     True
dtype: bool
>>> string_data.fillna("miss")
0     aardvark
1    artichoke
2         miss
3      avocado
dtype: object
>>> string_data
0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object

NA处理方法

方法   说明
dropna 根据个标签中的是否存在缺失数据进行过滤,可以通过阈值进行调整
fillna   用指定值或插值来填充缺失数据
isnull 返回一个含有布尔值的对象,这些布尔值表示哪些是缺失值,给对象的类型与原类型一样
notnull isnull的否定式

 

特别说明dropna方法:

  常用参数:

    axis  指定轴

    how  :“any/all” any代表只有有缺失值,all代表一列全部缺失

    thresh; 代表最少notnull值的个数,是一个整型。

 

滤除缺失数据

对于Series有两种方法实现:

  

>>> from numpy import nan as NA
>>>
>>>
>>> data = Series([1,NA,3.2,NA,5])
>>> data
0    1.0
1    NaN
2    3.2
3    NaN
4    5.0
dtype: float64
#方法一
>>> data.dropna()
0    1.0
2    3.2
4    5.0
dtype: float64
#方法二
>>> data[data.notnull()]
0    1.0
2    3.2
4    5.0
dtype: float64

而对于DataFrame对象,事情就有点复杂了。dropna默认丢弃任何含有缺失值的行。

>>> frame = DataFrame([[1,6.5,3],[1,NA,NA],[NA,NA,NA],[NA,6.5,3]])
>>>
>>>
>>>
>>> frame
     0    1    2
0  1.0  6.5  3.0
1  1.0  NaN  NaN
2  NaN  NaN  NaN
3  NaN  6.5  3.0
>>> clean_data = frame.dropna()#默认丢弃所有含有缺失值的行
>>> clean_data
     0    1    2
0  1.0  6.5  3.0

>>> frame.dropna(how ='all')#只丢弃全部是缺失值的行
     0    1    2
0  1.0  6.5  3.0
1  1.0  NaN  NaN
3  NaN  6.5  3.0
>>> frame.dropna(axis = 1 ,how='all')#丢弃全部是缺失值的列
     0    1    2
0  1.0  6.5  3.0
1  1.0  NaN  NaN
2  NaN  NaN  NaN
3  NaN  6.5  3.0
>>> frame.dropna(thresh =2)#丢弃剩余少于2个真实值的行
     0    1    2
0  1.0  6.5  3.0
3  NaN  6.5  3.0
>>>

填充缺失数据

对于DataFrame对象

>>> df = DataFrame(np.random.randn(7,3))
>>> df.ix[:4 ,1] = NA
>>> df.ix[:2,2] =NA
>>> df
          0         1         2
0 -1.362151       NaN       NaN
1 -0.465262       NaN       NaN
2  0.037518       NaN       NaN
3 -2.895224       NaN -2.514141
4 -0.635875       NaN  1.722823
5 -0.479897  0.999354 -0.547433
6 -0.744960  0.363400  0.706812
>>> df.fillna(0) #元素级填充
          0         1         2
0 -1.362151  0.000000  0.000000
1 -0.465262  0.000000  0.000000
2  0.037518  0.000000  0.000000
3 -2.895224  0.000000 -2.514141
4 -0.635875  0.000000  1.722823
5 -0.479897  0.999354 -0.547433
6 -0.744960  0.363400  0.706812
#根据不同的列填充不同的数值
>>> df.fillna({1:0.5,2:-1 })
          0         1         2
0 -1.362151  0.500000 -1.000000
1 -0.465262  0.500000 -1.000000
2  0.037518  0.500000 -1.000000
3 -2.895224  0.500000 -2.514141
4 -0.635875  0.500000  1.722823
5 -0.479897  0.999354 -0.547433
6 -0.744960  0.363400  0.706812
>>> df.fillna(method ='bfill')#method方法选择前向或后向填充
          0         1         2
0 -1.362151  0.999354 -2.514141
1 -0.465262  0.999354 -2.514141
2  0.037518  0.999354 -2.514141
3 -2.895224  0.999354 -2.514141
4 -0.635875  0.999354  1.722823
5 -0.479897  0.999354 -0.547433
6 -0.744960  0.363400  0.706812
>>> df.fillna(method ='bfill',limit =2)#限制后向填充为两行
          0         1         2
0 -1.362151       NaN       NaN
1 -0.465262       NaN -2.514141
2  0.037518       NaN -2.514141
3 -2.895224  0.999354 -2.514141
4 -0.635875  0.999354  1.722823
5 -0.479897  0.999354 -0.547433
6 -0.744960  0.363400  0.706812
>>>

fillna默认会返回新对象,如果需要就地修改元数据,可以加上inplace = True

>>> df.fillna(0,inplace = True)
>>> df
          0         1         2
0 -1.362151  0.000000  0.000000
1 -0.465262  0.000000  0.000000
2  0.037518  0.000000  0.000000
3 -2.895224  0.000000 -2.514141
4 -0.635875  0.000000  1.722823
5 -0.479897  0.999354 -0.547433
6 -0.744960  0.363400  0.706812

fillna函数的参数

参数 说明
method 前向或后向填充
value 待填充的值或字典对象
axis 待填充的轴
inplace 修改调用者对象而不产生副本
limit 前向或后向填充的最大数量

层次化索引

能使你在一个轴上拥有多个索引级别。

创建层次化索引

>>> data = Series(np.random.randn(10),index=[['a','a','a','b','b','b','c','c','d','d'],[1,2,3,1,2,3,1,2,1,2]])
>>> data
a  1   -0.450814
   2   -0.776317
   3   -0.140582
b  1   -0.717184
   2    0.943802
   3    0.972454
c  1   -0.390725
   2   -1.340875
d  1   -0.648987
   2   -0.960173
dtype: float64
>>> data.index
MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],
           labels=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 0, 1]])
>>>

利用层次化索引来选取子集

>>> data['a']
1   -0.450814
2   -0.776317
3   -0.140582
dtype: float64
>>> data['c':'d']
c  1   -0.390725
   2   -1.340875
d  1   -0.648987
   2   -0.960173
dtype: float64
>>> data.ix[['a','c']]
a  1   -0.450814
   2   -0.776317
   3   -0.140582
c  1   -0.390725
   2   -1.340875
dtype: float64
选择内层子集
>>> data['a',2]
-0.7763173836675796
>>> data[:,2]
a   -0.776317
b    0.943802
c   -1.340875
d   -0.960173
dtype: float64

利用stack和unstack可以实现层次化索引的Series和DataFrame的转换

>>> frame
     0    1    2
0  1.0  6.5  3.0
1  1.0  NaN  NaN
2  NaN  NaN  NaN
3  NaN  6.5  3.0
>>> frame.stack()
0  0    1.0
   1    6.5
   2    3.0
1  0    1.0
3  1    6.5
   2    3.0
dtype: float64
>>> data.unstack()
          1         2         3
a -0.450814 -0.776317 -0.140582
b -0.717184  0.943802  0.972454
c -0.390725 -1.340875       NaN
d -0.648987 -0.960173       NaN

重排分级顺序

swaplevel根据给定的编号或name属性进行交换层次化索引

sortlevel 根据给定的级别的值进行排序

>>> frame = DataFrame(np.random.randn(5,4),index = [['a','a','a','b','b'],[1,2,3,1,2]],columns = pd.MultiIndex.from_arrays([['o','o','w','w'],[1,2,1,2]],names=['color','num']))
>>> frame
color         o                   w
num           1         2         1         2
a 1    1.558178  1.614265  0.674642 -0.269209
  2   -0.324755 -0.486829 -1.086918 -0.496748
  3    0.283367 -0.518154  0.551998  0.747767
b 1    0.904257  1.315240  0.328065 -0.006465
  2    0.249438  0.946020  1.572290 -0.198329
>>> frame.index.names = ['name','age']
>>> frame
color            o                   w
num              1         2         1         2
name age
a    1    1.558178  1.614265  0.674642 -0.269209
     2   -0.324755 -0.486829 -1.086918 -0.496748
     3    0.283367 -0.518154  0.551998  0.747767
b    1    0.904257  1.315240  0.328065 -0.006465
     2    0.249438  0.946020  1.572290 -0.198329
>>> frame.swaplevel('name','age')
color            o                   w
num              1         2         1         2
age name
1   a     1.558178  1.614265  0.674642 -0.269209
2   a    -0.324755 -0.486829 -1.086918 -0.496748
3   a     0.283367 -0.518154  0.551998  0.747767
1   b     0.904257  1.315240  0.328065 -0.006465
2   b     0.249438  0.946020  1.572290 -0.198329
>>> frame.sortlevel(1)
__main__:1: FutureWarning: sortlevel is deprecated, use sort_index(level= ...)
color            o                   w
num              1         2         1         2
name age
a    1    1.558178  1.614265  0.674642 -0.269209
b    1    0.904257  1.315240  0.328065 -0.006465
a    2   -0.324755 -0.486829 -1.086918 -0.496748
b    2    0.249438  0.946020  1.572290 -0.198329
a    3    0.283367 -0.518154  0.551998  0.747767
>>> frame.sort_index(level = 1)#以后sortlevel会废弃,这里可以用sort_index的level选项排序
color            o                   w
num              1         2         1         2
name age
a    1    1.558178  1.614265  0.674642 -0.269209
b    1    0.904257  1.315240  0.328065 -0.006465
a    2   -0.324755 -0.486829 -1.086918 -0.496748
b    2    0.249438  0.946020  1.572290 -0.198329
a    3    0.283367 -0.518154  0.551998  0.747767

 

可以根据级别汇总统计

许多对DataFrame和Series的描述和汇总统计都有一个level选项,用于指定在某条轴上算术运算的级别

>>> frame.sum(level = 'age')
color         o                   w
num           1         2         1         2
age
1      2.462435  2.929505  1.002707 -0.275673
2     -0.075318  0.459191  0.485372 -0.695077
3      0.283367 -0.518154  0.551998  0.747767
>>> frame.sum(level = 'color',axis =1)
color            o         w
name age
a    1    3.172443  0.405433
     2   -0.811584 -1.583666
     3   -0.234786  1.299765
b    1    2.219497  0.321600
     2    1.195458  1.373961
>>>

使用DataFrame的列完成层次化行索引的转化

>>> frame = DataFrame({'a':range(7),'b':range(7,0,-1),'c':['o','o','o','t','t','f','f'],'d':[1,2,3,4,1,2,3]})
>>> frame
   a  b  c  d
0  0  7  o  1
1  1  6  o  2
2  2  5  o  3
3  3  4  t  4
4  4  3  t  1
5  5  2  f  2
6  6  1  f  3
>>> frame2 = frame.set_index(['c','d'])#将一个或多个列转换为行索引
>>> frame2
     a  b
c d
o 1  0  7
  2  1  6
  3  2  5
t 4  3  4
  1  4  3
f 2  5  2
  3  6  1
>>> frame2.reset_index(['c','d'])#将层次化索引转换为列
   c  d  a  b
0  o  1  0  7
1  o  2  1  6
2  o  3  2  5
3  t  4  3  4
4  t  1  4  3
5  f  2  5  2
6  f  3  6  1

在将列转换为层次化行索引的时候,默认会删除原来的列,如果要保留的话,需要drop选项

>>> frame3 = frame.set_index(['c','d'],drop=False)
>>> frame3
     a  b  c  d
c d
o 1  0  7  o  1
  2  1  6  o  2
  3  2  5  o  3
t 4  3  4  t  4
  1  4  3  t  1
f 2  5  2  f  2
  3  6  1  f  3