1.索引对象
pandas的索引对象负责管理轴标签和其它元数据。构建Series或DataFrame时,所用到的任何数组或其它序列的标签都会被转换成一个Index
obj=Series(range(3),index=['a','b','c'])
index=obj.index
print index
print index[1:]
结果为:
Index([u'a', u'b', u'c'], dtype='object')
Index([u'b', u'c'], dtype='object')
Index对象是不可修改的,因此用户不能对其进行修改
不可修改性非常重要,因为这样才能使Index对象在多个数据结构之间安全共享
index=pd.Index(np.arange(3))
obj2=Series([1.5,-2.5,0],index=index)
print obj2.index is index
结果为:
True
2.基本功能
a.重新索引
pandas对象的一个重要方法是reindex,其作用是创建一个适应新索引的新对象。调用该Series的reindex将会根据新索引进行重排,如果某个索引不存在,就引入缺失值
obj3=Series([4.5,7.2,-5.3,3.6],index=['d','b','a','c'])
print obj3
obj4=obj3.reindex(['a','b','c','d','e'])
print obj4
print obj3.reindex(['a','b','c','d','e'],fill_value=0)
结果为:
d 4.5
b 7.2
a -5.3
c 3.6
dtype: float64
a -5.3
b 7.2
c 3.6
d 4.5
e NaN
dtype: float64
a -5.3
b 7.2
c 3.6
d 4.5
e 0.0
对于时间序列这样的有序数据,重新索引时有可能需要做一些插值处理。method选项即可达到此目的,例如使用ffill可以实现前向值填充
obj5=Series(['blue','purple','yellow'],index=[0,2,4])
print obj5.reindex(range(6),method='ffill')
结果为:
0 blue
1 blue
2 purple
3 purple
4 yellow
5 yellow
dtype: object
对于DataFrame、reindex可以修改(行)索引、列,或两个都修改。如果仅传入一个序列,则会重新索引行
frame=DataFrame(np.arange(9).reshape(3,3),index=['a','c','d'],
columns=['Ohio','Texas','California'])
print frame
frame2=frame.reindex(['a','b','c','d'])
print frame2
结果为:
Ohio Texas California
a 0 1 2
c 3 4 5
d 6 7 8
Ohio Texas California
a 0 1 2
b NaN NaN NaN
c 3 4 5
d 6 7 8
使用columns关键字即可重新索引列
states=['Texas','Utah','California']
print frame.reindex(columns=states)
结果为:
Texas Utah California
a 1 NaN 2
c 4 NaN 5
d 7 NaN 8
也可以同时对行和列进行重新索引,而插值则只能按行应用(即轴0)
print frame.reindex(index=['a','b','c','d'],method='ffill',columns=states)
结果为:
Texas Utah California
a 1 NaN 2
b 1 NaN 2
c 4 NaN 5
d 7 NaN 8
利用ix的标签索引功能,重新索引任务可以变得更简洁
print frame.ix[['a','b','c','d'],states]
结果为:
Texas Utah California
a 1 NaN 2
b NaN NaN NaN
c 4 NaN 5
d 7 NaN 8
b.丢弃指定轴上的项
丢弃某条轴上的一个或多个项很简单,只要有一个索引数组或列表即可。drop方法返回的是一个在指定轴上删除了指定值的新对象
obj6=Series(np.arange(5),index=['a','b','c','d','e'])
new_obj=obj6.drop('c')
print new_obj
print obj6.drop(['a','b'])
结果为:
a 0
b 1
d 3
e 4
dtype: int32
c 2
d 3
e 4
dtype: int32
对于DataFrame,可以删除任意轴上的索引值
data=DataFrame(np.arange(16).reshape(4,4),index=['Ohio','Colorado','Utah','New York'],
columns=['one','two','three','four'])
print data.drop(['Colorado','Ohio'])
print data.drop(['two','four'],axis=1)
结果为:
one two three four
Utah 8 9 10 11
New York 12 13 14 15
one three
Ohio 0 2
Colorado 4 6
Utah 8 10
New York 12 14
c.索引、选取和过滤
Series索引的工作方式类似于NumPy数组的索引,只不过Series的索引值不只是整数
obj7=Series(np.arange(4),index=['a','b','c','d'])
print obj7['b']
print obj7[1]
print obj7[2:4]
print obj7[['b','a','d']]
print obj7[[1,3]]
结果为:
1
1
c 2
d 3
dtype: int32
b 1
a 0
d 3
dtype: int32
b 1
d 3
dtype: int32
利用标签的切片与普通的Python切片运算不同,其末端是包含的
print obj7['b':'c']
结果为:
b 1
c 2
dtype: int32
对DataFrame进行索引其实就是获取一个或多个列
data2=DataFrame(np.arange(16).reshape(4,4),index=['Ohio','Colorado','Utah','New York'],
columns=['one','two','three','four'])
print data2
print data2['two']
print data2[['three','one']]
结果为:
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
Ohio 1
Colorado 5
Utah 9
New York 13
Name: two, dtype: int32
three one
Ohio 2 0
Colorado 6 4
Utah 10 8
New York 14 12
这种索引方式有几个特殊的情况。首先通过切片或布尔型数组选取行
print data2[:2]
print data2[data2['three']>6]
结果为:
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
one two three four
Utah 8 9 10 11
New York 12 13 14 15
另一种用法是通过布尔型DataFrame进行索引
print data2<5
data2[data2<5]=0
print data2
结果为:
one two three four
Ohio True True True True
Colorado True False False False
Utah False False False False
New York False False False False
one two three four
Ohio 0 0 0 0
Colorado 0 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
为了在DataFrame的行上进行标签索引,我引入了专门的索引字段ix
print data2.ix['Colorado',['two','three']]
print data2.ix[['Colorado','Utah'],[3,0,1]]
print data2.ix[2]
print data2.ix[:'Utah','two']
print data2.ix[data2.three>5,:3]
结果为:
two 5
three 6
Name: Colorado, dtype: int32
four one two
Colorado 7 0 5
Utah 11 8 9
one 8
two 9
three 10
four 11
Name: Utah, dtype: int32
Ohio 0
Colorado 5
Utah 9
Name: two, dtype: int32
one two three
Colorado 0 5 6
Utah 8 9 10
New York 12 13 14
d.算术运算和数据对齐
在将对象相加时,如果存在不同的索引时,则结果的索引就是该索引对的并集
s1=Series([7.3,-2.5,3.4,1.5],index=['a','c','d','e'])
s2=Series([-2.1,3.6,-1.5,4,3.1],index=['a','c','e','f','g'])
print s1
print s2
print s1+s2
结果为:
a 7.3
c -2.5
d 3.4
e 1.5
dtype: float64
a -2.1
c 3.6
e -1.5
f 4.0
g 3.1
dtype: float64
a 5.2
c 1.1
d NaN
e 0.0
f NaN
g NaN
dtype: float64
对于DataFrame,对齐操作会同时发生在行和列上
df1=DataFrame(np.arange(9).reshape(3,3),columns=list('bcd'),index=['Ohio','Texas','Colorado'])
df2=DataFrame(np.arange(12).reshape(4,3),columns=list('bde'),index=['Utah','Ohio','Texas','Oregon'])
print df1
print df2
print df1+df2
结果为:
b c d
Ohio 0 1 2
Texas 3 4 5
Colorado 6 7 8
b d e
Utah 0 1 2
Ohio 3 4 5
Texas 6 7 8
Oregon 9 10 11
b c d e
Colorado NaN NaN NaN NaN
Ohio 3 NaN 6 NaN
Oregon NaN NaN NaN NaN
Texas 9 NaN 12 NaN
Utah NaN NaN NaN NaN
e.在算术方法中填充值
在对不同索引的对象进行算术运算时,可能希望当一个对象中某个轴标签在另一个对象中找不到时填充一个特殊值(比如0)
df3=DataFrame(np.arange(12).reshape(3,4),columns=list('abcd'))
df4=DataFrame(np.arange(20).reshape(4,5),columns=list('abcde'))
print df3
print df4
print df3+df4
结果为:
a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
a b c d e
0 0 1 2 3 4
1 5 6 7 8 9
2 10 11 12 13 14
3 15 16 17 18 19
a b c d e
0 0 2 4 6 NaN
1 9 11 13 15 NaN
2 18 20 22 24 NaN
3 NaN NaN NaN NaN NaN
将他们相加时,没有重叠的位置就会产生NA值;使用df3的add方法,传入df4以及一个fill_value参数
print df3.add(df4,fill_value=0)
结果为:
a b c d e
0 0 2 4 6 4
1 9 11 13 15 9
2 18 20 22 24 14
3 15 16 17 18 19
与此类似,在对Series或DataFrame重新索引时,也可以指定一个填充值
print df3.reindex(columns=df4.columns,fill_value=0)
结果为:
a b c d e
0 0 1 2 3 0
1 4 5 6 7 0
2 8 9 10 11 0
f.DataFrame和Series之间的运算
默认情况下,DataFrame和Series之间的算术运算会将Series的索引匹配到DataFrame的列,然后沿着行一直向下广播
frame=DataFrame(np.arange(12).reshape(4,3),columns=list('bde'),index=['Utah','Ohio','Texas','Oregon'])
series=frame.ix[0]
print frame
print series
print frame-series
结果为:
b d e
Utah 0 1 2
Ohio 3 4 5
Texas 6 7 8
Oregon 9 10 11
b 0
d 1
e 2
Name: Utah, dtype: int32
b d e
Utah 0 0 0
Ohio 3 3 3
Texas 6 6 6
Oregon 9 9 9
如果某个索引值在DataFrame的列或Series的索引中找不到,则参与运算的两个对象就会被重新索引以形成并集
series2=Series(range(3),index=['b','e','f'])
print frame+series2
结果为:
b d e f
Utah 0 NaN 3 NaN
Ohio 3 NaN 6 NaN
Texas 6 NaN 9 NaN
Oregon 9 NaN 12 NaN
如果希望匹配行且在列上广播,必须使用算术运算符;传入的轴号就是希望匹配的轴
series3=frame['d']
print frame.sub(series3,axis=0)
结果为:
b d e
Utah -1 0 1
Ohio -1 0 1
Texas -1 0 1
Oregon -1 0 1