一、重新索引
obj = Series([1,2,3,4],index=['a','b','c','d'])
输出为:
a 1
b 2
c 3
d 4
Series有一个reindex函数,可以将索引重排,以致元素顺序发生变化
obj.reindex(['a','c','d','b','e'],fill_value = 0) #fill_value 填充空的index的值
输出为:
a 1
c 3
d 4
b 2
e 0
obj2 = Series(['red','blue'],index=[0,4])
输出为:
0 red
4 blue
obj2.reindex(range(6),method='ffill') #method = ffill,意味着前向值填充
输出为:
0 red
1 red
2 red
3 red
4 blue
5 blue
对于DataFrame,reindex可以修改行(索引)、列或者两个都改。
frame = DataFrame(np.arange(9).reshape((3,3)),index = ['a','c','d'],columns = ['Ohio','Texas','California'])
输出为:
Ohio Texas California
a 0 1 2
c 3 4 5
d 6 7 8
frame2 = frame.reindex(['a','b','c','d']) #只是传入一列数,是对行进行reindex
输出为:
Ohio Texas California
a 0.0 1.0 2.0
b NaN NaN NaN
c 3.0 4.0 5.0
d 6.0 7.0 8.0
frame4 = frame.reindex(columns=states) # 使用columns关键字即可重新索引列
输出为:
Texas Utah California
a 1 NaN 2
c 4 NaN 5
d 7 NaN 8
frame5 = frame.reindex(index = ['a','b','c','d'],columns=states) #同时对行、列进行重新索引
输出为:
Texas Utah California
a 1.0 NaN 2.0
b NaN NaN NaN
c 4.0 NaN 5.0
d 7.0 NaN 8.0
二、丢弃指定轴上的项:
obj = Series(np.arange(3.),index = ['a','b','c'])
输出为:
a 0.0
b 1.0
c 2.0
obj.drop(['a','b'])
输出为:
c 2.0
frame = DataFrame(np.arange(9).reshape((3,3)),index = ['a','c','d'],columns = ['Ohio','Texas','California'])
输出为:
Ohio Texas California
a 0 1 2
c 3 4 5
d 6 7 8
frame.drop(['a']) #删除行
输出为:
Ohio Texas California
c 3 4 5
d 6 7 8
frame.drop(['Ohio'],axis = 1) #删除列
输出为:
Texas California
a 1 2
c 4 5
d 7 8
三、索引、选取和过滤
Series的索引的工作方式类似于Numpy数组的索引,只不过Series的索引值不只是整数。
obj = Series([1,2,3,4],index=['a','b','c','d'])
>>>
a 1
b 2
c 3
d 4
obj['b']
>>> 2
obj[1]
>>> 2
obj[0:3]
>>>
a 1
b 2
c 3
obj[[0,3]]
>>>
a 1
d 4
obj[obj<2]
>>>a 1
利用标签的切片运算与普通的Python切片运算不同,其末端是包含的,即封闭区间:
obj['b':'d']
>>>
b 2
c 3
d 4
DataFrame索引:对DataFrame进行索引就是获取一个或多个列:
frame = DataFrame(np.arange(16).reshape((4,4)),index = ['Ohio','Colorado','Utah','New York'],columns = ['one','two','three','four'])
>>>
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
frame['two']
>>>
Ohio 1
Colorado 5
Utah 9
New York 13
frame[:2] # 通过切片选得到的是行
>>>
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
四、算术运算和数据对齐
pandas最重要的一个功能是,它可以对不同索引的对象进行算术运算。在将对象相加时,如果存在不同的索引对,则结果的索引就是该索引对的并集。
s1 = Series([1,2,3],['a','b','c'])
s2 = Series([4,5,6],['b','c','d'])
s1 + s2
>>>
a NaN
b 6.0
c 8.0
d NaN
对于DataFrame,对齐操作会同时发生在行和列上:
df1 = DataFrame(np.arange(12.).reshape(3,4),columns=list('abcd'))
df2 = DataFrame(np.arange(20.).reshape(4,5),columns=list('abcde'))
df1
>>>
a b c d
0 0.0 1.0 2.0 3.0
1 4.0 5.0 6.0 7.0
2 8.0 9.0 10.0 11.0
df2
>>>
a b c d e
0 0.0 1.0 2.0 3.0 4.0
1 5.0 6.0 7.0 8.0 9.0
2 10.0 11.0 12.0 13.0 14.0
3 15.0 16.0 17.0 18.0 19.0
df1 + df2
>>>
a b c d e
0 0.0 2.0 4.0 6.0 NaN
1 9.0 11.0 13.0 15.0 NaN
2 18.0 20.0 22.0 24.0 NaN
3 NaN NaN NaN NaN NaN
下面看一下DataFrame和Series之间的计算过程:
arr = DataFrame(np.arange(12.).reshape((3,4)),columns = list('abcd'))
arr
>>>
a b c d
0 0.0 1.0 2.0 3.0
1 4.0 5.0 6.0 7.0
2 8.0 9.0 10.0 11.0
Series = arr.ix[0] #如果写arr[0]是错的,因为只有标签索引函数ix后面加数字才表示行
>>>
a 0.0
b 1.0
c 2.0
d 3.0
arr - Series #默认情况下,DataFrame和Series的计算会将Series的索引匹配到DataFrame的列,然后沿着行一直向下广播
>>>
a b c d
0 0.0 0.0 0.0 0.0
1 4.0 4.0 4.0 4.0
2 8.0 8.0 8.0 8.0
Series2 = Series(range(3),index = list('cdf'))
>>>
c 0
d 1
f 2
arr + Series2 # #按照规则,在不匹配的列会形成NaN值
>>>
a b c d f
0 NaN NaN 2.0 4.0 NaN
1 NaN NaN 6.0 8.0 NaN
2 NaN NaN 10.0 12.0 NaN
Series3 = arr['d']
>>>
0 3.0
1 7.0
2 11.0
# 如果想匹配行且在列上广播,需要用到算术运算方法
# 传入的轴号就是希望匹配的轴,这里是匹配行索引并进行广播
# axis = 0 表示按照第0轴 二维情况下表示列
arr.sub(Series3,axis = 0)
>>>
a b c d
0 -3.0 -2.0 -1.0 0.0
1 -3.0 -2.0 -1.0 0.0
2 -3.0 -2.0 -1.0 0.0
五、函数应用和映射