pandas模块学习笔记2

时间:2022-12-28 20:56:53

一、重新索引

obj = Series([1,2,3,4],index=['a','b','c','d'])
输出为:
a
1
b
2
c
3
d
4
Series有一个reindex函数,可以将索引重排,以致元素顺序发生变化
obj.reindex(['a','c','d','b','e'],fill_value = 0)  #fill_value 填充空的index的值
输出为:
a
1
c
3
d
4
b
2
e 0
obj2 = Series(['red','blue'],index=[0,4])  
输出为:
0 red
4 blue

obj2.reindex(range(
6),method='ffill') #method = ffill,意味着前向值填充
输出为:
0 red
1 red
2 red
3 red
4 blue
5 blue
对于DataFrame,reindex可以修改行(索引)、列或者两个都改。
frame = DataFrame(np.arange(9).reshape((3,3)),index = ['a','c','d'],columns = ['Ohio','Texas','California']) 
输出为:
Ohio Texas California
a 0
1 2
c
3 4 5
d
6 7 8

frame2
= frame.reindex(['a','b','c','d']) #只是传入一列数,是对行进行reindex
输出为:
Ohio Texas California
a
0.0 1.0 2.0
b NaN NaN NaN
c
3.0 4.0 5.0
d
6.0 7.0 8.0

frame4
= frame.reindex(columns=states) # 使用columns关键字即可重新索引列
输出为:
Texas Utah California
a
1 NaN 2
c
4 NaN 5
d
7 NaN 8

frame5
= frame.reindex(index = ['a','b','c','d'],columns=states) #同时对行、列进行重新索引
输出为:
Texas Utah California
a
1.0 NaN 2.0
b NaN NaN NaN
c
4.0 NaN 5.0
d
7.0 NaN 8.0

二、丢弃指定轴上的项:

obj = Series(np.arange(3.),index = ['a','b','c']) 
输出为:
a
0.0
b
1.0
c
2.0

obj.drop([
'a','b'])
输出为:
c
2.0
frame = DataFrame(np.arange(9).reshape((3,3)),index = ['a','c','d'],columns = ['Ohio','Texas','California']) 
输出为:
Ohio Texas California
a 0
1 2
c
3 4 5
d
6 7 8

frame.drop([
'a']) #删除行
输出为:
Ohio Texas California
c
3 4 5
d
6 7 8

frame.drop([
'Ohio'],axis = 1) #删除列
输出为:
Texas California
a
1 2
c
4 5
d
7 8

三、索引、选取和过滤

Series的索引的工作方式类似于Numpy数组的索引,只不过Series的索引值不只是整数。

obj = Series([1,2,3,4],index=['a','b','c','d'])   
>>>
a
1
b
2
c
3
d
4

obj[
'b']
>>> 2

obj[
1]
>>> 2

obj[0:
3]
>>>
a
1
b
2
c
3

obj[[0,
3]]
>>>
a
1
d
4

obj[obj
<2]
>>>a 1

利用标签的切片运算与普通的Python切片运算不同,其末端是包含的,即封闭区间:

obj['b':'d']
>>>
b
2
c
3
d
4

DataFrame索引:对DataFrame进行索引就是获取一个或多个列:

frame = DataFrame(np.arange(16).reshape((4,4)),index = ['Ohio','Colorado','Utah','New York'],columns = ['one','two','three','four'])
>>>
one two three four
Ohio 0
1 2 3
Colorado
4 5 6 7
Utah
8 9 10 11
New York
12 13 14 15

frame[
'two']
>>>
Ohio
1
Colorado
5
Utah
9
New York
13

frame[:
2] # 通过切片选得到的是行
>>>
one two three four
Ohio 0
1 2 3
Colorado
4 5 6 7

四、算术运算和数据对齐

pandas最重要的一个功能是,它可以对不同索引的对象进行算术运算。在将对象相加时,如果存在不同的索引对,则结果的索引就是该索引对的并集。

s1 = Series([1,2,3],['a','b','c'])             
s2
= Series([4,5,6],['b','c','d'])
s1
+ s2
>>>
a NaN
b
6.0
c
8.0
d NaN

对于DataFrame,对齐操作会同时发生在行和列上:

df1 = DataFrame(np.arange(12.).reshape(3,4),columns=list('abcd'))   
df2
= DataFrame(np.arange(20.).reshape(4,5),columns=list('abcde'))

df1
>>>
a b c d
0
0.0 1.0 2.0 3.0
1 4.0 5.0 6.0 7.0
2 8.0 9.0 10.0 11.0

df2
>>>
a b c d e
0
0.0 1.0 2.0 3.0 4.0
1 5.0 6.0 7.0 8.0 9.0
2 10.0 11.0 12.0 13.0 14.0
3 15.0 16.0 17.0 18.0 19.0

df1
+ df2
>>>
a b c d e
0
0.0 2.0 4.0 6.0 NaN
1 9.0 11.0 13.0 15.0 NaN
2 18.0 20.0 22.0 24.0 NaN
3 NaN NaN NaN NaN NaN
下面看一下DataFrame和Series之间的计算过程:
arr = DataFrame(np.arange(12.).reshape((3,4)),columns = list('abcd'))

arr
>>>
a b c d
0
0.0 1.0 2.0 3.0
1 4.0 5.0 6.0 7.0
2 8.0 9.0 10.0 11.0

Series
= arr.ix[0] #如果写arr[0]是错的,因为只有标签索引函数ix后面加数字才表示行
>>>
a
0.0
b
1.0
c
2.0
d
3.0

arr
- Series #默认情况下,DataFrame和Series的计算会将Series的索引匹配到DataFrame的列,然后沿着行一直向下广播
>>>
a b c d
0
0.0 0.0 0.0 0.0
1 4.0 4.0 4.0 4.0
2 8.0 8.0 8.0 8.0

Series2
= Series(range(3),index = list('cdf'))
>>>
c 0
d
1
f
2

arr
+ Series2 # #按照规则,在不匹配的列会形成NaN值
>>>
a b c d f
0 NaN NaN
2.0 4.0 NaN
1 NaN NaN 6.0 8.0 NaN
2 NaN NaN 10.0 12.0 NaN

Series3
= arr['d']
>>>
0
3.0
1 7.0
2 11.0

# 如果想匹配行且在列上广播,需要用到算术运算方法
#
传入的轴号就是希望匹配的轴,这里是匹配行索引并进行广播
#
axis = 0 表示按照第0轴 二维情况下表示列
arr.sub(Series3,axis = 0)
>>>
a b c d
0
-3.0 -2.0 -1.0 0.0
1 -3.0 -2.0 -1.0 0.0
2 -3.0 -2.0 -1.0 0.0

五、函数应用和映射