Python学习(十二)——pandas函数库1

时间:2022-03-20 04:03:57

pandas

基本功能:
(1)具备按轴自动或显式数据对齐功能的数据结构;
(2)集成时间序列功能;
(3)既能处理时间序列数据也能处理非时间序列数据的数据结构;
(4)数学运算和约简(如对某个轴求和)可以根据不同的元数据(轴编号)执行;
(5)灵活处理缺失数据;
(6)合并及其他出现在常见数据库(SQL等)中的关系型运算;
数据结构:
一、Series
Series是一种类似于一维数组的对象,它由一组数据(各种NumPy数据类型)以及一组与之相关的数据标签(即索引)组成。
Series的字符串表现形式为:索引在左边,值在右边。

首先导入模块

from pandas import Series

1、Series创建
①用数组生成Series
(默认索引为从0开始,类似一维数组结构)

ser1=Series([111,222,333,-444])
print ser1
print ser1.values
print ser1.index

输出:

0 111
1 222
2 333
3 -444
dtype: int64
[ 111 222 333 -444]
RangeIndex(start=0, stop=4, step=1)

②指定Series的index
(指定索引时,类似于字典dict中的键-值(key-value)存储。)

ser2=Series([10,20,30,40],index=['fir','sec','thi','fou'])
print ser2
print ser2.index

输出:

fir    10
sec 20
thi 30
fou 40
dtype: int64
Index([u'fir', u'sec', u'thi', u'fou'], dtype='object')

③使用字典生成Series

d={'ShangHai':21,'TianJin':22,'ChongQing':23,'ShenYang':24,'NanJing':25,'GuangZhou':20,'WuHan':27,'ChengDu':28,'XiAn':29}
print d
ser3=Series(d)
print ser3

输出:

{'WuHan': 27, 'GuangZhou': 20, 'ShenYang': 24, 'ChengDu': 28, 'TianJin': 22, 'ShangHai': 21, 'XiAn': 29, 'NanJing': 25, 'ChongQing': 23}
ChengDu 28
ChongQing 23
GuangZhou 20
NanJing 25
ShangHai 21
ShenYang 24
TianJin 22
WuHan 27
XiAn 29
dtype: int64

④使用字典生成的Series并指定index时,index中不匹配的部分为Nan(not a number):

d={'ShangHai':21,'TianJin':22,'ChongQing':23,'ShenYang':24,'NanJing':25,'GuangZhou':20,'WuHan':27,'ChengDu':28,'XiAn':29}
ser3=Series(d)
city=['HaErBin','ShangHai','TianJin','ChongQing','ShenYang','NanJing','GuangZhou','WuHan','ChengDu','XiAn']
ser4=Series(d,index=city)
print ser4

输出:

HaErBin       NaN
ShangHai 21.0
TianJin 22.0
ChongQing 23.0
ShenYang 24.0
NanJing 25.0
GuangZhou 20.0
WuHan 27.0
ChengDu 28.0
XiAn 29.0
dtype: float64

2、Series读写
①指定索引index对Series进行读写

ser2=Series([10,20,30,40],index=['fir','sec','thi','fou'])
print ser2['thi']
ser2['thi']=666
print ser2['thi']

输出:

30
666

②指定多个index对Series读写

ser2=Series([10,20,30,40],index=['fir','sec','thi','fou'])
print ser2[['fir','sec','thi']]

输出:

fir     10
sec 20
thi 30
dtype: int64

③用布尔索引读取Series元素:

ser2=Series([10,20,30,40],index=['fir','sec','thi','fou'])
ser2['thi']=666
print '找出小于600的元素'
print ser2[ser2<600]

输出:
找出小于600的元素

fir    10
sec 20
fou 40
dtype: int64

④判断index是否存在
类似于字典dict中的判断key值的存在;存在时返回True,否则返回False。

ser2=Series([10,20,30,40],index=['fir','sec','thi','fou'])
print 'thi' in ser2
print 'no' in ser2

输出:

True
False

3、Series运算
①Series相加减,相同索引部分会进行加减,无对应部分的会作为缺失值Nan进行处理:

d={'ShenYang':24,'NanJing':25,'GuangZhou':20,'WuHan':27,'ChengDu':28,'XiAn':29}
ser3=Series(d)
city=['HaErBin','ShenYang','NanJing','GuangZhou','WuHan','ChengDu','XiAn']
ser4=Series(d,index=city)

print ser4+ser3
print ser4-0.5*ser3

输出:

ChengDu      56.0
GuangZhou 40.0
HaErBin NaN
NanJing 50.0
ShenYang 48.0
WuHan 54.0
XiAn 58.0
dtype: float64

ChengDu 14.0
GuangZhou 10.0
HaErBin NaN
NanJing 12.5
ShenYang 12.0
WuHan 13.5
XiAn 14.5
dtype: float64

4、可以对Series及其索引进行命名:
可提升代码的可读性;

d={'ShenYang':24,'NanJing':25,'GuangZhou':20,'WuHan':27,'ChengDu':28,'XiAn':29}
ser3=Series(d)
# city=['HaErBin','ShenYang','NanJing','GuangZhou','WuHan','ChengDu','XiAn']

ser3.name='area_code'
ser3.index.name='city name'

print ser3
print ser3.index

输出:

city name
ChengDu 28
GuangZhou 20
NanJing 25
ShenYang 24
WuHan 27
XiAn 29
Name: area_code, dtype: int64
Index([u'ChengDu', u'GuangZhou', u'NanJing', u'ShenYang', u'WuHan', u'XiAn'], dtype='object', name=u'city name')

5、索引index可以重新指定即可替换:

d={'ShenYang':24,'NanJing':25,'GuangZhou':20,'WuHan':27,'ChengDu':28,'XiAn':29}
ser3=Series(d)
ser3.index=['SY','NJ','GZ','WH','CD','XA']
print ser3

输出:

SY    28
NJ 20
GZ 25
WH 24
CD 27
XA 29
dtype: int64

二、DataFrame
DateFrame是一个表格型的数据结构,含有一组有序的列,每列可以为不同的数据类型。
既有行索引也有列索引,可以看作由Series组成的字典(共用一个索引)。
pandas兼具了Numpy高性能的数组计算功能及电子表格个关系型数据库(如SQL)灵活
的数据处理功能。
首先导入模块

from pandas import DataFrame

1、DataFrame构造
①用字典生成DataFrame,key为列名:

data={'ShenYang':{'AreaCode':24,'GDP':2412.2},
'NanJing':{'AreaCode':25,'GDP':5488.73},
'GuangZhou':{'AreaCode':20,'GDP':9891.48},
'WuHan':{'AreaCode':27,'GDP':6019.08}}
dfame=DataFrame(data)

print dfame

输出:

          GuangZhou  NanJing  ShenYang    WuHan
AreaCode 20.00 25.00 24.0 27.00
GDP 9891.48 5488.73 2412.2 6019.08

或:

data={'city':['ShenYang','NanJing','GuangZhou','WuHan','ChengDu','XiAn'],
'AreaCode':[24,25,20,27,28,29],
'GDP':[2412.2,5488.73,9891.48,6019.08,6111.4,3304.08]
}

dfame=DataFrame(data)
print dfame

输出:

   AreaCode      GDP       city
0 24 2412.20 ShenYang
1 25 5488.73 NanJing
2 20 9891.48 GuangZhou
3 27 6019.08 WuHan
4 28 6111.40 ChengDu
5 29 3304.08 XiAn

可以看到,字典key值本身是无序的,此时列的顺序是无法保证的(输入‘city’、‘AreaCode’、‘GDP’输出AreaCode GDP city)。
若需要确定列的顺序时,DataFrame可以通过columns单独指定列的顺序。

data={'city':['ShenYang','NanJing','GuangZhou','WuHan','ChengDu','XiAn'],
'AreaCode':[24,25,20,27,28,29],
'GDP':[2412.2,5488.73,9891.48,6019.08,6111.4,3304.08]
}
dfame=DataFrame(data,columns=['city','AreaCode','GDP'])
print dfame

输出:

        city  AreaCode      GDP
0 ShenYang 24 2412.20
1 NanJing 25 5488.73
2 GuangZhou 20 9891.48
3 WuHan 27 6019.08
4 ChengDu 28 6111.40
5 XiAn 29 3304.08

如果指定的列中某个列在字典data中不存在,则全部用Nan代替:

data={'city':['ShenYang','NanJing','GuangZhou','WuHan','ChengDu','XiAn'],
'AreaCode':[24,25,20,27,28,29],
'GDP':[2412.2,5488.73,9891.48,6019.08,6111.4,3304.08]
}
dfame=DataFrame(data,columns=['city','AreaCode','GDP','Population'])
print dfame

输出:

        city  AreaCode      GDP Population
0 ShenYang 24 2412.20 NaN
1 NanJing 25 5488.73 NaN
2 GuangZhou 20 9891.48 NaN
3 WuHan 27 6019.08 NaN
4 ChengDu 28 6111.40 NaN
5 XiAn 29 3304.08 NaN

若只指定了部分列,则只会输出指定的列:

data={'city':['ShenYang','NanJing','GuangZhou','WuHan','ChengDu','XiAn'],
'AreaCode':[24,25,20,27,28,29],
'GDP':[2412.2,5488.73,9891.48,6019.08,6111.4,3304.08]
}
dfame=DataFrame(data,columns=['city','GDP','Population'])
print dfame

输出:

        city      GDP Population
0 ShenYang 2412.20 NaN
1 NanJing 5488.73 NaN
2 GuangZhou 9891.48 NaN
3 WuHan 6019.08 NaN
4 ChengDu 6111.40 NaN
5 XiAn 3304.08 NaN

同时,还可以指定DataFrame的index;(默认情况下为0 1 2 3 ……)

data={'city':['ShenYang','NanJing','GuangZhou','WuHan','ChengDu','XiAn'],
'AreaCode':[24,25,20,27,28,29],
'GDP':[2412.2,5488.73,9891.48,6019.08,6111.4,3304.08]
}
dfame=DataFrame(data,columns=['city','AreaCode','GDP','Population'],index=['line1','line2','line3','line4','line5','line6'])
print dfame

输出:

            city  AreaCode      GDP Population
line1 ShenYang 24 2412.20 NaN
line2 NanJing 25 5488.73 NaN
line3 GuangZhou 20 9891.48 NaN
line4 WuHan 27 6019.08 NaN
line5 ChengDu 28 6111.40 NaN
line6 XiAn 29 3304.08 NaN

也可以指定通过
dfame.index.name=’line’
dfame.columns.name=’brief
‘索引和列的名称:

data={'city':['ShenYang','NanJing','GuangZhou','WuHan','ChengDu','XiAn'],
'AreaCode':[24,25,20,27,28,29],
'GDP':[2412.2,5488.73,9891.48,6019.08,6111.4,3304.08]
}
dfame=DataFrame(data,columns=['city','AreaCode','GDP'])
dfame.index.name='line'
dfame.columns.name='brief'
print dfame

输出:

brief       city  AreaCode      GDP
line
0 ShenYang 24 2412.20
1 NanJing 25 5488.73
2 GuangZhou 20 9891.48
3 WuHan 27 6019.08
4 ChengDu 28 6111.40
5 XiAn 29 3304.08

2、DataFrame读写

①读取列:
读取列信息:

data={'city':['ShenYang','NanJing','GuangZhou','WuHan','ChengDu','XiAn'],
'AreaCode':[24,25,20,27,28,29],
'GDP':[2412.2,5488.73,9891.48,6019.08,6111.4,3304.08]
}
dfame=DataFrame(data,columns=['city','AreaCode','GDP'],index=['line1','line2','line3','line4','line5','line6'])
print dfame.columns

输出:

Index([u'city', u'AreaCode', u'GDP'], dtype='object')

读取DataFrame的列可以用dfame[‘AreaCode’]
也可以用dfame.city获取某列

data={'city':['ShenYang','NanJing','GuangZhou','WuHan','ChengDu','XiAn'],
'AreaCode':[24,25,20,27,28,29],
'GDP':[2412.2,5488.73,9891.48,6019.08,6111.4,3304.08]
}
dfame=DataFrame(data,columns=['city','AreaCode','GDP','Population'],index=['line1','line2','line3','line4','line5','line6'])
print dfame['AreaCode']
print dfame.city

输出:

line1    24
line2 25
line3 20
line4 27
line5 28
line6 29
Name: AreaCode, dtype: int64
line1 ShenYang
line2 NanJing
line3 GuangZhou
line4 WuHan
line5 ChengDu
line6 XiAn
Name: city, dtype: object

也可以利用values直接打印出一个二维数组,不含行列信息:

data={'city':['ShenYang','NanJing','GuangZhou','WuHan','ChengDu','XiAn'],
'AreaCode':[24,25,20,27,28,29],
'GDP':[2412.2,5488.73,9891.48,6019.08,6111.4,3304.08]
}
dfame=DataFrame(data,columns=['city','AreaCode','GDP'])
print dfame.values

输出:

[['ShenYang' 24L 2412.2]
['NanJing' 25L 5488.73]
['GuangZhou' 20L 9891.48]
['WuHan' 27L 6019.08]
['ChengDu' 28L 6111.4]
['XiAn' 29L 3304.08]]

②直接赋值修改列:
通过赋值,直接修改整列的值:

dfame['Population']=7000000
print dfame


dfame['Population']=[111,222,333,444,555,666]
print dfame

输出:

            city  AreaCode      GDP  Population
line1 ShenYang 24 2412.20 7000000
line2 NanJing 25 5488.73 7000000
line3 GuangZhou 20 9891.48 7000000
line4 WuHan 27 6019.08 7000000
line5 ChengDu 28 6111.40 7000000
line6 XiAn 29 3304.08 7000000

city AreaCode GDP Population
line1 ShenYang 24 2412.20 111
line2 NanJing 25 5488.73 222
line3 GuangZhou 20 9891.48 333
line4 WuHan 27 6019.08 444
line5 ChengDu 28 6111.40 555
line6 XiAn 29 3304.08 666

③通过numpy数据修改列:

dfame['Population']=np.arange(100,700,100)
print dfame

输出:

            city  AreaCode      GDP  Population
line1 ShenYang 24 2412.20 100
line2 NanJing 25 5488.73 200
line3 GuangZhou 20 9891.48 300
line4 WuHan 27 6019.08 400
line5 ChengDu 28 6111.40 500
line6 XiAn 29 3304.08 600

④通过Series修改列:
通过Series指定要修改的索引及对应的值,及可指定DataFrame某列中不同行的值,未指定的默认为NaN;

data={'city':['ShenYang','NanJing','GuangZhou','WuHan','ChengDu','XiAn'],
'AreaCode':[24,25,20,27,28,29],
'GDP':[2412.2,5488.73,9891.48,6019.08,6111.4,3304.08]
}
dfame=DataFrame(data,columns=['city','AreaCode','GDP','Population'],index=['line1','line2','line3','line4','line5','line6'])
ser=Series([111,333,444,555,666],index=['line1','line3','line4','line5','line6'])
dfame['Population']=ser
print dfame

输出:

            city  AreaCode      GDP  Population
line1 ShenYang 24 2412.20 111.0
line2 NanJing 25 5488.73 NaN
line3 GuangZhou 20 9891.48 333.0
line4 WuHan 27 6019.08 444.0
line5 ChengDu 28 6111.40 555.0
line6 XiAn 29 3304.08 666.0

⑤增加新列:
增加新列并赋值;

data={'city':['ShenYang','NanJing','GuangZhou','WuHan','ChengDu','XiAn'],
'AreaCode':[24,25,20,27,28,29],
'GDP':[2412.2,5488.73,9891.48,6019.08,6111.4,3304.08]
}
dfame=DataFrame(data,columns=['city','AreaCode','GDP'],index=['line1','line2','line3','line4','line5','line6'])
dfame['Temperature']=[-2,10,30,18,25,5]
print dfame

输出:

            city  AreaCode      GDP  Temperature
line1 ShenYang 24 2412.20 -2
line2 NanJing 25 5488.73 10
line3 GuangZhou 20 9891.48 30
line4 WuHan 27 6019.08 18
line5 ChengDu 28 6111.40 25
line6 XiAn 29 3304.08 5

3、DataFrame操作
①DataFrame转置:
类比行列式的转置,转置后行列交换;

data={'city':['ShenYang','NanJing','GuangZhou','WuHan','ChengDu','XiAn'],
'AreaCode':[24,25,20,27,28,29],
'GDP':[2412.2,5488.73,9891.48,6019.08,6111.4,3304.08]
}
dfame=DataFrame(data,columns=['city','AreaCode','GDP'],index=['line1','line2','line3','line4','line5','line6'])
print dfame.T

输出:

             line1    line2      line3    line4    line5    line6
city ShenYang NanJing GuangZhou WuHan ChengDu XiAn
AreaCode 24 25 20 27 28 29
GDP 2412.2 5488.73 9891.48 6019.08 6111.4 3304.08

②DataFrame切片操作:

data={'city':['ShenYang','NanJing','GuangZhou','WuHan','ChengDu','XiAn'],
'AreaCode':[24,25,20,27,28,29],
'GDP':[2412.2,5488.73,9891.48,6019.08,6111.4,3304.08]
}
dfame=DataFrame(data,columns=['city','AreaCode','GDP'])
print dfame['city'][2:6]

输出:

2    GuangZhou
3 WuHan
4 ChengDu
5 XiAn
Name: city, dtype: object

三、index
索引对象
pandas的索引对象负责管理轴标签和其他元数据(比如轴名称等)。构建Series或DataFrame时,所用到的任何数组或者其他序列的标签都会被转换成Index。
Index对象是不可修改的,这样可以使Index对象在多个数据结构之间安全共享。

首先导入模块:

from pandas import Index

①创建Index
直接利用数组生成Index:

index=Index(np.arange(5))
print index

输出:

Int64Index([0, 1, 2, 3, 4], dtype='int64')

生成的Index可以作为Series的index
可根据ser.index is index判断两个index是否为同一个index;

index=Index(np.arange(5))
ser=Series(['one','two','three','four','five'],index=index)
print ser
print ser.index is index

输出:

0      one
1 two
2 three
3 four
4 five

dtype: object
True

②获取Index

ser=Series(range(5),index=['one','two','three','four','five'])
index=ser.index
print index
print index[2:5]

输出:

Index([u'one', u'two', u'three', u'four', u'five'], dtype='object')

Index([u'three', u'four', u'five'], dtype='object')

③判断索引是否存在

data={'ShenYang':{'AreaCode':24,'GDP':2412.2},
'NanJing':{'AreaCode':25,'GDP':5488.73},
'GuangZhou':{'AreaCode':20,'GDP':9891.48},
'WuHan':{'AreaCode':27,'GDP':6019.08}}
dfame=DataFrame(data)

print dfame
print 'WuHan' in dfame.columns
print 'GDP' in dfame.index

输出:

          GuangZhou  NanJing  ShenYang    WuHan
AreaCode 20.00 25.00 24.0 27.00
GDP 9891.48 5488.73 2412.2 6019.08

True
True

④Index的方法和属性:

1)append——链接另外一个index对象,产生一个新的index;
2)diff——计算差集;
3)union——计算交集;
4)isin——计算一个指示各值是否包含在参数集合中的布尔型数组;
5)delete——删除索引处的元素,并包含到新的index;
6)drop——删除传入的值,并的到新的索引;
7)insert——将元素插入到索引处,并得到新的index;
8)unique——计算index中唯一值得到数组;
9)is_monotonic——当各个元素均大于等于的一个元素时返回True;
10)is_unique——当index没有重复值时,返回True;

pandas中主要的index对象:

1)index——最泛华的index对象,将轴标签作为一个由Python对象组成的Numpy数组;
2)int64Index——针对整数的特殊index;
3)MultiIndex——层级索引–“层次化”索引对象,表示单个轴上的多层次索引,可以看作原数组组成的数组;
4)DatetimeIndex——存储纳秒级时间戳;
5)PeriodIndex ——针对Period数据的特殊index。


Python学习(十二)——pandas函数库1