要是用pandas,你首先得了解它的两个主要数据结构:Series和DataFrame,这里我将简单介绍一下DataFrame
DataFrame,Python,pandas
- 导入相关库
>>> import numpy as np
>>> from pandas import Series, DataFrame
>>> import pandas as pd
- DataFrame是一个表格型的数据结构,它含有一组有序的列,每列可以是不同的值类型,可以类比Excel里的表格。DataFrame有行索引,也有列索引
- 直接传入一个由等长列表或numpy数组组成的字典
>>> data={'jun':['m',51,'worker',7000],'dan':['f',28,'doctor',10000.0],'hao':['M',21,'student',0.0]}
>>> frame=DataFrame(data) >>> frame
dan hao jun
0 f M m
1 28 21 51
2 doctor student worker
3 10000 0 7000
DataFrame会自动加上索引(跟Series一样),且全部列会被有序排列
- 可以指定行序列,列序列,如果传入的列数据找不到,就会产生NA值
>>> frame=DataFrame(data,index=['a','b','c','d'])
>>> frame
dan hao jun
a f M m
b 28 21 51
c doctor student worker
d 10000 0 7000
>>> frame=DataFrame(data,columns=['dan','hao','jun','aiqing','lianying'],index=['a','b','c','d'])
>>> frame dan hao jun aiqing lianying
a f M m NaN NaN
b 28 21 51 NaN NaN
c doctor student worker NaN NaN
d 10000 0 7000 NaN NaN
- 获取DataFrame行,列的信息,以及行子集,列子集
>>> frame.columns
Index([u'dan', u'hao', u'jun', u'aiqing', u'lianying'], dtype='object')
>>> frame.index
Index([u'a', u'b', u'c', u'd'], dtype='object')
>>> frame.values
array([['f', 'M', 'm', 0, 'f'],
[28, 21, 51, 1, 48],
['doctor', 'student', 'worker', 2, nan],
[10000.0, 0.0, 7000, 3, 2000]], dtype=object)
>>> frame['jun']
a m
b 51
c worker
d 7000
Name: jun, dtype: object
>>> frame.hao
a M
b 21
c student
d 0
Name: hao, dtype: object
>>> frame.ix['a']
dan f
hao M
jun m
aiqing NaN
lianying NaN
Name: a, dtype: object
- 给特定列赋值
>>> frame['aiqing']=np.arange(4)
>>> frame
dan hao jun aiqing lianying
a f M m 0 NaN
b 28 21 51 1 NaN
c doctor student worker 2 NaN
d 10000 0 7000 3 NaN
>>> lianying=['a','b','d']
>>> lianying=Series(['f',48,2000],index=['a','b','d'])
>>> frame['lianying']=lianying
>>> frame
dan hao jun aiqing lianying
a f M m 0 f
b 28 21 51 1 48
c doctor student worker 2 NaN
d 10000 0 7000 3 2000
- 创建新列,删除列
>>> frame['zebing']=frame.aiqing.isnull()
>>> frame
dan hao jun aiqing lianying zebing
a f M m 0 f False
b 28 21 51 1 48 False
c doctor student worker 2 NaN False
d 10000 0 7000 3 2000 False
>>> frame['zebing']=pd.isnull(frame.lianying)
>>> frame
dan hao jun aiqing lianying zebing
a f M m 0 f False
b 28 21 51 1 48 False
c doctor student worker 2 NaN True
d 10000 0 7000 3 2000 False
>>> del frame['zebing']
>>> frame.columns
Index([u'dan', u'hao', u'jun', u'aiqing', u'lianying'], dtype='object')
>>> frame.T
a b c d
dan f 28 doctor 10000
hao M 21 student 0
jun m 51 worker 7000
aiqing 0 1 2 3
lianying f 48 NaN 2000
- 用嵌套字典(字典的字典)创建DataFrame,它会被解释为:外层字典作为列索引,内层键则作为行索引
>>> wang={'a':{'nanchang':'jiangxi','wuhan':'hubei'},'b':{'nanchang':'tengwangge','wuhan':'hunaghelou'}}
>>> house=DataFrame(wang)
>>> house
a b
nanchang jiangxi tengwangge
wuhan hubei hunaghelou
- DataFrame对象本身及索引都有一个NAME属性
>>> house.name='Test'
>>> house.index.name='city'
>>> house.columns.name='info'
>>> house
info a b
city
nanchang jiangxi tengwangge
wuhan hubei hunaghelou
对象本身的NAME属性没有显示,不知道为什么?