3.1,pandas【基本功能】

时间:2024-09-12 18:02:56

一:改变索引

  reindex方法对于Series直接索引,对于DataFrame既可以改变行索引,也可以改变列索引,还可以两个一起改变.

  1)对于Series

 In [2]: seri = pd.Series([4.5,7.2,-5.3,3.6],index = ['d','b','a','c'])

 In [3]: seri
Out[3]:
d 4.5
b 7.2
a -5.3
c 3.6
dtype: float64 In [4]: seri1 = seri.reindex(['a','b','c','d','e']) In [5]: seri1
Out[5]:
a -5.3
b 7.2
c 3.6
d 4.5
e NaN #没有的即为NaN
dtype: float64 In [6]: seri.reindex(['a','b','c','d','e'], fill_value=0)
Out[6]:
a -5.3
b 7.2
c 3.6
d 4.5
e 0.0 #没有的填充为0
dtype: float64 In [7]: seri
Out[7]:
d 4.5
b 7.2
a -5.3
c 3.6
dtype: float64 In [8]: seri_2 = pd.Series(['blue','purple','yellow'], index=[0,2,4]) In [9]: seri_2
Out[9]:
0 blue
2 purple
4 yellow
dtype: object #reindex可用的方法:ffill为向前填充,bfill为向后填充 In [10]: seri_2.reindex(range(6),method='ffill')
Out[10]:
0 blue
1 blue
2 purple
3 purple
4 yellow
5 yellow
dtype: object In [11]: seri_2.reindex(range(6),method='bfill')
Out[11]:
0 blue
1 purple
2 purple
3 yellow
4 yellow
5 NaN
dtype: object

Series的改变索引

  2)对于DataFrame

    其reindex的函数参数:method="ffill/bfill";fill_value=...[若为NaN时的填充值];......

 In [4]: dframe_1 = pd.DataFrame(np.arange(9).reshape((3,3)),index=['a','b','c'],
columns=['Ohio','Texas','Cal'])
In [5]: dframe_1
Out[5]:
Ohio Texas Cal
a 0 1 2
b 3 4 5
c 6 7 8 In [6]: dframe_2 = dframe_1.reindex(['a','b','c','d']) In [7]: dframe_2
Out[7]:
Ohio Texas Cal
a 0 1 2
b 3 4 5
c 6 7 8
d NaN NaN NaN In [16]: dframe_1.reindex(index=['a','b','c','d'],method='ffill',columns=['Ohio'
,'Beijin','Cal'])
Out[16]:
Ohio Beijin Cal
a 0 NaN 2
b 3 NaN 5
c 6 NaN 8
d 6 NaN 8 In [17]: dframe_1.reindex(index=['a','b','c','d'],fill_value='Z',columns=['Ohio'
Out[17]: ,'Cal'])
Ohio Beijin Cal
a 0 Z 2
b 3 Z 5
c 6 Z 8
d Z Z Z In [8]: dframe_1.reindex(columns=['Chengdu','Beijin','Shanghai','Guangdong'])
Out[8]:
Chengdu Beijin Shanghai Guangdong
a NaN NaN NaN NaN
b NaN NaN NaN NaN
c NaN NaN NaN NaN In [9]: dframe_1
Out[9]:
Ohio Texas Cal
a 0 1 2
b 3 4 5
c 6 7 8 #用ix关键字同时改变行/列索引
In [10]: dframe_1.ix[['a','b','c','d'],['Ohio','Beijing','Guangdong']]
Out[10]:
Ohio Beijing Guangdong
a 0 NaN NaN
b 3 NaN NaN
c 6 NaN NaN
d NaN NaN NaN

DataFrame的改变索引

二:丢弃指定轴的数据

  drop方法, 通过索引删除

  1)对于Series

 In [21]: seri = pd.Series(np.arange(5),index=['a','b','c','d','e'])

 In [22]: seri
Out[22]:
a 0
b 1
c 2
d 3
e 4
dtype: int32 In [23]: seri.drop('b')
Out[23]:
a 0
c 2
d 3
e 4
dtype: int32 In [24]: seri.drop(['d','e'])
Out[24]:
a 0
b 1
c 2
dtype: int32

Series的删除数据

  2)对于DataFrame

 In [29]: dframe = pd.DataFrame(np.arange(16).reshape((4,4)),index=['Chen','Bei',
'Shang','Guang'],columns=['one','two','three','four']) In [30]: dframe
Out[30]:
one two three four
Chen 0 1 2 3
Bei 4 5 6 7
Shang 8 9 10 11
Guang 12 13 14 15 #删除行
In [31]: dframe.drop(['Bei','Shang'])
Out[31]:
one two three four
Chen 0 1 2 3
Guang 12 13 14 15 #删除列
In [33]: dframe.drop(['two','three'],axis=1)
Out[33]:
one four
Chen 0 3
Bei 4 7
Shang 8 11
Guang 12 15 #若第一个参数只有一个时可以不要【】

DataFrame的删除数据

三:索引,选取,过滤

  1)Series

    仍然可以向list那些那样用下标访问,不过我觉得不太还,最好还是选择用索引值来进行访问,并且索引值也可用于切片

In [4]: seri = pd.Series(np.arange(4),index=['a','b','c','d'])

In [5]: seri
Out[5]:
a 0
b 1
c 2
d 3
dtype: int32 In [6]: seri['a']
Out[6]: 0 In [7]: seri[['b','a']] #显示顺序也变了
Out[7]:
b 1
a 0
dtype: int32 In [18]: seri[seri<2] #!!元素级别运算!!
Out[18]:
a 0
b 1
dtype: int32 In [11]: seri['a':'c'] #索引用于切片
Out[11]:
a 0
b 1
c 2
dtype: int32 In [12]: seri['a':'c']='z' In [13]: seri
Out[13]:
a z
b z
c z
d 3
dtype: object

Series选取

  2)DataFrame

    其实就是获取一个或多个列的问题。需要注意的是,其实DataFrame可以看作多列索引相同的Series组成的,对应DataFrame数据来说,其首行横向的字段才应该看作是他的索引,所以通过dframe【【n个索引值】】可以选出多列Series,而其中的索引值必须是首行横向的字段,否者报错。而想要取列的话可以通过切片完成,如dframe[:2]选出第0和1行。通过ix【参数1(x),参数2(y)】可以在两个方向上进行选取。

 In [19]: dframe = pd.DataFrame(np.arange(16).reshape((4,4)),index=['one','two','
three','four'],columns=['Bei','Shang','Guang','Sheng']) In [21]: dframe
Out[21]:
Bei Shang Guang Sheng
one 0 1 2 3
two 4 5 6 7
three 8 9 10 11
four 12 13 14 15 In [22]: dframe[['one']] #即是开头讲的索引值用的不正确而报错
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-22-c2522043b676> in <module>()
----> 1 dframe[['one']] In [25]: dframe[['Bei']]
Out[25]:
Bei
one 0
two 4
three 8
four 12 In [26]: dframe[['Bei','Sheng']]
Out[26]:
Bei Sheng
one 0 3
two 4 7
three 8 11
four 12 15 In [27]: dframe[:2] #取行
Out[27]:
Bei Shang Guang Sheng
one 0 1 2 3
two 4 5 6 7 In [32]: #为了在DataFrame中引入标签索引,用ix字段,其第一个参数是对行的控制,第二个为对列的控制 In [33]: dframe.ix[['one','two'],['Bei','Shang']]
Out[33]:
Bei Shang
one 0 1
two 4 5 #有此可看出横向的每个字段为dframe实例的属性
In [35]: dframe.Bei
Out[35]:
one 0
two 4
three 8
four 12
Name: Bei, dtype: int32 In [36]: dframe[dframe.Bei<5]
Out[36]:
Bei Shang Guang Sheng
one 0 1 2 3
two 4 5 6 7 In [38]: dframe.ix[dframe.Bei<5,:2]
Out[38]:
Bei Shang
one 0 1
two 4 5 In [43]: dframe.ix[:'two',['Shang','Bei']]
Out[43]:
Shang Bei
one 1 0
two 5 4

DataFrame选取

四:算术运算

  1)Series

    在运算时会自动按索引对齐后再运算,且在索引值不重叠时产生的运算结果是NaN值, 用运算函数时可以避免此情况。

 In [4]: seri_1 = pd.Series([1,2,3,4],index = ['a','b','c','d'])

 In [5]: seri_2 = pd.Series([5,6,7,8,9],index = ['a','c','e','g','f'])

 In [6]: seri_1 + seri_2
Out[6]:
a 6
b NaN
c 9
d NaN
e NaN
f NaN
g NaN
dtype: float64 In [8]: seri_1.add(seri_2)
Out[8]:
a 6
b NaN
c 9
d NaN
e NaN
f NaN
g NaN
dtype: float64 In [7]: seri_1.add(seri_2,fill_value = 0)
Out[7]:
a 6
b 2
c 9
d 4
e 7
f 9
g 8
dtype: float64 #上面的未重叠区依然有显示值而不是NaN!!
#对应的方法是:add:+; mul: X; sub: -; div : /

Series算术运算

  2)DataFrame

 In [10]: df_1 = pd.DataFrame(np.arange(12).reshape((3,4)),columns = list('abcd')
)
In [11]: df_2 = pd.DataFrame(np.arange(20).reshape((4,5)),columns = list('abcde'
))
In [12]: df_1 + df_2
Out[12]:
a b c d e
0 0 2 4 6 NaN
1 9 11 13 15 NaN
2 18 20 22 24 NaN
3 NaN NaN NaN NaN NaN In [13]: df_1.add(df_2)
Out[13]:
a b c d e
0 0 2 4 6 NaN
1 9 11 13 15 NaN
2 18 20 22 24 NaN
3 NaN NaN NaN NaN NaN In [14]: df_1.add(df_2, fill_value = 0)
Out[14]:
a b c d e
0 0 2 4 6 4
1 9 11 13 15 9
2 18 20 22 24 14
3 15 16 17 18 19

DataFrame算术运算

  3)DataFrame与Series之间进行运算

  类似:np.array

 In [15]: arr_1 = np.arange(12).reshape((3,4))

 In [16]: arr_1 - arr_1[0]
Out[16]:
array([[0, 0, 0, 0],
[4, 4, 4, 4],
[8, 8, 8, 8]]) In [17]: arr_1
Out[17]:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])

array型

 In [18]: dframe_1 = pd.DataFrame(np.arange(12).reshape((4,3)),columns=list('bde'
),index = ['Chen','Bei','Shang','Sheng'])
In [19]: dframe_1
Out[19]:
b d e
Chen 0 1 2
Bei 3 4 5
Shang 6 7 8
Sheng 9 10 11 In [20]: seri = dframe_1.ix[0] In [21]: seri
Out[21]:
b 0
d 1
e 2
Name: Chen, dtype: int32 In [22]: dframe_1 - seri #每行匹配的进行运算
Out[22]:
b d e
Chen 0 0 0
Bei 3 3 3
Shang 6 6 6
Sheng 9 9 9 In [23]: seri_2 = pd.Series(range(3),index=['b','e','f']) In [24]: dframe_1 - seri_2
Out[24]:
b d e f
Chen 0 NaN 1 NaN
Bei 3 NaN 4 NaN
Shang 6 NaN 7 NaN
Sheng 9 NaN 10 NaN In [27]: seri_3 = dframe_1['d'] In [28]: seri_3 #注意!Serie_3索引并不与dframe_1的相同,与上面的运算形式不同
Out[28]:
Chen 1
Bei 4
Shang 7
Sheng 10
Name: d, dtype: int32 In [29]: dframe_1 - seri_3
Out[29]:
Bei Chen Shang Sheng b d e
Chen NaN NaN NaN NaN NaN NaN NaN
Bei NaN NaN NaN NaN NaN NaN NaN
Shang NaN NaN NaN NaN NaN NaN NaN
Sheng NaN NaN NaN NaN NaN NaN NaN
#注意dframe的columns已经变成了Series的index和其自己的columns相加了 #通过运算函数中的axis参数可改变匹配轴以避免上情况
#0为列匹配,1为行匹配
In [31]: dframe_1.sub(seri_3,axis=0)
Out[31]:
b d e
Chen -1 0 1
Bei -1 0 1
Shang -1 0 1
Sheng -1 0 1 In [33]: dframe_1.sub(seri_3,axis=1)
Out[33]:
Bei Chen Shang Sheng b d e
Chen NaN NaN NaN NaN NaN NaN NaN
Bei NaN NaN NaN NaN NaN NaN NaN
Shang NaN NaN NaN NaN NaN NaN NaN
Sheng NaN NaN NaN NaN NaN NaN NaN

DataFrame & Series运算

    注:axis按轴取可以看成  0:以index为index的Series【竖轴】, 1:以colum为index的Series【横轴】

五:使用函数

使用函数
 In [6]: dframe=pd.DataFrame(np.random.randn(4,3),columns=list('bde'),index=['Che
n','Bei','Shang','Sheng'])
In [7]: dframe
Out[7]:
b d e
Chen 1.838620 1.023421 0.641420
Bei 0.920563 -2.037778 -0.853871
Shang -0.587332 0.576442 0.596269
Sheng 0.366174 -0.689582 -1.064030 In [8]: np.abs(dframe) #绝对值函数
Out[8]:
b d e
Chen 1.838620 1.023421 0.641420
Bei 0.920563 2.037778 0.853871
Shang 0.587332 0.576442 0.596269
Sheng 0.366174 0.689582 1.064030 In [9]: func = lambda x: x.max() - x.min() In [10]: dframe.apply(func)
Out[10]:
b 2.425952
d 3.061200
e 1.705449
dtype: float64 In [11]: dframe.apply(func,axis=1)
Out[11]:
Chen 1.197200
Bei 2.958341
Shang 1.183602
Sheng 1.430204
dtype: float64 In [12]: dframe.max() #即dframe.max(axis=0)
Out[12]:
b 1.838620
d 1.023421
e 0.641420
dtype: float64 In [15]: dframe.max(axis=1)
Out[15]:
Chen 1.838620
Bei 0.920563
Shang 0.596269
Sheng 0.366174
dtype: float64

六:排序

  1)按索引排序:sort_index(【axis=0/1,ascending=False/True】)注,其中默认axis为0(index排序),ascending为True(升序)

 In [16]: seri = pd.Series(range(4),index=['d','a','d','c'])

 In [17]: seri
Out[17]:
d 0
a 1
d 2
c 3
dtype: int64 In [18]: seri.sort_index()
Out[18]:
a 1
c 3
d 2
d 0
dtype: int64

Series的索引排序

 In [22]: dframe
Out[22]:
c a b
Chen 1.838620 1.023421 0.641420
Bei 0.920563 -2.037778 -0.853871
Shang -0.587332 0.576442 0.596269
Sheng 0.366174 -0.689582 -1.064030 In [23]: dframe.sort_index()
Out[23]:
c a b
Bei 0.920563 -2.037778 -0.853871
Chen 1.838620 1.023421 0.641420
Shang -0.587332 0.576442 0.596269
Sheng 0.366174 -0.689582 -1.064030 In [24]: dframe.sort_index(axis=1)
Out[24]:
a b c
Chen 1.023421 0.641420 1.838620
Bei -2.037778 -0.853871 0.920563
Shang 0.576442 0.596269 -0.587332
Sheng -0.689582 -1.064030 0.366174

DataFrame的索引排序,用axis制定是按index(默认)还是columns进行排序(1)

  2)按值排序sort_values方法【注:order方法已不推荐使用了】

 In [32]: seri =pd.Series([4,7,np.nan,-1,2,np.nan])

 In [33]: seri
Out[33]:
0 4
1 7
2 NaN
3 -1
4 2
5 NaN
dtype: float64 In [34]: seri.sort_values()
Out[34]:
3 -1
4 2
0 4
1 7
2 NaN
5 NaN
dtype: float64 #NaN值会默认排到最后

Series的值排序

 In [38]: dframe = pd.DataFrame({'b':[4,7,-3,2],'a':[0,1,0,1]})

 In [39]: dframe
Out[39]:
a b
0 0 4
1 1 7
2 0 -3
3 1 2 In [54]: dframe.sort_values('a')
Out[54]:
a b
0 0 4
2 0 -3
1 1 7
3 1 2 In [55]: dframe.sort_values('b')
Out[55]:
a b
2 0 -3
3 1 2
0 0 4
1 1 7 In [57]: dframe.sort_values(['a','b'])
Out[57]:
a b
2 0 -3
0 0 4
3 1 2
1 1 7 In [58]: dframe.sort_values(['b','a'])
Out[58]:
a b
2 0 -3
3 1 2
0 0 4
1 1 7

DataFrame的值排序

七:排名

  rank方法

八:统计计算

  count:非NaN值  describe:对Series或DataFrame列计算汇总统计  min,max  argmin,argmax(整数值):最值得索引值  idmax,idmin:最值索引值

  sum  mean:平均数  var:样本方差  std:样本标准差  kurt:峰值  cumsum:累积和  cummin/cummax:累计最值  pct_change:百分数变化

 In [63]: df = pd.DataFrame([[1.4,np.nan],[7.1,-4.5],[np.nan,np.nan],[0.75,-1.3]]
,index=['a','b','c','d'],columns=['one','two']) In [64]: df
Out[64]:
one two
a 1.40 NaN
b 7.10 -4.5
c NaN NaN
d 0.75 -1.3 In [66]: df.sum()
Out[66]:
one 9.25
two -5.80
dtype: float64 In [67]: df.sum(axis=1)
Out[67]:
a 1.40
b 2.60
c NaN
d -0.55
dtype: float64 #求平均值,skipna:跳过NaN
In [68]: df.mean(axis=1,skipna=False)
Out[68]:
a NaN
b 1.300
c NaN
d -0.275
dtype: float64 In [70]: df.idxmax()
Out[70]:
one b
two d
dtype: object In [71]: df.cumsum()
Out[71]:
one two
a 1.40 NaN
b 8.50 -4.5
c NaN NaN
d 9.25 -5.8 In [72]: df.describe()
Out[72]:
one two
count 3.000000 2.000000
mean 3.083333 -2.900000
std 3.493685 2.262742
min 0.750000 -4.500000
25% 1.075000 -3.700000
50% 1.400000 -2.900000
75% 4.250000 -2.100000
max 7.100000 -1.300000

一些统计计算

九:唯一值,值计数,以及成员资格

  unique方法  value_counts:*方法  isin方法

 In [74]: seri = pd.Series(['c','a','d','a','a','b','b','c','c'])

 In [75]: seri
Out[75]:
0 c
1 a
2 d
3 a
4 a
5 b
6 b
7 c
8 c
dtype: object In [76]: seri.unique()
Out[76]: array(['c', 'a', 'd', 'b'], dtype=object) In [77]: seri.value_counts()
Out[77]:
c 3
a 3
b 2
d 1
dtype: int64 In [78]: pd.value_counts(seri.values,sort=False)
Out[78]:
a 3
c 3
b 2
d 1
dtype: int64 In [81]: seri.isin(['b','c'])
Out[81]:
0 True
1 False
2 False
3 False
4 False
5 True
6 True
7 True
8 True
dtype: bool

唯一值,值计数,成员资格

十:缺少数据处理

  一)删除NaN:dropna方法

    1)Series

      python中的None即是对应到的Numpy的NaN

 In [3]: seri = pd.Series(['aaa','bbb',np.nan,'ccc'])

 In [4]: seri[0]=None

 In [5]: seri
Out[5]:
0 None
1 bbb
2 NaN
3 ccc
dtype: object In [7]: seri.isnull()
Out[7]:
0 True
1 False
2 True
3 False
dtype: bool In [8]: seri.dropna() #返回非NaN值
Out[8]:
1 bbb
3 ccc
dtype: object In [9]: seri
Out[9]:
0 None
1 bbb
2 NaN
3 ccc
dtype: object In [10]: seri[seri.notnull()] #返回非空值
Out[10]:
1 bbb
3 ccc
dtype: object

Series数据处理

    2)DataFrame

      对于DataFrame事情稍微复杂,有时希望删除全NaN或者含有NaN的行或列。

 In [15]: df = pd.DataFrame([[1,6.5,3],[1,np.nan,np.nan],[np.nan,np.nan,np.nan],[
np.nan,6.5,3]]) In [16]: df
Out[16]:
0 1 2
0 1 6.5 3
1 1 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3 In [17]: df.dropna() #默认以行(axis=0),只要有NaN的就删除
Out[17]:
0 1 2
0 1 6.5 3 In [19]: df.dropna(how='all') #只删除全是NaN的行
Out[19]:
0 1 2
0 1 6.5 3
1 1 NaN NaN
3 NaN 6.5 3 In [21]: df.dropna(axis=1,how='all') #以列为标准来丢弃列
Out[21]:
0 1 2
0 1 6.5 3
1 1 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3 In [22]: df.dropna(axis=1)
Out[22]:
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3]

DataFrame的数据处理

  

  二)填充NaN:fillna方法    

 In [88]: df
Out[88]:
one two
a 1.40 NaN
b 7.10 -4.5
c NaN NaN
d 0.75 -1.3 In [90]: df.fillna(0)
Out[90]:
one two
a 1.40 0.0
b 7.10 -4.5
c 0.00 0.0
d 0.75 -1.3

填充NaN

十一:层次化索引

 In [30]: seri = pd.Series(np.random.randn(10),index=[['a','a','a','b','b','b','c
','c','d','d'],[1,2,3,1,2,3,1,2,2,3]])
In [31]: seri
Out[31]:
a 1 0.528387
2 -0.152286
3 -0.776540
b 1 0.025425
2 -1.412776
3 0.969498
c 1 0.478260
2 0.116301
d 2 1.464144
3 2.266069
dtype: float64 In [32]: seri['a']
Out[32]:
1 0.528387
2 -0.152286
3 -0.776540
dtype: float64 In [33]: seri.index
Out[33]:
MultiIndex(levels=[[u'a', u'b', u'c', u'd'], [1, 2, 3]],
labels=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 1, 2
]]) In [35]: seri['a':'c']
Out[35]:
a 1 0.528387
2 -0.152286
3 -0.776540
b 1 0.025425
2 -1.412776
3 0.969498
c 1 0.478260
2 0.116301
dtype: float64 In [45]: seri.unstack()
Out[45]:
1 2 3
a 0.528387 -0.152286 -0.776540
b 0.025425 -1.412776 0.969498
c 0.478260 0.116301 NaN
d NaN 1.464144 2.266069 In [46]: seri.unstack().stack()
Out[46]:
a 1 0.528387
2 -0.152286
3 -0.776540
b 1 0.025425
2 -1.412776
3 0.969498
c 1 0.478260
2 0.116301
d 2 1.464144
3 2.266069
dtype: float64

Series层次化索引,利用unstack方法可以转化为DataFrame型数据

 In [48]: df = pd.DataFrame(np.arange(12).reshape((4,3)),index=[['a','a','b','b']
,[1,2,1,2]],columns=[['Ohio','Ohio','Colorado'],['Green','Red','Green']]) In [49]: df
Out[49]:
Ohio Colorado
Green Red Green
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11 In [50]: df.index
Out[50]:
MultiIndex(levels=[[u'a', u'b'], [1, 2]],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]]) In [51]: df.columns
Out[51]:
MultiIndex(levels=[[u'Colorado', u'Ohio'], [u'Green', u'Red']],
labels=[[1, 1, 0], [0, 1, 0]]) In [53]: df['Ohio']
Out[53]:
Green Red
a 1 0 1
2 3 4
b 1 6 7
2 9 10 In [57]: df.ix['a','Ohio']
Out[57]:
Green Red
1 0 1
2 3 4 In [61]: df.ix['a','Ohio'].ix[1,'Red']
Out[61]: 1

DataFrame层次化索引