在处理数据时,经常需要对数据分组计算均值或者计数,在Microsoft Excel中,可以通过透视表轻易实现简单的分组运算。而对于更加复杂的分组运算,Python中pandas包可以帮助我们实现。
1 数据
首先引入几个重要的包:
1
2
3
|
import pandas as pd
import numpy as np
from pandas import DataFrame,Series
|
通过代码构造数据集:
1
|
data = DataFrame({ 'key1' :[ 'a' , 'b' , 'c' , 'a' , 'c' , 'a' , 'b' , 'a' , 'c' , 'a' , 'b' , 'c' ], 'key2' :[ 'one' , 'two' , 'three' , 'two' , 'one' , 'one' , 'three' , 'one' , 'two' , 'three' , 'one' , 'two' ], 'num1' :np.random.rand( 12 ), 'num2' :np.random.randn( 12 )})
|
得到数据集如下:
1
2
3
4
5
6
7
8
9
10
11
12
|
data
key1 key2 num1 num2
0 a one 0.268705 0.084091
1 b two 0.876707 0.217794
2 c three 0.229999 0.574402
3 a two 0.707990 - 1.444415
4 c one 0.786064 0.343244
5 a one 0.587273 1.212391
6 b three 0.927396 1.505372
7 a one 0.295271 - 0.497633
8 c two 0.292721 0.098814
9 a three 0.369788 - 1.157426
|
2 交叉表—分类计数
按照不同类进行计数统计是最常见透视功能,可以通
(1)crosstab
1
2
|
#函数:
crosstab(index, columns, values = None , rownames = None , colnames = None , aggfunc = None , margins = False , dropna = True , normalize = False )
|
crosstab的index和columns是必须要指定复制的参数:
1
|
pd.crosstab(data.key1,data.key2)
|
结果如下:
1
2
3
4
5
|
key2 one three two
key1
a 3 1 1
b 0 1 1
c 1 1 1
|
想要在边框处增加汇总项可以指定margin的值为True:
1
|
pd.crosstab(data.key1,data.key2,margins = True )
|
结果:
1
2
3
4
5
6
|
key2 one three two All
key1
a 3 1 1 5
b 1 1 1 3
c 1 1 2 4
All 5 3 4 12
|
(2)pivot_table
函数:
1
|
pivot_table(data, values = None , index = None , columns = None , aggfunc = 'mean' , fill_value = None , margins = False , dropna = True , margins_name = 'All' )
|
使用pivot_table函数同样可以实现,运算函数默认值aggfunc='mean',指定为aggfunc='count'即可:
1
|
data.pivot_table( 'num1' ,index = 'key1' ,columns = 'key2' ,aggfunc = 'count' )
|
结果相同:
1
2
3
4
5
|
key2 one three two
key1
a 3 1 1
b 1 1 1
c 1 1 2
|
(3)groupby
通过groupby相对来说会更加复杂,首先需要对data按照key1和key2进行聚类,然后进行count运算,再将key2的index重塑为columns:
1
|
data.groupby([ 'key1' , 'key2' ])[ 'num1' ].count().unstack()
|
结果:
1
2
3
4
5
|
key2 one three two
key1
a 3 1 1
b 1 1 1
c 1 1 2
|
3 其它透视表运算
(1)pivot_table
1
|
pivot_table(data, values = None , index = None , columns = None , aggfunc = 'mean' , fill_value = None , margins = False , dropna = True , margins_name = 'All' )
|
要进行何种运算,只需要指定aggfunc即可。
默认计算均值:
1
|
data.pivot_table(index = 'key1' ,columns = 'key2' )
|
out:
1
2
3
4
5
6
|
num1 num2
key2 one three two one three two
key1
a 0.193332 0.705657 0.203155 - 0.165749 2.398164 - 1.293595
b 0.167947 0.204545 0.661460 0.555850 - 0.522528 0.143530
c 0.496993 0.033673 0.206028 - 0.115093 0.024650 0.077726
|
分类汇总呢并求和:
1
|
data.pivot_table(index = 'key1' ,columns = 'key2' ,aggfunc = 'sum' )
|
结果:
1
2
3
4
5
6
|
num1 num2
key2 one three two one three two
key1
a 0.579996 0.705657 0.203155 - 0.497246 2.398164 - 1.293595
b 0.167947 0.204545 0.661460 0.555850 - 0.522528 0.143530
c 0.496993 0.033673 0.412055 - 0.115093 0.024650 0.155452
|
也可以使用其它自定义函数:
1
2
3
|
#定义一个最大值减最小值的函数
def max_min (group):
return group. max () - group. min ()
|
1
|
data.pivot_table(index = 'key1' ,columns = 'key2' ,aggfunc = max_min)
|
结果:
1
2
3
4
5
6
|
num1 num2
key2 one three two one three two
key1
a 0.179266 0.0 0.000 3.109405 0.0 0.000000
b 0.000000 0.0 0.000 0.000000 0.0 0.000000
c 0.000000 0.0 0.177 0.000000 0.0 1.609466
|
(2)通过groupby
普通的函数如mean,sum可以直接应用:
data.groupby(['key1','key2']).mean().unstack()
返回结果:
1
2
3
4
5
6
|
num1 num2
key2 one three two one three two
key1
a 0.193332 0.705657 0.203155 - 0.165749 2.398164 - 1.293595
b 0.167947 0.204545 0.661460 0.555850 - 0.522528 0.143530
c 0.496993 0.033673 0.206028 - 0.115093 0.024650 0.077726
|
以上这篇用Python实现数据的透视表的方法就是小编分享给大家的全部内容了,希望能给大家一个参考,也希望大家多多支持服务器之家。
原文链接:https://blog.csdn.net/moxigandashu/article/details/69486269#3-其它透视表运算