df:
DF:
name score
A 1
A 2
A 3
A 4
A 5
B 2
B 4
B 6
B 8
Want to get the following new dataframe in the form of below:
想以下面的形式获得以下新数据框:
name count mean std min 25% 50% 75% max
A 5 3 .. .. .. .. .. ..
B 4 5 .. .. .. .. .. ..
How to exctract the information from df.describe() and reformat it? Thanks
如何从df.describe()中提取信息并重新格式化?谢谢
6 个解决方案
#1
0
Define some data
In[1]:
import pandas as pd
import io
data = """
name score
A 1
A 2
A 3
A 4
A 5
B 2
B 4
B 6
B 8
"""
df = pd.read_csv(io.StringIO(data), delimiter='\s+')
print(df)
.
。
Out[1]:
name score
0 A 1
1 A 2
2 A 3
3 A 4
4 A 5
5 B 2
6 B 4
7 B 6
8 B 8
Solution
A nice approach to this problem uses a generator expression (see footnote) to allow pd.DataFrame()
to iterate over the results of groupby
, and construct the summary stats dataframe on the fly:
解决此问题的一个很好的方法是使用生成器表达式(请参阅脚注)以允许pd.DataFrame()迭代groupby的结果,并动态构造摘要统计数据框:
In[2]:
df2 = pd.DataFrame(group.describe().rename(columns={'score':name}).squeeze()
for name, group in df.groupby('name'))
print(df2)
.
。
Out[2]:
count mean std min 25% 50% 75% max
A 5 3 1.581139 1 2.0 3 4.0 5
B 4 5 2.581989 2 3.5 5 6.5 8
Here the squeeze
function is squeezing out a dimension, to convert the one-column group summary stats Dataframe
into a Series
.
挤压函数在这里挤出一个维度,将单列组摘要统计数据Dataframe转换为Series。
Footnote: A generator expression has the form my_function(a) for a in iterator
, or if iterator
gives us back two-element tuples
, as in the case of groupby
: my_function(a,b) for a,b in iterator
脚注:生成器表达式的形式为my_function(a),用于迭代器,或者迭代器为我们提供两元素元组,如groupby:my_function(a,b)表示a,b表示迭代器
#2
9
Nothing beats one-liner:
没有什么比单线更好:
In [145]:
print df.groupby('name').describe().reset_index().pivot(index='name', values='score', columns='level_1')
level_1 25% 50% 75% count max mean min std
name
A 2.0 3 4.0 5 5 3 1 1.581139
B 3.5 5 6.5 4 8 5 2 2.581989
#3
2
import pandas as pd
import io
import numpy as np
data = """
name score
A 1
A 2
A 3
A 4
A 5
B 2
B 4
B 6
B 8
"""
df = pd.read_csv(io.StringIO(data), delimiter='\s+')
df2 = df.groupby('name').describe().reset_index().T.drop('name')
arr = np.array(df2).reshape((4,8))
df2 = pd.DataFrame(arr[1:], index=['name','A','B'])
print(df2)
That will give you df2 as:
那会给你df2:
0 1 2 3 4 5 6 7
name count mean std min 25% 50% 75% max
A 5 3 1.58114 1 2 3 4 5
B 4 5 2.58199 2 3.5 5 6.5 8
#4
2
there is even a shorter one :)
甚至更短的一个:)
print df.groupby('name').describe().unstack(1)
Nothing beats one-liner:
没有什么比单线更好:
In [145]:
在[145]中:
print df.groupby('name').describe().reset_index().pivot(index='name', values='score', columns='level_1')
print df.groupby('name')。describe()。reset_index()。pivot(index ='name',values ='score',columns ='level_1')
#5
1
Well I managed to get what you wanted but it doesn't scale very well.
好吧,我设法得到你想要的东西,但它不能很好地扩展。
import pandas as pd
name = ['a','a','a','a','a','b','b','b','b','b']
score = [1,2,3,4,5,2,4,6,8]
d = pd.DataFrame(zip(name,score), columns=['Name','Score'])
d = d.groupby('Name').describe()
d = d.reset_index()
df2 = pd.DataFrame(zip(d.level_1[8:], list(d.Score)[:8], list(d.Score)[8:]), columns = ['Name','A','B']).T
print df2
0 1 2 3 4 5 6 7
Name count mean std min 25% 50% 75% max
A 5 3 1.581139 1 2 3 4 5
B 4 5 2.581989 2 3.5 5 6.5 8
#6
0
Table is stored in dataframe named df
表存储在名为df的数据框中
df= pd.read_csv(io.StringIO(data),delimiter='\s+')
Just specify column name and describe
give you required output. In this way you calculate w.r.t any column
只需指定列名称并描述即可获得所需的输出。通过这种方式,您可以计算任何列的w.r.t.
df.groupby('name')['score'].describe()
#1
0
Define some data
In[1]:
import pandas as pd
import io
data = """
name score
A 1
A 2
A 3
A 4
A 5
B 2
B 4
B 6
B 8
"""
df = pd.read_csv(io.StringIO(data), delimiter='\s+')
print(df)
.
。
Out[1]:
name score
0 A 1
1 A 2
2 A 3
3 A 4
4 A 5
5 B 2
6 B 4
7 B 6
8 B 8
Solution
A nice approach to this problem uses a generator expression (see footnote) to allow pd.DataFrame()
to iterate over the results of groupby
, and construct the summary stats dataframe on the fly:
解决此问题的一个很好的方法是使用生成器表达式(请参阅脚注)以允许pd.DataFrame()迭代groupby的结果,并动态构造摘要统计数据框:
In[2]:
df2 = pd.DataFrame(group.describe().rename(columns={'score':name}).squeeze()
for name, group in df.groupby('name'))
print(df2)
.
。
Out[2]:
count mean std min 25% 50% 75% max
A 5 3 1.581139 1 2.0 3 4.0 5
B 4 5 2.581989 2 3.5 5 6.5 8
Here the squeeze
function is squeezing out a dimension, to convert the one-column group summary stats Dataframe
into a Series
.
挤压函数在这里挤出一个维度,将单列组摘要统计数据Dataframe转换为Series。
Footnote: A generator expression has the form my_function(a) for a in iterator
, or if iterator
gives us back two-element tuples
, as in the case of groupby
: my_function(a,b) for a,b in iterator
脚注:生成器表达式的形式为my_function(a),用于迭代器,或者迭代器为我们提供两元素元组,如groupby:my_function(a,b)表示a,b表示迭代器
#2
9
Nothing beats one-liner:
没有什么比单线更好:
In [145]:
print df.groupby('name').describe().reset_index().pivot(index='name', values='score', columns='level_1')
level_1 25% 50% 75% count max mean min std
name
A 2.0 3 4.0 5 5 3 1 1.581139
B 3.5 5 6.5 4 8 5 2 2.581989
#3
2
import pandas as pd
import io
import numpy as np
data = """
name score
A 1
A 2
A 3
A 4
A 5
B 2
B 4
B 6
B 8
"""
df = pd.read_csv(io.StringIO(data), delimiter='\s+')
df2 = df.groupby('name').describe().reset_index().T.drop('name')
arr = np.array(df2).reshape((4,8))
df2 = pd.DataFrame(arr[1:], index=['name','A','B'])
print(df2)
That will give you df2 as:
那会给你df2:
0 1 2 3 4 5 6 7
name count mean std min 25% 50% 75% max
A 5 3 1.58114 1 2 3 4 5
B 4 5 2.58199 2 3.5 5 6.5 8
#4
2
there is even a shorter one :)
甚至更短的一个:)
print df.groupby('name').describe().unstack(1)
Nothing beats one-liner:
没有什么比单线更好:
In [145]:
在[145]中:
print df.groupby('name').describe().reset_index().pivot(index='name', values='score', columns='level_1')
print df.groupby('name')。describe()。reset_index()。pivot(index ='name',values ='score',columns ='level_1')
#5
1
Well I managed to get what you wanted but it doesn't scale very well.
好吧,我设法得到你想要的东西,但它不能很好地扩展。
import pandas as pd
name = ['a','a','a','a','a','b','b','b','b','b']
score = [1,2,3,4,5,2,4,6,8]
d = pd.DataFrame(zip(name,score), columns=['Name','Score'])
d = d.groupby('Name').describe()
d = d.reset_index()
df2 = pd.DataFrame(zip(d.level_1[8:], list(d.Score)[:8], list(d.Score)[8:]), columns = ['Name','A','B']).T
print df2
0 1 2 3 4 5 6 7
Name count mean std min 25% 50% 75% max
A 5 3 1.581139 1 2 3 4 5
B 4 5 2.581989 2 3.5 5 6.5 8
#6
0
Table is stored in dataframe named df
表存储在名为df的数据框中
df= pd.read_csv(io.StringIO(data),delimiter='\s+')
Just specify column name and describe
give you required output. In this way you calculate w.r.t any column
只需指定列名称并描述即可获得所需的输出。通过这种方式,您可以计算任何列的w.r.t.
df.groupby('name')['score'].describe()