Pandas数据帧：如何将describe（）应用于每个组并添加到新列？

df:

DF：

Want to get the following new dataframe in the form of below:

想以下面的形式获得以下新数据框：

   name count mean std min 25% 50% 75% max
    A     5    3    .. ..  ..  ..  ..  ..
    B     4    5    .. ..  ..  ..  ..  ..

How to exctract the information from df.describe() and reformat it? Thanks

如何从df.describe（）中提取信息并重新格式化？谢谢

6 个解决方案

#1

Define some data

In[1]:
import pandas as pd
import io

data = """
name score
A      1
A      2
A      3
A      4
A      5
B      2
B      4
B      6
B      8
    """

df = pd.read_csv(io.StringIO(data), delimiter='\s+')
print(df)

。

Out[1]:
  name  score
0    A      1
1    A      2
2    A      3
3    A      4
4    A      5
5    B      2
6    B      4
7    B      6
8    B      8

Solution

A nice approach to this problem uses a generator expression (see footnote) to allow pd.DataFrame() to iterate over the results of groupby, and construct the summary stats dataframe on the fly:

解决此问题的一个很好的方法是使用生成器表达式（请参阅脚注）以允许pd.DataFrame（）迭代groupby的结果，并动态构造摘要统计数据框：

In[2]:
df2 = pd.DataFrame(group.describe().rename(columns={'score':name}).squeeze()
                         for name, group in df.groupby('name'))

print(df2)

。

Out[2]:
   count  mean       std  min  25%  50%  75%  max
A      5     3  1.581139    1  2.0    3  4.0    5
B      4     5  2.581989    2  3.5    5  6.5    8

Here the squeeze function is squeezing out a dimension, to convert the one-column group summary stats Dataframe into a Series.

挤压函数在这里挤出一个维度，将单列组摘要统计数据Dataframe转换为Series。

Footnote: A generator expression has the form my_function(a) for a in iterator, or if iterator gives us back two-element tuples, as in the case of groupby: my_function(a,b) for a,b in iterator

脚注：生成器表达式的形式为my_function（a），用于迭代器，或者迭代器为我们提供两元素元组，如groupby：my_function（a，b）表示a，b表示迭代器

#2

Nothing beats one-liner:

没有什么比单线更好：

In [145]:

print df.groupby('name').describe().reset_index().pivot(index='name', values='score', columns='level_1')

level_1  25%  50%  75%  count  max  mean  min       std
name                                                   
A        2.0    3  4.0      5    5     3    1  1.581139
B        3.5    5  6.5      4    8     5    2  2.581989

#3

import pandas as pd
import io
import numpy as np

data = """
name score
A      1
A      2
A      3
A      4
A      5
B      2
B      4
B      6
B      8
    """

df = pd.read_csv(io.StringIO(data), delimiter='\s+')

df2 = df.groupby('name').describe().reset_index().T.drop('name')
arr = np.array(df2).reshape((4,8))

df2 = pd.DataFrame(arr[1:], index=['name','A','B'])

print(df2)

That will give you df2 as:

那会给你df2：

              0     1        2    3    4    5    6    7
    name  count  mean      std  min  25%  50%  75%  max
    A         5     3  1.58114    1    2    3    4    5
    B         4     5  2.58199    2  3.5    5  6.5    8

#4

there is even a shorter one :)

甚至更短的一个:)

print df.groupby('name').describe().unstack(1)

Nothing beats one-liner:

没有什么比单线更好：

In [145]:

在[145]中：

print df.groupby('name').describe().reset_index().pivot(index='name', values='score', columns='level_1')

print df.groupby（'name'）。describe（）。reset_index（）。pivot（index ='name'，values ='score'，columns ='level_1'）

#5

Well I managed to get what you wanted but it doesn't scale very well.

好吧，我设法得到你想要的东西，但它不能很好地扩展。

import pandas as pd

name = ['a','a','a','a','a','b','b','b','b','b']
score = [1,2,3,4,5,2,4,6,8]

d = pd.DataFrame(zip(name,score), columns=['Name','Score'])
d = d.groupby('Name').describe()
d = d.reset_index()
df2 = pd.DataFrame(zip(d.level_1[8:], list(d.Score)[:8], list(d.Score)[8:]), columns = ['Name','A','B']).T

print df2

          0     1         2    3    4    5    6    7
Name  count  mean       std  min  25%  50%  75%  max
A         5     3  1.581139    1    2    3    4    5
B         4     5  2.581989    2  3.5    5  6.5    8

#6

Table is stored in dataframe named df

表存储在名为df的数据框中

df= pd.read_csv(io.StringIO(data),delimiter='\s+')

Just specify column name and describe give you required output. In this way you calculate w.r.t any column

只需指定列名称并描述即可获得所需的输出。通过这种方式，您可以计算任何列的w.r.t.

df.groupby('name')['score'].describe()

#1

Define some data

In[1]:
import pandas as pd
import io

data = """
name score
A      1
A      2
A      3
A      4
A      5
B      2
B      4
B      6
B      8
    """

df = pd.read_csv(io.StringIO(data), delimiter='\s+')
print(df)

。

Out[1]:
  name  score
0    A      1
1    A      2
2    A      3
3    A      4
4    A      5
5    B      2
6    B      4
7    B      6
8    B      8

Solution

A nice approach to this problem uses a generator expression (see footnote) to allow pd.DataFrame() to iterate over the results of groupby, and construct the summary stats dataframe on the fly:

解决此问题的一个很好的方法是使用生成器表达式（请参阅脚注）以允许pd.DataFrame（）迭代groupby的结果，并动态构造摘要统计数据框：

In[2]:
df2 = pd.DataFrame(group.describe().rename(columns={'score':name}).squeeze()
                         for name, group in df.groupby('name'))

print(df2)

。

Out[2]:
   count  mean       std  min  25%  50%  75%  max
A      5     3  1.581139    1  2.0    3  4.0    5
B      4     5  2.581989    2  3.5    5  6.5    8

Here the squeeze function is squeezing out a dimension, to convert the one-column group summary stats Dataframe into a Series.

挤压函数在这里挤出一个维度，将单列组摘要统计数据Dataframe转换为Series。

脚注：生成器表达式的形式为my_function（a），用于迭代器，或者迭代器为我们提供两元素元组，如groupby：my_function（a，b）表示a，b表示迭代器

#2

Nothing beats one-liner:

没有什么比单线更好：

In [145]:

print df.groupby('name').describe().reset_index().pivot(index='name', values='score', columns='level_1')

level_1  25%  50%  75%  count  max  mean  min       std
name                                                   
A        2.0    3  4.0      5    5     3    1  1.581139
B        3.5    5  6.5      4    8     5    2  2.581989

#3

import pandas as pd
import io
import numpy as np

data = """
name score
A      1
A      2
A      3
A      4
A      5
B      2
B      4
B      6
B      8
    """

df = pd.read_csv(io.StringIO(data), delimiter='\s+')

df2 = df.groupby('name').describe().reset_index().T.drop('name')
arr = np.array(df2).reshape((4,8))

df2 = pd.DataFrame(arr[1:], index=['name','A','B'])

print(df2)

That will give you df2 as:

那会给你df2：

              0     1        2    3    4    5    6    7
    name  count  mean      std  min  25%  50%  75%  max
    A         5     3  1.58114    1    2    3    4    5
    B         4     5  2.58199    2  3.5    5  6.5    8

#4

there is even a shorter one :)

甚至更短的一个:)

print df.groupby('name').describe().unstack(1)

Nothing beats one-liner:

没有什么比单线更好：

In [145]:

在[145]中：

print df.groupby('name').describe().reset_index().pivot(index='name', values='score', columns='level_1')

print df.groupby（'name'）。describe（）。reset_index（）。pivot（index ='name'，values ='score'，columns ='level_1'）

#5

Well I managed to get what you wanted but it doesn't scale very well.

好吧，我设法得到你想要的东西，但它不能很好地扩展。

import pandas as pd

name = ['a','a','a','a','a','b','b','b','b','b']
score = [1,2,3,4,5,2,4,6,8]

d = pd.DataFrame(zip(name,score), columns=['Name','Score'])
d = d.groupby('Name').describe()
d = d.reset_index()
df2 = pd.DataFrame(zip(d.level_1[8:], list(d.Score)[:8], list(d.Score)[8:]), columns = ['Name','A','B']).T

print df2

          0     1         2    3    4    5    6    7
Name  count  mean       std  min  25%  50%  75%  max
A         5     3  1.581139    1    2    3    4    5
B         4     5  2.581989    2  3.5    5  6.5    8

#6

Table is stored in dataframe named df

表存储在名为df的数据框中

df= pd.read_csv(io.StringIO(data),delimiter='\s+')

Just specify column name and describe give you required output. In this way you calculate w.r.t any column

只需指定列名称并描述即可获得所需的输出。通过这种方式，您可以计算任何列的w.r.t.

df.groupby('name')['score'].describe()

秒客网

Pandas数据帧：如何将describe（）应用于每个组并添加到新列？

6 个解决方案

#1

Define some data

Solution

#2

#3

#4

#5

#6

#1

Define some data

Solution

#2

#3

#4

#5

#6

相关文章