如何计算每个组的行数(以及其他统计数据)?

时间:2021-03-02 15:22:05

I have a data frame df and I use several columns from it to groupby:

我有一个数据帧df,我使用它到groupby的几个列:

df['col1','col2','col3','col4'].groupby(['col1','col2']).mean()

In the above way I almost get the table (data frame) that I need. What is missing is an additional column that contains number of rows in each group. In other words, I have mean but I also would like to know how many number were used to get these means. For example in the first group there are 8 values and in the second one 10 and so on.

在上面的方法中,我几乎得到了我需要的表(数据帧)。缺少的是包含每个组中行数的附加列。换句话说,我的意思是,但我也想知道有多少数字被用来表示这些意义。例如在第一组中有8个值,在第二个10中,等等。

3 个解决方案

#1


198  

On groupby object, the agg function can take a list to apply several aggregation methods at once. This should give you the result you need:

在groupby对象上,agg函数可以获取一个列表,同时应用多个聚合方法。这应该会给你你需要的结果:

df[['col1', 'col2', 'col3', 'col4']].groupby(['col1', 'col2']).agg(['mean', 'count'])

#2


418  

Quick Answer:

The simplest way to get row counts per group is by calling .size(), which returns a Series:

获取每个组的行数的最简单方法是调用.size(),它返回一个系列:

df.groupby(['col1','col2']).size()


Usually you want this result as a DataFrame (instead of a Series) so you can do:

通常,您希望这个结果作为一个DataFrame(而不是一个系列),所以您可以这样做:

df.groupby(['col1', 'col2']).size().reset_index(name='counts')


If you want to find out how to calculate the row counts and other statistics for each group continue reading below.

如果您想知道如何计算每个组的行数和其他统计信息,请继续阅读下面的内容。


Detailed example:

Consider the following example dataframe:

考虑下面的dataframe示例:

In [2]: df
Out[2]: 
  col1 col2  col3  col4  col5  col6
0    A    B  0.20 -0.61 -0.49  1.49
1    A    B -1.53 -1.01 -0.39  1.82
2    A    B -0.44  0.27  0.72  0.11
3    A    B  0.28 -1.32  0.38  0.18
4    C    D  0.12  0.59  0.81  0.66
5    C    D -0.13 -1.65 -1.64  0.50
6    C    D -1.42 -0.11 -0.18 -0.44
7    E    F -0.00  1.42 -0.26  1.17
8    E    F  0.91 -0.47  1.35 -0.34
9    G    H  1.48 -0.63 -1.14  0.17

First let's use .size() to get the row counts:

首先使用.size()获取行数:

In [3]: df.groupby(['col1', 'col2']).size()
Out[3]: 
col1  col2
A     B       4
C     D       3
E     F       2
G     H       1
dtype: int64

Then let's use .size().reset_index(name='counts') to get the row counts:

然后让我们使用.size().reset_index(name='count ')来获取行数:

In [4]: df.groupby(['col1', 'col2']).size().reset_index(name='counts')
Out[4]: 
  col1 col2  counts
0    A    B       4
1    C    D       3
2    E    F       2
3    G    H       1


Including results for more statistics

When you want to calculate statistics on grouped data, it usually looks like this:

当您想要计算分组数据的统计数据时,通常是这样的:

In [5]: (df
   ...: .groupby(['col1', 'col2'])
   ...: .agg({
   ...:     'col3': ['mean', 'count'], 
   ...:     'col4': ['median', 'min', 'count']
   ...: }))
Out[5]: 
            col4                  col3      
          median   min count      mean count
col1 col2                                   
A    B    -0.810 -1.32     4 -0.372500     4
C    D    -0.110 -1.65     3 -0.476667     3
E    F     0.475 -0.47     2  0.455000     2
G    H    -0.630 -0.63     1  1.480000     1

The result above is a little annoying to deal with because of the nested column labels, and also because row counts are on a per column basis.

上面的结果有点烦人,因为嵌套的列标签,而且行计数是基于每个列的。

To gain more control over the output I usually split the statistics into individual aggregations that I then combine using join. It looks like this:

为了获得对输出的更多控制,我通常将统计数据拆分为单独的聚合,然后使用join进行组合。它看起来像这样:

In [6]: gb = df.groupby(['col1', 'col2'])
   ...: counts = gb.size().to_frame(name='counts')
   ...: (counts
   ...:  .join(gb.agg({'col3': 'mean'}).rename(columns={'col3': 'col3_mean'}))
   ...:  .join(gb.agg({'col4': 'median'}).rename(columns={'col4': 'col4_median'}))
   ...:  .join(gb.agg({'col4': 'min'}).rename(columns={'col4': 'col4_min'}))
   ...:  .reset_index()
   ...: )
   ...: 
Out[6]: 
  col1 col2  counts  col3_mean  col4_median  col4_min
0    A    B       4  -0.372500       -0.810     -1.32
1    C    D       3  -0.476667       -0.110     -1.65
2    E    F       2   0.455000        0.475     -0.47
3    G    H       1   1.480000       -0.630     -0.63



Footnotes

The code used to generate the test data is shown below:

用于生成测试数据的代码如下所示:

In [1]: import numpy as np
   ...: import pandas as pd 
   ...: 
   ...: keys = np.array([
   ...:         ['A', 'B'],
   ...:         ['A', 'B'],
   ...:         ['A', 'B'],
   ...:         ['A', 'B'],
   ...:         ['C', 'D'],
   ...:         ['C', 'D'],
   ...:         ['C', 'D'],
   ...:         ['E', 'F'],
   ...:         ['E', 'F'],
   ...:         ['G', 'H'] 
   ...:         ])
   ...: 
   ...: df = pd.DataFrame(
   ...:     np.hstack([keys,np.random.randn(10,4).round(2)]), 
   ...:     columns = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6']
   ...: )
   ...: 
   ...: df[['col3', 'col4', 'col5', 'col6']] = \
   ...:     df[['col3', 'col4', 'col5', 'col6']].astype(float)
   ...: 


Disclaimer:

免责声明:

If some of the columns that you are aggregating have null values, then you really want to be looking at the group row counts as an independent aggregation for each column. Otherwise you may be misled as to how many records are actually being used to calculate things like the mean because pandas will drop NaN entries in the mean calculation without telling you about it.

如果您正在聚合的一些列具有null值,那么您确实希望将组行计数作为每个列的独立聚合。否则,你可能会被误导,不知道有多少条记录被用来计算像平均数这样的东西,因为熊猫会在平均数中删除NaN条目,而不会告诉你。

#3


3  

We can easily do it by using groupby and count. But, we should remember to use reset_index().

通过使用groupby和count,我们可以很容易地做到这一点。但是,我们应该记住使用reset_index()。

df[['col1','col2','col3','col4']].groupby(['col1','col2']).count().\
reset_index()

#1


198  

On groupby object, the agg function can take a list to apply several aggregation methods at once. This should give you the result you need:

在groupby对象上,agg函数可以获取一个列表,同时应用多个聚合方法。这应该会给你你需要的结果:

df[['col1', 'col2', 'col3', 'col4']].groupby(['col1', 'col2']).agg(['mean', 'count'])

#2


418  

Quick Answer:

The simplest way to get row counts per group is by calling .size(), which returns a Series:

获取每个组的行数的最简单方法是调用.size(),它返回一个系列:

df.groupby(['col1','col2']).size()


Usually you want this result as a DataFrame (instead of a Series) so you can do:

通常,您希望这个结果作为一个DataFrame(而不是一个系列),所以您可以这样做:

df.groupby(['col1', 'col2']).size().reset_index(name='counts')


If you want to find out how to calculate the row counts and other statistics for each group continue reading below.

如果您想知道如何计算每个组的行数和其他统计信息,请继续阅读下面的内容。


Detailed example:

Consider the following example dataframe:

考虑下面的dataframe示例:

In [2]: df
Out[2]: 
  col1 col2  col3  col4  col5  col6
0    A    B  0.20 -0.61 -0.49  1.49
1    A    B -1.53 -1.01 -0.39  1.82
2    A    B -0.44  0.27  0.72  0.11
3    A    B  0.28 -1.32  0.38  0.18
4    C    D  0.12  0.59  0.81  0.66
5    C    D -0.13 -1.65 -1.64  0.50
6    C    D -1.42 -0.11 -0.18 -0.44
7    E    F -0.00  1.42 -0.26  1.17
8    E    F  0.91 -0.47  1.35 -0.34
9    G    H  1.48 -0.63 -1.14  0.17

First let's use .size() to get the row counts:

首先使用.size()获取行数:

In [3]: df.groupby(['col1', 'col2']).size()
Out[3]: 
col1  col2
A     B       4
C     D       3
E     F       2
G     H       1
dtype: int64

Then let's use .size().reset_index(name='counts') to get the row counts:

然后让我们使用.size().reset_index(name='count ')来获取行数:

In [4]: df.groupby(['col1', 'col2']).size().reset_index(name='counts')
Out[4]: 
  col1 col2  counts
0    A    B       4
1    C    D       3
2    E    F       2
3    G    H       1


Including results for more statistics

When you want to calculate statistics on grouped data, it usually looks like this:

当您想要计算分组数据的统计数据时,通常是这样的:

In [5]: (df
   ...: .groupby(['col1', 'col2'])
   ...: .agg({
   ...:     'col3': ['mean', 'count'], 
   ...:     'col4': ['median', 'min', 'count']
   ...: }))
Out[5]: 
            col4                  col3      
          median   min count      mean count
col1 col2                                   
A    B    -0.810 -1.32     4 -0.372500     4
C    D    -0.110 -1.65     3 -0.476667     3
E    F     0.475 -0.47     2  0.455000     2
G    H    -0.630 -0.63     1  1.480000     1

The result above is a little annoying to deal with because of the nested column labels, and also because row counts are on a per column basis.

上面的结果有点烦人,因为嵌套的列标签,而且行计数是基于每个列的。

To gain more control over the output I usually split the statistics into individual aggregations that I then combine using join. It looks like this:

为了获得对输出的更多控制,我通常将统计数据拆分为单独的聚合,然后使用join进行组合。它看起来像这样:

In [6]: gb = df.groupby(['col1', 'col2'])
   ...: counts = gb.size().to_frame(name='counts')
   ...: (counts
   ...:  .join(gb.agg({'col3': 'mean'}).rename(columns={'col3': 'col3_mean'}))
   ...:  .join(gb.agg({'col4': 'median'}).rename(columns={'col4': 'col4_median'}))
   ...:  .join(gb.agg({'col4': 'min'}).rename(columns={'col4': 'col4_min'}))
   ...:  .reset_index()
   ...: )
   ...: 
Out[6]: 
  col1 col2  counts  col3_mean  col4_median  col4_min
0    A    B       4  -0.372500       -0.810     -1.32
1    C    D       3  -0.476667       -0.110     -1.65
2    E    F       2   0.455000        0.475     -0.47
3    G    H       1   1.480000       -0.630     -0.63



Footnotes

The code used to generate the test data is shown below:

用于生成测试数据的代码如下所示:

In [1]: import numpy as np
   ...: import pandas as pd 
   ...: 
   ...: keys = np.array([
   ...:         ['A', 'B'],
   ...:         ['A', 'B'],
   ...:         ['A', 'B'],
   ...:         ['A', 'B'],
   ...:         ['C', 'D'],
   ...:         ['C', 'D'],
   ...:         ['C', 'D'],
   ...:         ['E', 'F'],
   ...:         ['E', 'F'],
   ...:         ['G', 'H'] 
   ...:         ])
   ...: 
   ...: df = pd.DataFrame(
   ...:     np.hstack([keys,np.random.randn(10,4).round(2)]), 
   ...:     columns = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6']
   ...: )
   ...: 
   ...: df[['col3', 'col4', 'col5', 'col6']] = \
   ...:     df[['col3', 'col4', 'col5', 'col6']].astype(float)
   ...: 


Disclaimer:

免责声明:

If some of the columns that you are aggregating have null values, then you really want to be looking at the group row counts as an independent aggregation for each column. Otherwise you may be misled as to how many records are actually being used to calculate things like the mean because pandas will drop NaN entries in the mean calculation without telling you about it.

如果您正在聚合的一些列具有null值,那么您确实希望将组行计数作为每个列的独立聚合。否则,你可能会被误导,不知道有多少条记录被用来计算像平均数这样的东西,因为熊猫会在平均数中删除NaN条目,而不会告诉你。

#3


3  

We can easily do it by using groupby and count. But, we should remember to use reset_index().

通过使用groupby和count,我们可以很容易地做到这一点。但是,我们应该记住使用reset_index()。

df[['col1','col2','col3','col4']].groupby(['col1','col2']).count().\
reset_index()