I have a data frame df
and I use several columns from it to groupby
:
我有一个数据帧df,我使用它到groupby的几个列:
df['col1','col2','col3','col4'].groupby(['col1','col2']).mean()
In the above way I almost get the table (data frame) that I need. What is missing is an additional column that contains number of rows in each group. In other words, I have mean but I also would like to know how many number were used to get these means. For example in the first group there are 8 values and in the second one 10 and so on.
在上面的方法中,我几乎得到了我需要的表(数据帧)。缺少的是包含每个组中行数的附加列。换句话说,我的意思是,但我也想知道有多少数字被用来表示这些意义。例如在第一组中有8个值,在第二个10中,等等。
3 个解决方案
#1
198
On groupby
object, the agg
function can take a list to apply several aggregation methods at once. This should give you the result you need:
在groupby对象上,agg函数可以获取一个列表,同时应用多个聚合方法。这应该会给你你需要的结果:
df[['col1', 'col2', 'col3', 'col4']].groupby(['col1', 'col2']).agg(['mean', 'count'])
#2
418
Quick Answer:
The simplest way to get row counts per group is by calling .size()
, which returns a Series
:
获取每个组的行数的最简单方法是调用.size(),它返回一个系列:
df.groupby(['col1','col2']).size()
Usually you want this result as a DataFrame
(instead of a Series
) so you can do:
通常,您希望这个结果作为一个DataFrame(而不是一个系列),所以您可以这样做:
df.groupby(['col1', 'col2']).size().reset_index(name='counts')
If you want to find out how to calculate the row counts and other statistics for each group continue reading below.
如果您想知道如何计算每个组的行数和其他统计信息,请继续阅读下面的内容。
Detailed example:
Consider the following example dataframe:
考虑下面的dataframe示例:
In [2]: df
Out[2]:
col1 col2 col3 col4 col5 col6
0 A B 0.20 -0.61 -0.49 1.49
1 A B -1.53 -1.01 -0.39 1.82
2 A B -0.44 0.27 0.72 0.11
3 A B 0.28 -1.32 0.38 0.18
4 C D 0.12 0.59 0.81 0.66
5 C D -0.13 -1.65 -1.64 0.50
6 C D -1.42 -0.11 -0.18 -0.44
7 E F -0.00 1.42 -0.26 1.17
8 E F 0.91 -0.47 1.35 -0.34
9 G H 1.48 -0.63 -1.14 0.17
First let's use .size()
to get the row counts:
首先使用.size()获取行数:
In [3]: df.groupby(['col1', 'col2']).size()
Out[3]:
col1 col2
A B 4
C D 3
E F 2
G H 1
dtype: int64
Then let's use .size().reset_index(name='counts')
to get the row counts:
然后让我们使用.size().reset_index(name='count ')来获取行数:
In [4]: df.groupby(['col1', 'col2']).size().reset_index(name='counts')
Out[4]:
col1 col2 counts
0 A B 4
1 C D 3
2 E F 2
3 G H 1
Including results for more statistics
When you want to calculate statistics on grouped data, it usually looks like this:
当您想要计算分组数据的统计数据时,通常是这样的:
In [5]: (df
...: .groupby(['col1', 'col2'])
...: .agg({
...: 'col3': ['mean', 'count'],
...: 'col4': ['median', 'min', 'count']
...: }))
Out[5]:
col4 col3
median min count mean count
col1 col2
A B -0.810 -1.32 4 -0.372500 4
C D -0.110 -1.65 3 -0.476667 3
E F 0.475 -0.47 2 0.455000 2
G H -0.630 -0.63 1 1.480000 1
The result above is a little annoying to deal with because of the nested column labels, and also because row counts are on a per column basis.
上面的结果有点烦人,因为嵌套的列标签,而且行计数是基于每个列的。
To gain more control over the output I usually split the statistics into individual aggregations that I then combine using join
. It looks like this:
为了获得对输出的更多控制,我通常将统计数据拆分为单独的聚合,然后使用join进行组合。它看起来像这样:
In [6]: gb = df.groupby(['col1', 'col2'])
...: counts = gb.size().to_frame(name='counts')
...: (counts
...: .join(gb.agg({'col3': 'mean'}).rename(columns={'col3': 'col3_mean'}))
...: .join(gb.agg({'col4': 'median'}).rename(columns={'col4': 'col4_median'}))
...: .join(gb.agg({'col4': 'min'}).rename(columns={'col4': 'col4_min'}))
...: .reset_index()
...: )
...:
Out[6]:
col1 col2 counts col3_mean col4_median col4_min
0 A B 4 -0.372500 -0.810 -1.32
1 C D 3 -0.476667 -0.110 -1.65
2 E F 2 0.455000 0.475 -0.47
3 G H 1 1.480000 -0.630 -0.63
Footnotes
The code used to generate the test data is shown below:
用于生成测试数据的代码如下所示:
In [1]: import numpy as np
...: import pandas as pd
...:
...: keys = np.array([
...: ['A', 'B'],
...: ['A', 'B'],
...: ['A', 'B'],
...: ['A', 'B'],
...: ['C', 'D'],
...: ['C', 'D'],
...: ['C', 'D'],
...: ['E', 'F'],
...: ['E', 'F'],
...: ['G', 'H']
...: ])
...:
...: df = pd.DataFrame(
...: np.hstack([keys,np.random.randn(10,4).round(2)]),
...: columns = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6']
...: )
...:
...: df[['col3', 'col4', 'col5', 'col6']] = \
...: df[['col3', 'col4', 'col5', 'col6']].astype(float)
...:
Disclaimer:
免责声明:
If some of the columns that you are aggregating have null values, then you really want to be looking at the group row counts as an independent aggregation for each column. Otherwise you may be misled as to how many records are actually being used to calculate things like the mean because pandas will drop NaN
entries in the mean calculation without telling you about it.
如果您正在聚合的一些列具有null值,那么您确实希望将组行计数作为每个列的独立聚合。否则,你可能会被误导,不知道有多少条记录被用来计算像平均数这样的东西,因为熊猫会在平均数中删除NaN条目,而不会告诉你。
#3
3
We can easily do it by using groupby and count. But, we should remember to use reset_index().
通过使用groupby和count,我们可以很容易地做到这一点。但是,我们应该记住使用reset_index()。
df[['col1','col2','col3','col4']].groupby(['col1','col2']).count().\
reset_index()
#1
198
On groupby
object, the agg
function can take a list to apply several aggregation methods at once. This should give you the result you need:
在groupby对象上,agg函数可以获取一个列表,同时应用多个聚合方法。这应该会给你你需要的结果:
df[['col1', 'col2', 'col3', 'col4']].groupby(['col1', 'col2']).agg(['mean', 'count'])
#2
418
Quick Answer:
The simplest way to get row counts per group is by calling .size()
, which returns a Series
:
获取每个组的行数的最简单方法是调用.size(),它返回一个系列:
df.groupby(['col1','col2']).size()
Usually you want this result as a DataFrame
(instead of a Series
) so you can do:
通常,您希望这个结果作为一个DataFrame(而不是一个系列),所以您可以这样做:
df.groupby(['col1', 'col2']).size().reset_index(name='counts')
If you want to find out how to calculate the row counts and other statistics for each group continue reading below.
如果您想知道如何计算每个组的行数和其他统计信息,请继续阅读下面的内容。
Detailed example:
Consider the following example dataframe:
考虑下面的dataframe示例:
In [2]: df
Out[2]:
col1 col2 col3 col4 col5 col6
0 A B 0.20 -0.61 -0.49 1.49
1 A B -1.53 -1.01 -0.39 1.82
2 A B -0.44 0.27 0.72 0.11
3 A B 0.28 -1.32 0.38 0.18
4 C D 0.12 0.59 0.81 0.66
5 C D -0.13 -1.65 -1.64 0.50
6 C D -1.42 -0.11 -0.18 -0.44
7 E F -0.00 1.42 -0.26 1.17
8 E F 0.91 -0.47 1.35 -0.34
9 G H 1.48 -0.63 -1.14 0.17
First let's use .size()
to get the row counts:
首先使用.size()获取行数:
In [3]: df.groupby(['col1', 'col2']).size()
Out[3]:
col1 col2
A B 4
C D 3
E F 2
G H 1
dtype: int64
Then let's use .size().reset_index(name='counts')
to get the row counts:
然后让我们使用.size().reset_index(name='count ')来获取行数:
In [4]: df.groupby(['col1', 'col2']).size().reset_index(name='counts')
Out[4]:
col1 col2 counts
0 A B 4
1 C D 3
2 E F 2
3 G H 1
Including results for more statistics
When you want to calculate statistics on grouped data, it usually looks like this:
当您想要计算分组数据的统计数据时,通常是这样的:
In [5]: (df
...: .groupby(['col1', 'col2'])
...: .agg({
...: 'col3': ['mean', 'count'],
...: 'col4': ['median', 'min', 'count']
...: }))
Out[5]:
col4 col3
median min count mean count
col1 col2
A B -0.810 -1.32 4 -0.372500 4
C D -0.110 -1.65 3 -0.476667 3
E F 0.475 -0.47 2 0.455000 2
G H -0.630 -0.63 1 1.480000 1
The result above is a little annoying to deal with because of the nested column labels, and also because row counts are on a per column basis.
上面的结果有点烦人,因为嵌套的列标签,而且行计数是基于每个列的。
To gain more control over the output I usually split the statistics into individual aggregations that I then combine using join
. It looks like this:
为了获得对输出的更多控制,我通常将统计数据拆分为单独的聚合,然后使用join进行组合。它看起来像这样:
In [6]: gb = df.groupby(['col1', 'col2'])
...: counts = gb.size().to_frame(name='counts')
...: (counts
...: .join(gb.agg({'col3': 'mean'}).rename(columns={'col3': 'col3_mean'}))
...: .join(gb.agg({'col4': 'median'}).rename(columns={'col4': 'col4_median'}))
...: .join(gb.agg({'col4': 'min'}).rename(columns={'col4': 'col4_min'}))
...: .reset_index()
...: )
...:
Out[6]:
col1 col2 counts col3_mean col4_median col4_min
0 A B 4 -0.372500 -0.810 -1.32
1 C D 3 -0.476667 -0.110 -1.65
2 E F 2 0.455000 0.475 -0.47
3 G H 1 1.480000 -0.630 -0.63
Footnotes
The code used to generate the test data is shown below:
用于生成测试数据的代码如下所示:
In [1]: import numpy as np
...: import pandas as pd
...:
...: keys = np.array([
...: ['A', 'B'],
...: ['A', 'B'],
...: ['A', 'B'],
...: ['A', 'B'],
...: ['C', 'D'],
...: ['C', 'D'],
...: ['C', 'D'],
...: ['E', 'F'],
...: ['E', 'F'],
...: ['G', 'H']
...: ])
...:
...: df = pd.DataFrame(
...: np.hstack([keys,np.random.randn(10,4).round(2)]),
...: columns = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6']
...: )
...:
...: df[['col3', 'col4', 'col5', 'col6']] = \
...: df[['col3', 'col4', 'col5', 'col6']].astype(float)
...:
Disclaimer:
免责声明:
If some of the columns that you are aggregating have null values, then you really want to be looking at the group row counts as an independent aggregation for each column. Otherwise you may be misled as to how many records are actually being used to calculate things like the mean because pandas will drop NaN
entries in the mean calculation without telling you about it.
如果您正在聚合的一些列具有null值,那么您确实希望将组行计数作为每个列的独立聚合。否则,你可能会被误导,不知道有多少条记录被用来计算像平均数这样的东西,因为熊猫会在平均数中删除NaN条目,而不会告诉你。
#3
3
We can easily do it by using groupby and count. But, we should remember to use reset_index().
通过使用groupby和count,我们可以很容易地做到这一点。但是,我们应该记住使用reset_index()。
df[['col1','col2','col3','col4']].groupby(['col1','col2']).count().\
reset_index()