命名返回Pandas聚合函数中的列?

时间:2021-04-09 22:34:26

I'm having trouble with Pandas' groupby functionality. I've read the documentation, but I can't see to figure out how to apply aggregate functions to multiple columns and have custom names for those columns.

我在使用Pandas的groupby功能时遇到了麻烦。我已经阅读了文档,但是我无法弄清楚如何将聚合函数应用于多个列并为这些列提供自定义名称。

This comes very close, but the data structure returned has nested column headings:

这非常接近,但返回的数据结构具有嵌套的列标题:

data.groupby("Country").agg(
        {"column1": {"foo": sum()}, "column2": {"mean": np.mean, "std": np.std}})

(ie. I want to take the mean and std of column2, but return those columns as "mean" and "std")

(即。我想取column2的mean和std,但将这些列作为“mean”和“std”返回)

What am I missing?

我错过了什么?

5 个解决方案

#1


59  

This will drop the outermost level from the hierarchical column index:

这将从分层列索引中删除最外层:

df = data.groupby(...).agg(...)
df.columns = df.columns.droplevel(0)

If you'd like to keep the outermost level, you can use the ravel() function on the multi-level column to form new labels:

如果您想保持最外层,可以使用多级列上的ravel()函数来形成新标签:

df.columns = ["_".join(x) for x in df.columns.ravel()]

For example:

例如:

import pandas as pd
import pandas.rpy.common as com
import numpy as np

data = com.load_data('Loblolly')
print(data.head())
#     height  age Seed
# 1     4.51    3  301
# 15   10.89    5  301
# 29   28.72   10  301
# 43   41.74   15  301
# 57   52.70   20  301

df = data.groupby('Seed').agg(
    {'age':['sum'],
     'height':['mean', 'std']})
print(df.head())
#       age     height           
#       sum        std       mean
# Seed                           
# 301    78  22.638417  33.246667
# 303    78  23.499706  34.106667
# 305    78  23.927090  35.115000
# 307    78  22.222266  31.328333
# 309    78  23.132574  33.781667

df.columns = df.columns.droplevel(0)
print(df.head())

yields

产量

      sum        std       mean
Seed                           
301    78  22.638417  33.246667
303    78  23.499706  34.106667
305    78  23.927090  35.115000
307    78  22.222266  31.328333
309    78  23.132574  33.781667

Alternatively, to keep the first level of the index:

或者,保持索引的第一级:

df = data.groupby('Seed').agg(
    {'age':['sum'],
     'height':['mean', 'std']})
df.columns = ["_".join(x) for x in df.columns.ravel()]

yields

产量

      age_sum   height_std  height_mean
Seed                           
301        78    22.638417    33.246667
303        78    23.499706    34.106667
305        78    23.927090    35.115000
307        78    22.222266    31.328333
309        78    23.132574    33.781667

#2


28  

The currently accepted answer by unutbu describes are great way of doing this in pandas versions <= 0.20. However, as of pandas 0.20, using this method raises a warning indicating that the syntax will not be available in future versions of pandas.

unutbu描述的当前接受的答案是在pandas版本<= 0.20中这样做的好方法。但是,从pandas 0.20开始,使用此方法会发出警告,指示在将来的pandas版本中将无法使用该语法。

Series:

系列:

FutureWarning: using a dict on a Series for aggregation is deprecated and will be removed in a future version

FutureWarning:不推荐在系列上使用dict进行聚合,并且将在以后的版本中删除

DataFrames:

DataFrames:

FutureWarning: using a dict with renaming is deprecated and will be removed in a future version

FutureWarning:不推荐使用带重命名的dict,将在以后的版本中删除

According to the pandas 0.20 changelog, the recommended way of renaming columns while aggregating is as follows.

根据pandas 0.20 changelog,聚合时重命名列的推荐方法如下。

# Create a sample data frame
df = pd.DataFrame({'A': [1, 1, 1, 2, 2],
                   'B': range(5),
                   'C': range(5)})

# ==== SINGLE COLUMN (SERIES) ====
# Syntax soon to be deprecated
df.groupby('A').B.agg({'foo': 'count'})
# Recommended replacement syntax
df.groupby('A').B.agg(['count']).rename(columns={'count': 'foo'})

# ==== MULTI COLUMN ====
# Syntax soon to be deprecated
df.groupby('A').agg({'B': {'foo': 'sum'}, 'C': {'bar': 'min'}})
# Recommended replacement syntax
df.groupby('A').agg({'B': 'sum', 'C': 'min'}).rename(columns={'B': 'foo', 'C': 'bar'})
# As the recommended syntax is more verbose, parentheses can
# be used to introduce line breaks and increase readability
(df.groupby('A')
    .agg({'B': 'sum', 'C': 'min'})
    .rename(columns={'B': 'foo', 'C': 'bar'})
)

Please see the 0.20 changelog for additional details.

有关其他详细信息,请参阅0.20更改日志。


Update 2017-01-03 in response to @JunkMechanic's comment.

With the old style dictionary syntax, it was possible to pass multiple lambda functions to .agg, since these would be renamed with the key in the passed dictionary:

使用旧样式字典语法,可以将多个lambda函数传递给.agg,因为这些函数将使用传递的字典中的键重命名:

>>> df.groupby('A').agg({'B': {'min': lambda x: x.min(), 'max': lambda x: x.max()}})

    B    
  max min
A        
1   2   0
2   4   3

Multiple functions can also be passed to a single column as a list:

多个函数也可以作为列表传递给单个列:

>>> df.groupby('A').agg({'B': [np.min, np.max]})

     B     
  amin amax
A          
1    0    2
2    3    4

However, this does not work with lambda functions, since they are anonymous and all return <lambda>, which causes a name collision:

但是,这不适用于lambda函数,因为它们是匿名的并且都返回 ,这会导致名称冲突:

>>> df.groupby('A').agg({'B': [lambda x: x.min(), lambda x: x.max]})
SpecificationError: Function names must be unique, found multiple named <lambda>

To avoid the SpecificationError, named functions can be defined a priori instead of using lambda. Suitable function names also avoid calling .rename on the data frame afterwards. These functions can be passed with the same list syntax as above:

为了避免使用SpecificationError,可以先验地定义命名函数而不是使用lambda。合适的函数名称也避免在数据帧之后调用.rename。这些函数可以使用与上面相同的列表语法传递:

>>> def my_min(x):
>>>     return x.min()

>>> def my_max(x):
>>>     return x.max()

>>> df.groupby('A').agg({'B': [my_min, my_max]})

       B       
  my_min my_max
A              
1      0      2
2      3      4

#3


4  

If you want to have a behavior similar to JMP, creating column titles that keep all info from the multi index you can use:

如果您希望获得类似于JMP的行为,请创建列标题,以保留您可以使用的多索引中的所有信息:

newidx = []
for (n1,n2) in df.columns.ravel():
    newidx.append("%s-%s" % (n1,n2))
df.columns=newidx

It will change your dataframe from:

它将改变您的数据框架:

    I                       V
    mean        std         first
V
4200.0  25.499536   31.557133   4200.0
4300.0  25.605662   31.678046   4300.0
4400.0  26.679005   32.919996   4400.0
4500.0  26.786458   32.811633   4500.0

to

    I-mean      I-std       V-first
V
4200.0  25.499536   31.557133   4200.0
4300.0  25.605662   31.678046   4300.0
4400.0  26.679005   32.919996   4400.0
4500.0  26.786458   32.811633   4500.0

#4


0  

With the inspiration of @Joel Ostblom

借助@Joel Ostblom的灵感

For those who already have a workable dictionary for merely aggregation, you can use/modify the following code for the newer version aggregation, separating aggregation and renaming part. Please be aware of the nested dictionary if there are more than 1 item.

对于那些已经只有聚合的可行字典的人,可以使用/修改以下代码进行更新的版本聚合,分离聚合和重命名部分。如果有多个项目,请注意嵌套字典。

def agg_translate_agg_rename(input_agg_dict):
    agg_dict = {}
    rename_dict = {}
    for k, v in input_agg_dict.items():
        if len(v) == 1:
            agg_dict[k] = list(v.values())[0]
            rename_dict[k] = list(v.keys())[0]
        else:
            updated_index = 1
            for nested_dict_k, nested_dict_v in v.items():
                modified_key = k + "_" + str(updated_index)
                agg_dict[modified_key] = nested_dict_v
                rename_dict[modified_key] = nested_dict_k
                updated_index += 1
    return agg_dict, rename_dict

one_dict = {"column1": {"foo": 'sum'}, "column2": {"mean": 'mean', "std": 'std'}}
agg, rename = agg_translator_aa(one_dict)

We get

我们明白了

agg = {'column1': 'sum', 'column2_1': 'mean', 'column2_2': 'std'}
rename = {'column1': 'foo', 'column2_1': 'mean', 'column2_2': 'std'}

Please let me know if there is a smarter way to do it. Thanks.

如果有更聪明的方法,请告诉我。谢谢。

#5


0  

I agree with the OP that it seems more natural and consistent to name and define the output columns in the same place (e.g. as is done with tidyverse's summarize in R), but a work-around in pandas for now is to create the new columns with desired names via assign before doing the aggregation:

我同意OP认为在同一个地方命名和定义输出列似乎更自然和一致(例如在R中用tidyverse总结),但现在在熊猫中解决方法是创建新列在进行聚合之前通过assign获得所需的名称:

data.assign(
    f=data['column1'],
    mean=data['column2'],
    std=data['column2']
).groupby('Country').agg(dict(f=sum, mean=np.mean, std=np.std)).reset_index()

(Using reset_index turns 'Country', 'f', 'mean', and 'std' all into regular columns with a separate integer index.)

(使用reset_index将'Country','f','mean'和'std'全部转换为带有单独整数索引的常规列。)

#1


59  

This will drop the outermost level from the hierarchical column index:

这将从分层列索引中删除最外层:

df = data.groupby(...).agg(...)
df.columns = df.columns.droplevel(0)

If you'd like to keep the outermost level, you can use the ravel() function on the multi-level column to form new labels:

如果您想保持最外层,可以使用多级列上的ravel()函数来形成新标签:

df.columns = ["_".join(x) for x in df.columns.ravel()]

For example:

例如:

import pandas as pd
import pandas.rpy.common as com
import numpy as np

data = com.load_data('Loblolly')
print(data.head())
#     height  age Seed
# 1     4.51    3  301
# 15   10.89    5  301
# 29   28.72   10  301
# 43   41.74   15  301
# 57   52.70   20  301

df = data.groupby('Seed').agg(
    {'age':['sum'],
     'height':['mean', 'std']})
print(df.head())
#       age     height           
#       sum        std       mean
# Seed                           
# 301    78  22.638417  33.246667
# 303    78  23.499706  34.106667
# 305    78  23.927090  35.115000
# 307    78  22.222266  31.328333
# 309    78  23.132574  33.781667

df.columns = df.columns.droplevel(0)
print(df.head())

yields

产量

      sum        std       mean
Seed                           
301    78  22.638417  33.246667
303    78  23.499706  34.106667
305    78  23.927090  35.115000
307    78  22.222266  31.328333
309    78  23.132574  33.781667

Alternatively, to keep the first level of the index:

或者,保持索引的第一级:

df = data.groupby('Seed').agg(
    {'age':['sum'],
     'height':['mean', 'std']})
df.columns = ["_".join(x) for x in df.columns.ravel()]

yields

产量

      age_sum   height_std  height_mean
Seed                           
301        78    22.638417    33.246667
303        78    23.499706    34.106667
305        78    23.927090    35.115000
307        78    22.222266    31.328333
309        78    23.132574    33.781667

#2


28  

The currently accepted answer by unutbu describes are great way of doing this in pandas versions <= 0.20. However, as of pandas 0.20, using this method raises a warning indicating that the syntax will not be available in future versions of pandas.

unutbu描述的当前接受的答案是在pandas版本<= 0.20中这样做的好方法。但是,从pandas 0.20开始,使用此方法会发出警告,指示在将来的pandas版本中将无法使用该语法。

Series:

系列:

FutureWarning: using a dict on a Series for aggregation is deprecated and will be removed in a future version

FutureWarning:不推荐在系列上使用dict进行聚合,并且将在以后的版本中删除

DataFrames:

DataFrames:

FutureWarning: using a dict with renaming is deprecated and will be removed in a future version

FutureWarning:不推荐使用带重命名的dict,将在以后的版本中删除

According to the pandas 0.20 changelog, the recommended way of renaming columns while aggregating is as follows.

根据pandas 0.20 changelog,聚合时重命名列的推荐方法如下。

# Create a sample data frame
df = pd.DataFrame({'A': [1, 1, 1, 2, 2],
                   'B': range(5),
                   'C': range(5)})

# ==== SINGLE COLUMN (SERIES) ====
# Syntax soon to be deprecated
df.groupby('A').B.agg({'foo': 'count'})
# Recommended replacement syntax
df.groupby('A').B.agg(['count']).rename(columns={'count': 'foo'})

# ==== MULTI COLUMN ====
# Syntax soon to be deprecated
df.groupby('A').agg({'B': {'foo': 'sum'}, 'C': {'bar': 'min'}})
# Recommended replacement syntax
df.groupby('A').agg({'B': 'sum', 'C': 'min'}).rename(columns={'B': 'foo', 'C': 'bar'})
# As the recommended syntax is more verbose, parentheses can
# be used to introduce line breaks and increase readability
(df.groupby('A')
    .agg({'B': 'sum', 'C': 'min'})
    .rename(columns={'B': 'foo', 'C': 'bar'})
)

Please see the 0.20 changelog for additional details.

有关其他详细信息,请参阅0.20更改日志。


Update 2017-01-03 in response to @JunkMechanic's comment.

With the old style dictionary syntax, it was possible to pass multiple lambda functions to .agg, since these would be renamed with the key in the passed dictionary:

使用旧样式字典语法,可以将多个lambda函数传递给.agg,因为这些函数将使用传递的字典中的键重命名:

>>> df.groupby('A').agg({'B': {'min': lambda x: x.min(), 'max': lambda x: x.max()}})

    B    
  max min
A        
1   2   0
2   4   3

Multiple functions can also be passed to a single column as a list:

多个函数也可以作为列表传递给单个列:

>>> df.groupby('A').agg({'B': [np.min, np.max]})

     B     
  amin amax
A          
1    0    2
2    3    4

However, this does not work with lambda functions, since they are anonymous and all return <lambda>, which causes a name collision:

但是,这不适用于lambda函数,因为它们是匿名的并且都返回 ,这会导致名称冲突:

>>> df.groupby('A').agg({'B': [lambda x: x.min(), lambda x: x.max]})
SpecificationError: Function names must be unique, found multiple named <lambda>

To avoid the SpecificationError, named functions can be defined a priori instead of using lambda. Suitable function names also avoid calling .rename on the data frame afterwards. These functions can be passed with the same list syntax as above:

为了避免使用SpecificationError,可以先验地定义命名函数而不是使用lambda。合适的函数名称也避免在数据帧之后调用.rename。这些函数可以使用与上面相同的列表语法传递:

>>> def my_min(x):
>>>     return x.min()

>>> def my_max(x):
>>>     return x.max()

>>> df.groupby('A').agg({'B': [my_min, my_max]})

       B       
  my_min my_max
A              
1      0      2
2      3      4

#3


4  

If you want to have a behavior similar to JMP, creating column titles that keep all info from the multi index you can use:

如果您希望获得类似于JMP的行为,请创建列标题,以保留您可以使用的多索引中的所有信息:

newidx = []
for (n1,n2) in df.columns.ravel():
    newidx.append("%s-%s" % (n1,n2))
df.columns=newidx

It will change your dataframe from:

它将改变您的数据框架:

    I                       V
    mean        std         first
V
4200.0  25.499536   31.557133   4200.0
4300.0  25.605662   31.678046   4300.0
4400.0  26.679005   32.919996   4400.0
4500.0  26.786458   32.811633   4500.0

to

    I-mean      I-std       V-first
V
4200.0  25.499536   31.557133   4200.0
4300.0  25.605662   31.678046   4300.0
4400.0  26.679005   32.919996   4400.0
4500.0  26.786458   32.811633   4500.0

#4


0  

With the inspiration of @Joel Ostblom

借助@Joel Ostblom的灵感

For those who already have a workable dictionary for merely aggregation, you can use/modify the following code for the newer version aggregation, separating aggregation and renaming part. Please be aware of the nested dictionary if there are more than 1 item.

对于那些已经只有聚合的可行字典的人,可以使用/修改以下代码进行更新的版本聚合,分离聚合和重命名部分。如果有多个项目,请注意嵌套字典。

def agg_translate_agg_rename(input_agg_dict):
    agg_dict = {}
    rename_dict = {}
    for k, v in input_agg_dict.items():
        if len(v) == 1:
            agg_dict[k] = list(v.values())[0]
            rename_dict[k] = list(v.keys())[0]
        else:
            updated_index = 1
            for nested_dict_k, nested_dict_v in v.items():
                modified_key = k + "_" + str(updated_index)
                agg_dict[modified_key] = nested_dict_v
                rename_dict[modified_key] = nested_dict_k
                updated_index += 1
    return agg_dict, rename_dict

one_dict = {"column1": {"foo": 'sum'}, "column2": {"mean": 'mean', "std": 'std'}}
agg, rename = agg_translator_aa(one_dict)

We get

我们明白了

agg = {'column1': 'sum', 'column2_1': 'mean', 'column2_2': 'std'}
rename = {'column1': 'foo', 'column2_1': 'mean', 'column2_2': 'std'}

Please let me know if there is a smarter way to do it. Thanks.

如果有更聪明的方法,请告诉我。谢谢。

#5


0  

I agree with the OP that it seems more natural and consistent to name and define the output columns in the same place (e.g. as is done with tidyverse's summarize in R), but a work-around in pandas for now is to create the new columns with desired names via assign before doing the aggregation:

我同意OP认为在同一个地方命名和定义输出列似乎更自然和一致(例如在R中用tidyverse总结),但现在在熊猫中解决方法是创建新列在进行聚合之前通过assign获得所需的名称:

data.assign(
    f=data['column1'],
    mean=data['column2'],
    std=data['column2']
).groupby('Country').agg(dict(f=sum, mean=np.mean, std=np.std)).reset_index()

(Using reset_index turns 'Country', 'f', 'mean', and 'std' all into regular columns with a separate integer index.)

(使用reset_index将'Country','f','mean'和'std'全部转换为带有单独整数索引的常规列。)