组合groupby并应用于multiIndex DataFrames

I am working with a multiIndex DataFrame and want to do some groupby / apply() operations. I am struggling with how to combine groupby and apply.

我正在使用multiIndex DataFrame并希望执行一些groupby / apply（）操作。我正在努力如何结合groupby和apply。

I would like to extract the values of two indices of my DataFrame and compare those values in an apply function.

我想提取我的DataFrame的两个索引的值，并在apply函数中比较这些值。

For those occurrences where the apply function is true, I would like to do a groupby / sum over the values of my DataFrame.

对于apply函数为true的那些事件，我想对我的DataFrame的值进行groupby / sum。

Is there a good way to do this without using for loops?

有没有使用for循环这样做的好方法？

 # Index specifier
ix = pd.MultiIndex.from_product(
    [['2015', '2016', '2017', '2018'],
     ['2016', '2017', '2018', '2019', '2020'],
     ['A', 'B', 'C']],
    names=['SimulationStart', 'ProjectionPeriod', 'Group']
)

df = pd.DataFrame(np.random.randn(60,1), index= ix, columns=['Input'])

# Calculate sum over all projection periods for each simulation/group
all_periods = df.groupby(level=['SimulationStart', 'Group']).sum()

# This part of the code is not working yet
# is there a way to extract data from the indices of the DataFrame?
# Calculate sum over all projection periods for each simulation/group;
# where projection period is a maximum of one year in the future
one_year_ahead = df.groupby(level=['SimulationStart', 'Group']) \
                   .apply(lambda x: x['ProjectionPeriod'] - \
                                    x['SimulationStart'] <= 1).sum()

2 个解决方案

#1

You could calculate the difference, ProjectionPeriod - SimulationStart, before performing the groupby/sum operation.

在执行groupby / sum操作之前，您可以计算差值ProjectionPeriod - SimulationStart。

get_values = df.index.get_level_values
mask = (get_values('ProjectionPeriod') - get_values('SimulationStart')) <= 1
one_year_ahead = df.loc[mask].groupby(level=['SimulationStart', 'Group']).sum()

import numpy as np
import pandas as pd
ix = pd.MultiIndex.from_product(
    [[2015, 2016, 2017, 2018], 
     [2016, 2017, 2018, 2019, 2020], ['A', 'B', 'C']],
    names=['SimulationStart', 'ProjectionPeriod', 'Group'])
df = pd.DataFrame(np.random.randn(60,1), index= ix, columns=['Input'])

get_values = df.index.get_level_values
mask = (get_values('ProjectionPeriod') - get_values('SimulationStart')) <= 1
one_year_ahead = df.loc[mask].groupby(level=['SimulationStart', 'Group']).sum()
print(one_year_ahead)

yields

产量

                          Input
SimulationStart Group          
2015            A      0.821851
                B     -0.643342
                C     -0.140112
2016            A      0.384885
                B     -0.252186
                C     -1.057493
2017            A     -1.055933
                B      1.096221
                C     -4.150002
2018            A      0.584859
                B     -4.062078
                C      1.225105

#2

Here is one way to do it.

这是一种方法。

df.reset_index().query('ProjectionPeriod - SimulationStart == 1') \
    .groupby(['SimulationStart', 'Group']).Input.sum()

SimulationStart  Group
2015             A        1.100246
                 B       -0.605710
                 C        1.366465
2016             A        0.359406
                 B       -2.077444
                 C       -0.004356
2017             A        0.604497
                 B       -0.362941
                 C        0.103945
2018             A       -0.861976
                 B       -0.737274
                 C        0.237512
Name: Input, dtype: float64

Because you have unique values in the Group column, this also works but I don't believe its what you want.

因为您在“组”列中有唯一值，所以这也有效但我不相信您想要的。

df.reset_index().query('ProjectionPeriod - SimulationStart == 1') \
    [['SimulationStart', 'Group', 'Input']]

#1