I am working with a multiIndex DataFrame and want to do some groupby / apply() operations. I am struggling with how to combine groupby and apply.
我正在使用multiIndex DataFrame并希望执行一些groupby / apply()操作。我正在努力如何结合groupby和apply。
I would like to extract the values of two indices of my DataFrame and compare those values in an apply function.
我想提取我的DataFrame的两个索引的值,并在apply函数中比较这些值。
For those occurrences where the apply function is true, I would like to do a groupby / sum over the values of my DataFrame.
对于apply函数为true的那些事件,我想对我的DataFrame的值进行groupby / sum。
Is there a good way to do this without using for loops?
有没有使用for循环这样做的好方法?
# Index specifier
ix = pd.MultiIndex.from_product(
[['2015', '2016', '2017', '2018'],
['2016', '2017', '2018', '2019', '2020'],
['A', 'B', 'C']],
names=['SimulationStart', 'ProjectionPeriod', 'Group']
)
df = pd.DataFrame(np.random.randn(60,1), index= ix, columns=['Input'])
# Calculate sum over all projection periods for each simulation/group
all_periods = df.groupby(level=['SimulationStart', 'Group']).sum()
# This part of the code is not working yet
# is there a way to extract data from the indices of the DataFrame?
# Calculate sum over all projection periods for each simulation/group;
# where projection period is a maximum of one year in the future
one_year_ahead = df.groupby(level=['SimulationStart', 'Group']) \
.apply(lambda x: x['ProjectionPeriod'] - \
x['SimulationStart'] <= 1).sum()
2 个解决方案
#1
4
You could calculate the difference, ProjectionPeriod - SimulationStart
, before performing the groupby/sum
operation.
在执行groupby / sum操作之前,您可以计算差值ProjectionPeriod - SimulationStart。
get_values = df.index.get_level_values
mask = (get_values('ProjectionPeriod') - get_values('SimulationStart')) <= 1
one_year_ahead = df.loc[mask].groupby(level=['SimulationStart', 'Group']).sum()
import numpy as np
import pandas as pd
ix = pd.MultiIndex.from_product(
[[2015, 2016, 2017, 2018],
[2016, 2017, 2018, 2019, 2020], ['A', 'B', 'C']],
names=['SimulationStart', 'ProjectionPeriod', 'Group'])
df = pd.DataFrame(np.random.randn(60,1), index= ix, columns=['Input'])
get_values = df.index.get_level_values
mask = (get_values('ProjectionPeriod') - get_values('SimulationStart')) <= 1
one_year_ahead = df.loc[mask].groupby(level=['SimulationStart', 'Group']).sum()
print(one_year_ahead)
yields
产量
Input
SimulationStart Group
2015 A 0.821851
B -0.643342
C -0.140112
2016 A 0.384885
B -0.252186
C -1.057493
2017 A -1.055933
B 1.096221
C -4.150002
2018 A 0.584859
B -4.062078
C 1.225105
#2
3
Here is one way to do it.
这是一种方法。
df.reset_index().query('ProjectionPeriod - SimulationStart == 1') \
.groupby(['SimulationStart', 'Group']).Input.sum()
SimulationStart Group
2015 A 1.100246
B -0.605710
C 1.366465
2016 A 0.359406
B -2.077444
C -0.004356
2017 A 0.604497
B -0.362941
C 0.103945
2018 A -0.861976
B -0.737274
C 0.237512
Name: Input, dtype: float64
Because you have unique values in the Group
column, this also works but I don't believe its what you want.
因为您在“组”列中有唯一值,所以这也有效但我不相信您想要的。
df.reset_index().query('ProjectionPeriod - SimulationStart == 1') \
[['SimulationStart', 'Group', 'Input']]
#1
4
You could calculate the difference, ProjectionPeriod - SimulationStart
, before performing the groupby/sum
operation.
在执行groupby / sum操作之前,您可以计算差值ProjectionPeriod - SimulationStart。
get_values = df.index.get_level_values
mask = (get_values('ProjectionPeriod') - get_values('SimulationStart')) <= 1
one_year_ahead = df.loc[mask].groupby(level=['SimulationStart', 'Group']).sum()
import numpy as np
import pandas as pd
ix = pd.MultiIndex.from_product(
[[2015, 2016, 2017, 2018],
[2016, 2017, 2018, 2019, 2020], ['A', 'B', 'C']],
names=['SimulationStart', 'ProjectionPeriod', 'Group'])
df = pd.DataFrame(np.random.randn(60,1), index= ix, columns=['Input'])
get_values = df.index.get_level_values
mask = (get_values('ProjectionPeriod') - get_values('SimulationStart')) <= 1
one_year_ahead = df.loc[mask].groupby(level=['SimulationStart', 'Group']).sum()
print(one_year_ahead)
yields
产量
Input
SimulationStart Group
2015 A 0.821851
B -0.643342
C -0.140112
2016 A 0.384885
B -0.252186
C -1.057493
2017 A -1.055933
B 1.096221
C -4.150002
2018 A 0.584859
B -4.062078
C 1.225105
#2
3
Here is one way to do it.
这是一种方法。
df.reset_index().query('ProjectionPeriod - SimulationStart == 1') \
.groupby(['SimulationStart', 'Group']).Input.sum()
SimulationStart Group
2015 A 1.100246
B -0.605710
C 1.366465
2016 A 0.359406
B -2.077444
C -0.004356
2017 A 0.604497
B -0.362941
C 0.103945
2018 A -0.861976
B -0.737274
C 0.237512
Name: Input, dtype: float64
Because you have unique values in the Group
column, this also works but I don't believe its what you want.
因为您在“组”列中有唯一值,所以这也有效但我不相信您想要的。
df.reset_index().query('ProjectionPeriod - SimulationStart == 1') \
[['SimulationStart', 'Group', 'Input']]