Pandas:基于不同列中的值汇总列

时间:2021-03-21 22:44:17

Lets say I start with a dataframe that looks like this:

让我们说我从一个看起来像这样的数据框开始:

    Group   Val     date
0   home    first   2017-12-01
1   home    second  2017-12-02
2   away    first   2018-03-07
3   away    second  2018-03-01

Data types are [string, string, datetime]. I would like to get a dataframe that for each group, shows me the value that was entered most recently:

数据类型是[string,string,datetime]。我想获得一个针对每个组的数据框,向我显示最近输入的值:

    Group   Most rececnt Val    Most recent date
0   home    second              12-02-2017
1   away    first               03-07-2018

(Data types are [string, string, datetime])

(数据类型为[string,string,datetime])

My initial thought is that I should be able to do something like this by grouping by 'group' and then aggregating the dates and vals. I know I can get the most recent datetime using the 'max' agg function, but I'm stuck on what function to use to get the corresponding val:

我最初的想法是,我应该能够通过“组”分组然后聚合日期和数据来做这样的事情。我知道我可以使用'max'agg函数获取最新的日期时间,但我仍然坚持使用什么函数来获取相应的val:

df.groupby('Group').agg({'val':lambda x: ____????____
                     'date':'max'})

Thanks,

2 个解决方案

#1


0  

In case I understood you right, you can do this:

如果我理解你,你可以这样做:

df.iloc[df.groupby('Group').agg({'date': 'idxmax'}).date]

Or as a whole example:

或者作为一个整体的例子:

import pandas as pd
import numpy as np

np.random.seed(42)

data = [(np.random.choice(['home', 'away'], size=1)[0],
         np.random.choice(['first', 'second'], size=1)[0],
         pd.Timestamp(np.random.rand()*1.9989e+18)) for i in range(10)]

df = pd.DataFrame.from_records(data)
df.columns = ['Group', 'Val', 'date']

df.iloc[df.groupby('Group').agg({'date': 'idxmax'}).date]

Which selects

  Group     Val                          date
5  away   first 2031-06-09 06:26:43.486610432
0  home  second 2030-03-22 04:07:07.082781440

from

  Group     Val                          date
0  home  second 2030-03-22 04:07:07.082781440
1  home  second 2007-12-03 05:07:24.061456384
2  home  second 1979-11-18 23:57:26.700035456
3  home   first 2024-11-12 08:18:17.789517824
4  away  second 2014-11-07 13:17:55.756515328
5  away   first 2031-06-09 06:26:43.486610432
6  away  second 1983-06-14 13:17:28.334806208
7  away  second 1981-08-14 03:21:14.746028864
8  away  second 2003-03-29 11:00:31.189680256
9  away   first 1988-06-12 16:58:48.341865984

#2


0  

First select the indeces of the dataframe whose variable value is maximum

首先选择变量值最大的数据帧的indeces

max_indeces = df.groupby(['Group'])['date'].idxmax()

and then select the corresponding rows in the original dataframe, maybe only indicating the actual value you are interested in:

然后选择原始数据框中的相应行,可能只显示您感兴趣的实际值:

df.iloc[max_indeces]['Val']

#1


0  

In case I understood you right, you can do this:

如果我理解你,你可以这样做:

df.iloc[df.groupby('Group').agg({'date': 'idxmax'}).date]

Or as a whole example:

或者作为一个整体的例子:

import pandas as pd
import numpy as np

np.random.seed(42)

data = [(np.random.choice(['home', 'away'], size=1)[0],
         np.random.choice(['first', 'second'], size=1)[0],
         pd.Timestamp(np.random.rand()*1.9989e+18)) for i in range(10)]

df = pd.DataFrame.from_records(data)
df.columns = ['Group', 'Val', 'date']

df.iloc[df.groupby('Group').agg({'date': 'idxmax'}).date]

Which selects

  Group     Val                          date
5  away   first 2031-06-09 06:26:43.486610432
0  home  second 2030-03-22 04:07:07.082781440

from

  Group     Val                          date
0  home  second 2030-03-22 04:07:07.082781440
1  home  second 2007-12-03 05:07:24.061456384
2  home  second 1979-11-18 23:57:26.700035456
3  home   first 2024-11-12 08:18:17.789517824
4  away  second 2014-11-07 13:17:55.756515328
5  away   first 2031-06-09 06:26:43.486610432
6  away  second 1983-06-14 13:17:28.334806208
7  away  second 1981-08-14 03:21:14.746028864
8  away  second 2003-03-29 11:00:31.189680256
9  away   first 1988-06-12 16:58:48.341865984

#2


0  

First select the indeces of the dataframe whose variable value is maximum

首先选择变量值最大的数据帧的indeces

max_indeces = df.groupby(['Group'])['date'].idxmax()

and then select the corresponding rows in the original dataframe, maybe only indicating the actual value you are interested in:

然后选择原始数据框中的相应行,可能只显示您感兴趣的实际值:

df.iloc[max_indeces]['Val']