I have a panda dataframe, call it df1, with many columns (col1, col2, ...)
我有一个熊猫数据帧,称之为df1,有很多列(col1,col2,...)
I want to group the data on two particular columns - say col4 and col7
我想将数据分组在两个特定的列上 - 比如col4和col7
In each group, I want to find the top K values in col9.
在每个组中,我想在col9中找到前K个值。
Then, I want to find the mean of values in col10, which satisfy the condition of having the top K values in col9.
然后,我想在col10中找到值的平均值,它满足col9中具有最高K值的条件。
I attempted to solve it as shown below:
我尝试解决它,如下所示:
consideredCols = ['col4', 'col7']
k_value = 3
grp_data = df1.groupby(consideredCols)
print(grp_data.apply(lambda grp: (grp.col9.nlargest(k_value)).mean('col10')))
Example: (showing after the groupby ['col4', 'col7'] phase)
示例:(显示在groupby ['col4','col7']阶段之后)
col9 col10
col4 col7
john doe 5 12
4 15
11 9
4 14
jane doe 42 421
50 42
124 27
15 25
if K=2 here, then I want the result as (12+9)/2 for John and (42+27)/2 for Jane.
如果K = 2,那么我希望结果为(12 + 9)/ 2表示John,(42 + 27)/ 2表示Jane。
1 个解决方案
#1
0
You are close - need DataFrame.nlargest
for not lose column col10
and then get mean
:
你很接近 - 需要DataFrame.nlargest不丢失列col10然后得到平均值:
grp_data = df1.groupby(consideredCols)
print(grp_data.apply(lambda grp: (grp.nlargest(k_value, 'col9'))['col10'].mean()))
col4 col7
jane doe 34.5
john doe 10.5
dtype: float64
Another solution with sort_values
and head
:
sort_values和head的另一个解决方案:
out = (df1.sort_values(['col4','col7','col9'], ascending=[True, True, False])
.groupby(consideredCols)
.apply(lambda grp: grp.head(2)['col10'].mean()))
print (out)
col4 col7
jane doe 34.5
john doe 10.5
dtype: float64
out = (df1.sort_values(['col4','col7','col9'], ascending=[True, True, False])
.groupby(consideredCols)
.apply(lambda grp: grp.head(2)['col10'].mean())).mean()
print (out)
22.5
For better understanding functions with apply
is the best create custom function and use print
s, then is possible rewrite it with lambda functions:
为了更好地理解使用apply的函数是最好的创建自定义函数并使用print,然后可以使用lambda函数重写它:
consideredCols = ['col4', 'col7']
k_value = 2
def f(grp):
print (grp)
print (grp.nlargest(k_value, 'col9'))
print (grp.nlargest(k_value, 'col9')['col10'].mean())
return grp.nlargest(k_value, 'col9')['col10'].mean()
grp_data = df1.groupby(consideredCols)
print(grp_data.apply(f))
#1
0
You are close - need DataFrame.nlargest
for not lose column col10
and then get mean
:
你很接近 - 需要DataFrame.nlargest不丢失列col10然后得到平均值:
grp_data = df1.groupby(consideredCols)
print(grp_data.apply(lambda grp: (grp.nlargest(k_value, 'col9'))['col10'].mean()))
col4 col7
jane doe 34.5
john doe 10.5
dtype: float64
Another solution with sort_values
and head
:
sort_values和head的另一个解决方案:
out = (df1.sort_values(['col4','col7','col9'], ascending=[True, True, False])
.groupby(consideredCols)
.apply(lambda grp: grp.head(2)['col10'].mean()))
print (out)
col4 col7
jane doe 34.5
john doe 10.5
dtype: float64
out = (df1.sort_values(['col4','col7','col9'], ascending=[True, True, False])
.groupby(consideredCols)
.apply(lambda grp: grp.head(2)['col10'].mean())).mean()
print (out)
22.5
For better understanding functions with apply
is the best create custom function and use print
s, then is possible rewrite it with lambda functions:
为了更好地理解使用apply的函数是最好的创建自定义函数并使用print,然后可以使用lambda函数重写它:
consideredCols = ['col4', 'col7']
k_value = 2
def f(grp):
print (grp)
print (grp.nlargest(k_value, 'col9'))
print (grp.nlargest(k_value, 'col9')['col10'].mean())
return grp.nlargest(k_value, 'col9')['col10'].mean()
grp_data = df1.groupby(consideredCols)
print(grp_data.apply(f))