查找列B中的那些值的平均值,这些值位于具有A列中K个最大元素之一的行中:Pandas Dataframe GroupBy Object

时间:2021-02-06 15:50:40

I have a panda dataframe, call it df1, with many columns (col1, col2, ...)

我有一个熊猫数据帧,称之为df1,有很多列(col1,col2,...)

I want to group the data on two particular columns - say col4 and col7

我想将数据分组在两个特定的列上 - 比如col4和col7

In each group, I want to find the top K values in col9.

在每个组中,我想在col9中找到前K个值。

Then, I want to find the mean of values in col10, which satisfy the condition of having the top K values in col9.

然后,我想在col10中找到值的平均值,它满足col9中具有最高K值的条件。

I attempted to solve it as shown below:

我尝试解决它,如下所示:

consideredCols = ['col4', 'col7']
k_value = 3
grp_data = df1.groupby(consideredCols)
print(grp_data.apply(lambda grp: (grp.col9.nlargest(k_value)).mean('col10')))

Example: (showing after the groupby ['col4', 'col7'] phase)

示例:(显示在groupby ['col4','col7']阶段之后)

                col9    col10
col4    col7        
john    doe        5    12
                   4    15
                  11    9
                   4    14


jane    doe       42    421
                  50    42
                 124    27
                  15    25

if K=2 here, then I want the result as (12+9)/2 for John and (42+27)/2 for Jane.

如果K = 2,那么我希望结果为(12 + 9)/ 2表示John,(42 + 27)/ 2表示Jane。

1 个解决方案

#1


0  

You are close - need DataFrame.nlargest for not lose column col10 and then get mean:

你很接近 - 需要DataFrame.nlargest不丢失列col10然后得到平均值:

grp_data = df1.groupby(consideredCols)
print(grp_data.apply(lambda grp: (grp.nlargest(k_value, 'col9'))['col10'].mean()))

col4  col7
jane  doe     34.5
john  doe     10.5
dtype: float64

Another solution with sort_values and head:

sort_values和head的另一个解决方案:

out = (df1.sort_values(['col4','col7','col9'], ascending=[True, True, False])
              .groupby(consideredCols)
              .apply(lambda grp: grp.head(2)['col10'].mean()))
print (out)

col4  col7
jane  doe     34.5
john  doe     10.5
dtype: float64

out = (df1.sort_values(['col4','col7','col9'], ascending=[True, True, False])
              .groupby(consideredCols)
              .apply(lambda grp: grp.head(2)['col10'].mean())).mean()
print (out)
22.5

For better understanding functions with apply is the best create custom function and use prints, then is possible rewrite it with lambda functions:

为了更好地理解使用apply的函数是最好的创建自定义函数并使用print,然后可以使用lambda函数重写它:

consideredCols = ['col4', 'col7']
k_value = 2

def f(grp):
     print (grp)

     print (grp.nlargest(k_value, 'col9'))
     print (grp.nlargest(k_value, 'col9')['col10'].mean())

     return grp.nlargest(k_value, 'col9')['col10'].mean()

grp_data = df1.groupby(consideredCols)
print(grp_data.apply(f))

#1


0  

You are close - need DataFrame.nlargest for not lose column col10 and then get mean:

你很接近 - 需要DataFrame.nlargest不丢失列col10然后得到平均值:

grp_data = df1.groupby(consideredCols)
print(grp_data.apply(lambda grp: (grp.nlargest(k_value, 'col9'))['col10'].mean()))

col4  col7
jane  doe     34.5
john  doe     10.5
dtype: float64

Another solution with sort_values and head:

sort_values和head的另一个解决方案:

out = (df1.sort_values(['col4','col7','col9'], ascending=[True, True, False])
              .groupby(consideredCols)
              .apply(lambda grp: grp.head(2)['col10'].mean()))
print (out)

col4  col7
jane  doe     34.5
john  doe     10.5
dtype: float64

out = (df1.sort_values(['col4','col7','col9'], ascending=[True, True, False])
              .groupby(consideredCols)
              .apply(lambda grp: grp.head(2)['col10'].mean())).mean()
print (out)
22.5

For better understanding functions with apply is the best create custom function and use prints, then is possible rewrite it with lambda functions:

为了更好地理解使用apply的函数是最好的创建自定义函数并使用print,然后可以使用lambda函数重写它:

consideredCols = ['col4', 'col7']
k_value = 2

def f(grp):
     print (grp)

     print (grp.nlargest(k_value, 'col9'))
     print (grp.nlargest(k_value, 'col9')['col10'].mean())

     return grp.nlargest(k_value, 'col9')['col10'].mean()

grp_data = df1.groupby(consideredCols)
print(grp_data.apply(f))