如何使用dplyr在每个组中选择值最大的行?

I would like to select a row with maximum value in each group with dplyr.

我想在每组中选择一个有dplyr最大值的行。

Firstly I generate some random data to show my question

首先，我生成一些随机数据来展示我的问题

set.seed(1)df <- expand.grid(list(A = 1:5, B = 1:5, C = 1:5))df$value <- runif(nrow(df))

In plyr, I could use a custom function to select this row.

在plyr中，我可以使用一个自定义函数来选择这一行。

library(plyr)ddply(df, .(A, B), function(x) x[which.max(x$value),])

In dplyr, I am using this code to get the maximum value, but not the rows with maximum value (Column C in this case).

在dplyr中，我使用这段代码获取最大值，但不获取具有最大值的行(在本例中为C列)。

library(dplyr)df %>% group_by(A, B) %>%    summarise(max = max(value))

How could I achieve this? Thanks for any suggestion.

我如何做到这一点?谢谢你的建议。

sessionInfo()R version 3.1.0 (2014-04-10)Platform: x86_64-w64-mingw32/x64 (64-bit)locale:[1] LC_COLLATE=English_Australia.1252  LC_CTYPE=English_Australia.1252   [3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C                      [5] LC_TIME=English_Australia.1252    attached base packages:[1] stats     graphics  grDevices utils     datasets  methods   base     other attached packages:[1] dplyr_0.2  plyr_1.8.1loaded via a namespace (and not attached):[1] assertthat_0.1.0.99 parallel_3.1.0      Rcpp_0.11.1        [4] tools_3.1.0

4 个解决方案

#1

Try this:

试试这个:

result <- df %>%              group_by(A, B) %>%             filter(value == max(value)) %>%             arrange(A,B,C)

Seems to work:

似乎工作:

identical(  as.data.frame(result),  ddply(df, .(A, B), function(x) x[which.max(x$value),]))#[1] TRUE

As pointed out by @docendo in the comments, slice may be preferred here as per @RoyalITS' answer below if you strictly only want 1 row per group. This answer will return multiple rows if there are multiple with an identical maximum value.

正如@docendo在评论中指出的那样，在这里，slice可以根据@RoyalITS的答案来选择，如果你只希望每组只需要一行的话。如果有多个具有相同最大值的多个行，这个答案将返回多个行。

#2

You can use top_n

您可以使用top_n

df %>% group_by(A, B) %>% top_n(n=1)

This will rank by the last column (value) and return the top n=1 rows.

这将按最后一列(值)排序，并返回前n=1行。

Currently, you can't change the this default without causing an error (See https://github.com/hadley/dplyr/issues/426)

目前，您无法在不导致错误的情况下更改此默认值(参见https://github.com/hadley/dplyr/issues es/426)

#3

df %>% group_by(A,B) %>% slice(which.max(value))

#4

This more verbose solution provides greater control on what happens in case of duplicate maximum value (in this example, it will take one of the corresponding rows randomly)

这种更详细的解决方案可以更好地控制重复的最大值时发生的情况(在本例中，它将随机地取一个相应的行)

library(dplyr)df %>% group_by(A, B) %>%  mutate(the_rank  = rank(-value, ties.method = "random")) %>%  filter(the_rank == 1) %>% select(-the_rank)

#1