I would like to select a row with maximum value in each group with dplyr.
我想在每组中选择一个有dplyr最大值的行。
Firstly I generate some random data to show my question
首先,我生成一些随机数据来展示我的问题
set.seed(1)df <- expand.grid(list(A = 1:5, B = 1:5, C = 1:5))df$value <- runif(nrow(df))
In plyr, I could use a custom function to select this row.
在plyr中,我可以使用一个自定义函数来选择这一行。
library(plyr)ddply(df, .(A, B), function(x) x[which.max(x$value),])
In dplyr, I am using this code to get the maximum value, but not the rows with maximum value (Column C in this case).
在dplyr中,我使用这段代码获取最大值,但不获取具有最大值的行(在本例中为C列)。
library(dplyr)df %>% group_by(A, B) %>% summarise(max = max(value))
How could I achieve this? Thanks for any suggestion.
我如何做到这一点?谢谢你的建议。
sessionInfo()R version 3.1.0 (2014-04-10)Platform: x86_64-w64-mingw32/x64 (64-bit)locale:[1] LC_COLLATE=English_Australia.1252 LC_CTYPE=English_Australia.1252 [3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C [5] LC_TIME=English_Australia.1252 attached base packages:[1] stats graphics grDevices utils datasets methods base other attached packages:[1] dplyr_0.2 plyr_1.8.1loaded via a namespace (and not attached):[1] assertthat_0.1.0.99 parallel_3.1.0 Rcpp_0.11.1 [4] tools_3.1.0
4 个解决方案
#1
77
Try this:
试试这个:
result <- df %>% group_by(A, B) %>% filter(value == max(value)) %>% arrange(A,B,C)
Seems to work:
似乎工作:
identical( as.data.frame(result), ddply(df, .(A, B), function(x) x[which.max(x$value),]))#[1] TRUE
As pointed out by @docendo in the comments, slice
may be preferred here as per @RoyalITS' answer below if you strictly only want 1 row per group. This answer will return multiple rows if there are multiple with an identical maximum value.
正如@docendo在评论中指出的那样,在这里,slice可以根据@RoyalITS的答案来选择,如果你只希望每组只需要一行的话。如果有多个具有相同最大值的多个行,这个答案将返回多个行。
#2
45
You can use top_n
您可以使用top_n
df %>% group_by(A, B) %>% top_n(n=1)
This will rank by the last column (value
) and return the top n=1
rows.
这将按最后一列(值)排序,并返回前n=1行。
Currently, you can't change the this default without causing an error (See https://github.com/hadley/dplyr/issues/426)
目前,您无法在不导致错误的情况下更改此默认值(参见https://github.com/hadley/dplyr/issues es/426)
#3
35
df %>% group_by(A,B) %>% slice(which.max(value))
#4
8
This more verbose solution provides greater control on what happens in case of duplicate maximum value (in this example, it will take one of the corresponding rows randomly)
这种更详细的解决方案可以更好地控制重复的最大值时发生的情况(在本例中,它将随机地取一个相应的行)
library(dplyr)df %>% group_by(A, B) %>% mutate(the_rank = rank(-value, ties.method = "random")) %>% filter(the_rank == 1) %>% select(-the_rank)
#1
77
Try this:
试试这个:
result <- df %>% group_by(A, B) %>% filter(value == max(value)) %>% arrange(A,B,C)
Seems to work:
似乎工作:
identical( as.data.frame(result), ddply(df, .(A, B), function(x) x[which.max(x$value),]))#[1] TRUE
As pointed out by @docendo in the comments, slice
may be preferred here as per @RoyalITS' answer below if you strictly only want 1 row per group. This answer will return multiple rows if there are multiple with an identical maximum value.
正如@docendo在评论中指出的那样,在这里,slice可以根据@RoyalITS的答案来选择,如果你只希望每组只需要一行的话。如果有多个具有相同最大值的多个行,这个答案将返回多个行。
#2
45
You can use top_n
您可以使用top_n
df %>% group_by(A, B) %>% top_n(n=1)
This will rank by the last column (value
) and return the top n=1
rows.
这将按最后一列(值)排序,并返回前n=1行。
Currently, you can't change the this default without causing an error (See https://github.com/hadley/dplyr/issues/426)
目前,您无法在不导致错误的情况下更改此默认值(参见https://github.com/hadley/dplyr/issues es/426)
#3
35
df %>% group_by(A,B) %>% slice(which.max(value))
#4
8
This more verbose solution provides greater control on what happens in case of duplicate maximum value (in this example, it will take one of the corresponding rows randomly)
这种更详细的解决方案可以更好地控制重复的最大值时发生的情况(在本例中,它将随机地取一个相应的行)
library(dplyr)df %>% group_by(A, B) %>% mutate(the_rank = rank(-value, ties.method = "random")) %>% filter(the_rank == 1) %>% select(-the_rank)