I have a data frame that looks as follows:
我有一个数据框,如下所示:
> df <- data_frame(g = c('A', 'A', 'B', 'B', 'B', 'C'), x = c(7, 3, 5, 9, 2, 4))
> df
Source: local data frame [6 x 2]
g x
1 A 7
2 A 3
3 B 5
4 B 9
5 B 2
6 C 4
I know how to add a column with the maximum x
value for each group g
:
我知道如何为每个组g添加一个具有最大x值的列:
> df %>% group_by(g) %>% mutate(x_max = max(x))
Source: local data frame [6 x 3]
Groups: g
g x x_max
1 A 7 7
2 A 3 7
3 B 5 9
4 B 9 9
5 B 2 9
6 C 4 4
But what I would like is to get is the maximum x
value for each group g
, excluding the x
value in each row.
但我想得到的是每组g的最大x值,不包括每行中的x值。
For the given example, the desired output would look like this:
对于给定的示例,所需的输出将如下所示:
Source: local data frame [6 x 3]
Groups: g
g x x_max x_max_exclude
1 A 7 7 3
2 A 3 7 7
3 B 5 9 9
4 B 9 9 5
5 B 2 9 9
6 C 4 4 NA
I thought I might be able to use row_number()
to remove particular elements and take the max of what remained, but hit warning messages and got incorrect -Inf
output:
我以为我可以使用row_number()删除特定元素并获取剩余的最大值,但是点击警告消息并得到错误的-Inf输出:
> df %>% group_by(g) %>% mutate(x_max = max(x), r = row_number(), x_max_exclude = max(x[-r]))
Source: local data frame [6 x 5]
Groups: g
g x x_max r x_max_exclude
1 A 7 7 1 -Inf
2 A 3 7 2 -Inf
3 B 5 9 1 -Inf
4 B 9 9 2 -Inf
5 B 2 9 3 -Inf
6 C 4 4 1 -Inf
Warning messages:
1: In max(c(4, 9, 2)[-1:3]) :
no non-missing arguments to max; returning -Inf
2: In max(c(4, 9, 2)[-1:3]) :
no non-missing arguments to max; returning -Inf
3: In max(c(4, 9, 2)[-1:3]) :
no non-missing arguments to max; returning -Inf
What is the most {readable, concise, efficient} way to get this output in dplyr? Any insight into why my attempt using row_number()
doesn't work would also be much appreciated. Thanks for the help.
在dplyr中获取此输出的最{可读,简洁,高效}方法是什么?任何洞察我使用row_number()的尝试不起作用的原因也将非常感激。谢谢您的帮助。
4 个解决方案
#1
5
You could try:
你可以尝试:
df %>%
group_by(g) %>%
arrange(desc(x)) %>%
mutate(max = ifelse(x == max(x), x[2], max(x)))
Which gives:
#Source: local data frame [6 x 3]
#Groups: g
#
# g x max
#1 A 7 3
#2 A 3 7
#3 B 9 5
#4 B 5 9
#5 B 2 9
#6 C 4 NA
Benchmark
I've tried the solutions so far on the benchmark:
到目前为止,我已经在基准测试中尝试了解决方案:
df <- data.frame(g = sample(LETTERS, 10e5, replace = TRUE),
x = sample(1:10, 10e5, replace = TRUE))
library(microbenchmark)
mbm <- microbenchmark(
steven = df %>%
group_by(g) %>%
arrange(desc(x)) %>%
mutate(max = ifelse(x == max(x), x[2], max(x))),
eric = df %>%
group_by(g) %>%
mutate(x_max = max(x),
x_max2 = sort(x, decreasing = TRUE)[2],
x_max_exclude = ifelse(x == x_max, x_max2, x_max)) %>%
select(-x_max2),
arun = setDT(df)[order(x), x_max_exclude := c(rep(x[.N], .N-1L), x[.N-1L]), by=g],
times = 50
)
@Arun's data.table solution is the fastest:
@ Arun的data.table解决方案是最快的:
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# steven 158.58083 163.82669 197.28946 210.54179 212.1517 260.1448 50 b
# eric 223.37877 228.98313 262.01623 274.74702 277.1431 284.5170 50 c
# arun 44.48639 46.17961 54.65824 47.74142 48.9884 102.3830 50 a
#2
3
Interesting problem. Here's one way using data.table
:
有趣的问题。这是使用data.table的一种方法:
require(data.table)
setDT(df)[order(x), x_max_exclude := c(rep(x[.N], .N-1L), x[.N-1L]), by=g]
The idea is to order by column x
and on those indices, we group by g
. Since we've the ordered indices, for the first .N-1
rows, the max value is the value at .N
. And for the .N
th row, it's the value at .N-1
th row.
我们的想法是按列x排序,在这些索引上,我们按g分组。由于我们有了有序索引,对于第一个.N-1行,最大值是.N处的值。对于.Nth行,它是.N-1行的值。
.N
is a special variable that holds the number of observations in each group.
.N是一个特殊变量,用于保存每组中的观察数。
I'll leave it to you and/or the dplyr
experts to translate this (or answer with another approach).
我会留给你和/或dplyr专家来翻译这个(或用另一种方法回答)。
#3
2
This is the best I've come up with so far. Not sure if there's a better way.
这是迄今为止我提出的最好的。不确定是否有更好的方法。
df %>%
group_by(g) %>%
mutate(x_max = max(x),
x_max2 = sort(x, decreasing = TRUE)[2],
x_max_exclude = ifelse(x == x_max, x_max2, x_max)) %>%
select(-x_max2)
#4
0
Another way with a functional:
功能的另一种方式:
df %>% group_by(g) %>% mutate(x_max_exclude = max_exclude(x))
Source: local data frame [6 x 3]
Groups: g
g x x_max_exclude
1 A 7 3
2 A 3 7
3 B 5 9
4 B 9 5
5 B 2 9
6 C 4 NA
We write a function called max_exclude
that does the operation that you describe.
我们编写了一个名为max_exclude的函数来执行您描述的操作。
max_exclude <- function(v) {
res <- c()
for(i in seq_along(v)) {
res[i] <- suppressWarnings(max(v[-i]))
}
res <- ifelse(!is.finite(res), NA, res)
as.numeric(res)
}
It works with base R
too:
它也适用于基础R:
df$x_max_exclude <- with(df, ave(x, g, FUN=max_exclude))
Source: local data frame [6 x 3]
g x x_max_exclude
1 A 7 3
2 A 3 7
3 B 5 9
4 B 9 5
5 B 2 9
6 C 4 NA
Benchmark
Here's a lesson kids, beware of for loops!
这是一个孩子的课,小心循环!
big.df <- data.frame(g=rep(LETTERS[1:4], each=1e3), x=sample(10, 4e3, replace=T))
microbenchmark(
plafort_dplyr = big.df %>% group_by(g) %>% mutate(x_max_exclude = max_exclude(x)),
plafort_ave = big.df$x_max_exclude <- with(big.df, ave(x, g, FUN=max_exclude)),
StevenB = (big.df %>%
group_by(g) %>%
mutate(max = ifelse(row_number(desc(x)) == 1, x[row_number(desc(x)) == 2], max(x)))
),
Eric = df %>%
group_by(g) %>%
mutate(x_max = max(x),
x_max2 = sort(x, decreasing = TRUE)[2],
x_max_exclude = ifelse(x == x_max, x_max2, x_max)) %>%
select(-x_max2),
Arun = setDT(df)[order(x), x_max_exclude := c(rep(x[.N], .N-1L), x[.N-1L]), by=g]
)
Unit: milliseconds
expr min lq mean median uq max neval
plafort_dplyr 75.219042 85.207442 89.247409 88.203225 90.627663 179.553166 100
plafort_ave 75.907798 84.604180 87.136122 86.961251 89.431884 104.884294 100
StevenB 4.436973 4.699226 5.207548 4.931484 5.364242 11.893306 100
Eric 7.233057 8.034092 8.921904 8.414720 9.060488 15.946281 100
Arun 1.789097 2.037235 2.410915 2.226988 2.423638 9.326272 100
#1
5
You could try:
你可以尝试:
df %>%
group_by(g) %>%
arrange(desc(x)) %>%
mutate(max = ifelse(x == max(x), x[2], max(x)))
Which gives:
#Source: local data frame [6 x 3]
#Groups: g
#
# g x max
#1 A 7 3
#2 A 3 7
#3 B 9 5
#4 B 5 9
#5 B 2 9
#6 C 4 NA
Benchmark
I've tried the solutions so far on the benchmark:
到目前为止,我已经在基准测试中尝试了解决方案:
df <- data.frame(g = sample(LETTERS, 10e5, replace = TRUE),
x = sample(1:10, 10e5, replace = TRUE))
library(microbenchmark)
mbm <- microbenchmark(
steven = df %>%
group_by(g) %>%
arrange(desc(x)) %>%
mutate(max = ifelse(x == max(x), x[2], max(x))),
eric = df %>%
group_by(g) %>%
mutate(x_max = max(x),
x_max2 = sort(x, decreasing = TRUE)[2],
x_max_exclude = ifelse(x == x_max, x_max2, x_max)) %>%
select(-x_max2),
arun = setDT(df)[order(x), x_max_exclude := c(rep(x[.N], .N-1L), x[.N-1L]), by=g],
times = 50
)
@Arun's data.table solution is the fastest:
@ Arun的data.table解决方案是最快的:
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# steven 158.58083 163.82669 197.28946 210.54179 212.1517 260.1448 50 b
# eric 223.37877 228.98313 262.01623 274.74702 277.1431 284.5170 50 c
# arun 44.48639 46.17961 54.65824 47.74142 48.9884 102.3830 50 a
#2
3
Interesting problem. Here's one way using data.table
:
有趣的问题。这是使用data.table的一种方法:
require(data.table)
setDT(df)[order(x), x_max_exclude := c(rep(x[.N], .N-1L), x[.N-1L]), by=g]
The idea is to order by column x
and on those indices, we group by g
. Since we've the ordered indices, for the first .N-1
rows, the max value is the value at .N
. And for the .N
th row, it's the value at .N-1
th row.
我们的想法是按列x排序,在这些索引上,我们按g分组。由于我们有了有序索引,对于第一个.N-1行,最大值是.N处的值。对于.Nth行,它是.N-1行的值。
.N
is a special variable that holds the number of observations in each group.
.N是一个特殊变量,用于保存每组中的观察数。
I'll leave it to you and/or the dplyr
experts to translate this (or answer with another approach).
我会留给你和/或dplyr专家来翻译这个(或用另一种方法回答)。
#3
2
This is the best I've come up with so far. Not sure if there's a better way.
这是迄今为止我提出的最好的。不确定是否有更好的方法。
df %>%
group_by(g) %>%
mutate(x_max = max(x),
x_max2 = sort(x, decreasing = TRUE)[2],
x_max_exclude = ifelse(x == x_max, x_max2, x_max)) %>%
select(-x_max2)
#4
0
Another way with a functional:
功能的另一种方式:
df %>% group_by(g) %>% mutate(x_max_exclude = max_exclude(x))
Source: local data frame [6 x 3]
Groups: g
g x x_max_exclude
1 A 7 3
2 A 3 7
3 B 5 9
4 B 9 5
5 B 2 9
6 C 4 NA
We write a function called max_exclude
that does the operation that you describe.
我们编写了一个名为max_exclude的函数来执行您描述的操作。
max_exclude <- function(v) {
res <- c()
for(i in seq_along(v)) {
res[i] <- suppressWarnings(max(v[-i]))
}
res <- ifelse(!is.finite(res), NA, res)
as.numeric(res)
}
It works with base R
too:
它也适用于基础R:
df$x_max_exclude <- with(df, ave(x, g, FUN=max_exclude))
Source: local data frame [6 x 3]
g x x_max_exclude
1 A 7 3
2 A 3 7
3 B 5 9
4 B 9 5
5 B 2 9
6 C 4 NA
Benchmark
Here's a lesson kids, beware of for loops!
这是一个孩子的课,小心循环!
big.df <- data.frame(g=rep(LETTERS[1:4], each=1e3), x=sample(10, 4e3, replace=T))
microbenchmark(
plafort_dplyr = big.df %>% group_by(g) %>% mutate(x_max_exclude = max_exclude(x)),
plafort_ave = big.df$x_max_exclude <- with(big.df, ave(x, g, FUN=max_exclude)),
StevenB = (big.df %>%
group_by(g) %>%
mutate(max = ifelse(row_number(desc(x)) == 1, x[row_number(desc(x)) == 2], max(x)))
),
Eric = df %>%
group_by(g) %>%
mutate(x_max = max(x),
x_max2 = sort(x, decreasing = TRUE)[2],
x_max_exclude = ifelse(x == x_max, x_max2, x_max)) %>%
select(-x_max2),
Arun = setDT(df)[order(x), x_max_exclude := c(rep(x[.N], .N-1L), x[.N-1L]), by=g]
)
Unit: milliseconds
expr min lq mean median uq max neval
plafort_dplyr 75.219042 85.207442 89.247409 88.203225 90.627663 179.553166 100
plafort_ave 75.907798 84.604180 87.136122 86.961251 89.431884 104.884294 100
StevenB 4.436973 4.699226 5.207548 4.931484 5.364242 11.893306 100
Eric 7.233057 8.034092 8.921904 8.414720 9.060488 15.946281 100
Arun 1.789097 2.037235 2.410915 2.226988 2.423638 9.326272 100