Can the mutate be used when the mutation is conditional (depending on the values of certain column values)?
当突变是有条件的(取决于特定列值的值)时,是否可以使用mutate ?
This example helps showing what I mean.
这个例子有助于说明我的意思。
structure(list(a = c(1, 3, 4, 6, 3, 2, 5, 1), b = c(1, 3, 4,
2, 6, 7, 2, 6), c = c(6, 3, 6, 5, 3, 6, 5, 3), d = c(6, 2, 4,
5, 3, 7, 2, 6), e = c(1, 2, 4, 5, 6, 7, 6, 3), f = c(2, 3, 4,
2, 2, 7, 5, 2)), .Names = c("a", "b", "c", "d", "e", "f"), row.names = c(NA,
8L), class = "data.frame")
a b c d e f
1 1 1 6 6 1 2
2 3 3 3 2 2 3
3 4 4 6 4 4 4
4 6 2 5 5 5 2
5 3 6 3 3 6 2
6 2 7 6 7 7 7
7 5 2 5 2 6 5
8 1 6 3 6 3 2
I was hoping to find a solution to my problem using the dplyr package (and yes I know this not code that should work, but I guess it makes the purpose clear) for creating a new column g:
我希望通过dplyr包找到我的问题的解决方案(是的,我知道这不是应该工作的代码,但是我想它可以清楚地说明)创建一个新的列g:
library(dplyr)
df <- mutate(df, if (a == 2 | a == 5 | a == 7 | (a == 1 & b == 4)){g = 2},
if (a == 0 | a == 1 | a == 4 | a == 3 | c == 4){g = 3})
The result of the code I am looking for should have this result in this particular example:
我正在寻找的代码的结果应该在这个例子中得到这个结果:
a b c d e f g
1 1 1 6 6 1 2 3
2 3 3 3 2 2 3 3
3 4 4 6 4 4 4 3
4 6 2 5 5 5 2 NA
5 3 6 3 3 6 2 NA
6 2 7 6 7 7 7 2
7 5 2 5 2 6 5 2
8 1 6 3 6 3 2 3
Does anyone have an idea about how to do this in dplyr? This data frame is just an example, the data frames I am dealing with are much larger. Because of its speed I tried to use dplyr, but perhaps there are other, better ways to handle this problem?
有人知道怎么在dplyr中做这个吗?这个数据框只是一个例子,我处理的数据帧要大得多。因为它的速度,我尝试使用dplyr,但是也许还有其他更好的方法来处理这个问题?
5 个解决方案
#1
105
Use ifelse
使用ifelse
df %>%
mutate(g = ifelse(a == 2 | a == 5 | a == 7 | (a == 1 & b == 4), 2,
ifelse(a == 0 | a == 1 | a == 4 | a == 3 | c == 4, 3, NA)))
Added - if_else: Note that in dplyr 0.5 there is an if_else
function defined so an alternative would be to replace ifelse
with if_else
; however, note that since if_else
is stricter than ifelse
(both legs of the condition must have the same type) so the NA
in that case would have to be replaced with NA_real_
.
添加- if_else:请注意,在dplyr 0.5中,定义了if_else函数,因此另一种替代方法是用if_else代替ifelse;但是,请注意,因为if_else比ifelse更严格(条件的两条腿必须具有相同的类型),所以在这种情况下,NA必须被NA_real_替换。
df %>%
mutate(g = if_else(a == 2 | a == 5 | a == 7 | (a == 1 & b == 4), 2,
if_else(a == 0 | a == 1 | a == 4 | a == 3 | c == 4, 3, NA_real_)))
Added - case_when Since this question was posted dplyr has added case_when
so another alternative would be:
添加- case_when自从这个问题发布后,dplyr已经添加了case_when,所以另一个替代方案是:
df %>% mutate(g = case_when(a == 2 | a == 5 | a == 7 | (a == 1 & b == 4) ~ 2,
a == 0 | a == 1 | a == 4 | a == 3 | c == 4 ~ 3,
TRUE ~ NA_real_))
#2
45
Since you ask for other better ways to handle the problem, here's another way using data.table
:
既然您要求其他更好的方法来处理这个问题,下面是另一种使用data.table的方法:
require(data.table) ## 1.9.2+
setDT(df)
df[a %in% c(0,1,3,4) | c == 4, g := 3L]
df[a %in% c(2,5,7) | (a==1 & b==4), g := 2L]
Note the order of conditional statements is reversed to get g
correctly. There's no copy of g
made, even during the second assignment - it's replaced in-place.
注意,条件语句的顺序颠倒了,以便正确地获取g。没有g的拷贝,即使是在第二项任务中,它已经被取代了。
On larger data this'd have better performance than using nested if-else
, as it can evaluate both 'yes' and 'no' cases, and nesting can get harder to read/maintain IMHO.
在较大的数据上,这比使用嵌套的if-else更好,因为它可以同时评估“是”和“不”,嵌套会变得更难读/保持IMHO。
Here's a benchmark on relatively bigger data:
以下是相对更大数据的基准:
# R version 3.1.0
require(data.table) ## 1.9.2
require(dplyr)
DT <- setDT(lapply(1:6, function(x) sample(7, 1e7, TRUE)))
setnames(DT, letters[1:6])
# > dim(DT)
# [1] 10000000 6
DF <- as.data.frame(DT)
DT_fun <- function(DT) {
DT[(a %in% c(0,1,3,4) | c == 4), g := 3L]
DT[a %in% c(2,5,7) | (a==1 & b==4), g := 2L]
}
DPLYR_fun <- function(DF) {
mutate(DF, g = ifelse(a %in% c(2,5,7) | (a==1 & b==4), 2L,
ifelse(a %in% c(0,1,3,4) | c==4, 3L, NA_integer_)))
}
BASE_fun <- function(DF) { # R v3.1.0
transform(DF, g = ifelse(a %in% c(2,5,7) | (a==1 & b==4), 2L,
ifelse(a %in% c(0,1,3,4) | c==4, 3L, NA_integer_)))
}
system.time(ans1 <- DT_fun(DT))
# user system elapsed
# 2.659 0.420 3.107
system.time(ans2 <- DPLYR_fun(DF))
# user system elapsed
# 11.822 1.075 12.976
system.time(ans3 <- BASE_fun(DF))
# user system elapsed
# 11.676 1.530 13.319
identical(as.data.frame(ans1), as.data.frame(ans2))
# [1] TRUE
identical(as.data.frame(ans1), as.data.frame(ans3))
# [1] TRUE
Not sure if this is an alternative you'd asked for, but I hope it helps.
不确定这是不是你想要的替代方案,但我希望它能有所帮助。
#3
27
dplyr now has a function case_when
that offers a vectorised if. The syntax is a little strange compared to mosaic:::derivedFactor
as you cannot access variables in the standard dplyr way, and need to declare the mode of NA, but it is considerably faster than mosaic:::derivedFactor
.
dplyr现在有一个函数case_when,它提供了一个矢量化的if。与马赛克相比,语法有点奇怪::由于您无法访问标准dplyr方式中的变量,因此需要声明NA的模式,但它比mosaic:::derivedFactor更快。
df %>%
mutate(g = case_when(a %in% c(2,5,7) | (a==1 & b==4) ~ 2L,
a %in% c(0,1,3,4) | c == 4 ~ 3L,
TRUE~as.integer(NA)))
EDIT: If you're using dplyr::case_when()
from before version 0.7.0 of the package, then you need to precede variable names with '.$
' (e.g. write .$a == 1
inside case_when
).
编辑:如果您使用的是dplyr::case_when(),从包的0.7.0版本开始,那么您需要在变量名前面加上“。”$'(例如,$a == 1在case_when中)。
Benchmark: For the benchmark (reusing functions from Arun 's post) and reducing sample size:
基准:用于基准测试(重新使用Arun的post函数)和减少样本量:
require(data.table)
require(mosaic)
require(dplyr)
require(microbenchmark)
DT <- setDT(lapply(1:6, function(x) sample(7, 10000, TRUE)))
setnames(DT, letters[1:6])
DF <- as.data.frame(DT)
DPLYR_case_when <- function(DF) {
DF %>%
mutate(g = case_when(a %in% c(2,5,7) | (a==1 & b==4) ~ 2L,
a %in% c(0,1,3,4) | c==4 ~ 3L,
TRUE~as.integer(NA)))
}
DT_fun <- function(DT) {
DT[(a %in% c(0,1,3,4) | c == 4), g := 3L]
DT[a %in% c(2,5,7) | (a==1 & b==4), g := 2L]
}
DPLYR_fun <- function(DF) {
mutate(DF, g = ifelse(a %in% c(2,5,7) | (a==1 & b==4), 2L,
ifelse(a %in% c(0,1,3,4) | c==4, 3L, NA_integer_)))
}
mosa_fun <- function(DF) {
mutate(DF, g = derivedFactor(
"2" = (a == 2 | a == 5 | a == 7 | (a == 1 & b == 4)),
"3" = (a == 0 | a == 1 | a == 4 | a == 3 | c == 4),
.method = "first",
.default = NA
))
}
microbenchmark(
DT_fun(DT),
DPLYR_fun(DF),
DPLYR_case_when(DF),
mosa_fun(DF),
times=20
)
This gives:
这给:
expr min lq mean median uq max neval
DT_fun(DT) 1.503589 1.626971 2.054825 1.755860 2.292157 3.426192 20
DPLYR_fun(DF) 2.420798 2.596476 3.617092 3.484567 4.184260 6.235367 20
DPLYR_case_when(DF) 2.153481 2.252134 6.124249 2.365763 3.119575 72.344114 20
mosa_fun(DF) 396.344113 407.649356 413.743179 412.412634 416.515742 459.974969 20
#4
13
The derivedFactor
function from mosaic
package seems to be designed to handle this. Using this example, it would look like:
mosaic软件包的derivedFactor函数似乎被设计用来处理这个问题。使用这个例子,它看起来是这样的:
library(dplyr)
library(mosaic)
df <- mutate(df, g = derivedFactor(
"2" = (a == 2 | a == 5 | a == 7 | (a == 1 & b == 4)),
"3" = (a == 0 | a == 1 | a == 4 | a == 3 | c == 4),
.method = "first",
.default = NA
))
(If you want the result to be numeric instead of a factor, you can wrap derivedFactor
in an as.numeric
call.)
(如果您希望结果是数值而不是一个因素,您可以在a中使用derivedFactor。数字电话。)
derivedFactor
can be used for an arbitrary number of conditionals, too.
derivedFactor也可以用于任意数量的条件。
#5
6
case_when
is now a pretty clean implementation of the SQL-style case when:
case_when现在很好地实现了sql样式的情况:
structure(list(a = c(1, 3, 4, 6, 3, 2, 5, 1), b = c(1, 3, 4,
2, 6, 7, 2, 6), c = c(6, 3, 6, 5, 3, 6, 5, 3), d = c(6, 2, 4,
5, 3, 7, 2, 6), e = c(1, 2, 4, 5, 6, 7, 6, 3), f = c(2, 3, 4,
2, 2, 7, 5, 2)), .Names = c("a", "b", "c", "d", "e", "f"), row.names = c(NA,
8L), class = "data.frame") -> df
df %>%
mutate( g = case_when(
a == 2 | a == 5 | a == 7 | (a == 1 & b == 4 ) ~ 2,
a == 0 | a == 1 | a == 4 | a == 3 | c == 4 ~ 3
))
Using dplyr 0.7.4
使用dplyr 0.7.4
The manual: http://dplyr.tidyverse.org/reference/case_when.html
手动:http://dplyr.tidyverse.org/reference/case_when.html
#1
105
Use ifelse
使用ifelse
df %>%
mutate(g = ifelse(a == 2 | a == 5 | a == 7 | (a == 1 & b == 4), 2,
ifelse(a == 0 | a == 1 | a == 4 | a == 3 | c == 4, 3, NA)))
Added - if_else: Note that in dplyr 0.5 there is an if_else
function defined so an alternative would be to replace ifelse
with if_else
; however, note that since if_else
is stricter than ifelse
(both legs of the condition must have the same type) so the NA
in that case would have to be replaced with NA_real_
.
添加- if_else:请注意,在dplyr 0.5中,定义了if_else函数,因此另一种替代方法是用if_else代替ifelse;但是,请注意,因为if_else比ifelse更严格(条件的两条腿必须具有相同的类型),所以在这种情况下,NA必须被NA_real_替换。
df %>%
mutate(g = if_else(a == 2 | a == 5 | a == 7 | (a == 1 & b == 4), 2,
if_else(a == 0 | a == 1 | a == 4 | a == 3 | c == 4, 3, NA_real_)))
Added - case_when Since this question was posted dplyr has added case_when
so another alternative would be:
添加- case_when自从这个问题发布后,dplyr已经添加了case_when,所以另一个替代方案是:
df %>% mutate(g = case_when(a == 2 | a == 5 | a == 7 | (a == 1 & b == 4) ~ 2,
a == 0 | a == 1 | a == 4 | a == 3 | c == 4 ~ 3,
TRUE ~ NA_real_))
#2
45
Since you ask for other better ways to handle the problem, here's another way using data.table
:
既然您要求其他更好的方法来处理这个问题,下面是另一种使用data.table的方法:
require(data.table) ## 1.9.2+
setDT(df)
df[a %in% c(0,1,3,4) | c == 4, g := 3L]
df[a %in% c(2,5,7) | (a==1 & b==4), g := 2L]
Note the order of conditional statements is reversed to get g
correctly. There's no copy of g
made, even during the second assignment - it's replaced in-place.
注意,条件语句的顺序颠倒了,以便正确地获取g。没有g的拷贝,即使是在第二项任务中,它已经被取代了。
On larger data this'd have better performance than using nested if-else
, as it can evaluate both 'yes' and 'no' cases, and nesting can get harder to read/maintain IMHO.
在较大的数据上,这比使用嵌套的if-else更好,因为它可以同时评估“是”和“不”,嵌套会变得更难读/保持IMHO。
Here's a benchmark on relatively bigger data:
以下是相对更大数据的基准:
# R version 3.1.0
require(data.table) ## 1.9.2
require(dplyr)
DT <- setDT(lapply(1:6, function(x) sample(7, 1e7, TRUE)))
setnames(DT, letters[1:6])
# > dim(DT)
# [1] 10000000 6
DF <- as.data.frame(DT)
DT_fun <- function(DT) {
DT[(a %in% c(0,1,3,4) | c == 4), g := 3L]
DT[a %in% c(2,5,7) | (a==1 & b==4), g := 2L]
}
DPLYR_fun <- function(DF) {
mutate(DF, g = ifelse(a %in% c(2,5,7) | (a==1 & b==4), 2L,
ifelse(a %in% c(0,1,3,4) | c==4, 3L, NA_integer_)))
}
BASE_fun <- function(DF) { # R v3.1.0
transform(DF, g = ifelse(a %in% c(2,5,7) | (a==1 & b==4), 2L,
ifelse(a %in% c(0,1,3,4) | c==4, 3L, NA_integer_)))
}
system.time(ans1 <- DT_fun(DT))
# user system elapsed
# 2.659 0.420 3.107
system.time(ans2 <- DPLYR_fun(DF))
# user system elapsed
# 11.822 1.075 12.976
system.time(ans3 <- BASE_fun(DF))
# user system elapsed
# 11.676 1.530 13.319
identical(as.data.frame(ans1), as.data.frame(ans2))
# [1] TRUE
identical(as.data.frame(ans1), as.data.frame(ans3))
# [1] TRUE
Not sure if this is an alternative you'd asked for, but I hope it helps.
不确定这是不是你想要的替代方案,但我希望它能有所帮助。
#3
27
dplyr now has a function case_when
that offers a vectorised if. The syntax is a little strange compared to mosaic:::derivedFactor
as you cannot access variables in the standard dplyr way, and need to declare the mode of NA, but it is considerably faster than mosaic:::derivedFactor
.
dplyr现在有一个函数case_when,它提供了一个矢量化的if。与马赛克相比,语法有点奇怪::由于您无法访问标准dplyr方式中的变量,因此需要声明NA的模式,但它比mosaic:::derivedFactor更快。
df %>%
mutate(g = case_when(a %in% c(2,5,7) | (a==1 & b==4) ~ 2L,
a %in% c(0,1,3,4) | c == 4 ~ 3L,
TRUE~as.integer(NA)))
EDIT: If you're using dplyr::case_when()
from before version 0.7.0 of the package, then you need to precede variable names with '.$
' (e.g. write .$a == 1
inside case_when
).
编辑:如果您使用的是dplyr::case_when(),从包的0.7.0版本开始,那么您需要在变量名前面加上“。”$'(例如,$a == 1在case_when中)。
Benchmark: For the benchmark (reusing functions from Arun 's post) and reducing sample size:
基准:用于基准测试(重新使用Arun的post函数)和减少样本量:
require(data.table)
require(mosaic)
require(dplyr)
require(microbenchmark)
DT <- setDT(lapply(1:6, function(x) sample(7, 10000, TRUE)))
setnames(DT, letters[1:6])
DF <- as.data.frame(DT)
DPLYR_case_when <- function(DF) {
DF %>%
mutate(g = case_when(a %in% c(2,5,7) | (a==1 & b==4) ~ 2L,
a %in% c(0,1,3,4) | c==4 ~ 3L,
TRUE~as.integer(NA)))
}
DT_fun <- function(DT) {
DT[(a %in% c(0,1,3,4) | c == 4), g := 3L]
DT[a %in% c(2,5,7) | (a==1 & b==4), g := 2L]
}
DPLYR_fun <- function(DF) {
mutate(DF, g = ifelse(a %in% c(2,5,7) | (a==1 & b==4), 2L,
ifelse(a %in% c(0,1,3,4) | c==4, 3L, NA_integer_)))
}
mosa_fun <- function(DF) {
mutate(DF, g = derivedFactor(
"2" = (a == 2 | a == 5 | a == 7 | (a == 1 & b == 4)),
"3" = (a == 0 | a == 1 | a == 4 | a == 3 | c == 4),
.method = "first",
.default = NA
))
}
microbenchmark(
DT_fun(DT),
DPLYR_fun(DF),
DPLYR_case_when(DF),
mosa_fun(DF),
times=20
)
This gives:
这给:
expr min lq mean median uq max neval
DT_fun(DT) 1.503589 1.626971 2.054825 1.755860 2.292157 3.426192 20
DPLYR_fun(DF) 2.420798 2.596476 3.617092 3.484567 4.184260 6.235367 20
DPLYR_case_when(DF) 2.153481 2.252134 6.124249 2.365763 3.119575 72.344114 20
mosa_fun(DF) 396.344113 407.649356 413.743179 412.412634 416.515742 459.974969 20
#4
13
The derivedFactor
function from mosaic
package seems to be designed to handle this. Using this example, it would look like:
mosaic软件包的derivedFactor函数似乎被设计用来处理这个问题。使用这个例子,它看起来是这样的:
library(dplyr)
library(mosaic)
df <- mutate(df, g = derivedFactor(
"2" = (a == 2 | a == 5 | a == 7 | (a == 1 & b == 4)),
"3" = (a == 0 | a == 1 | a == 4 | a == 3 | c == 4),
.method = "first",
.default = NA
))
(If you want the result to be numeric instead of a factor, you can wrap derivedFactor
in an as.numeric
call.)
(如果您希望结果是数值而不是一个因素,您可以在a中使用derivedFactor。数字电话。)
derivedFactor
can be used for an arbitrary number of conditionals, too.
derivedFactor也可以用于任意数量的条件。
#5
6
case_when
is now a pretty clean implementation of the SQL-style case when:
case_when现在很好地实现了sql样式的情况:
structure(list(a = c(1, 3, 4, 6, 3, 2, 5, 1), b = c(1, 3, 4,
2, 6, 7, 2, 6), c = c(6, 3, 6, 5, 3, 6, 5, 3), d = c(6, 2, 4,
5, 3, 7, 2, 6), e = c(1, 2, 4, 5, 6, 7, 6, 3), f = c(2, 3, 4,
2, 2, 7, 5, 2)), .Names = c("a", "b", "c", "d", "e", "f"), row.names = c(NA,
8L), class = "data.frame") -> df
df %>%
mutate( g = case_when(
a == 2 | a == 5 | a == 7 | (a == 1 & b == 4 ) ~ 2,
a == 0 | a == 1 | a == 4 | a == 3 | c == 4 ~ 3
))
Using dplyr 0.7.4
使用dplyr 0.7.4
The manual: http://dplyr.tidyverse.org/reference/case_when.html
手动:http://dplyr.tidyverse.org/reference/case_when.html