是否可以使用dplyr包进行条件突变?

时间:2022-03-14 17:42:24

Can the mutate be used when the mutation is conditional (depending on the values of certain column values)?

当突变是有条件的(取决于特定列值的值)时,是否可以使用mutate ?

This example helps showing what I mean.

这个例子有助于说明我的意思。

structure(list(a = c(1, 3, 4, 6, 3, 2, 5, 1), b = c(1, 3, 4, 
2, 6, 7, 2, 6), c = c(6, 3, 6, 5, 3, 6, 5, 3), d = c(6, 2, 4, 
5, 3, 7, 2, 6), e = c(1, 2, 4, 5, 6, 7, 6, 3), f = c(2, 3, 4, 
2, 2, 7, 5, 2)), .Names = c("a", "b", "c", "d", "e", "f"), row.names = c(NA, 
8L), class = "data.frame")

  a b c d e f
1 1 1 6 6 1 2
2 3 3 3 2 2 3
3 4 4 6 4 4 4
4 6 2 5 5 5 2
5 3 6 3 3 6 2
6 2 7 6 7 7 7
7 5 2 5 2 6 5
8 1 6 3 6 3 2

I was hoping to find a solution to my problem using the dplyr package (and yes I know this not code that should work, but I guess it makes the purpose clear) for creating a new column g:

我希望通过dplyr包找到我的问题的解决方案(是的,我知道这不是应该工作的代码,但是我想它可以清楚地说明)创建一个新的列g:

  library(dplyr)
 df <- mutate(df, if (a == 2 | a == 5 | a == 7 | (a == 1 & b == 4)){g = 2},
        if (a == 0 | a == 1 | a == 4 | a == 3 |  c == 4){g = 3})

The result of the code I am looking for should have this result in this particular example:

我正在寻找的代码的结果应该在这个例子中得到这个结果:

  a b c d e f  g
1 1 1 6 6 1 2  3
2 3 3 3 2 2 3  3
3 4 4 6 4 4 4  3
4 6 2 5 5 5 2 NA
5 3 6 3 3 6 2 NA
6 2 7 6 7 7 7  2
7 5 2 5 2 6 5  2
8 1 6 3 6 3 2  3

Does anyone have an idea about how to do this in dplyr? This data frame is just an example, the data frames I am dealing with are much larger. Because of its speed I tried to use dplyr, but perhaps there are other, better ways to handle this problem?

有人知道怎么在dplyr中做这个吗?这个数据框只是一个例子,我处理的数据帧要大得多。因为它的速度,我尝试使用dplyr,但是也许还有其他更好的方法来处理这个问题?

5 个解决方案

#1


105  

Use ifelse

使用ifelse

df %>%
  mutate(g = ifelse(a == 2 | a == 5 | a == 7 | (a == 1 & b == 4), 2,
               ifelse(a == 0 | a == 1 | a == 4 | a == 3 |  c == 4, 3, NA)))

Added - if_else: Note that in dplyr 0.5 there is an if_else function defined so an alternative would be to replace ifelse with if_else; however, note that since if_else is stricter than ifelse (both legs of the condition must have the same type) so the NA in that case would have to be replaced with NA_real_ .

添加- if_else:请注意,在dplyr 0.5中,定义了if_else函数,因此另一种替代方法是用if_else代替ifelse;但是,请注意,因为if_else比ifelse更严格(条件的两条腿必须具有相同的类型),所以在这种情况下,NA必须被NA_real_替换。

df %>%
  mutate(g = if_else(a == 2 | a == 5 | a == 7 | (a == 1 & b == 4), 2,
               if_else(a == 0 | a == 1 | a == 4 | a == 3 |  c == 4, 3, NA_real_)))

Added - case_when Since this question was posted dplyr has added case_when so another alternative would be:

添加- case_when自从这个问题发布后,dplyr已经添加了case_when,所以另一个替代方案是:

df %>% mutate(g = case_when(a == 2 | a == 5 | a == 7 | (a == 1 & b == 4) ~ 2,
                            a == 0 | a == 1 | a == 4 | a == 3 |  c == 4 ~ 3,
                            TRUE ~ NA_real_))

#2


45  

Since you ask for other better ways to handle the problem, here's another way using data.table:

既然您要求其他更好的方法来处理这个问题,下面是另一种使用data.table的方法:

require(data.table) ## 1.9.2+
setDT(df)
df[a %in% c(0,1,3,4) | c == 4, g := 3L]
df[a %in% c(2,5,7) | (a==1 & b==4), g := 2L]

Note the order of conditional statements is reversed to get g correctly. There's no copy of g made, even during the second assignment - it's replaced in-place.

注意,条件语句的顺序颠倒了,以便正确地获取g。没有g的拷贝,即使是在第二项任务中,它已经被取代了。

On larger data this'd have better performance than using nested if-else, as it can evaluate both 'yes' and 'no' cases, and nesting can get harder to read/maintain IMHO.

在较大的数据上,这比使用嵌套的if-else更好,因为它可以同时评估“是”和“不”,嵌套会变得更难读/保持IMHO。


Here's a benchmark on relatively bigger data:

以下是相对更大数据的基准:

# R version 3.1.0
require(data.table) ## 1.9.2
require(dplyr)
DT <- setDT(lapply(1:6, function(x) sample(7, 1e7, TRUE)))
setnames(DT, letters[1:6])
# > dim(DT) 
# [1] 10000000        6
DF <- as.data.frame(DT)

DT_fun <- function(DT) {
    DT[(a %in% c(0,1,3,4) | c == 4), g := 3L]
    DT[a %in% c(2,5,7) | (a==1 & b==4), g := 2L]
}

DPLYR_fun <- function(DF) {
    mutate(DF, g = ifelse(a %in% c(2,5,7) | (a==1 & b==4), 2L, 
            ifelse(a %in% c(0,1,3,4) | c==4, 3L, NA_integer_)))
}

BASE_fun <- function(DF) { # R v3.1.0
    transform(DF, g = ifelse(a %in% c(2,5,7) | (a==1 & b==4), 2L, 
            ifelse(a %in% c(0,1,3,4) | c==4, 3L, NA_integer_)))
}

system.time(ans1 <- DT_fun(DT))
#   user  system elapsed 
#  2.659   0.420   3.107 

system.time(ans2 <- DPLYR_fun(DF))
#   user  system elapsed 
# 11.822   1.075  12.976 

system.time(ans3 <- BASE_fun(DF))
#   user  system elapsed 
# 11.676   1.530  13.319 

identical(as.data.frame(ans1), as.data.frame(ans2))
# [1] TRUE

identical(as.data.frame(ans1), as.data.frame(ans3))
# [1] TRUE

Not sure if this is an alternative you'd asked for, but I hope it helps.

不确定这是不是你想要的替代方案,但我希望它能有所帮助。

#3


27  

dplyr now has a function case_when that offers a vectorised if. The syntax is a little strange compared to mosaic:::derivedFactor as you cannot access variables in the standard dplyr way, and need to declare the mode of NA, but it is considerably faster than mosaic:::derivedFactor.

dplyr现在有一个函数case_when,它提供了一个矢量化的if。与马赛克相比,语法有点奇怪::由于您无法访问标准dplyr方式中的变量,因此需要声明NA的模式,但它比mosaic:::derivedFactor更快。

df %>%
mutate(g = case_when(a %in% c(2,5,7) | (a==1 & b==4) ~ 2L, 
                     a %in% c(0,1,3,4) | c == 4 ~ 3L, 
                     TRUE~as.integer(NA)))

EDIT: If you're using dplyr::case_when() from before version 0.7.0 of the package, then you need to precede variable names with '.$' (e.g. write .$a == 1 inside case_when).

编辑:如果您使用的是dplyr::case_when(),从包的0.7.0版本开始,那么您需要在变量名前面加上“。”$'(例如,$a == 1在case_when中)。

Benchmark: For the benchmark (reusing functions from Arun 's post) and reducing sample size:

基准:用于基准测试(重新使用Arun的post函数)和减少样本量:

require(data.table) 
require(mosaic) 
require(dplyr)
require(microbenchmark)

DT <- setDT(lapply(1:6, function(x) sample(7, 10000, TRUE)))
setnames(DT, letters[1:6])
DF <- as.data.frame(DT)

DPLYR_case_when <- function(DF) {
  DF %>%
  mutate(g = case_when(a %in% c(2,5,7) | (a==1 & b==4) ~ 2L, 
                       a %in% c(0,1,3,4) | c==4 ~ 3L, 
                       TRUE~as.integer(NA)))
}

DT_fun <- function(DT) {
  DT[(a %in% c(0,1,3,4) | c == 4), g := 3L]
  DT[a %in% c(2,5,7) | (a==1 & b==4), g := 2L]
}

DPLYR_fun <- function(DF) {
  mutate(DF, g = ifelse(a %in% c(2,5,7) | (a==1 & b==4), 2L, 
                    ifelse(a %in% c(0,1,3,4) | c==4, 3L, NA_integer_)))
}

mosa_fun <- function(DF) {
  mutate(DF, g = derivedFactor(
    "2" = (a == 2 | a == 5 | a == 7 | (a == 1 & b == 4)),
    "3" = (a == 0 | a == 1 | a == 4 | a == 3 |  c == 4),
    .method = "first",
    .default = NA
  ))
}

microbenchmark(
  DT_fun(DT),
  DPLYR_fun(DF),
  DPLYR_case_when(DF),
  mosa_fun(DF),
  times=20
)

This gives:

这给:

            expr        min         lq       mean     median         uq        max neval
         DT_fun(DT)   1.503589   1.626971   2.054825   1.755860   2.292157   3.426192    20
      DPLYR_fun(DF)   2.420798   2.596476   3.617092   3.484567   4.184260   6.235367    20
DPLYR_case_when(DF)   2.153481   2.252134   6.124249   2.365763   3.119575  72.344114    20
       mosa_fun(DF) 396.344113 407.649356 413.743179 412.412634 416.515742 459.974969    20

#4


13  

The derivedFactor function from mosaic package seems to be designed to handle this. Using this example, it would look like:

mosaic软件包的derivedFactor函数似乎被设计用来处理这个问题。使用这个例子,它看起来是这样的:

library(dplyr)
library(mosaic)
df <- mutate(df, g = derivedFactor(
     "2" = (a == 2 | a == 5 | a == 7 | (a == 1 & b == 4)),
     "3" = (a == 0 | a == 1 | a == 4 | a == 3 |  c == 4),
     .method = "first",
     .default = NA
     ))

(If you want the result to be numeric instead of a factor, you can wrap derivedFactor in an as.numeric call.)

(如果您希望结果是数值而不是一个因素,您可以在a中使用derivedFactor。数字电话。)

derivedFactor can be used for an arbitrary number of conditionals, too.

derivedFactor也可以用于任意数量的条件。

#5


6  

case_when is now a pretty clean implementation of the SQL-style case when:

case_when现在很好地实现了sql样式的情况:

structure(list(a = c(1, 3, 4, 6, 3, 2, 5, 1), b = c(1, 3, 4, 
2, 6, 7, 2, 6), c = c(6, 3, 6, 5, 3, 6, 5, 3), d = c(6, 2, 4, 
5, 3, 7, 2, 6), e = c(1, 2, 4, 5, 6, 7, 6, 3), f = c(2, 3, 4, 
2, 2, 7, 5, 2)), .Names = c("a", "b", "c", "d", "e", "f"), row.names = c(NA, 
8L), class = "data.frame") -> df


df %>% 
    mutate( g = case_when(
                a == 2 | a == 5 | a == 7 | (a == 1 & b == 4 )     ~   2,
                a == 0 | a == 1 | a == 4 |  a == 3 | c == 4       ~   3
))

Using dplyr 0.7.4

使用dplyr 0.7.4

The manual: http://dplyr.tidyverse.org/reference/case_when.html

手动:http://dplyr.tidyverse.org/reference/case_when.html

#1


105  

Use ifelse

使用ifelse

df %>%
  mutate(g = ifelse(a == 2 | a == 5 | a == 7 | (a == 1 & b == 4), 2,
               ifelse(a == 0 | a == 1 | a == 4 | a == 3 |  c == 4, 3, NA)))

Added - if_else: Note that in dplyr 0.5 there is an if_else function defined so an alternative would be to replace ifelse with if_else; however, note that since if_else is stricter than ifelse (both legs of the condition must have the same type) so the NA in that case would have to be replaced with NA_real_ .

添加- if_else:请注意,在dplyr 0.5中,定义了if_else函数,因此另一种替代方法是用if_else代替ifelse;但是,请注意,因为if_else比ifelse更严格(条件的两条腿必须具有相同的类型),所以在这种情况下,NA必须被NA_real_替换。

df %>%
  mutate(g = if_else(a == 2 | a == 5 | a == 7 | (a == 1 & b == 4), 2,
               if_else(a == 0 | a == 1 | a == 4 | a == 3 |  c == 4, 3, NA_real_)))

Added - case_when Since this question was posted dplyr has added case_when so another alternative would be:

添加- case_when自从这个问题发布后,dplyr已经添加了case_when,所以另一个替代方案是:

df %>% mutate(g = case_when(a == 2 | a == 5 | a == 7 | (a == 1 & b == 4) ~ 2,
                            a == 0 | a == 1 | a == 4 | a == 3 |  c == 4 ~ 3,
                            TRUE ~ NA_real_))

#2


45  

Since you ask for other better ways to handle the problem, here's another way using data.table:

既然您要求其他更好的方法来处理这个问题,下面是另一种使用data.table的方法:

require(data.table) ## 1.9.2+
setDT(df)
df[a %in% c(0,1,3,4) | c == 4, g := 3L]
df[a %in% c(2,5,7) | (a==1 & b==4), g := 2L]

Note the order of conditional statements is reversed to get g correctly. There's no copy of g made, even during the second assignment - it's replaced in-place.

注意,条件语句的顺序颠倒了,以便正确地获取g。没有g的拷贝,即使是在第二项任务中,它已经被取代了。

On larger data this'd have better performance than using nested if-else, as it can evaluate both 'yes' and 'no' cases, and nesting can get harder to read/maintain IMHO.

在较大的数据上,这比使用嵌套的if-else更好,因为它可以同时评估“是”和“不”,嵌套会变得更难读/保持IMHO。


Here's a benchmark on relatively bigger data:

以下是相对更大数据的基准:

# R version 3.1.0
require(data.table) ## 1.9.2
require(dplyr)
DT <- setDT(lapply(1:6, function(x) sample(7, 1e7, TRUE)))
setnames(DT, letters[1:6])
# > dim(DT) 
# [1] 10000000        6
DF <- as.data.frame(DT)

DT_fun <- function(DT) {
    DT[(a %in% c(0,1,3,4) | c == 4), g := 3L]
    DT[a %in% c(2,5,7) | (a==1 & b==4), g := 2L]
}

DPLYR_fun <- function(DF) {
    mutate(DF, g = ifelse(a %in% c(2,5,7) | (a==1 & b==4), 2L, 
            ifelse(a %in% c(0,1,3,4) | c==4, 3L, NA_integer_)))
}

BASE_fun <- function(DF) { # R v3.1.0
    transform(DF, g = ifelse(a %in% c(2,5,7) | (a==1 & b==4), 2L, 
            ifelse(a %in% c(0,1,3,4) | c==4, 3L, NA_integer_)))
}

system.time(ans1 <- DT_fun(DT))
#   user  system elapsed 
#  2.659   0.420   3.107 

system.time(ans2 <- DPLYR_fun(DF))
#   user  system elapsed 
# 11.822   1.075  12.976 

system.time(ans3 <- BASE_fun(DF))
#   user  system elapsed 
# 11.676   1.530  13.319 

identical(as.data.frame(ans1), as.data.frame(ans2))
# [1] TRUE

identical(as.data.frame(ans1), as.data.frame(ans3))
# [1] TRUE

Not sure if this is an alternative you'd asked for, but I hope it helps.

不确定这是不是你想要的替代方案,但我希望它能有所帮助。

#3


27  

dplyr now has a function case_when that offers a vectorised if. The syntax is a little strange compared to mosaic:::derivedFactor as you cannot access variables in the standard dplyr way, and need to declare the mode of NA, but it is considerably faster than mosaic:::derivedFactor.

dplyr现在有一个函数case_when,它提供了一个矢量化的if。与马赛克相比,语法有点奇怪::由于您无法访问标准dplyr方式中的变量,因此需要声明NA的模式,但它比mosaic:::derivedFactor更快。

df %>%
mutate(g = case_when(a %in% c(2,5,7) | (a==1 & b==4) ~ 2L, 
                     a %in% c(0,1,3,4) | c == 4 ~ 3L, 
                     TRUE~as.integer(NA)))

EDIT: If you're using dplyr::case_when() from before version 0.7.0 of the package, then you need to precede variable names with '.$' (e.g. write .$a == 1 inside case_when).

编辑:如果您使用的是dplyr::case_when(),从包的0.7.0版本开始,那么您需要在变量名前面加上“。”$'(例如,$a == 1在case_when中)。

Benchmark: For the benchmark (reusing functions from Arun 's post) and reducing sample size:

基准:用于基准测试(重新使用Arun的post函数)和减少样本量:

require(data.table) 
require(mosaic) 
require(dplyr)
require(microbenchmark)

DT <- setDT(lapply(1:6, function(x) sample(7, 10000, TRUE)))
setnames(DT, letters[1:6])
DF <- as.data.frame(DT)

DPLYR_case_when <- function(DF) {
  DF %>%
  mutate(g = case_when(a %in% c(2,5,7) | (a==1 & b==4) ~ 2L, 
                       a %in% c(0,1,3,4) | c==4 ~ 3L, 
                       TRUE~as.integer(NA)))
}

DT_fun <- function(DT) {
  DT[(a %in% c(0,1,3,4) | c == 4), g := 3L]
  DT[a %in% c(2,5,7) | (a==1 & b==4), g := 2L]
}

DPLYR_fun <- function(DF) {
  mutate(DF, g = ifelse(a %in% c(2,5,7) | (a==1 & b==4), 2L, 
                    ifelse(a %in% c(0,1,3,4) | c==4, 3L, NA_integer_)))
}

mosa_fun <- function(DF) {
  mutate(DF, g = derivedFactor(
    "2" = (a == 2 | a == 5 | a == 7 | (a == 1 & b == 4)),
    "3" = (a == 0 | a == 1 | a == 4 | a == 3 |  c == 4),
    .method = "first",
    .default = NA
  ))
}

microbenchmark(
  DT_fun(DT),
  DPLYR_fun(DF),
  DPLYR_case_when(DF),
  mosa_fun(DF),
  times=20
)

This gives:

这给:

            expr        min         lq       mean     median         uq        max neval
         DT_fun(DT)   1.503589   1.626971   2.054825   1.755860   2.292157   3.426192    20
      DPLYR_fun(DF)   2.420798   2.596476   3.617092   3.484567   4.184260   6.235367    20
DPLYR_case_when(DF)   2.153481   2.252134   6.124249   2.365763   3.119575  72.344114    20
       mosa_fun(DF) 396.344113 407.649356 413.743179 412.412634 416.515742 459.974969    20

#4


13  

The derivedFactor function from mosaic package seems to be designed to handle this. Using this example, it would look like:

mosaic软件包的derivedFactor函数似乎被设计用来处理这个问题。使用这个例子,它看起来是这样的:

library(dplyr)
library(mosaic)
df <- mutate(df, g = derivedFactor(
     "2" = (a == 2 | a == 5 | a == 7 | (a == 1 & b == 4)),
     "3" = (a == 0 | a == 1 | a == 4 | a == 3 |  c == 4),
     .method = "first",
     .default = NA
     ))

(If you want the result to be numeric instead of a factor, you can wrap derivedFactor in an as.numeric call.)

(如果您希望结果是数值而不是一个因素,您可以在a中使用derivedFactor。数字电话。)

derivedFactor can be used for an arbitrary number of conditionals, too.

derivedFactor也可以用于任意数量的条件。

#5


6  

case_when is now a pretty clean implementation of the SQL-style case when:

case_when现在很好地实现了sql样式的情况:

structure(list(a = c(1, 3, 4, 6, 3, 2, 5, 1), b = c(1, 3, 4, 
2, 6, 7, 2, 6), c = c(6, 3, 6, 5, 3, 6, 5, 3), d = c(6, 2, 4, 
5, 3, 7, 2, 6), e = c(1, 2, 4, 5, 6, 7, 6, 3), f = c(2, 3, 4, 
2, 2, 7, 5, 2)), .Names = c("a", "b", "c", "d", "e", "f"), row.names = c(NA, 
8L), class = "data.frame") -> df


df %>% 
    mutate( g = case_when(
                a == 2 | a == 5 | a == 7 | (a == 1 & b == 4 )     ~   2,
                a == 0 | a == 1 | a == 4 |  a == 3 | c == 4       ~   3
))

Using dplyr 0.7.4

使用dplyr 0.7.4

The manual: http://dplyr.tidyverse.org/reference/case_when.html

手动:http://dplyr.tidyverse.org/reference/case_when.html