如何在R中有效地实现合并?

时间:2021-10-07 02:23:23

Background

Several SQL languages (I mostly use postgreSQL) have a function called coalesce which returns the first non null column element for each row. This can be very efficient to use when tables have a lot of NULL elements in them.

一些SQL语言(我主要使用postgreSQL)有一个名为coalesce的函数,它为每一行返回第一个非空列元素。当表中有很多空元素时,使用这种方法非常有效。

I encounter this in a lot of scenarios in R as well when dealing with not so structured data which has a lot of NA's in them.

我在R的很多场景中也遇到过这种情况,当我处理不那么结构化的数据时,这些数据中有很多NA。

I have made a naive implementation myself but it is ridiculously slow.

我自己已经做了一个幼稚的实现,但它是荒谬的缓慢。

coalesce <- function(...) {
  apply(cbind(...), 1, function(x) {
          x[which(!is.na(x))[1]]
        })
}

Example

a <- c(1,  2,  NA, 4, NA)
b <- c(NA, NA, NA, 5, 6)
c <- c(7,  8,  NA, 9, 10)
coalesce(a,b,c)
# [1]  1  2 NA  4  6

Question

Is there any efficient way to implement coalesce in R?

有没有有效的方法来实现R中的合并?

7 个解决方案

#1


32  

On my machine, using Reduce gets a 5x performance improvement:

在我的机器上,使用Reduce可以得到5倍的性能改进:

coalesce2 <- function(...) {
  Reduce(function(x, y) {
    i <- which(is.na(x))
    x[i] <- y[i]
    x},
  list(...))
}

> microbenchmark(coalesce(a,b,c),coalesce2(a,b,c))
Unit: microseconds
               expr    min       lq   median       uq     max neval
  coalesce(a, b, c) 97.669 100.7950 102.0120 103.0505 243.438   100
 coalesce2(a, b, c) 19.601  21.4055  22.8835  23.8315  45.419   100

#2


18  

Looks like coalesce1 is still available

看起来联合ce1仍然可用

coalesce1 <- function(...) {
    ans <- ..1
    for (elt in list(...)[-1]) {
        i <- is.na(ans)
        ans[i] <- elt[i]
    }
    ans
}

which is faster still (but more-or-less a hand re-write of Reduce, so less general)

还是更快(但或多或少地重写了Reduce,所以不那么通用)

> identical(coalesce(a, b, c), coalesce1(a, b, c))
[1] TRUE
> microbenchmark(coalesce(a,b,c), coalesce1(a, b, c), coalesce2(a,b,c))
Unit: microseconds
               expr     min       lq   median       uq     max neval
  coalesce(a, b, c) 336.266 341.6385 344.7320 355.4935 538.348   100
 coalesce1(a, b, c)   8.287   9.4110  10.9515  12.1295  20.940   100
 coalesce2(a, b, c)  37.711  40.1615  42.0885  45.1705  67.258   100

Or for larger data compare

或者更大的数据比较

coalesce1a <- function(...) {
    ans <- ..1
    for (elt in list(...)[-1]) {
        i <- which(is.na(ans))
        ans[i] <- elt[i]
    }
    ans
}

showing that which() can sometimes be effective, even though it implies a second pass through the index.

显示which()有时是有效的,即使它意味着通过索引进行第二次传递。

> aa <- sample(a, 100000, TRUE)
> bb <- sample(b, 100000, TRUE)
> cc <- sample(c, 100000, TRUE)
> microbenchmark(coalesce1(aa, bb, cc),
+                coalesce1a(aa, bb, cc),
+                coalesce2(aa,bb,cc), times=10)
Unit: milliseconds
                   expr       min        lq    median        uq       max neval
  coalesce1(aa, bb, cc) 11.110024 11.137963 11.145723 11.212907 11.270533    10
 coalesce1a(aa, bb, cc)  2.906067  2.953266  2.962729  2.971761  3.452251    10
  coalesce2(aa, bb, cc)  3.080842  3.115607  3.139484  3.166642  3.198977    10

#3


11  

Using dplyr package:

使用dplyr包:

library(dplyr)
coalesce(a, b, c)
# [1]  1  2 NA  4  6

Benchamark, not as fast as accepted solution:

Benchamark,不像公认的解决方案那样快:

coalesce2 <- function(...) {
  Reduce(function(x, y) {
    i <- which(is.na(x))
    x[i] <- y[i]
    x},
    list(...))
}

microbenchmark::microbenchmark(
  coalesce(a, b, c),
  coalesce2(a, b, c)
)

# Unit: microseconds
#                expr    min     lq     mean median      uq     max neval cld
#   coalesce(a, b, c) 21.951 24.518 27.28264 25.515 26.9405 126.293   100   b
#  coalesce2(a, b, c)  7.127  8.553  9.68731  9.123  9.6930  27.368   100  a 

But on a larger dataset, it is comparable:

但在更大的数据集中,它是可比较的:

aa <- sample(a, 100000, TRUE)
bb <- sample(b, 100000, TRUE)
cc <- sample(c, 100000, TRUE)

microbenchmark::microbenchmark(
  coalesce(aa, bb, cc),
  coalesce2(aa, bb, cc))

# Unit: milliseconds
#                   expr      min       lq     mean   median       uq      max neval cld
#   coalesce(aa, bb, cc) 1.708511 1.837368 5.468123 3.268492 3.511241 96.99766   100   a
#  coalesce2(aa, bb, cc) 1.474171 1.516506 3.312153 1.957104 3.253240 91.05223   100   a

#4


9  

I have a ready-to-use implementation called coalesce.na in my misc package. It seems to be competitive, but not fastest. It will also work for vectors of different length, and has a special treatment for vectors of length one:

我有一个现成的实现,叫做coalesce。在我的错包里。它看起来很有竞争力,但不是最快的。它也适用于不同长度的向量,对长度为1的向量有特殊处理:

                    expr        min          lq      median          uq         max neval
    coalesce(aa, bb, cc) 990.060402 1030.708466 1067.000698 1083.301986 1280.734389    10
   coalesce1(aa, bb, cc)  11.356584   11.448455   11.804239   12.507659   14.922052    10
  coalesce1a(aa, bb, cc)   2.739395    2.786594    2.852942    3.312728    5.529927    10
   coalesce2(aa, bb, cc)   2.929364    3.041345    3.593424    3.868032    7.838552    10
 coalesce.na(aa, bb, cc)   4.640552    4.691107    4.858385    4.973895    5.676463    10

Here's the code:

这是代码:

coalesce.na <- function(x, ...) {
  x.len <- length(x)
  ly <- list(...)
  for (y in ly) {
    y.len <- length(y)
    if (y.len == 1) {
      x[is.na(x)] <- y
    } else {
      if (x.len %% y.len != 0)
        warning('object length is not a multiple of first object length')
      pos <- which(is.na(x))
      x[pos] <- y[(pos - 1) %% y.len + 1]
    }
  }
  x
}

Of course, as Kevin pointed out, an Rcpp solution might be faster by orders of magnitude.

当然,正如Kevin指出的那样,Rcpp解决方案的速度可能会更快。

#5


1  

Here is my solution:

这是我的解决方案:

coalesce <- function(x){ y <- head( x[is.na(x) == F] , 1) return(y) } It returns first vaule which is not NA and it works on data.table, for example if you want to use coalesce on few columns and these column names are in vector of strings:

合并<- function(x){y <- head(x[is. NA (x) = F], 1) return(y例如,如果你想在一些列上使用合并,而这些列名是字符串向量:

column_names <- c("col1", "col2", "col3")

<- c("col1", "col2", "col3")

how to use:

如何使用:

ranking[, coalesce_column := coalesce( mget(column_names) ), by = 1:nrow(ranking)]

排序[,coalesce_column:= coalesce(mget(column_names))], by = 1:nrow(rank)]

#6


1  

Another apply method, with mapply.

另一种应用方法是mapply。

mapply(function(...) {temp <- c(...); temp[!is.na(temp)][1]}, a, b, c)
[1]  1  2 NA  4  6

This selects the first non-NA value if more than one exists. The last non-missing element could be selected using tail.

如果存在多个非na值,则选择第一个非na值。最后一个非缺失元素可以使用tail来选择。

Maybe a bit more speed could be squeezed out of this alternative using the bare bones .mapply function, which looks a little different.

也许可以通过使用裸骨.mapply函数来提高速度,这看起来有点不同。

unlist(.mapply(function(...) {temp <- c(...); temp[!is.na(temp)][1]},
               dots=list(a, b, c), MoreArgs=NULL))
[1]  1  2 NA  4  6

.mapplydiffers in important ways from its non-dotted cousin.

········································

  • it returns a list (like Map) and so must be wrapped in some function like unlist or c to return a vector.
  • 它返回一个列表(如Map),因此必须用unlist或c之类的函数包装才能返回一个向量。
  • the set of arguments to be fed in parallel to the function in FUN must be given in a list to the dots argument.
  • 必须在dots参数的列表中提供与函数并行的参数集。
  • Finally, mapply, the moreArgs argument does not have a default, so must explicitly be fed NULL.
  • 最后,mapply, moreArgs参数没有默认值,因此必须显式地为NULL。

#7


0  

A very simple solution is to use the ifelse function from the base package:

一个非常简单的解决方案是使用基本包中的ifelse函数:

coalesce3 <- function(x, y) {

    ifelse(is.na(x), y, x)
}

Although it appears to be slower than coalesce2 above:

虽然看起来比上面的合并慢:

test <- function(a, b, func) {

    for (i in 1:10000) {

        func(a, b)
    }
}

system.time(test(a, b, coalesce2))
user  system elapsed 
0.11    0.00    0.10 

system.time(test(a, b, coalesce3))
user  system elapsed 
0.16    0.00    0.15 

You can use Reduce to make it work for an arbitrary number of vectors:

你可以用Reduce使它对任意数量的向量成立:

coalesce4 <- function(...) {

    Reduce(coalesce3, list(...))
}

#1


32  

On my machine, using Reduce gets a 5x performance improvement:

在我的机器上,使用Reduce可以得到5倍的性能改进:

coalesce2 <- function(...) {
  Reduce(function(x, y) {
    i <- which(is.na(x))
    x[i] <- y[i]
    x},
  list(...))
}

> microbenchmark(coalesce(a,b,c),coalesce2(a,b,c))
Unit: microseconds
               expr    min       lq   median       uq     max neval
  coalesce(a, b, c) 97.669 100.7950 102.0120 103.0505 243.438   100
 coalesce2(a, b, c) 19.601  21.4055  22.8835  23.8315  45.419   100

#2


18  

Looks like coalesce1 is still available

看起来联合ce1仍然可用

coalesce1 <- function(...) {
    ans <- ..1
    for (elt in list(...)[-1]) {
        i <- is.na(ans)
        ans[i] <- elt[i]
    }
    ans
}

which is faster still (but more-or-less a hand re-write of Reduce, so less general)

还是更快(但或多或少地重写了Reduce,所以不那么通用)

> identical(coalesce(a, b, c), coalesce1(a, b, c))
[1] TRUE
> microbenchmark(coalesce(a,b,c), coalesce1(a, b, c), coalesce2(a,b,c))
Unit: microseconds
               expr     min       lq   median       uq     max neval
  coalesce(a, b, c) 336.266 341.6385 344.7320 355.4935 538.348   100
 coalesce1(a, b, c)   8.287   9.4110  10.9515  12.1295  20.940   100
 coalesce2(a, b, c)  37.711  40.1615  42.0885  45.1705  67.258   100

Or for larger data compare

或者更大的数据比较

coalesce1a <- function(...) {
    ans <- ..1
    for (elt in list(...)[-1]) {
        i <- which(is.na(ans))
        ans[i] <- elt[i]
    }
    ans
}

showing that which() can sometimes be effective, even though it implies a second pass through the index.

显示which()有时是有效的,即使它意味着通过索引进行第二次传递。

> aa <- sample(a, 100000, TRUE)
> bb <- sample(b, 100000, TRUE)
> cc <- sample(c, 100000, TRUE)
> microbenchmark(coalesce1(aa, bb, cc),
+                coalesce1a(aa, bb, cc),
+                coalesce2(aa,bb,cc), times=10)
Unit: milliseconds
                   expr       min        lq    median        uq       max neval
  coalesce1(aa, bb, cc) 11.110024 11.137963 11.145723 11.212907 11.270533    10
 coalesce1a(aa, bb, cc)  2.906067  2.953266  2.962729  2.971761  3.452251    10
  coalesce2(aa, bb, cc)  3.080842  3.115607  3.139484  3.166642  3.198977    10

#3


11  

Using dplyr package:

使用dplyr包:

library(dplyr)
coalesce(a, b, c)
# [1]  1  2 NA  4  6

Benchamark, not as fast as accepted solution:

Benchamark,不像公认的解决方案那样快:

coalesce2 <- function(...) {
  Reduce(function(x, y) {
    i <- which(is.na(x))
    x[i] <- y[i]
    x},
    list(...))
}

microbenchmark::microbenchmark(
  coalesce(a, b, c),
  coalesce2(a, b, c)
)

# Unit: microseconds
#                expr    min     lq     mean median      uq     max neval cld
#   coalesce(a, b, c) 21.951 24.518 27.28264 25.515 26.9405 126.293   100   b
#  coalesce2(a, b, c)  7.127  8.553  9.68731  9.123  9.6930  27.368   100  a 

But on a larger dataset, it is comparable:

但在更大的数据集中,它是可比较的:

aa <- sample(a, 100000, TRUE)
bb <- sample(b, 100000, TRUE)
cc <- sample(c, 100000, TRUE)

microbenchmark::microbenchmark(
  coalesce(aa, bb, cc),
  coalesce2(aa, bb, cc))

# Unit: milliseconds
#                   expr      min       lq     mean   median       uq      max neval cld
#   coalesce(aa, bb, cc) 1.708511 1.837368 5.468123 3.268492 3.511241 96.99766   100   a
#  coalesce2(aa, bb, cc) 1.474171 1.516506 3.312153 1.957104 3.253240 91.05223   100   a

#4


9  

I have a ready-to-use implementation called coalesce.na in my misc package. It seems to be competitive, but not fastest. It will also work for vectors of different length, and has a special treatment for vectors of length one:

我有一个现成的实现,叫做coalesce。在我的错包里。它看起来很有竞争力,但不是最快的。它也适用于不同长度的向量,对长度为1的向量有特殊处理:

                    expr        min          lq      median          uq         max neval
    coalesce(aa, bb, cc) 990.060402 1030.708466 1067.000698 1083.301986 1280.734389    10
   coalesce1(aa, bb, cc)  11.356584   11.448455   11.804239   12.507659   14.922052    10
  coalesce1a(aa, bb, cc)   2.739395    2.786594    2.852942    3.312728    5.529927    10
   coalesce2(aa, bb, cc)   2.929364    3.041345    3.593424    3.868032    7.838552    10
 coalesce.na(aa, bb, cc)   4.640552    4.691107    4.858385    4.973895    5.676463    10

Here's the code:

这是代码:

coalesce.na <- function(x, ...) {
  x.len <- length(x)
  ly <- list(...)
  for (y in ly) {
    y.len <- length(y)
    if (y.len == 1) {
      x[is.na(x)] <- y
    } else {
      if (x.len %% y.len != 0)
        warning('object length is not a multiple of first object length')
      pos <- which(is.na(x))
      x[pos] <- y[(pos - 1) %% y.len + 1]
    }
  }
  x
}

Of course, as Kevin pointed out, an Rcpp solution might be faster by orders of magnitude.

当然,正如Kevin指出的那样,Rcpp解决方案的速度可能会更快。

#5


1  

Here is my solution:

这是我的解决方案:

coalesce <- function(x){ y <- head( x[is.na(x) == F] , 1) return(y) } It returns first vaule which is not NA and it works on data.table, for example if you want to use coalesce on few columns and these column names are in vector of strings:

合并<- function(x){y <- head(x[is. NA (x) = F], 1) return(y例如,如果你想在一些列上使用合并,而这些列名是字符串向量:

column_names <- c("col1", "col2", "col3")

<- c("col1", "col2", "col3")

how to use:

如何使用:

ranking[, coalesce_column := coalesce( mget(column_names) ), by = 1:nrow(ranking)]

排序[,coalesce_column:= coalesce(mget(column_names))], by = 1:nrow(rank)]

#6


1  

Another apply method, with mapply.

另一种应用方法是mapply。

mapply(function(...) {temp <- c(...); temp[!is.na(temp)][1]}, a, b, c)
[1]  1  2 NA  4  6

This selects the first non-NA value if more than one exists. The last non-missing element could be selected using tail.

如果存在多个非na值,则选择第一个非na值。最后一个非缺失元素可以使用tail来选择。

Maybe a bit more speed could be squeezed out of this alternative using the bare bones .mapply function, which looks a little different.

也许可以通过使用裸骨.mapply函数来提高速度,这看起来有点不同。

unlist(.mapply(function(...) {temp <- c(...); temp[!is.na(temp)][1]},
               dots=list(a, b, c), MoreArgs=NULL))
[1]  1  2 NA  4  6

.mapplydiffers in important ways from its non-dotted cousin.

········································

  • it returns a list (like Map) and so must be wrapped in some function like unlist or c to return a vector.
  • 它返回一个列表(如Map),因此必须用unlist或c之类的函数包装才能返回一个向量。
  • the set of arguments to be fed in parallel to the function in FUN must be given in a list to the dots argument.
  • 必须在dots参数的列表中提供与函数并行的参数集。
  • Finally, mapply, the moreArgs argument does not have a default, so must explicitly be fed NULL.
  • 最后,mapply, moreArgs参数没有默认值,因此必须显式地为NULL。

#7


0  

A very simple solution is to use the ifelse function from the base package:

一个非常简单的解决方案是使用基本包中的ifelse函数:

coalesce3 <- function(x, y) {

    ifelse(is.na(x), y, x)
}

Although it appears to be slower than coalesce2 above:

虽然看起来比上面的合并慢:

test <- function(a, b, func) {

    for (i in 1:10000) {

        func(a, b)
    }
}

system.time(test(a, b, coalesce2))
user  system elapsed 
0.11    0.00    0.10 

system.time(test(a, b, coalesce3))
user  system elapsed 
0.16    0.00    0.15 

You can use Reduce to make it work for an arbitrary number of vectors:

你可以用Reduce使它对任意数量的向量成立:

coalesce4 <- function(...) {

    Reduce(coalesce3, list(...))
}