I have an integer vector a
我有一个整数向量a
a=function(l) as.integer(runif(l,1,600))
a(100)
[1] 414 476 6 58 74 76 45 359 482 340 103 575 494 323 74 347 157 503 385 518 547 192 149 222 152 67 497 588 388 140 457 429 353
[34] 484 91 310 394 122 302 158 405 43 300 439 173 375 218 357 98 196 260 588 499 230 22 369 36 291 221 358 296 206 96 439 423 281
[67] 581 127 178 330 403 91 297 341 280 164 442 114 234 36 257 307 320 307 222 53 327 394 467 480 323 97 109 564 258 2 355 253 596
[100] 215
and an integer matrix B
和整数矩阵B.
B=function(c) matrix(as.integer(runif(5*c,1,600)),nrow=5)
B(10)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 250 411 181 345 4 519 167 395 130 388
[2,] 383 377 555 304 119 317 586 351 136 528
[3,] 238 262 513 476 579 145 461 191 262 302
[4,] 428 467 217 590 50 171 450 189 140 158
[5,] 178 14 31 148 285 365 515 64 166 584
and I would like to make a new boolean l x c
matrix that shows whether or not each vector element in a
is present in each specific column of matrix B
.
我想创建一个新的布尔l x c矩阵,显示a中的每个向量元素是否存在于矩阵B的每个特定列中。
I tried this with
我试过这个
ispresent1 = function (a,B) {
out = outer(a, B, FUN = "==" )
apply(out,c(1,3),FUN="any") }
or with
ispresent2 = function (a,B) t(sapply(1:length(a), function(i) apply(B,2,function(x) a[[i]] %in% x)))
but neither of these ways to do this are very fast:
但这些方法都不是很快:
a1=a(1000)
B1=B(20000)
system.time(ispresent1(a1,B1))
user system elapsed
76.63 1.08 77.84
system.time(ispresent2(a1,B1))
user system elapsed
218.10 0.00 230.00
(in my application matrix B
would have about 500 000 - 2 million columns)
(在我的应用矩阵B中将有大约500 000 - 200万列)
Probably this is something trivial, but what is the proper way to do this?
可能这是微不足道的,但是这样做的正确方法是什么?
EDIT: proper syntax, as mentioned below, is ispresent = function (a,B) apply(B,2,function(x) { a %in% x } )
, but the Rcpp
solution below is still almost 2 times faster! Thanks for this!
编辑:正确的语法,如下所述,ispresent = function(a,B)apply(B,2,function(x){a%in%x}),但下面的Rcpp解决方案仍然快2倍!谢谢你!
3 个解决方案
#1
8
Rcpp
is awesome for problems like this. It is quite possible that there is some way to do it with data.table
or with an existing function, but with the inline
package it takes almost less time to write it yourself than to find out.
对于像这样的问题,Rcpp非常棒。很可能有一些方法可以使用data.table或使用现有函数来实现它,但使用内联包时,自己编写它所花费的时间几乎比找出它要少。
require(inline)
ispresent.cpp <- cxxfunction(signature(a="integer", B="integer"),
plugin="Rcpp", body='
IntegerVector av(a);
IntegerMatrix Bm(B);
int i,j,k;
LogicalMatrix out(av.size(), Bm.ncol());
for(i = 0; i < av.size(); i++){
for(j = 0; j < Bm.ncol(); j++){
for(k = 0; k < Bm.nrow() && av[i] != Bm(k, j); k++);
if(k < Bm.nrow()) out(i, j) = true;
}
}
return(out);
')
set.seed(123)
a1 <- a(1000)
B1 <- B(20000)
system.time(res.cpp <- ispresent.cpp(a1, B1))
user system elapsed 0.442 0.005 0.446
res1 <- ispresent1(a1,B1)
identical(res1, res.cpp)
[1] TRUE
#2
10
After digging a little, and by curiosity about the Rcpp answer of @Backlin I did write a benchmark of orignal solution and our two solutions:
在挖掘了一点之后,由于对@Backlin的Rcpp答案的好奇,我确实写了一个orignal解决方案的基准和我们的两个解决方案:
I had to change a little Backlin's function as inline didn't work on my windows box (sorry if I missed something with it, let me know if there's something to adapt)
我不得不改变一点Backlin的功能,因为内联在我的Windows框上没有用(对不起,如果我错过了它的东西,让我知道是否有适应的东西)
Code used:
set.seed(123) # Fix the generator
a=function(l) as.integer(runif(l,1,600))
B=function(c) matrix(as.integer(runif(5*c,1,600)),nrow=5)
ispresent1 = function (a,B) {
out = outer(a, B, FUN = "==" )
apply(out,c(1,3),FUN="any") }
a1=a(1000)
B1=B(20000)
tensibai <- function(v,m) {
apply(m,2,function(x) { v %in% x })
}
library(Rcpp)
cppFunction("LogicalMatrix backlin(IntegerVector a,IntegerMatrix B) {
IntegerVector av(a);
IntegerMatrix Bm(B);
int i,j,k;
LogicalMatrix out(av.size(), Bm.ncol());
for(i = 0; i < av.size(); i++){
for(j = 0; j < Bm.ncol(); j++){
for(k = 0; k < Bm.nrow() && av[i] != Bm(k, j); k++);
if(k < Bm.nrow()) out(i, j) = true;
}
}
return(out);
}")
Validation:
> identical(ispresent1(a1,B1),tensibai(a1,B1))
[1] TRUE
> identical(ispresent1(a1,B1),backlin(a1,B1))
[1] TRUE
Benchmark:
> library(microbenchmark)
> microbenchmark(ispresent1(a1,B1),tensibai(a1,B1),backlin(a1,B1),times=3)
Unit: milliseconds
expr min lq mean median uq max neval
ispresent1(a1, B1) 36358.4633 36683.0932 37312.0568 37007.7231 37788.8536 38569.9840 3
tensibai(a1, B1) 603.6323 645.7884 802.0970 687.9445 901.3294 1114.7144 3
backlin(a1, B1) 471.5052 506.2873 528.3476 541.0694 556.7689 572.4684 3
Backlin's solution is slightly faster, proving again Rcpp is a good choice if you know cpp at first :)
Backlin的解决方案稍快一点,再次证明如果你先了解cpp,Rcpp是个不错的选择:)
#3
-1
a=function(l) as.integer(runif(l,1,600))
B=function(c) matrix(as.integer(runif(5*c,1,600)),nrow=5)
ispresent1 = function (a,B) {
out = outer(a, B, FUN = "==" )
apply(out,c(1,3),FUN="any") }
ispresent2 = function (a,B) t(sapply(1:length(a), function(i) apply(B,2,function(x) a[[i]] %in% x)))
ispresent3<-function(a,B){
tf<-matrix((B %in% a),nrow=5)
sapply(1:ncol(tf),function(x) a %in% B[,x][tf[,x]])
}
a1=a(1000)
B1=B(20000)
> system.time(ispresent1(a1,B1))
user system elapsed
29.91 0.48 30.44
> system.time(ispresent2(a1,B1))
user system elapsed
89.65 0.15 89.83
> system.time(ispresent3(a1,B1))
user system elapsed
0.83 0.00 0.86
res1<-ispresent1(a1,B1)
res3<-ispresent3(a1,B1)
> identical(res1,res3)
[1] TRUE
#1
8
Rcpp
is awesome for problems like this. It is quite possible that there is some way to do it with data.table
or with an existing function, but with the inline
package it takes almost less time to write it yourself than to find out.
对于像这样的问题,Rcpp非常棒。很可能有一些方法可以使用data.table或使用现有函数来实现它,但使用内联包时,自己编写它所花费的时间几乎比找出它要少。
require(inline)
ispresent.cpp <- cxxfunction(signature(a="integer", B="integer"),
plugin="Rcpp", body='
IntegerVector av(a);
IntegerMatrix Bm(B);
int i,j,k;
LogicalMatrix out(av.size(), Bm.ncol());
for(i = 0; i < av.size(); i++){
for(j = 0; j < Bm.ncol(); j++){
for(k = 0; k < Bm.nrow() && av[i] != Bm(k, j); k++);
if(k < Bm.nrow()) out(i, j) = true;
}
}
return(out);
')
set.seed(123)
a1 <- a(1000)
B1 <- B(20000)
system.time(res.cpp <- ispresent.cpp(a1, B1))
user system elapsed 0.442 0.005 0.446
res1 <- ispresent1(a1,B1)
identical(res1, res.cpp)
[1] TRUE
#2
10
After digging a little, and by curiosity about the Rcpp answer of @Backlin I did write a benchmark of orignal solution and our two solutions:
在挖掘了一点之后,由于对@Backlin的Rcpp答案的好奇,我确实写了一个orignal解决方案的基准和我们的两个解决方案:
I had to change a little Backlin's function as inline didn't work on my windows box (sorry if I missed something with it, let me know if there's something to adapt)
我不得不改变一点Backlin的功能,因为内联在我的Windows框上没有用(对不起,如果我错过了它的东西,让我知道是否有适应的东西)
Code used:
set.seed(123) # Fix the generator
a=function(l) as.integer(runif(l,1,600))
B=function(c) matrix(as.integer(runif(5*c,1,600)),nrow=5)
ispresent1 = function (a,B) {
out = outer(a, B, FUN = "==" )
apply(out,c(1,3),FUN="any") }
a1=a(1000)
B1=B(20000)
tensibai <- function(v,m) {
apply(m,2,function(x) { v %in% x })
}
library(Rcpp)
cppFunction("LogicalMatrix backlin(IntegerVector a,IntegerMatrix B) {
IntegerVector av(a);
IntegerMatrix Bm(B);
int i,j,k;
LogicalMatrix out(av.size(), Bm.ncol());
for(i = 0; i < av.size(); i++){
for(j = 0; j < Bm.ncol(); j++){
for(k = 0; k < Bm.nrow() && av[i] != Bm(k, j); k++);
if(k < Bm.nrow()) out(i, j) = true;
}
}
return(out);
}")
Validation:
> identical(ispresent1(a1,B1),tensibai(a1,B1))
[1] TRUE
> identical(ispresent1(a1,B1),backlin(a1,B1))
[1] TRUE
Benchmark:
> library(microbenchmark)
> microbenchmark(ispresent1(a1,B1),tensibai(a1,B1),backlin(a1,B1),times=3)
Unit: milliseconds
expr min lq mean median uq max neval
ispresent1(a1, B1) 36358.4633 36683.0932 37312.0568 37007.7231 37788.8536 38569.9840 3
tensibai(a1, B1) 603.6323 645.7884 802.0970 687.9445 901.3294 1114.7144 3
backlin(a1, B1) 471.5052 506.2873 528.3476 541.0694 556.7689 572.4684 3
Backlin's solution is slightly faster, proving again Rcpp is a good choice if you know cpp at first :)
Backlin的解决方案稍快一点,再次证明如果你先了解cpp,Rcpp是个不错的选择:)
#3
-1
a=function(l) as.integer(runif(l,1,600))
B=function(c) matrix(as.integer(runif(5*c,1,600)),nrow=5)
ispresent1 = function (a,B) {
out = outer(a, B, FUN = "==" )
apply(out,c(1,3),FUN="any") }
ispresent2 = function (a,B) t(sapply(1:length(a), function(i) apply(B,2,function(x) a[[i]] %in% x)))
ispresent3<-function(a,B){
tf<-matrix((B %in% a),nrow=5)
sapply(1:ncol(tf),function(x) a %in% B[,x][tf[,x]])
}
a1=a(1000)
B1=B(20000)
> system.time(ispresent1(a1,B1))
user system elapsed
29.91 0.48 30.44
> system.time(ispresent2(a1,B1))
user system elapsed
89.65 0.15 89.83
> system.time(ispresent3(a1,B1))
user system elapsed
0.83 0.00 0.86
res1<-ispresent1(a1,B1)
res3<-ispresent3(a1,B1)
> identical(res1,res3)
[1] TRUE