I wish to implement a simple split-apply-combine
routine in Rcpp
where a dataset (matrix) is split up into groups, and then the groupwise column sums are returned. This is a procedure easily implemented in R
, but often takes quite some time. I have managed to implement an Rcpp
solution that beats the performance of R
, but I wonder if I can further improve upon it. To illustrate, here some code, first for the use of R
:
我希望在Rcpp中实现一个简单的split-apply-combine例程,其中数据集(矩阵)被分成组,然后返回groupwise列的总和。这是一个在R中很容易实现的过程,但通常需要相当长的时间。我已经设法实现了一个比R的性能更好的Rcpp解决方案,但我想知道我是否可以进一步改进它。为了说明,这里有一些代码,首先是使用R:
n <- 50000
k <- 50
set.seed(42)
X <- matrix(rnorm(n*k), nrow=n)
g=rep(1:8,length.out=n )
use.for <- function(mat, ind){
sums <- matrix(NA, nrow=length(unique(ind)), ncol=ncol(mat))
for(i in seq_along(unique(ind))){
sums[i,] <- colSums(mat[ind==i,])
}
return(sums)
}
use.apply <- function(mat, ind){
apply(mat,2, function(x) tapply(x, ind, sum))
}
use.dt <- function(mat, ind){ # based on Roland's answer
dt <- as.data.table(mat)
dt[, cvar := ind]
dt2 <- dt[,lapply(.SD, sum), by=cvar]
as.matrix(dt2[,cvar:=NULL])
}
It turns out that the for
-loops is actually quite fast and is the easiest (for me) to implement with Rcpp
. It works by creating a submatrix for each group and then calling colSums
on the matrix. This is implemented using RcppArmadillo
:
事实证明for循环实际上非常快,并且对于Rcpp来说是最容易实现的(对我来说)。它的工作原理是为每个组创建一个子矩阵,然后在矩阵上调用colSums。这是使用RcppArmadillo实现的:
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
using namespace Rcpp;
using namespace arma;
// [[Rcpp::export]]
arma::mat use_arma(arma::mat X, arma::colvec G){
arma::colvec gr = arma::unique(G);
int gr_n = gr.n_rows;
int ncol = X.n_cols;
arma::mat out = zeros(gr_n, ncol);
for(int g=0; g<gr_n; g++){
int g_id = gr(g);
arma::uvec subvec = find(G==g_id);
arma::mat submat = X.rows(subvec);
arma::rowvec res = sum(submat,0);
out.row(g) = res;
}
return out;
}
However, based on answers to this question, I learned that creating copies is expensive in C++
(just as in R
), but that loops are not as bad as they are in R
. Since the arma
-solution relies on creating matrixes (submat
in the code) for each group, my guess is that avoiding this will speed up the process even further. Hence, here a second implementation based on Rcpp
only using a loop:
然而,根据这个问题的答案,我了解到在C ++中创建副本的成本很高(就像在R中一样),但是循环并不像在R中那样糟糕。因为arma-solution依赖于创建矩阵(submat in对于每个组,我的猜测是避免这种情况会进一步加速这个过程。因此,这里仅使用循环基于Rcpp的第二个实现:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericMatrix use_Rcpp(NumericMatrix X, IntegerVector G){
IntegerVector gr = unique(G);
std::sort(gr.begin(), gr.end());
int gr_n = gr.size();
int nrow = X.nrow(), ncol = X.ncol();
NumericMatrix out(gr_n, ncol);
for(int g=0; g<gr_n; g++){
int g_id = gr(g);
for (int j = 0; j < ncol; j++) {
double total = 0;
for (int i = 0; i < nrow; i++) {
if (G(i) != g_id) continue; // not sure how else to do this
total += X(i, j);
}
out(g,j) = total;
}
}
return out;
}
Benchmarking these solutions, including the use_dt
version provided by @Roland (my previous version discriminted unfairly against data.table
), as well as the dplyr
-solution suggested by @beginneR, yields the following:
对这些解决方案进行基准测试,包括@Roland提供的use_dt版本(我以前的版本对data.table不公平地歧视),以及@beginneR建议的dplyr解决方案,产生以下结果:
library(rbenchmark)
benchmark(use.for(X,g), use.apply(X,g), use.dt(X,g), use.dplyr(X,g), use_arma(X,g), use_Rcpp(X,g),
+ columns = c("test", "replications", "elapsed", "relative"), order = "relative", replications = 1000)
test replications elapsed relative
# 5 use_arma(X, g) 1000 29.65 1.000
# 4 use.dplyr(X, g) 1000 42.05 1.418
# 3 use.dt(X, g) 1000 56.94 1.920
# 1 use.for(X, g) 1000 60.97 2.056
# 6 use_Rcpp(X, g) 1000 113.96 3.844
# 2 use.apply(X, g) 1000 301.14 10.156
My intution (use_Rcpp
better than use_arma
) did not turn out right. Having said that, I guess that the line if (G(i) != g_id) continue;
in my use_Rcpp
function slows down everything. I am happy to learn about alternatives to set this up.
我的直觉(use_Rcpp比use_arma更好)并没有成功。话虽如此,我想那条线if(G(i)!= g_id)继续;在我的use_Rcpp函数中减慢了一切。我很高兴了解有关替代方案的建议。
I am happy that I have achieved the same task in half the time it takes R
to do it, but maybe the several Rcpp is much faster than R
-examples have messed with my expectations, and I am wondering if I can speed this up even more. Does anyone have an idea? I also welcome any programming / coding comments in general since I am relatively new to Rcpp
and C++
.
我很高兴我在R的一半时间内完成了相同的任务,但是也许几个Rcpp比R-examples更快地与我的期望相提并论,我想知道我是否可以加速这个更多。有没有人有想法?我也欢迎任何编程/编码注释,因为我对Rcpp和C ++相对较新。
3 个解决方案
#1
1
Maybe you're looking for (the oddly named) rowsum
也许你正在寻找(奇怪的名字)rowum
library(microbenchmark)
use.rowsum = rowsum
and
> all.equal(use.for(X, g), use.rowsum(X, g), check.attributes=FALSE)
[1] TRUE
> microbenchmark(use.for(X, g), use.rowsum(X, g), times=5)
Unit: milliseconds
expr min lq median uq max neval
use.for(X, g) 126.92876 127.19027 127.51403 127.64082 128.06579 5
use.rowsum(X, g) 10.56727 10.93942 11.01106 11.38697 11.38918 5
#2
3
No, it's not the for
loop that you need to beat:
不,这不是你需要击败的for循环:
library(data.table)
#it doesn't seem fair to include calls to library in benchmarks
#you need to do that only once in your session after all
use.dt2 <- function(mat, ind){
dt <- as.data.table(mat)
dt[, cvar := ind]
dt2 <- dt[,lapply(.SD, sum), by=cvar]
as.matrix(dt2[,cvar:=NULL])
}
all.equal(use.dt(X,g), use.dt2(X,g))
#TRUE
benchmark(use.for(X,g), use.apply(X,g), use.dt(X,g), use.dt2(X,g),
columns = c("test", "replications", "elapsed", "relative"),
order = "relative", replications = 50)
# test replications elapsed relative
#4 use.dt2(X, g) 50 3.12 1.000
#1 use.for(X, g) 50 4.67 1.497
#3 use.dt(X, g) 50 7.53 2.413
#2 use.apply(X, g) 50 17.46 5.596
#3
2
Here's my critiques with in-line comments for your Rcpp solution.
以下是我对您的Rcpp解决方案的内联评论的批评。
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericMatrix use_Rcpp(NumericMatrix X, IntegerVector G){
// Rcpp has a sort_unique() function, which combines the
// sort and unique steps into one, and is often faster than
// performing the operations separately. Try `sort_unique(G)`
IntegerVector gr = unique(G);
std::sort(gr.begin(), gr.end());
int gr_n = gr.size();
int nrow = X.nrow(), ncol = X.ncol();
// This constructor zero-initializes memory (kind of like
// making a copy). You should use:
//
// NumericMatrix out = no_init(gr_n, ncol)
//
// to ensure the memory is allocated, but not zeroed.
//
// EDIT: We don't have no_init for matrices right now, but you can hack
// around that with:
//
// NumericMatrix out(Rf_allocMatrix(REALSXP, gr_n, ncol));
NumericMatrix out(gr_n, ncol);
for(int g=0; g<gr_n; g++){
// subsetting with operator[] is cheaper, so use gr[g] when
// you can be sure bounds checks are not necessary
int g_id = gr(g);
for (int j = 0; j < ncol; j++) {
double total = 0;
for (int i = 0; i < nrow; i++) {
// similarily here
if (G(i) != g_id) continue; // not sure how else to do this
total += X(i, j);
}
// IIUC, you are filling the matrice row-wise. This is slower as
// R matrices are stored in column-major format, and so filling
// matrices column-wise will be faster.
out(g,j) = total;
}
}
return out;
}
#1
1
Maybe you're looking for (the oddly named) rowsum
也许你正在寻找(奇怪的名字)rowum
library(microbenchmark)
use.rowsum = rowsum
and
> all.equal(use.for(X, g), use.rowsum(X, g), check.attributes=FALSE)
[1] TRUE
> microbenchmark(use.for(X, g), use.rowsum(X, g), times=5)
Unit: milliseconds
expr min lq median uq max neval
use.for(X, g) 126.92876 127.19027 127.51403 127.64082 128.06579 5
use.rowsum(X, g) 10.56727 10.93942 11.01106 11.38697 11.38918 5
#2
3
No, it's not the for
loop that you need to beat:
不,这不是你需要击败的for循环:
library(data.table)
#it doesn't seem fair to include calls to library in benchmarks
#you need to do that only once in your session after all
use.dt2 <- function(mat, ind){
dt <- as.data.table(mat)
dt[, cvar := ind]
dt2 <- dt[,lapply(.SD, sum), by=cvar]
as.matrix(dt2[,cvar:=NULL])
}
all.equal(use.dt(X,g), use.dt2(X,g))
#TRUE
benchmark(use.for(X,g), use.apply(X,g), use.dt(X,g), use.dt2(X,g),
columns = c("test", "replications", "elapsed", "relative"),
order = "relative", replications = 50)
# test replications elapsed relative
#4 use.dt2(X, g) 50 3.12 1.000
#1 use.for(X, g) 50 4.67 1.497
#3 use.dt(X, g) 50 7.53 2.413
#2 use.apply(X, g) 50 17.46 5.596
#3
2
Here's my critiques with in-line comments for your Rcpp solution.
以下是我对您的Rcpp解决方案的内联评论的批评。
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericMatrix use_Rcpp(NumericMatrix X, IntegerVector G){
// Rcpp has a sort_unique() function, which combines the
// sort and unique steps into one, and is often faster than
// performing the operations separately. Try `sort_unique(G)`
IntegerVector gr = unique(G);
std::sort(gr.begin(), gr.end());
int gr_n = gr.size();
int nrow = X.nrow(), ncol = X.ncol();
// This constructor zero-initializes memory (kind of like
// making a copy). You should use:
//
// NumericMatrix out = no_init(gr_n, ncol)
//
// to ensure the memory is allocated, but not zeroed.
//
// EDIT: We don't have no_init for matrices right now, but you can hack
// around that with:
//
// NumericMatrix out(Rf_allocMatrix(REALSXP, gr_n, ncol));
NumericMatrix out(gr_n, ncol);
for(int g=0; g<gr_n; g++){
// subsetting with operator[] is cheaper, so use gr[g] when
// you can be sure bounds checks are not necessary
int g_id = gr(g);
for (int j = 0; j < ncol; j++) {
double total = 0;
for (int i = 0; i < nrow; i++) {
// similarily here
if (G(i) != g_id) continue; // not sure how else to do this
total += X(i, j);
}
// IIUC, you are filling the matrice row-wise. This is slower as
// R matrices are stored in column-major format, and so filling
// matrices column-wise will be faster.
out(g,j) = total;
}
}
return out;
}