元素矩阵乘法:R对Rcpp(如何加快这段代码?)

时间:2021-05-28 21:20:35

I am new to C++ programming (using Rcpp for seamless integration into R), and I would appreciate some advice on how to speed up some calculations.

我是c++编程的新手(使用Rcpp无缝集成到R),我希望能得到一些关于如何加快计算速度的建议。

Consider the following example:

考虑下面的例子:

 testmat <- matrix(1:9, nrow=3)
 testvec <- 1:3

 testmat*testvec
   #      [,1] [,2] [,3]
   #[1,]    1    4    7
   #[2,]    4   10   16
   #[3,]    9   18   27

Here, R recycled testvec so that, loosely speaking, testvec "became" a matrix of the same dimensions as testmat for the purpose of this multiplication. Then the Hadamard product is returned. I wish to implement this behavior using Rcpp, that is I want that each element of the i-th row in the matrix testmat is multiplied with the i-th element of the vector testvec. My benchmarks tell me that my implementations are extremely slow, and I would appreciate advise on how to speed this up. Here my code:

在这里,R回收的testvec,所以,从松散的角度来说,testvec“变成”了一个与testmat相同的维度的矩阵,用于这个乘法运算。然后返回Hadamard产品。我希望使用Rcpp来实现这个行为,即我希望矩阵testmat中的第I行中的每个元素都与向量testvec的第I个元素相乘。我的基准告诉我,我的实现非常缓慢,我希望能就如何加快这一进程提供建议。这里我的代码:

First, using Eigen:

首先,使用特征:

#include <RcppEigen.h>
// [[Rcpp::depends(RcppEigen)]]

using namespace Rcpp;
using namespace Eigen;

// [[Rcpp::export]]
NumericMatrix E_matvecprod_elwise(NumericMatrix Xs, NumericVector ys){
  Map<MatrixXd> X(as<Map<MatrixXd> >(Xs));
  Map<VectorXd> y(as<Map<VectorXd> >(ys));

  int k = X.cols();
  int n = X.rows();
  MatrixXd Y(n,k) ;

  // here, I emulate R's recycling. I did not find an easier way of doing this. Any hint appreciated.      
  for(int i = 0; i < k; ++i) {
   Y.col(i) = y;
  }

  MatrixXd out = X.cwiseProduct(Y);
  return wrap(out);
}

Here my implementation using Armadillo (adjusted to follow Dirk's example, see answer below):

这里我使用Armadillo(调整后跟随Dirk的例子,参见下面的答案):

#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]

using namespace Rcpp;
using namespace arma;

// [[Rcpp::export]]
arma::mat A_matvecprod_elwise(const arma::mat & X, const arma::vec & y){
  int k = X.n_cols ;
  arma::mat Y = repmat(y, 1, k) ;  // 
  arma::mat out = X % Y;  
return out;
}

Benchmarking these solutions using R, Eigen or Armadillo shows that both Eigen and Armadillo are about 2 times slower than R. Is there a way to speed these computations up or to get at least as fast as R? Are there more elegant ways of setting this up? Any advise is appreciated and welcome. (I also encourage tangential remarks about programming style in general as I am new to Rcpp / C++.)

用R、Eigen或Armadillo对这些解决方案进行基准测试表明,Eigen和Armadillo的速度都比R慢2倍,有没有一种方法可以加快这些计算速度,或者至少达到R的速度?有更优雅的方式来设置这个吗?如有任何建议,欢迎和欢迎。(我也鼓励一般的关于编程风格的评论,因为我是Rcpp / c++的新手。)

Here some reproducable benchmarks:

这里一些reproducable基准:

 # for comparison, define R function:
 R_matvecprod_elwise <- function(mat, vec) mat*vec

n <- 50000
k <- 50
X <- matrix(rnorm(n*k), nrow=n)
e <- rnorm(n)

benchmark(R_matvecprod_elwise(X, e), A2_matvecprod_elwise(X, e), E_matvecprod_elwise(X,e),
          columns = c("test", "replications", "elapsed", "relative"), order = "relative", replications = 1000)

This yields

这个收益率

                  test replications elapsed relative
1  R_matvecprod_elwise(X, e)         1000   10.89    1.000
2  A_matvecprod_elwise(X, e)         1000   26.87    2.467
3  E_matvecprod_elwise(X, e)         1000   27.73    2.546

As you can see, my Rcpp-solutions perform quite miserably. Any way to do it better?

正如您所看到的,我的rcppi解决方案执行得非常糟糕。有没有更好的方法?

4 个解决方案

#1


7  

If you want to speed up your calculations you will have to be a little careful about not making copies. This usually means sacrificing readability. Here is a version which makes no copies and modifies matrix X inplace.

如果你想加快计算速度,你就得小心一点,不要复制。这通常意味着牺牲可读性。这是一个没有复制和修改矩阵X的版本。

// [[Rcpp::export]]
NumericMatrix Rcpp_matvecprod_elwise(NumericMatrix & X, NumericVector & y){
  unsigned int ncol = X.ncol();
  unsigned int nrow = X.nrow();
  int counter = 0;
  for (unsigned int j=0; j<ncol; j++) {
    for (unsigned int i=0; i<nrow; i++)  {
      X[counter++] *= y[i];
    }
  }
  return X;
}

Here is what I get on my machine

这是我的机器上的东西。

 > library(microbenchmark)
 > microbenchmark(R=R_matvecprod_elwise(X, e), Arma=A_matvecprod_elwise(X, e),  Rcpp=Rcpp_matvecprod_elwise(X, e))

Unit: milliseconds
 expr       min        lq    median       uq      max neval
    R  8.262845  9.386214 10.542599 11.53498 12.77650   100
 Arma 18.852685 19.872929 22.782958 26.35522 83.93213   100
 Rcpp  6.391219  6.640780  6.940111  7.32773  7.72021   100

> all.equal(R_matvecprod_elwise(X, e), Rcpp_matvecprod_elwise(X, e))
[1] TRUE

#2


6  

For starters, I'd write the Armadillo version (interface) as

首先,我将编写Armadillo版本(接口)。

#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]

using namespace Rcpp;
using namespace arma;

// [[Rcpp::export]]
arama::mat A_matvecprod_elwise(const arma::mat & X, const arma::vec & y){
  int k = X.n_cols ;
  arma::mat Y = repmat(y, 1, k) ;  // 
  arma::mat out = X % Y;  
  return out;
}

as you're doing an additional conversion in and out (though the wrap() gets added by the glue code). The const & is notional (as you learned via your last question, a SEXP is a pointer object that is lightweight to copy) but better style.

当您正在进行额外的转换时(尽管wrap()会被粘合代码添加)。const &是概念性的(正如您在最后一个问题中学到的,SEXP是一个指针对象,它是轻量级的复制),但是更好的风格。

You didn't show your benchmark results so I can't comment on the effect of matrix size etc pp. I suspect you might get better answers on rcpp-devel than here. Your pick.

您没有显示您的基准测试结果,所以我无法评论矩阵大小等的影响。我猜想您可能会得到比这里更好的rcppdevel的答案。你的选择。

Edit: If you really want something cheap and fast, I would just do this:

编辑:如果你真的想要便宜和快的东西,我就这样做:

// [[Rcpp::export]]
mat cheapHadamard(mat X, vec y) {
    // should row dim of X versus length of Y here
    for (unsigned int i=0; i<y.n_elem; i++) X.row(i) *= y(i);
    return X;
}

which allocates no new memory and will hence be faster, and probably be competitive with R.

它不会分配新的内存,因此会更快,并且可能与R竞争。

Test output:

测试输出:

R> cheapHadamard(testmat, testvec)
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    4   10   16
[3,]    9   18   27
R> 

#3


3  

My apologies for giving an essentially C answer to a C++ question, but as has been suggested the solution generally lies in the efficient BLAS implementation of things. Unfortunately, BLAS itself lacks a Hadamard multiply so you would have to implement your own.

我很抱歉给出了一个C++问题的C答案,但正如有人建议的那样,解决方案通常是有效的BLAS实现。不幸的是,BLAS本身缺少一个Hadamard乘法,因此您必须实现您自己的。

Here is a pure Rcpp implementation that basically calls C code. If you want to make it proper C++, the worker function can be templated but for most applications using R that isn't a concern. Note that this also operates "in-place", which means that it modifies X without copying it.

这是一个纯粹的Rcpp实现,基本调用C代码。如果您想要使它成为合适的c++,那么这个worker函数可以被模板化,但是对于大多数使用R的应用程序来说,这不是一个问题。注意,这也操作“in-place”,这意味着它在不复制X的情况下修改X。

// it may be necessary on your system to uncomment one of the following
//#define restrict __restrict__ // gcc/clang
//#define restrict __restrict   // MS Visual Studio
//#define restrict              // remove it completely

#include <Rcpp.h>
using namespace Rcpp;

#include <cstdlib>
using std::size_t;

void hadamardMultiplyMatrixByVectorInPlace(double* restrict x,
                                           size_t numRows, size_t numCols,
                                           const double* restrict y)
{
  if (numRows == 0 || numCols == 0) return;

  for (size_t col = 0; col < numCols; ++col) {
    double* restrict x_col = x + col * numRows;

    for (size_t row = 0; row < numRows; ++row) {
      x_col[row] *= y[row];
    }
  }
}

// [[Rcpp::export]]
NumericMatrix C_matvecprod_elwise_inplace(NumericMatrix& X,
                                          const NumericVector& y)
{
  // do some dimension checking here

  hadamardMultiplyMatrixByVectorInPlace(X.begin(), X.nrow(), X.ncol(),
                                        y.begin());

  return X;
}

Here is a version that makes a copy first. I don't know Rcpp well enough to do this natively and not incur a substantial performance hit. Creating and returning a NumericMatrix(numRows, numCols) on the stack causes the code to run about 30% slower.

这是一个先复制的版本。我对Rcpp的了解还不够充分,不需要在性能上大受影响。在堆栈上创建和返回一个NumericMatrix(numRows, numCols)会导致代码运行速度降低30%。

#include <Rcpp.h>
using namespace Rcpp;

#include <cstdlib>
using std::size_t;

#include <R.h>
#include <Rdefines.h>

void hadamardMultiplyMatrixByVector(const double* restrict x,
                                    size_t numRows, size_t numCols,
                                    const double* restrict y,
                                    double* restrict z)
{
  if (numRows == 0 || numCols == 0) return;

  for (size_t col = 0; col < numCols; ++col) {
    const double* restrict x_col = x + col * numRows;
    double* restrict z_col = z + col * numRows;

    for (size_t row = 0; row < numRows; ++row) {
      z_col[row] = x_col[row] * y[row];
    }
  }
}

// [[Rcpp::export]]
SEXP C_matvecprod_elwise(const NumericMatrix& X, const NumericVector& y)
{
  size_t numRows = X.nrow();
  size_t numCols = X.ncol();

  // do some dimension checking here

  SEXP Z = PROTECT(Rf_allocVector(REALSXP, (int) (numRows * numCols)));
  SEXP dimsExpr = PROTECT(Rf_allocVector(INTSXP, 2));
  int* dims = INTEGER(dimsExpr);
  dims[0] = (int) numRows;
  dims[1] = (int) numCols;
  Rf_setAttrib(Z, R_DimSymbol, dimsExpr);

  hadamardMultiplyMatrixByVector(X.begin(), X.nrow(), X.ncol(), y.begin(), REAL(Z));

  UNPROTECT(2);

  return Z;
}

If you're curious about usage of restrict, it means that you as the programmer enter a contract with the compiler that different bits of memory do not overlap, allowing the compiler to make certain optimizations. The restrict keyword is part of C++11 (and C99), but many compilers added extensions to C++ for earlier standards.

如果您对限制的使用感到好奇,这意味着当程序员与编译器签订契约时,不同的内存位不会重叠,从而允许编译器进行某些优化。限制关键字是c++ 11(和C99)的一部分,但是许多编译器为早期的标准添加了对c++的扩展。

Some R code to benchmark:

一些R代码的基准:

require(rbenchmark)

n <- 50000
k <- 50
X <- matrix(rnorm(n*k), nrow=n)
e <- rnorm(n)

R_matvecprod_elwise <- function(mat, vec) mat*vec

all.equal(R_matvecprod_elwise(X, e), C_matvecprod_elwise(X, e))
X_dup <- X + 0
all.equal(R_matvecprod_elwise(X, e), C_matvecprod_elwise_inplace(X_dup, e))

benchmark(R_matvecprod_elwise(X, e),
          C_matvecprod_elwise(X, e),
          C_matvecprod_elwise_inplace(X, e),
          columns = c("test", "replications", "elapsed", "relative"),
          order = "relative", replications = 1000)

And the results:

结果:

                               test replications elapsed relative
3 C_matvecprod_elwise_inplace(X, e)         1000   3.317    1.000
2         C_matvecprod_elwise(X, e)         1000   7.174    2.163
1         R_matvecprod_elwise(X, e)         1000  10.670    3.217

Finally, the in-place version may actually be faster, as the repeated multiplications into the same matrix can cause some overflow mayhem.

最后,就地版本可能会更快,因为重复的乘法运算会导致一些溢出的混乱。

Edit:

编辑:

Removed the loop unrolling, as it provided no benefit and was otherwise distracting.

取消循环,因为它没有任何好处,否则会分散注意力。

#4


2  

I'd like to build on Sameer's answer, but I don't have enough reputation to comment.

我想建立在Sameer的答案上,但是我没有足够的声誉来评论。

I personally got better performance (about 50%) in Eigen using:

我个人的表现(大约50%)在使用:

return (y.asDiagonal() * X);

Despite the appearance, this does not create an n x n temporary for y.

尽管出现了,但这并没有为y创建一个nxn临时。

#1


7  

If you want to speed up your calculations you will have to be a little careful about not making copies. This usually means sacrificing readability. Here is a version which makes no copies and modifies matrix X inplace.

如果你想加快计算速度,你就得小心一点,不要复制。这通常意味着牺牲可读性。这是一个没有复制和修改矩阵X的版本。

// [[Rcpp::export]]
NumericMatrix Rcpp_matvecprod_elwise(NumericMatrix & X, NumericVector & y){
  unsigned int ncol = X.ncol();
  unsigned int nrow = X.nrow();
  int counter = 0;
  for (unsigned int j=0; j<ncol; j++) {
    for (unsigned int i=0; i<nrow; i++)  {
      X[counter++] *= y[i];
    }
  }
  return X;
}

Here is what I get on my machine

这是我的机器上的东西。

 > library(microbenchmark)
 > microbenchmark(R=R_matvecprod_elwise(X, e), Arma=A_matvecprod_elwise(X, e),  Rcpp=Rcpp_matvecprod_elwise(X, e))

Unit: milliseconds
 expr       min        lq    median       uq      max neval
    R  8.262845  9.386214 10.542599 11.53498 12.77650   100
 Arma 18.852685 19.872929 22.782958 26.35522 83.93213   100
 Rcpp  6.391219  6.640780  6.940111  7.32773  7.72021   100

> all.equal(R_matvecprod_elwise(X, e), Rcpp_matvecprod_elwise(X, e))
[1] TRUE

#2


6  

For starters, I'd write the Armadillo version (interface) as

首先,我将编写Armadillo版本(接口)。

#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]

using namespace Rcpp;
using namespace arma;

// [[Rcpp::export]]
arama::mat A_matvecprod_elwise(const arma::mat & X, const arma::vec & y){
  int k = X.n_cols ;
  arma::mat Y = repmat(y, 1, k) ;  // 
  arma::mat out = X % Y;  
  return out;
}

as you're doing an additional conversion in and out (though the wrap() gets added by the glue code). The const & is notional (as you learned via your last question, a SEXP is a pointer object that is lightweight to copy) but better style.

当您正在进行额外的转换时(尽管wrap()会被粘合代码添加)。const &是概念性的(正如您在最后一个问题中学到的,SEXP是一个指针对象,它是轻量级的复制),但是更好的风格。

You didn't show your benchmark results so I can't comment on the effect of matrix size etc pp. I suspect you might get better answers on rcpp-devel than here. Your pick.

您没有显示您的基准测试结果,所以我无法评论矩阵大小等的影响。我猜想您可能会得到比这里更好的rcppdevel的答案。你的选择。

Edit: If you really want something cheap and fast, I would just do this:

编辑:如果你真的想要便宜和快的东西,我就这样做:

// [[Rcpp::export]]
mat cheapHadamard(mat X, vec y) {
    // should row dim of X versus length of Y here
    for (unsigned int i=0; i<y.n_elem; i++) X.row(i) *= y(i);
    return X;
}

which allocates no new memory and will hence be faster, and probably be competitive with R.

它不会分配新的内存,因此会更快,并且可能与R竞争。

Test output:

测试输出:

R> cheapHadamard(testmat, testvec)
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    4   10   16
[3,]    9   18   27
R> 

#3


3  

My apologies for giving an essentially C answer to a C++ question, but as has been suggested the solution generally lies in the efficient BLAS implementation of things. Unfortunately, BLAS itself lacks a Hadamard multiply so you would have to implement your own.

我很抱歉给出了一个C++问题的C答案,但正如有人建议的那样,解决方案通常是有效的BLAS实现。不幸的是,BLAS本身缺少一个Hadamard乘法,因此您必须实现您自己的。

Here is a pure Rcpp implementation that basically calls C code. If you want to make it proper C++, the worker function can be templated but for most applications using R that isn't a concern. Note that this also operates "in-place", which means that it modifies X without copying it.

这是一个纯粹的Rcpp实现,基本调用C代码。如果您想要使它成为合适的c++,那么这个worker函数可以被模板化,但是对于大多数使用R的应用程序来说,这不是一个问题。注意,这也操作“in-place”,这意味着它在不复制X的情况下修改X。

// it may be necessary on your system to uncomment one of the following
//#define restrict __restrict__ // gcc/clang
//#define restrict __restrict   // MS Visual Studio
//#define restrict              // remove it completely

#include <Rcpp.h>
using namespace Rcpp;

#include <cstdlib>
using std::size_t;

void hadamardMultiplyMatrixByVectorInPlace(double* restrict x,
                                           size_t numRows, size_t numCols,
                                           const double* restrict y)
{
  if (numRows == 0 || numCols == 0) return;

  for (size_t col = 0; col < numCols; ++col) {
    double* restrict x_col = x + col * numRows;

    for (size_t row = 0; row < numRows; ++row) {
      x_col[row] *= y[row];
    }
  }
}

// [[Rcpp::export]]
NumericMatrix C_matvecprod_elwise_inplace(NumericMatrix& X,
                                          const NumericVector& y)
{
  // do some dimension checking here

  hadamardMultiplyMatrixByVectorInPlace(X.begin(), X.nrow(), X.ncol(),
                                        y.begin());

  return X;
}

Here is a version that makes a copy first. I don't know Rcpp well enough to do this natively and not incur a substantial performance hit. Creating and returning a NumericMatrix(numRows, numCols) on the stack causes the code to run about 30% slower.

这是一个先复制的版本。我对Rcpp的了解还不够充分,不需要在性能上大受影响。在堆栈上创建和返回一个NumericMatrix(numRows, numCols)会导致代码运行速度降低30%。

#include <Rcpp.h>
using namespace Rcpp;

#include <cstdlib>
using std::size_t;

#include <R.h>
#include <Rdefines.h>

void hadamardMultiplyMatrixByVector(const double* restrict x,
                                    size_t numRows, size_t numCols,
                                    const double* restrict y,
                                    double* restrict z)
{
  if (numRows == 0 || numCols == 0) return;

  for (size_t col = 0; col < numCols; ++col) {
    const double* restrict x_col = x + col * numRows;
    double* restrict z_col = z + col * numRows;

    for (size_t row = 0; row < numRows; ++row) {
      z_col[row] = x_col[row] * y[row];
    }
  }
}

// [[Rcpp::export]]
SEXP C_matvecprod_elwise(const NumericMatrix& X, const NumericVector& y)
{
  size_t numRows = X.nrow();
  size_t numCols = X.ncol();

  // do some dimension checking here

  SEXP Z = PROTECT(Rf_allocVector(REALSXP, (int) (numRows * numCols)));
  SEXP dimsExpr = PROTECT(Rf_allocVector(INTSXP, 2));
  int* dims = INTEGER(dimsExpr);
  dims[0] = (int) numRows;
  dims[1] = (int) numCols;
  Rf_setAttrib(Z, R_DimSymbol, dimsExpr);

  hadamardMultiplyMatrixByVector(X.begin(), X.nrow(), X.ncol(), y.begin(), REAL(Z));

  UNPROTECT(2);

  return Z;
}

If you're curious about usage of restrict, it means that you as the programmer enter a contract with the compiler that different bits of memory do not overlap, allowing the compiler to make certain optimizations. The restrict keyword is part of C++11 (and C99), but many compilers added extensions to C++ for earlier standards.

如果您对限制的使用感到好奇,这意味着当程序员与编译器签订契约时,不同的内存位不会重叠,从而允许编译器进行某些优化。限制关键字是c++ 11(和C99)的一部分,但是许多编译器为早期的标准添加了对c++的扩展。

Some R code to benchmark:

一些R代码的基准:

require(rbenchmark)

n <- 50000
k <- 50
X <- matrix(rnorm(n*k), nrow=n)
e <- rnorm(n)

R_matvecprod_elwise <- function(mat, vec) mat*vec

all.equal(R_matvecprod_elwise(X, e), C_matvecprod_elwise(X, e))
X_dup <- X + 0
all.equal(R_matvecprod_elwise(X, e), C_matvecprod_elwise_inplace(X_dup, e))

benchmark(R_matvecprod_elwise(X, e),
          C_matvecprod_elwise(X, e),
          C_matvecprod_elwise_inplace(X, e),
          columns = c("test", "replications", "elapsed", "relative"),
          order = "relative", replications = 1000)

And the results:

结果:

                               test replications elapsed relative
3 C_matvecprod_elwise_inplace(X, e)         1000   3.317    1.000
2         C_matvecprod_elwise(X, e)         1000   7.174    2.163
1         R_matvecprod_elwise(X, e)         1000  10.670    3.217

Finally, the in-place version may actually be faster, as the repeated multiplications into the same matrix can cause some overflow mayhem.

最后,就地版本可能会更快,因为重复的乘法运算会导致一些溢出的混乱。

Edit:

编辑:

Removed the loop unrolling, as it provided no benefit and was otherwise distracting.

取消循环,因为它没有任何好处,否则会分散注意力。

#4


2  

I'd like to build on Sameer's answer, but I don't have enough reputation to comment.

我想建立在Sameer的答案上,但是我没有足够的声誉来评论。

I personally got better performance (about 50%) in Eigen using:

我个人的表现(大约50%)在使用:

return (y.asDiagonal() * X);

Despite the appearance, this does not create an n x n temporary for y.

尽管出现了,但这并没有为y创建一个nxn临时。