max和min函数与colMeans相似

I am wondering if there are high speed min and max function that works on columns similarly to colMeans?

我想知道在列上是否有与colMeans类似的高速最小和最大值函数?

For 'max', although I can simulate the behavior with 'apply' such as the following:

对于max，虽然我可以用apply来模拟这个行为，比如:

colMax <- function (colData) {
    apply(colData, MARGIN=c(2), max)
}

It seems a lot slower than the colMeans in the base package.

它看起来比基本包中的colMeans慢得多。

5 个解决方案

#1

pmax is ~ 10x faster than apply. Still not as fast as colMeans though.

pmax比apply快10倍。不过还是没有colMeans快。

data = matrix(rnorm(10^6), 100)
data.df = data.frame(t(data))

system.time(apply(data, MARGIN=c(2), max))
system.time(do.call(pmax, data.df))
system.time(colMeans(data))

> system.time(apply(data, MARGIN=c(2), max))
   user  system elapsed 
  0.133   0.006   0.139 
> system.time(do.call(pmax, data.df))
   user  system elapsed 
  0.013   0.000   0.013 
> system.time(colMeans(data))
   user  system elapsed 
  0.003   0.000   0.002

#2

One can always start with profiling, but your hunch seems correct:

人们总是可以从剖析开始，但你的直觉似乎是正确的:

R> colMax <- function(X) apply(X, 2, max)
R> library(rbenchmark)
R> Z <- matrix(rnorm(100*100), 100, 100)
R> benchmark(colMeans(Z), colMax(Z))
         test replications elapsed relative user.self sys.self user.child 
2   colMax(Z)          100   0.350     87.5      0.12        0          0 
1 colMeans(Z)          100   0.004      1.0      0.00        0          0 
R>

In that case you may want to consider writing a simple C/C++ function using inline with the basic C API for R, or our Rcpp package. That should get your colMeans-alike speed.

在这种情况下，您可能想要考虑使用内联的C/ c++函数来编写一个简单的C/ c++函数，该函数适用于R的基本C API，或者我们的Rcpp包。这应该可以让你的colmees一样的速度。

Edit: Here is a more complete example. colMeans still wins, but we're getting closer:

编辑:这里有一个更完整的例子。colMeans还是赢了，但是我们越来越接近了:

R> suppressMessages(library(inline))
R> suppressMessages(library(rbenchmark))
R>
R> colMaxR <- function(X) apply(X, 2, max)
R>
R> colMaxRcpp <- cxxfunction(signature(X_="numeric"), plugin="Rcpp",
+                           body='
+   Rcpp::NumericMatrix X(X_);
+   int n = X.ncol();
+   Rcpp::NumericVector V(n);
+   for (int i=0; i<n; i++) {
+      Rcpp::NumericVector W = X.column(i);
+      V[i] = *std::max_element(W.begin(), W.end());  // from the STL
+   }
+   return(V);
+ ')
R>
R>
R> Z <- matrix(rnorm(100*100), 100, 100)
R> benchmark(colMeans(Z), colMaxR(Z), colMaxRcpp(Z), replications=1000, order="relative")
           test replications elapsed relative user.self sys.self user.child 
1   colMeans(Z)         1000   0.036  1.00000      0.04        0          0 
3 colMaxRcpp(Z)         1000   0.050  1.38889      0.05        0          0 
2    colMaxR(Z)         1000   1.002 27.83333      1.01        0          0 
R>

#3

I am posting an answer only because I don't have enough reputation to comment or vote up/down yet.

我发布一个答案只是因为我还没有足够的声誉来评论或投票赞成/反对。

The top answer that pmax is ~10x times faster than apply is not always correct. For example, calculate the max for 10^6 numbers in each column.

pmax比apply快10倍的最上面的答案并不总是正确的。例如,计算10 ^ 6的马克斯在每一列数字。

data <- matrix(rnorm(10^8), 10^6)
data.t <- t(data)
data.df <- data.frame(data)
data.t.df = data.frame(data.t)

system.time(a <- apply(data, MARGIN=c(2), max))
system.time(b <- sapply(data.df, max))
system.time(e <- sapply(seq_len(ncol(data)), function(x) max(data[, x])))
system.time(c <- do.call(pmax, data.t.df))
system.time(d <- colMaxs(data))

> system.time(a <- apply(data, MARGIN=c(2), max))
   user  system elapsed 
      2       0       2 
> system.time(b <- sapply(data.df, max))
   user  system elapsed 
   0.25    0.00    0.25 
> system.time(e <- sapply(seq_len(ncol(data)), function(x) max(data[, x])))
   user  system elapsed 
   0.83    0.00    0.83 
> system.time(c <- do.call(pmax, data.t.df))
   user  system elapsed 
  15.94    0.00   15.96 
> system.time(d <- colMaxs(data))
   user  system elapsed 
   0.21    0.00    0.20

Now calculate the max for 100 numbers in each column.

现在计算每列100个数的最大值。

system.time(a <- apply(data.t, MARGIN=c(2), max))
system.time(b <- sapply(data.t.df, max))
system.time(e <- sapply(seq_len(ncol(data.t)), function(x) max(data.t[, x])))
system.time(c <- do.call(pmax, data.df))
system.time(d <- colMaxs(data.t))

> system.time(a <- apply(data.t, MARGIN=c(2), max))
   user  system elapsed 
   4.41    0.00    4.42 
> system.time(b <- sapply(data.t.df, max))
   user  system elapsed 
   3.23    0.00    3.23 
> system.time(e <- sapply(seq_len(ncol(data.t)), function(x) max(data.t[, x])))
   user  system elapsed 
   3.57    0.00    3.57 
> system.time(c <- do.call(pmax, data.df))
   user  system elapsed 
   1.56    0.00    1.56 
> system.time(d <- colMaxs(data.t))
   user  system elapsed 
   0.25    0.00    0.25

It seems like pmax is only comparable or better than apply in speed when the number of rows is small (e.g. 100). When the number of rows is large (e.g. 10^6), pmax is much slower than apply.

看起来，当行数很小(例如100)时，pmax只具有可比性或比应用速度更好。当大的行数(例如10 ^ 6),pmax比应用慢得多。

In any case, colMaxs in the matrixStats package is the fastest, and seems to be the way to go.

无论如何，在matrixStats包中，colMaxs是最快的，而且似乎是正确的方法。

#4

The matrixStats package has a lot of great functions, including colMaxs.

matrixStats包有很多很棒的函数，包括colMaxs。

#5

pmin and pmax can be used easily to get row mins and maxes, but its a bit awkward for columns.

pmin和pmax可以很容易地用于获取行最小值和最大值，但是对于列来说有点笨拙。

# row maxes
do.call("pmax",mtcars)
 [1] 160.0 160.0 108.0 258.0 360.0 225.0 360.0 146.7 140.8 167.6 167.6 275.8
[13] 275.8 275.8 472.0 460.0 440.0  78.7  75.7  71.1 120.1 318.0 304.0 350.0
[25] 400.0  79.0 120.3 113.0 351.0 175.0 335.0 121.0

# col maxes
do.call("pmax",data.frame(t(mtcars)))
 [1]  33.900   8.000 472.000 335.000   4.930   5.424  22.900   1.000   1.000
[10]   5.000   8.000

Another option is max.col, which also (confusingly) gives row maxes by default.

另一个选择是马克斯。col默认也提供行maxes。

mmtcars <- as.matrix(mtcars)
mmtcars[max.col(t(mmtcars))+(seq(dim(mmtcars)[2])-1)*dim(mmtcars)[1]]
 [1]  33.900   8.000 472.000 335.000   4.930   5.424  22.900   1.000   1.000
[10]   5.000   8.000

#1