I made the following implementation of the median in C++
and and used it in R
via Rcpp
:
我在C ++中实现了以下中值的实现,并在R via Rcpp中使用它:
// [[Rcpp::export]]
double median2(std::vector<double> x){
double median;
size_t size = x.size();
sort(x.begin(), x.end());
if (size % 2 == 0){
median = (x[size / 2 - 1] + x[size / 2]) / 2.0;
}
else {
median = x[size / 2];
}
return median;
}
If I subsequently compare the performance with the standard built-in R median function, I get the following results via microbenchmark
如果我随后将性能与标准内置R中值函数进行比较,我会通过microbenchmark得到以下结果
> x = rnorm(100)
> microbenchmark(median(x),median2(x))
Unit: microseconds
expr min lq mean median uq max neval
median(x) 25.469 26.990 34.96888 28.130 29.081 518.126 100
median2(x) 1.140 1.521 2.47486 1.901 2.281 47.897 100
Why is the standard median function so much slower? This isn't what I would expect...
为什么标准中位数功能要慢得多?这不是我所期望的......
3 个解决方案
#1
13
As noted by @joran, your code is very specialized, and generally speaking, less generalized functions, algorithms, etc... are often more performant. Take a look at median.default
:
正如@joran所指出的,你的代码是非常专业的,一般而言,不那么通用的函数,算法等......通常更具性能。看看median.default:
median.default
# function (x, na.rm = FALSE)
# {
# if (is.factor(x) || is.data.frame(x))
# stop("need numeric data")
# if (length(names(x)))
# names(x) <- NULL
# if (na.rm)
# x <- x[!is.na(x)]
# else if (any(is.na(x)))
# return(x[FALSE][NA])
# n <- length(x)
# if (n == 0L)
# return(x[FALSE][NA])
# half <- (n + 1L)%/%2L
# if (n%%2L == 1L)
# sort(x, partial = half)[half]
# else mean(sort(x, partial = half + 0L:1L)[half + 0L:1L])
# }
There are several operations in place to accommodate the possibility of missing values, and these will definitely impact the overall execution time of the function. Since your function does not replicate this behavior it can eliminate a bunch of calculations, but consequently will not provide the same result for vectors with missing values:
有几个操作可以容纳缺少值的可能性,这些肯定会影响函数的整体执行时间。由于您的函数不会复制此行为,因此可以消除一堆计算,但因此不会为缺少值的向量提供相同的结果:
median(c(1, 2, NA))
#[1] NA
median2(c(1, 2, NA))
#[1] 2
A couple of other factors which probably don't have as much of an effect as the handling of NA
s, but are worth pointing out:
其他一些因素可能没有像处理NAs那样有效,但值得指出:
-
median
, along with a handful of the functions it uses, are S3 generics, so there is a small amount of time spent on method dispatch -
median
will work with more than just integer and numeric vectors; it will also handleDate
,POSIXt
, and probably a bunch of other classes, and preserve attributes correctly:
中位数以及它使用的一些函数是S3泛型,因此在方法调度上花费了少量时间
中位数不仅仅适用于整数和数字向量;它还将处理Date,POSIXt以及其他一些类,并正确保存属性:
median(Sys.Date() + 0:4)
#[1] "2016-01-15"
median(Sys.time() + (0:4) * 3600 * 24)
#[1] "2016-01-15 11:14:31 EST"
Edit: I should mention that the function below will cause the original vector to be sorted since NumericVector
s are proxy objects. If you want to avoid this, you can either Rcpp::clone
the input vector and operate on the clone, or use your original signature (with a std::vector<double>
), which implicitly requires a copy in the conversion from SEXP
to std::vector
.
编辑:我应该提到下面的函数将导致原始向量被排序,因为NumericVectors是代理对象。如果你想避免这种情况,你可以Rcpp ::克隆输入向量并对克隆进行操作,或者使用原始签名(带有std :: vector
Also note that you can shave off a little more time by using a NumericVector
instead of a std::vector<double>
:
另请注意,您可以使用NumericVector而不是std :: vector
#include <Rcpp.h>
// [[Rcpp::export]]
double cpp_med(Rcpp::NumericVector x){
std::size_t size = x.size();
std::sort(x.begin(), x.end());
if (size % 2 == 0) return (x[size / 2 - 1] + x[size / 2]) / 2.0;
return x[size / 2];
}
microbenchmark::microbenchmark(
median(x),
median2(x),
cpp_med(x),
times = 200L
)
# Unit: microseconds
# expr min lq mean median uq max neval
# median(x) 74.787 81.6485 110.09870 92.5665 129.757 293.810 200
# median2(x) 6.474 7.9665 13.90126 11.0570 14.844 151.817 200
# cpp_med(x) 5.737 7.4285 11.25318 9.0270 13.405 52.184 200
Yakk brought up a great point in the comments above - also elaborated on by Jerry Coffin - about the inefficiency of doing a complete sort. Here's a rewrite using std::nth_element
, benchmarked on a much larger vector:
Yakk在上面的评论中提出了一个很好的观点 - Jerry Coffin也详细阐述了关于完成整体效率的低效率。这是使用std :: nth_element的重写,以更大的向量为基准:
#include <Rcpp.h>
// [[Rcpp::export]]
double cpp_med2(Rcpp::NumericVector xx) {
Rcpp::NumericVector x = Rcpp::clone(xx);
std::size_t n = x.size() / 2;
std::nth_element(x.begin(), x.begin() + n, x.end());
if (x.size() % 2) return x[n];
return (x[n] + *std::max_element(x.begin(), x.begin() + n)) / 2.;
}
set.seed(123)
xx <- rnorm(10e5)
all.equal(cpp_med2(xx), median(xx))
all.equal(median2(xx), median(xx))
microbenchmark::microbenchmark(
cpp_med2(xx), median2(xx),
median(xx), times = 200L
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# cpp_med2(xx) 10.89060 11.34894 13.15313 12.72861 13.56161 33.92103 200
# median2(xx) 84.29518 85.47184 88.57361 86.05363 87.70065 228.07301 200
# median(xx) 46.18976 48.36627 58.77436 49.31659 53.46830 250.66939 200
#2
2
[This is more of an extended comment than an answer to the question you actually asked.]
[这是一个扩展的评论,而不是你实际问的问题的答案。]
Even your code may be open to significant improvement. In particular, you're sorting the entire input even though you only care about one or two elements.
即使您的代码可能会有明显的改进。特别是,即使您只关心一个或两个元素,也要对整个输入进行排序。
You can change this from O(n log n) to O(n) by using std::nth_element
instead of std::sort
. In case of an even number of elements, you'd typically want to use std::nth_element
to find the element just before the middle, then use std::min_element
to find the immediately succeeding element--but std::nth_element
also partitions the input items, so the std::min_element
only has to run on the items above the middle after the nth_element
, not the entire input array. That is, after nth_element, you get a situation like this:
您可以使用std :: nth_element而不是std :: sort将其从O(n log n)更改为O(n)。在偶数个元素的情况下,您通常希望使用std :: nth_element在中间之前找到元素,然后使用std :: min_element来查找紧接着的元素 - 但是std :: nth_element也是分区输入项,所以std :: min_element只需要在nth_element之后的中间项上运行,而不是整个输入数组。也就是说,在nth_element之后,你得到这样的情况:
The complexity of std::nth_element
is "linear on average", and (of course) std::min_element
is linear as well, so the overall complexity is linear.
std :: nth_element的复杂性是“平均线性”,并且(当然)std :: min_element也是线性的,因此总体复杂度是线性的。
So, for the simple case (odd number of elements), you get something like:
因此,对于简单的情况(奇数个元素),你会得到类似的东西:
auto pos = x.begin() + x.size()/2;
std::nth_element(x.begin(), pos, x.end());
return *pos;
...and for the more complex case (even number of elements):
......对于更复杂的情况(偶数个元素):
std::nth_element(x.begin(), pos, x.end());
auto pos2 = std::min_element(pos+1, x.end());
return (*pos + *pos2) / 2.0;
#3
0
I'm not sure what "standard" implementation you would be referring to.
我不确定你指的是什么“标准”实现。
Anyway: If there were one, it would, being part of a standard library, certainly not be allowed to change the order of elements in the vector (as your implementation does), so it would definitely have to work on a copy.
无论如何:如果有一个,它将作为标准库的一部分,当然不允许改变向量中元素的顺序(正如你的实现所做的那样),所以它肯定必须在副本上工作。
Creating this copy would take time and CPU (and significant memory), which would affect the run time.
创建此副本需要时间和CPU(以及大量内存),这会影响运行时。
#1
13
As noted by @joran, your code is very specialized, and generally speaking, less generalized functions, algorithms, etc... are often more performant. Take a look at median.default
:
正如@joran所指出的,你的代码是非常专业的,一般而言,不那么通用的函数,算法等......通常更具性能。看看median.default:
median.default
# function (x, na.rm = FALSE)
# {
# if (is.factor(x) || is.data.frame(x))
# stop("need numeric data")
# if (length(names(x)))
# names(x) <- NULL
# if (na.rm)
# x <- x[!is.na(x)]
# else if (any(is.na(x)))
# return(x[FALSE][NA])
# n <- length(x)
# if (n == 0L)
# return(x[FALSE][NA])
# half <- (n + 1L)%/%2L
# if (n%%2L == 1L)
# sort(x, partial = half)[half]
# else mean(sort(x, partial = half + 0L:1L)[half + 0L:1L])
# }
There are several operations in place to accommodate the possibility of missing values, and these will definitely impact the overall execution time of the function. Since your function does not replicate this behavior it can eliminate a bunch of calculations, but consequently will not provide the same result for vectors with missing values:
有几个操作可以容纳缺少值的可能性,这些肯定会影响函数的整体执行时间。由于您的函数不会复制此行为,因此可以消除一堆计算,但因此不会为缺少值的向量提供相同的结果:
median(c(1, 2, NA))
#[1] NA
median2(c(1, 2, NA))
#[1] 2
A couple of other factors which probably don't have as much of an effect as the handling of NA
s, but are worth pointing out:
其他一些因素可能没有像处理NAs那样有效,但值得指出:
-
median
, along with a handful of the functions it uses, are S3 generics, so there is a small amount of time spent on method dispatch -
median
will work with more than just integer and numeric vectors; it will also handleDate
,POSIXt
, and probably a bunch of other classes, and preserve attributes correctly:
中位数以及它使用的一些函数是S3泛型,因此在方法调度上花费了少量时间
中位数不仅仅适用于整数和数字向量;它还将处理Date,POSIXt以及其他一些类,并正确保存属性:
median(Sys.Date() + 0:4)
#[1] "2016-01-15"
median(Sys.time() + (0:4) * 3600 * 24)
#[1] "2016-01-15 11:14:31 EST"
Edit: I should mention that the function below will cause the original vector to be sorted since NumericVector
s are proxy objects. If you want to avoid this, you can either Rcpp::clone
the input vector and operate on the clone, or use your original signature (with a std::vector<double>
), which implicitly requires a copy in the conversion from SEXP
to std::vector
.
编辑:我应该提到下面的函数将导致原始向量被排序,因为NumericVectors是代理对象。如果你想避免这种情况,你可以Rcpp ::克隆输入向量并对克隆进行操作,或者使用原始签名(带有std :: vector
Also note that you can shave off a little more time by using a NumericVector
instead of a std::vector<double>
:
另请注意,您可以使用NumericVector而不是std :: vector
#include <Rcpp.h>
// [[Rcpp::export]]
double cpp_med(Rcpp::NumericVector x){
std::size_t size = x.size();
std::sort(x.begin(), x.end());
if (size % 2 == 0) return (x[size / 2 - 1] + x[size / 2]) / 2.0;
return x[size / 2];
}
microbenchmark::microbenchmark(
median(x),
median2(x),
cpp_med(x),
times = 200L
)
# Unit: microseconds
# expr min lq mean median uq max neval
# median(x) 74.787 81.6485 110.09870 92.5665 129.757 293.810 200
# median2(x) 6.474 7.9665 13.90126 11.0570 14.844 151.817 200
# cpp_med(x) 5.737 7.4285 11.25318 9.0270 13.405 52.184 200
Yakk brought up a great point in the comments above - also elaborated on by Jerry Coffin - about the inefficiency of doing a complete sort. Here's a rewrite using std::nth_element
, benchmarked on a much larger vector:
Yakk在上面的评论中提出了一个很好的观点 - Jerry Coffin也详细阐述了关于完成整体效率的低效率。这是使用std :: nth_element的重写,以更大的向量为基准:
#include <Rcpp.h>
// [[Rcpp::export]]
double cpp_med2(Rcpp::NumericVector xx) {
Rcpp::NumericVector x = Rcpp::clone(xx);
std::size_t n = x.size() / 2;
std::nth_element(x.begin(), x.begin() + n, x.end());
if (x.size() % 2) return x[n];
return (x[n] + *std::max_element(x.begin(), x.begin() + n)) / 2.;
}
set.seed(123)
xx <- rnorm(10e5)
all.equal(cpp_med2(xx), median(xx))
all.equal(median2(xx), median(xx))
microbenchmark::microbenchmark(
cpp_med2(xx), median2(xx),
median(xx), times = 200L
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# cpp_med2(xx) 10.89060 11.34894 13.15313 12.72861 13.56161 33.92103 200
# median2(xx) 84.29518 85.47184 88.57361 86.05363 87.70065 228.07301 200
# median(xx) 46.18976 48.36627 58.77436 49.31659 53.46830 250.66939 200
#2
2
[This is more of an extended comment than an answer to the question you actually asked.]
[这是一个扩展的评论,而不是你实际问的问题的答案。]
Even your code may be open to significant improvement. In particular, you're sorting the entire input even though you only care about one or two elements.
即使您的代码可能会有明显的改进。特别是,即使您只关心一个或两个元素,也要对整个输入进行排序。
You can change this from O(n log n) to O(n) by using std::nth_element
instead of std::sort
. In case of an even number of elements, you'd typically want to use std::nth_element
to find the element just before the middle, then use std::min_element
to find the immediately succeeding element--but std::nth_element
also partitions the input items, so the std::min_element
only has to run on the items above the middle after the nth_element
, not the entire input array. That is, after nth_element, you get a situation like this:
您可以使用std :: nth_element而不是std :: sort将其从O(n log n)更改为O(n)。在偶数个元素的情况下,您通常希望使用std :: nth_element在中间之前找到元素,然后使用std :: min_element来查找紧接着的元素 - 但是std :: nth_element也是分区输入项,所以std :: min_element只需要在nth_element之后的中间项上运行,而不是整个输入数组。也就是说,在nth_element之后,你得到这样的情况:
The complexity of std::nth_element
is "linear on average", and (of course) std::min_element
is linear as well, so the overall complexity is linear.
std :: nth_element的复杂性是“平均线性”,并且(当然)std :: min_element也是线性的,因此总体复杂度是线性的。
So, for the simple case (odd number of elements), you get something like:
因此,对于简单的情况(奇数个元素),你会得到类似的东西:
auto pos = x.begin() + x.size()/2;
std::nth_element(x.begin(), pos, x.end());
return *pos;
...and for the more complex case (even number of elements):
......对于更复杂的情况(偶数个元素):
std::nth_element(x.begin(), pos, x.end());
auto pos2 = std::min_element(pos+1, x.end());
return (*pos + *pos2) / 2.0;
#3
0
I'm not sure what "standard" implementation you would be referring to.
我不确定你指的是什么“标准”实现。
Anyway: If there were one, it would, being part of a standard library, certainly not be allowed to change the order of elements in the vector (as your implementation does), so it would definitely have to work on a copy.
无论如何:如果有一个,它将作为标准库的一部分,当然不允许改变向量中元素的顺序(正如你的实现所做的那样),所以它肯定必须在副本上工作。
Creating this copy would take time and CPU (and significant memory), which would affect the run time.
创建此副本需要时间和CPU(以及大量内存),这会影响运行时。