计算复杂数组的abs()值的最快方法

时间:2022-12-27 21:21:51

I want to calculate the absolute values of the elements of a complex array in C or C++. The easiest way would be

我想计算C或c++中复杂数组元素的绝对值。最简单的方法是。

for(int i = 0; i < N; i++)
{
    b[i] = cabs(a[i]);
}

But for large vectors that will be slow. Is there a way to speed that up (by using parallelization, for example)? Language can be either C or C++.

但是对于大的向量,它是慢的。有没有一种方法可以加快这个速度(例如使用并行化)?语言可以是C或c++。

6 个解决方案

#1


14  

Given that all loop iterations are independent, you can use the following code for parallelization:

考虑到所有循环迭代都是独立的,您可以使用以下代码进行并行化:

#pragma omp parallel for
for(int i = 0; i < N; i++)
{
    b[i] = cabs(a[i]);
}

Of course, for using this you should enable OpenMP support while compiling your code (usually by using /openmp flag or setting the project options).
You can find several examples of OpenMP usage in wiki.

当然,要使用它,您应该在编译代码时启用OpenMP支持(通常通过使用/ OpenMP标志或设置项目选项)。您可以在wiki中找到OpenMP使用的几个示例。

#2


5  

Use vector operations.

使用向量操作。

If you have glibc 2.22 (pretty recent), you can use the SIMD capabilities of OpenMP 4.0 to operate on vectors/arrays.

如果您有glibc 2.22(最近),您可以使用OpenMP 4.0的SIMD功能来操作向量/数组。

Libmvec is vector math library added in Glibc 2.22.

Libmvec是Glibc 2.22中增加的向量数学库。

Vector math library was added to support SIMD constructs of OpenMP4.0 (#2.8 in http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf) by adding vector implementations of vector math functions.

通过添加矢量数学函数的矢量实现,增加了矢量数学库,以支持OpenMP4.0的SIMD结构(http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf中的#2.8)。

Vector math functions are vector variants of corresponding scalar math operations implemented using SIMD ISA extensions (e.g. SSE or AVX for x86_64). They take packed vector arguments, perform the operation on each element of the packed vector argument, and return a packed vector result. Using vector math functions is faster than repeatedly calling the scalar math routines.

向量数学函数是使用SIMD ISA扩展实现的相应标量数学操作的向量变体(例如x86_64的SSE或AVX)。它们接受压缩矢量参数,对压缩矢量参数的每个元素执行操作,并返回压缩矢量结果。使用向量数学函数比反复调用标量数学例程要快。

Also, see Parallel for vs omp simd: when to use each?

另外,请参见vs omp simd中的Parallel: when to use each?

If you're running on Solaris, you can explicitly use vhypot() from the math vector library libmvec.so to operate on a vector of complex numbers to obtain the absolute value of each:

如果您正在Solaris上运行,您可以显式地使用来自数学向量库libmvec的vhypot()。对复数向量进行运算,得到每个向量的绝对值:

Description

描述

These functions evaluate the function hypot(x, y) for an entire vector of values at once. ...

这些函数同时计算整个值向量的函数假设(x, y)。

The source code for libmvec can be found at http://src.illumos.org/source/xref/illumos-gate/usr/src/lib/libmvec/ and the vhypot() code specifically at http://src.illumos.org/source/xref/illumos-gate/usr/src/lib/libmvec/common/__vhypot.c I don't recall if Sun Microsystems ever provided a Linux version of libmvec.so or not.

libmvec的源代码可以在http://src.illumos.org/source/xref/illumos-gate/usr/src/lib/libmvec/上找到,vhypot()代码可以在http://src.illumos.org/source/illumos-gatemo/usr/src/libmc/lib/libmvec/common/libmvc.com/上找到。所以不信。

#3


4  

Or use Concurrency::parallele_for like that :

或者使用并发::parallele_for such:

Concurrency::parallel_for(0, N, [&a, &b](int i)
{
b[i] = cabs(a[i]);
});

#4


4  

If you are using a modern compiler (GCC 5, for example), you can use Cilk+, that will give you a nice array notation, automatically usage of SIMD instructions, and parallelisation.

如果您正在使用一个现代编译器(例如GCC 5),您可以使用Cilk+,这将给您一个很好的数组表示法,自动使用SIMD指令和并行化。

So, if you want to run them in parallel you would do:

所以,如果你想让它们并行运行,你可以这样做:

#include <cilk/cilk.h>

cilk_for(int i = 0; i < N; i++)
{
    b[i] = cabs(a[i]);
}

or if you want to test SIMD:

或者如果你想测试SIMD:

#pragma simd
for(int i = 0; i < N; i++)
{
    b[i] = cabs(a[i]);
}

But, the nicest part of Cilk is that you can just do:

但是,Cilk最好的部分是:

b[:] = cabs(a[:])

In this case, the compiler and the runtime environment will decide to which level it should be SIMDed and what should be paralellised (the optimal way is applying SIMD on large-ish chunks in parallel). Since this is decided by a work scheduler at runtime, Intel claims it is capable of providing a near optimal scheduling, and that it should be able to make an optimal use of the cache.

在这种情况下,编译器和运行时环境将决定应该对哪个级别进行简化,以及应该对哪个级别进行对半椭圆处理(最佳的方法是并行地在大的块上应用SIMD)。由于这是由运行时的工作调度器决定的,Intel声称它能够提供一个接近最优的调度,并且它应该能够最优地使用缓存。

#5


4  

Using #pragma simd (even with -Ofast) or relying on the compilers auto-vectorization are more example of why it's a bad idea to blindly expect your compiler to implement SIMD efficiently. In order to use SIMD efficiently for this you need to use an array of struct of arrays. For example for single float with a SIMD width of 4 you could use

使用#pragma simd(即使使用-Ofast)或依赖编译器自动向量化,更能说明为什么盲目地期望编译器高效地实现simd是一个坏主意。为了有效地使用SIMD,您需要使用数组结构的数组。例如,对于SIMD宽度为4的单浮点数,您可以使用

//struct of arrays of four complex numbers
struct c4 {
    float x[4];  // real values of four complex numbers 
    float y[4];  // imaginary values of four complex numbers
};

Here is code showing how you could do this with SSE for the x86 instruction set.

下面的代码展示了如何使用SSE实现x86指令集。

#include <stdio.h>
#include <x86intrin.h>
#define N 10

struct c4{
    float x[4];
    float y[4];
};

static inline void cabs_soa4(struct c4 *a, float *b) {
    __m128 x4 = _mm_loadu_ps(a->x);
    __m128 y4 = _mm_loadu_ps(a->y);
    __m128 b4 = _mm_sqrt_ps(_mm_add_ps(_mm_mul_ps(x4,x4), _mm_mul_ps(y4,y4)));
    _mm_storeu_ps(b, b4);
}  

int main(void)
{
    int n4 = ((N+3)&-4)/4;  //choose next multiple of 4 and divide by 4
    printf("%d\n", n4);
    struct c4  a[n4];  //array of struct of arrays
    for(int i=0; i<n4; i++) {
        for(int j=0; j<4; j++) { a[i].x[j] = 1, a[i].y[j] = -1;}
    }
    float b[4*n4];
    for(int i=0; i<n4; i++) {
        cabs_soa4(&a[i], &b[4*i]);
    }
    for(int i = 0; i<N; i++) printf("%.2f ", b[i]); puts("");
}

It may help to unroll the loop a few times. In any case all this is moot for large N because the operation is memory bandwidth bound. For large N (meaning when the memory usage is much larger than the last level cache), although #pragma omp parallel may help some, the best solution is not to do this for large N. Instead do this in chunks which fit in the lowest level cache along with other compute operations. I mean something like this

它可能有助于多次展开循环。在任何情况下,这对于大N都没有意义,因为操作是内存带宽限制。对于大型N(即当内存使用量远远大于最后一个级别缓存),尽管# pragma omp并行可能帮助一些,最好的解决方案是不做这个大N .而不是做这一块合适的缓存以及其他计算操作的最低水平。我是说像这样的东西

for(int i = 0; i < nchunks; i++) {
    for(int j = 0; j < chunk_size; j++) {
        b[i*chunk_size+j] = cabs(a[i*chunk_size+j]);
    }
    foo(&b[i*chunck_size]); // foo is computationally intensive.
}

I did not implement an array of struct of array here but it should be easy to adjust the code for that.

这里我没有实现数组结构的数组,但是应该很容易调整代码。

#6


2  

Also, you can use std::future and std::async (they are part of C++11), maybe it's more clear way of achieving what you want to do:

此外,您还可以使用std::future和std::async(它们是c++ 11的一部分),也许这是实现您想做的事情的更清晰的方式:

#include <future>

...

int main()
{
    ...

    // Create async calculations
    std::future<void> *futures = new std::future<void>[N];
    for (int i = 0; i < N; ++i)
    {
        futures[i] = std::async([&a, &b, i]
        {
            b[i] = std::sqrt(a[i]);
        });
    }
    // Wait for calculation of all async procedures
    for (int i = 0; i < N; ++i)
    {
        futures[i].get();
    }

    ...

    return 0;
}

IdeOne live code

IdeOne住代码

We first create asynchronous procedures and then wait until everything is calculated.
Here I use sqrt instead of cabs because I just don't know what is cabs. I'm sure it doesn't matter.
Also, maybe you'll find this link useful: cplusplus.com

我们首先创建异步过程,然后等待一切都计算完毕。这里我用的是sqrt而不是cabs因为我不知道什么是cabs。我肯定这没关系。另外,你可能会发现这个链接很有用:cplusplusplus.com

#1


14  

Given that all loop iterations are independent, you can use the following code for parallelization:

考虑到所有循环迭代都是独立的,您可以使用以下代码进行并行化:

#pragma omp parallel for
for(int i = 0; i < N; i++)
{
    b[i] = cabs(a[i]);
}

Of course, for using this you should enable OpenMP support while compiling your code (usually by using /openmp flag or setting the project options).
You can find several examples of OpenMP usage in wiki.

当然,要使用它,您应该在编译代码时启用OpenMP支持(通常通过使用/ OpenMP标志或设置项目选项)。您可以在wiki中找到OpenMP使用的几个示例。

#2


5  

Use vector operations.

使用向量操作。

If you have glibc 2.22 (pretty recent), you can use the SIMD capabilities of OpenMP 4.0 to operate on vectors/arrays.

如果您有glibc 2.22(最近),您可以使用OpenMP 4.0的SIMD功能来操作向量/数组。

Libmvec is vector math library added in Glibc 2.22.

Libmvec是Glibc 2.22中增加的向量数学库。

Vector math library was added to support SIMD constructs of OpenMP4.0 (#2.8 in http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf) by adding vector implementations of vector math functions.

通过添加矢量数学函数的矢量实现,增加了矢量数学库,以支持OpenMP4.0的SIMD结构(http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf中的#2.8)。

Vector math functions are vector variants of corresponding scalar math operations implemented using SIMD ISA extensions (e.g. SSE or AVX for x86_64). They take packed vector arguments, perform the operation on each element of the packed vector argument, and return a packed vector result. Using vector math functions is faster than repeatedly calling the scalar math routines.

向量数学函数是使用SIMD ISA扩展实现的相应标量数学操作的向量变体(例如x86_64的SSE或AVX)。它们接受压缩矢量参数,对压缩矢量参数的每个元素执行操作,并返回压缩矢量结果。使用向量数学函数比反复调用标量数学例程要快。

Also, see Parallel for vs omp simd: when to use each?

另外,请参见vs omp simd中的Parallel: when to use each?

If you're running on Solaris, you can explicitly use vhypot() from the math vector library libmvec.so to operate on a vector of complex numbers to obtain the absolute value of each:

如果您正在Solaris上运行,您可以显式地使用来自数学向量库libmvec的vhypot()。对复数向量进行运算,得到每个向量的绝对值:

Description

描述

These functions evaluate the function hypot(x, y) for an entire vector of values at once. ...

这些函数同时计算整个值向量的函数假设(x, y)。

The source code for libmvec can be found at http://src.illumos.org/source/xref/illumos-gate/usr/src/lib/libmvec/ and the vhypot() code specifically at http://src.illumos.org/source/xref/illumos-gate/usr/src/lib/libmvec/common/__vhypot.c I don't recall if Sun Microsystems ever provided a Linux version of libmvec.so or not.

libmvec的源代码可以在http://src.illumos.org/source/xref/illumos-gate/usr/src/lib/libmvec/上找到,vhypot()代码可以在http://src.illumos.org/source/illumos-gatemo/usr/src/libmc/lib/libmvec/common/libmvc.com/上找到。所以不信。

#3


4  

Or use Concurrency::parallele_for like that :

或者使用并发::parallele_for such:

Concurrency::parallel_for(0, N, [&a, &b](int i)
{
b[i] = cabs(a[i]);
});

#4


4  

If you are using a modern compiler (GCC 5, for example), you can use Cilk+, that will give you a nice array notation, automatically usage of SIMD instructions, and parallelisation.

如果您正在使用一个现代编译器(例如GCC 5),您可以使用Cilk+,这将给您一个很好的数组表示法,自动使用SIMD指令和并行化。

So, if you want to run them in parallel you would do:

所以,如果你想让它们并行运行,你可以这样做:

#include <cilk/cilk.h>

cilk_for(int i = 0; i < N; i++)
{
    b[i] = cabs(a[i]);
}

or if you want to test SIMD:

或者如果你想测试SIMD:

#pragma simd
for(int i = 0; i < N; i++)
{
    b[i] = cabs(a[i]);
}

But, the nicest part of Cilk is that you can just do:

但是,Cilk最好的部分是:

b[:] = cabs(a[:])

In this case, the compiler and the runtime environment will decide to which level it should be SIMDed and what should be paralellised (the optimal way is applying SIMD on large-ish chunks in parallel). Since this is decided by a work scheduler at runtime, Intel claims it is capable of providing a near optimal scheduling, and that it should be able to make an optimal use of the cache.

在这种情况下,编译器和运行时环境将决定应该对哪个级别进行简化,以及应该对哪个级别进行对半椭圆处理(最佳的方法是并行地在大的块上应用SIMD)。由于这是由运行时的工作调度器决定的,Intel声称它能够提供一个接近最优的调度,并且它应该能够最优地使用缓存。

#5


4  

Using #pragma simd (even with -Ofast) or relying on the compilers auto-vectorization are more example of why it's a bad idea to blindly expect your compiler to implement SIMD efficiently. In order to use SIMD efficiently for this you need to use an array of struct of arrays. For example for single float with a SIMD width of 4 you could use

使用#pragma simd(即使使用-Ofast)或依赖编译器自动向量化,更能说明为什么盲目地期望编译器高效地实现simd是一个坏主意。为了有效地使用SIMD,您需要使用数组结构的数组。例如,对于SIMD宽度为4的单浮点数,您可以使用

//struct of arrays of four complex numbers
struct c4 {
    float x[4];  // real values of four complex numbers 
    float y[4];  // imaginary values of four complex numbers
};

Here is code showing how you could do this with SSE for the x86 instruction set.

下面的代码展示了如何使用SSE实现x86指令集。

#include <stdio.h>
#include <x86intrin.h>
#define N 10

struct c4{
    float x[4];
    float y[4];
};

static inline void cabs_soa4(struct c4 *a, float *b) {
    __m128 x4 = _mm_loadu_ps(a->x);
    __m128 y4 = _mm_loadu_ps(a->y);
    __m128 b4 = _mm_sqrt_ps(_mm_add_ps(_mm_mul_ps(x4,x4), _mm_mul_ps(y4,y4)));
    _mm_storeu_ps(b, b4);
}  

int main(void)
{
    int n4 = ((N+3)&-4)/4;  //choose next multiple of 4 and divide by 4
    printf("%d\n", n4);
    struct c4  a[n4];  //array of struct of arrays
    for(int i=0; i<n4; i++) {
        for(int j=0; j<4; j++) { a[i].x[j] = 1, a[i].y[j] = -1;}
    }
    float b[4*n4];
    for(int i=0; i<n4; i++) {
        cabs_soa4(&a[i], &b[4*i]);
    }
    for(int i = 0; i<N; i++) printf("%.2f ", b[i]); puts("");
}

It may help to unroll the loop a few times. In any case all this is moot for large N because the operation is memory bandwidth bound. For large N (meaning when the memory usage is much larger than the last level cache), although #pragma omp parallel may help some, the best solution is not to do this for large N. Instead do this in chunks which fit in the lowest level cache along with other compute operations. I mean something like this

它可能有助于多次展开循环。在任何情况下,这对于大N都没有意义,因为操作是内存带宽限制。对于大型N(即当内存使用量远远大于最后一个级别缓存),尽管# pragma omp并行可能帮助一些,最好的解决方案是不做这个大N .而不是做这一块合适的缓存以及其他计算操作的最低水平。我是说像这样的东西

for(int i = 0; i < nchunks; i++) {
    for(int j = 0; j < chunk_size; j++) {
        b[i*chunk_size+j] = cabs(a[i*chunk_size+j]);
    }
    foo(&b[i*chunck_size]); // foo is computationally intensive.
}

I did not implement an array of struct of array here but it should be easy to adjust the code for that.

这里我没有实现数组结构的数组,但是应该很容易调整代码。

#6


2  

Also, you can use std::future and std::async (they are part of C++11), maybe it's more clear way of achieving what you want to do:

此外,您还可以使用std::future和std::async(它们是c++ 11的一部分),也许这是实现您想做的事情的更清晰的方式:

#include <future>

...

int main()
{
    ...

    // Create async calculations
    std::future<void> *futures = new std::future<void>[N];
    for (int i = 0; i < N; ++i)
    {
        futures[i] = std::async([&a, &b, i]
        {
            b[i] = std::sqrt(a[i]);
        });
    }
    // Wait for calculation of all async procedures
    for (int i = 0; i < N; ++i)
    {
        futures[i].get();
    }

    ...

    return 0;
}

IdeOne live code

IdeOne住代码

We first create asynchronous procedures and then wait until everything is calculated.
Here I use sqrt instead of cabs because I just don't know what is cabs. I'm sure it doesn't matter.
Also, maybe you'll find this link useful: cplusplus.com

我们首先创建异步过程,然后等待一切都计算完毕。这里我用的是sqrt而不是cabs因为我不知道什么是cabs。我肯定这没关系。另外,你可能会发现这个链接很有用:cplusplusplus.com