数组中值的快速乘法

时间:2021-06-06 21:20:18

Is there a fast way to multiply values of a float array in C++, to optimize this function (where count is a multiple of 4):

有没有一种快速的方法来在C ++中乘以float数组的值,以优化此函数(其中count是4的倍数):

void multiply(float* values, float factor, int count)
{
    for(int i=0; i < count; i++)
    {
        *value *= factor;
        value++;
    }
}

A solution must work on Mac OS X and Windows, Intel and non-Intel. Think SSE, vectorization, compiler (gcc vs. MSVC).

解决方案必须适用于Mac OS X和Windows,Intel和非Intel。想想SSE,矢量化,编译器(gcc与MSVC)。

7 个解决方案

#1


2  

If you want your code to be cross-platform, then either you're going to have to write platform-independent code, or you're going to have to write a load of #ifdefs.

如果您希望您的代码是跨平台的,那么您将不得不编写与平台无关的代码,或者您将不得不编写一堆#ifdefs。

Have you tried some manual loop unrolling, and seeing if it makes any difference?

您是否尝试过一些手动循环展开,并查看它是否有任何区别?

#2


2  

Since you know the count is a multiple of 4, you can unroll your loop...

既然您知道计数是4的倍数,您可以展开循环...

void multiply(float* values, float factor, int count)
{
    count = count >> 2; // count / 4
    for(int i=0; i < count ; i++)
    {
        *value *= factor;
        *(value+1) *= factor;
        *(value+2) *= factor;
        *(value+3) *= factor;
        value += 4;
    }
}

#3


2  

Disclaimer: obviously, this won't work on iPhone, iPad, Android, or their future equivalents.

免责声明:显然,这不适用于iPhone,iPad,Android或其未来的等价物。

#include <mmintrin.h>
#include <xmmintrin.h>

__m128 factor4 = _mm_set1_ps(factor);
for (int i=0; i+3 < count; i += 4)
{
   __m128 data = _mm_mul_ps(_mm_loadu_ps(values), factor4);
   _mm_storeu_ps(values, data);
   values += 4;
}
for (int i=(count/4)*4; i < count; i++)
{
   *values *= factor;
   value++;
}

#4


2  

Have you thought of OpenMP?

你有没有想过OpenMP?

Most modern computers have multi-core CPUs and nearly every major compiler seems to have OpenMP built-in. You gain speed at barely any cost.

大多数现代计算机都有多核CPU,几乎所有主流编译器似乎都内置了OpenMP。你几乎不花任何成本获得速度。

See Wikipedia's article on OpenMP.

请参阅Wikipedia关于OpenMP的文章。

#5


0  

The best solution is to keep it simple, and let the compiler optimize it for you. GCC knows about SSE, SSE2, altivec and what else. If your code is too complex, your compiler won't be able to optimize it on every possible target.

最好的解决方案是保持简单,让编译器为您优化它。海湾合作委员会了解SSE,SSE2,altivec以及其他什么。如果您的代码太复杂,您的编译器将无法在每个可能的目标上对其进行优化。

#6


0  

As you mentioned, there are numerous architectures out there that have SIMD extensions and SIMD is probably your best bet when it comes to optimization. They are all however platform specific and the C and C++ as languages are not SIMD friendly.

正如您所提到的,有许多架构都有SIMD扩展,SIMD可能是您在优化方面最好的选择。然而,它们都是特定于平台的,而C和C ++作为语言并不是SIMD友好的。

The first thing you should try however is enabling the SIMD specific flags for your given build. The compiler may recognize patterns that can be optimized with SIMD.

但是,您应该尝试的第一件事是为您的给定构建启用SIMD特定标志。编译器可以识别可以使用SIMD优化的模式。

The next thing is to write platform specific SIMD code using compiler intrinsics or assembly where appropriate. You should however keep a portable non-SIMD implementation for platforms that do not have an optimized version. #ifdefs enable SIMD on platforms that support it.

接下来是在适当的情况下使用编译器内在函数或汇编来编写特定于平台的SIMD代码。但是,您应该为没有优化版本的平台保留可移植的非SIMD实现。 #ifdefs在支持它的平台上启用SIMD。

Lastly, at least on ARM but not sure on Intel, be aware that smaller integer and floating point types allow a larger number of parallel operations per single SIMD instruction.

最后,至少在ARM上但在英特尔上不确定,请注意较小的整数和浮点类型允许每个SIMD指令进行大量并行操作。

#7


0  

I think, there is not a lot you can do that makes a big difference. Maybe you can speed it up a little with OpenMP or SSE. But Modern CPUs are quite fast already. In some applications memory bandwidth / latency is actually the bottleneck and it gets worse. We already have three levels of cache and need smart prefetch algorithms to avoid huge delays. So, it makes sense to think about memory access patterns as well. For example, if you implement such a multiply and an add and use it like this:

我认为,没有太多可以做的事情会产生很大的影响。也许你可以使用OpenMP或SSE加快速度。但现代CPU已经非常快了。在某些应用程序中,内存带宽/延迟实际上是瓶颈而且会变得更糟。我们已经有三级缓存,需要智能预取算法以避免大量延迟。因此,考虑内存访问模式也是有意义的。例如,如果你实现这样的乘法和添加并使用它像这样:

void multiply(float vec[], float factor, int size)
{
  for (int i=0; i<size; ++i)
    vec[i] *= factor;
}

void add(float vec[], float summand, int size)
{
  for (int i=0; i<size; ++i)
    vec[i] += summand;
}

void foo(float vec[], int size)
{
  multiply(vec,2.f,size);
  add(vec,9.f,size);
}

you're basically passing twice over the block of memory. Depending on the vector's size it might not fit into the L1 cache in which case passing over it twice adds some extra time. This is obviously bad and you should try to keep memory accesses "local". In this case, a single loop

你基本上在内存块上传递了两次。根据矢量的大小,它可能不适合L1缓存,在这种情况下,两次传递会增加一些额外的时间。这显然很糟糕,你应该尝试保持内存访问“本地”。在这种情况下,单个循环

void foo(float vec[], int size)
{
  for (int i=0; i<size; ++i) {
    vec[i] = vec[i]*2+9;
  }
}

is likely to be faster. As a rule of thumb: Try to access memory linearly and try to access memory "locally" by which I mean, try to reuse the data that is already in the L1 cache. Just an idea.

可能会更快。根据经验:尝试线性访问内存并尝试“本地”访问内存,我的意思是,尝试重用已在L1缓存中的数据。只是一个想法。

#1


2  

If you want your code to be cross-platform, then either you're going to have to write platform-independent code, or you're going to have to write a load of #ifdefs.

如果您希望您的代码是跨平台的,那么您将不得不编写与平台无关的代码,或者您将不得不编写一堆#ifdefs。

Have you tried some manual loop unrolling, and seeing if it makes any difference?

您是否尝试过一些手动循环展开,并查看它是否有任何区别?

#2


2  

Since you know the count is a multiple of 4, you can unroll your loop...

既然您知道计数是4的倍数,您可以展开循环...

void multiply(float* values, float factor, int count)
{
    count = count >> 2; // count / 4
    for(int i=0; i < count ; i++)
    {
        *value *= factor;
        *(value+1) *= factor;
        *(value+2) *= factor;
        *(value+3) *= factor;
        value += 4;
    }
}

#3


2  

Disclaimer: obviously, this won't work on iPhone, iPad, Android, or their future equivalents.

免责声明:显然,这不适用于iPhone,iPad,Android或其未来的等价物。

#include <mmintrin.h>
#include <xmmintrin.h>

__m128 factor4 = _mm_set1_ps(factor);
for (int i=0; i+3 < count; i += 4)
{
   __m128 data = _mm_mul_ps(_mm_loadu_ps(values), factor4);
   _mm_storeu_ps(values, data);
   values += 4;
}
for (int i=(count/4)*4; i < count; i++)
{
   *values *= factor;
   value++;
}

#4


2  

Have you thought of OpenMP?

你有没有想过OpenMP?

Most modern computers have multi-core CPUs and nearly every major compiler seems to have OpenMP built-in. You gain speed at barely any cost.

大多数现代计算机都有多核CPU,几乎所有主流编译器似乎都内置了OpenMP。你几乎不花任何成本获得速度。

See Wikipedia's article on OpenMP.

请参阅Wikipedia关于OpenMP的文章。

#5


0  

The best solution is to keep it simple, and let the compiler optimize it for you. GCC knows about SSE, SSE2, altivec and what else. If your code is too complex, your compiler won't be able to optimize it on every possible target.

最好的解决方案是保持简单,让编译器为您优化它。海湾合作委员会了解SSE,SSE2,altivec以及其他什么。如果您的代码太复杂,您的编译器将无法在每个可能的目标上对其进行优化。

#6


0  

As you mentioned, there are numerous architectures out there that have SIMD extensions and SIMD is probably your best bet when it comes to optimization. They are all however platform specific and the C and C++ as languages are not SIMD friendly.

正如您所提到的,有许多架构都有SIMD扩展,SIMD可能是您在优化方面最好的选择。然而,它们都是特定于平台的,而C和C ++作为语言并不是SIMD友好的。

The first thing you should try however is enabling the SIMD specific flags for your given build. The compiler may recognize patterns that can be optimized with SIMD.

但是,您应该尝试的第一件事是为您的给定构建启用SIMD特定标志。编译器可以识别可以使用SIMD优化的模式。

The next thing is to write platform specific SIMD code using compiler intrinsics or assembly where appropriate. You should however keep a portable non-SIMD implementation for platforms that do not have an optimized version. #ifdefs enable SIMD on platforms that support it.

接下来是在适当的情况下使用编译器内在函数或汇编来编写特定于平台的SIMD代码。但是,您应该为没有优化版本的平台保留可移植的非SIMD实现。 #ifdefs在支持它的平台上启用SIMD。

Lastly, at least on ARM but not sure on Intel, be aware that smaller integer and floating point types allow a larger number of parallel operations per single SIMD instruction.

最后,至少在ARM上但在英特尔上不确定,请注意较小的整数和浮点类型允许每个SIMD指令进行大量并行操作。

#7


0  

I think, there is not a lot you can do that makes a big difference. Maybe you can speed it up a little with OpenMP or SSE. But Modern CPUs are quite fast already. In some applications memory bandwidth / latency is actually the bottleneck and it gets worse. We already have three levels of cache and need smart prefetch algorithms to avoid huge delays. So, it makes sense to think about memory access patterns as well. For example, if you implement such a multiply and an add and use it like this:

我认为,没有太多可以做的事情会产生很大的影响。也许你可以使用OpenMP或SSE加快速度。但现代CPU已经非常快了。在某些应用程序中,内存带宽/延迟实际上是瓶颈而且会变得更糟。我们已经有三级缓存,需要智能预取算法以避免大量延迟。因此,考虑内存访问模式也是有意义的。例如,如果你实现这样的乘法和添加并使用它像这样:

void multiply(float vec[], float factor, int size)
{
  for (int i=0; i<size; ++i)
    vec[i] *= factor;
}

void add(float vec[], float summand, int size)
{
  for (int i=0; i<size; ++i)
    vec[i] += summand;
}

void foo(float vec[], int size)
{
  multiply(vec,2.f,size);
  add(vec,9.f,size);
}

you're basically passing twice over the block of memory. Depending on the vector's size it might not fit into the L1 cache in which case passing over it twice adds some extra time. This is obviously bad and you should try to keep memory accesses "local". In this case, a single loop

你基本上在内存块上传递了两次。根据矢量的大小,它可能不适合L1缓存,在这种情况下,两次传递会增加一些额外的时间。这显然很糟糕,你应该尝试保持内存访问“本地”。在这种情况下,单个循环

void foo(float vec[], int size)
{
  for (int i=0; i<size; ++i) {
    vec[i] = vec[i]*2+9;
  }
}

is likely to be faster. As a rule of thumb: Try to access memory linearly and try to access memory "locally" by which I mean, try to reuse the data that is already in the L1 cache. Just an idea.

可能会更快。根据经验:尝试线性访问内存并尝试“本地”访问内存,我的意思是,尝试重用已在L1缓存中的数据。只是一个想法。