为什么内联函数中的循环无法正确自动矢量化？

I am trying to vectorize some simple calculations for speed up from SIMD architecture. However, I also want to put them as inline function because function calls and non-vectorized codes also take computation time. However, I cannot always achieve them at the same time. In fact, most of my inline functions fail to get auto-vectorized. Here is a simple test code that works:

我试图将一些简单的计算向量化,以加快SIMD架构的速度。但是,我还想将它们作为内联函数,因为函数调用和非向量化代码也需要计算时间。但是,我不能总是同时实现它们。实际上,我的大多数内联函数都无法自动向量化。这是一个有效的简单测试代码:

inline void add1(double *v, int Length) {
    for(int i=0; i < Length; i++) v[i] += 1;
}

void call_add1(double v[], int L) {
    add1(v, L);
}

int main(){return 0;}

On Mac OS X 10.12.3, compile it:

在Mac OS X 10.12.3上,编译它:

clang++ -O3 -Rpass=loop-vectorize -Rpass-analysis=loop-vectorize -std=c++11 -ffast-math test.cpp

test.cpp:2:5: remark: vectorized loop (vectorization width: 2, interleaved count: 2) [-Rpass=loop-vectorize]
    for(int i=0; i < Length; i++) v[i] += 1;
    ^

However, Something very similar (only moving arguments in call_add1) does not work:

但是,非常相似的东西(只在call_add1中移动参数)不起作用:

inline void add1(double *v, int Length) {
    for(int i=0; i < Length; i++) v[i] += 1;
}

void call_add1() {
    double v[20]={0,1,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9}; 
    int L=20;
    add1(v, L);
}

int main(){ return 0;}

Compiling with the same command produces no output. Why does this happen? How can I make sure that loops in inline functions always get auto-vectorized? I want to vectorize many function loops, so I hope the fix would not be to complex.

使用相同的命令进行编译不会产生任何输出。为什么会这样?如何确保内联函数中的循环始终自动向量化?我想向量化许多函数循环,所以我希望修复不会复杂。

3 个解决方案

#1

Compiling your code with -fsave-optimization-record shows that the loop was unrolled and then eliminated.

使用-fsave-optimization-record编译代码表明循环已展开然后被删除。

--- !Passed
Pass:            loop-unroll
Name:            FullyUnrolled
DebugLoc:        { File: main.cpp, Line: 2, Column: 5 }
Function:        _Z9call_add1v
Args:            
  - String:          'completely unrolled loop with '
  - UnrollCount:     '20'
  - String:          ' iterations'
...
--- !Passed
Pass:            gvn
Name:            LoadElim
DebugLoc:        { File: main.cpp, Line: 2, Column: 40 }
Function:        _Z9call_add1v
Args:            
  - String:          'load of type '
  - Type:            double
  - String:          ' eliminated'
  - String:          ' in favor of '
  - InfavorOfValue:  '0.000000e+00'

If you put 4000 elements to the array, it will exceed optimizer threshold and clang will enable vectorization.

如果将4000个元素放入数组,它将超过优化器阈值,clang将启用向量化。

#2

That is because for the 2nd case compiler knows there are no side effects and optimizes everything out https://godbolt.org/g/CnojEi clang 4.0.0 with -O3 leaves only:

那是因为对于第二种情况,编译器知道没有副作用并且优化了所有内容https://godbolt.org/g/CnojEi clang 4.0.0只有-O3离开:

call_add1():
  rep ret
main:
  xor eax, eax
  ret

And you get no marketing about the loop magic.

你没有关于循环魔术的营销。

In the 1st case compiler does produce some body for the function, because the function does modify the argument. If you compiled this as an object file. You could link to this function, and it would work. I guess if the parameters would be const, then maybe the function would also be left with empty body.

在第一种情况下,编译器确实为函数生成了一些主体,因为该函数确实修改了参数。如果您将其编译为目标文件。你可以链接到这个功能,它会工作。我猜如果参数是const,那么函数也可能留空体。

When you print out the contents the programs are not identical but they both use vectorized instructions: https://godbolt.org/g/KF1kNt

当您打印出内容时,程序不完全相同,但它们都使用矢量化指令:https://godbolt.org/g/KF1kNt

#3

It looks like the compiler would simply unroll and optimize-away the loop, when v is specified explicitly. Which is a good thing: the code that does not have to be executed is the fastest.

当显式指定v时,看起来编译器将简单地展开并优化掉循环。这是一件好事:不必执行的代码是最快的。

To verify it's an optimization, you could try to make some of the variables volatile (live example).

要验证它是一种优化,您可以尝试使一些变量变为volatile(实例)。

#1