为什么在使用intrinsic时,生成的程序集被重新排序?

时间:2021-10-18 03:12:19

I was playing around a bit with intrinsics, as I needed an O (1) complexity function similar to memcmp() for a fixed input size. I ended up writing this:

我使用了一些intrinsic,因为我需要一个与memcmp()类似的O(1)复杂性函数来处理固定的输入大小。最后我写到:

#include <stdint.h>
#include <emmintrin.h>

int64_t f (int64_t a[4], int64_t b[4]) {
    __m128i *x = (void *) a, *y = (void *) b, r[2], t;
    int64_t *ret = (void *) &t;

    r[0] = _mm_xor_si128(x[0], y[0]);
    r[1] = _mm_xor_si128(x[1], y[1]);
    t = _mm_or_si128(r[0], r[1]);


    return (ret[0] | ret[1]);
}

which, when compiled turns into this:

当编译成这样:

f:
    movdqa  xmm0, XMMWORD PTR [rdi]
    movdqa  xmm1, XMMWORD PTR [rdi+16]
    pxor    xmm0, XMMWORD PTR [rsi]
    pxor    xmm1, XMMWORD PTR [rsi+16]
    por xmm0, xmm1
    movq    rdx, xmm0
    pextrq  rax, xmm0, 1
    or  rax, rdx
    ret

http://goo.gl/EtovJa (Godbolt Compiler Explorer)

http://goo。gl / EtovJa(Godbolt编译器Explorer)


After that though, I became curious as to whether I really needed to use intrinsic functions or whether I only needed the types and I could just use normal operators. I then modified the above code (only the three SSE lines, really) and ended up with this:

在那之后,我开始好奇我是否真的需要使用内在的功能,或者我是否只需要类型,我可以使用正常的操作符。然后我修改了上面的代码(实际上只有三行SSE代码),最后得到了这个:

#include <stdint.h>
#include <emmintrin.h>

int64_t f (int64_t a[4], int64_t b[4]) {
    __m128i *x = (void *) a, *y = (void *) b, r[2], t;
    int64_t *ret = (void *) &t;

    r[0] = x[0] ^ y[0];
    r[1] = x[1] ^ y[1];
    t = r[0] | r[1];


    return (ret[0] | ret[1]);
}

which instead compiles to this:

相反,它是这样编译的:

f:
    movdqa  xmm0, XMMWORD PTR [rdi+16]
    movdqa  xmm1, XMMWORD PTR [rdi]
    pxor    xmm0, XMMWORD PTR [rsi+16]
    pxor    xmm1, XMMWORD PTR [rsi]
    por xmm0, xmm1
    movq    rdx, xmm0
    pextrq  rax, xmm0, 1
    or  rax, rdx
    ret

http://goo.gl/oDHF3z (Godbolt Compiler Explorer)

http://goo。gl / oDHF3z(Godbolt编译器Explorer)


Now functionally (AFAICT), the two compiled assembly outputs are identical. In fact, it appears that they would even take the exact same amount of time and resources; that they would execute identically. However, I am curious as to why the operands in the first four instructions have been moved around. Is there some particular reason as to why one way might be done over the other?

现在,在功能上(AFAICT),两个编译后的汇编输出是相同的。事实上,他们甚至会花费同样的时间和资源;它们的执行是相同的。然而,我很好奇为什么前四个指令中的操作数被移动了。有没有什么特别的原因可以解释为什么一种方法可以取代另一种方法?

Note: Both of the functions were compiled with GCC, with identical flags.

注意:这两个函数都是使用GCC编译的,具有相同的标志。

1 个解决方案

#1


3  

TL;DR: From a compiler's point of view, the input code is different and might go through different places and hit different tests on the way through, which would make the output be different.

TL;DR:从编译器的角度来看,输入代码是不同的,可能会经过不同的地方,遇到不同的测试,从而使输出不同。

You won't see this in (a current) clang, since the intrinsics disappear when you get to IR (an intermediate representation of your code that LLVM uses), and the IR eventually gets transformed to the instructions, but the IR for both cases is the same.

您不会在(当前的)clang中看到这一点,因为当您到达IR (LLVM使用的代码的中间表示)时,这些intrinsic会消失,并且IR最终会转换为指令,但是这两种情况的IR是一样的。

If you check out that code with clang or with different versions of gcc, you'll see slight changes in the instruction scheduling. These changes are usually due to changes in the CPU scheduler or the register allocator, from version to version.

如果您使用clang或使用不同版本的gcc来检查代码,您将看到指令调度中的细微变化。这些更改通常是由于CPU调度器或寄存器分配器(从版本到版本)的更改造成的。

Try this out, with the two functions you provided in the same file. Try the different versions of gcc, and try different versions of clang. Clang only changes the ordering of the movd instruction, and it always emits both functions with the same instructions, since the llvm backend gets the same IR for both cases.

尝试一下,使用在同一个文件中提供的两个函数。尝试不同版本的gcc,尝试不同版本的clang。Clang只改变movd指令的顺序,并且它总是发出具有相同指令的两个函数,因为llvm后端对这两种情况都有相同的IR。

I don't know about the internals of GCC, but I suppose the functions happen to not hit the exact same places in the code for the scheduler and end up emitting the loads in a different order. This could happen because one of the calls to the intrinsics might not be lowered to an intermediate representation on one case, and just stay as intrinsics (not function) calls.

我不知道GCC的内部结构,但是我假设函数在调度器的代码中没有达到完全相同的位置,最终以不同的顺序释放负载。这种情况可能会发生,因为对intrinsic的调用可能不会被降低到一个case的中间表示,而只保留intrinsic(而不是function)调用。

#1


3  

TL;DR: From a compiler's point of view, the input code is different and might go through different places and hit different tests on the way through, which would make the output be different.

TL;DR:从编译器的角度来看,输入代码是不同的,可能会经过不同的地方,遇到不同的测试,从而使输出不同。

You won't see this in (a current) clang, since the intrinsics disappear when you get to IR (an intermediate representation of your code that LLVM uses), and the IR eventually gets transformed to the instructions, but the IR for both cases is the same.

您不会在(当前的)clang中看到这一点,因为当您到达IR (LLVM使用的代码的中间表示)时,这些intrinsic会消失,并且IR最终会转换为指令,但是这两种情况的IR是一样的。

If you check out that code with clang or with different versions of gcc, you'll see slight changes in the instruction scheduling. These changes are usually due to changes in the CPU scheduler or the register allocator, from version to version.

如果您使用clang或使用不同版本的gcc来检查代码,您将看到指令调度中的细微变化。这些更改通常是由于CPU调度器或寄存器分配器(从版本到版本)的更改造成的。

Try this out, with the two functions you provided in the same file. Try the different versions of gcc, and try different versions of clang. Clang only changes the ordering of the movd instruction, and it always emits both functions with the same instructions, since the llvm backend gets the same IR for both cases.

尝试一下,使用在同一个文件中提供的两个函数。尝试不同版本的gcc,尝试不同版本的clang。Clang只改变movd指令的顺序,并且它总是发出具有相同指令的两个函数,因为llvm后端对这两种情况都有相同的IR。

I don't know about the internals of GCC, but I suppose the functions happen to not hit the exact same places in the code for the scheduler and end up emitting the loads in a different order. This could happen because one of the calls to the intrinsics might not be lowered to an intermediate representation on one case, and just stay as intrinsics (not function) calls.

我不知道GCC的内部结构,但是我假设函数在调度器的代码中没有达到完全相同的位置,最终以不同的顺序释放负载。这种情况可能会发生,因为对intrinsic的调用可能不会被降低到一个case的中间表示,而只保留intrinsic(而不是function)调用。