使用内联程序集对数组进行循环

When looping over an array with inline assembly should I use the register modifier "r" or he memory modifier "m"?

当使用内联程序集循环时，我是否应该使用寄存器修改器“r”或他的内存修改器“m”?

Let's consider an example which adds two float arrays x, and y and writes the results to z. Normally I would use intrinsics to do this like this

让我们考虑一个例子，它添加了两个浮点数组x和y，并将结果写到z中

for(int i=0; i<n/4; i++) {
    __m128 x4 = _mm_load_ps(&x[4*i]);
    __m128 y4 = _mm_load_ps(&y[4*i]);
    __m128 s = _mm_add_ps(x4,y4);
    _mm_store_ps(&z[4*i], s);
}

Here is the inline assembly solution I have come up with using the register modifier "r"

这是我用寄存器修饰符r提出的内联汇编解决方案

void add_asm1(float *x, float *y, float *z, unsigned n) {
    for(int i=0; i<n; i+=4) {
        __asm__ __volatile__ (
            "movaps   (%1,%%rax,4), %%xmm0\n"
            "addps    (%2,%%rax,4), %%xmm0\n"
            "movaps   %%xmm0, (%0,%%rax,4)\n"
            :
            : "r" (z), "r" (y), "r" (x), "a" (i)
            :
        );
    }
}

This generates similar assembly to GCC. The main difference is that GCC adds 16 to the index register and uses a scale of 1 whereas the inline-assembly solution adds 4 to the index register and uses a scale of 4.

这将生成与GCC类似的程序集。主要的不同之处在于GCC向索引寄存器添加16，使用1的比例，而内联组装解决方案向索引寄存器添加4，并使用4的比例。

I was not able to use a general register for the iterator. I had to specify one which in this case was rax. Is there a reason for this?

我不能为迭代器使用通用寄存器。我必须指定一个，在这里是rax。这是有原因的吗?

Here is the solution I came up with using the memory modifer "m"

这是我使用内存修饰符m提出的解决方案

void add_asm2(float *x, float *y, float *z, unsigned n) {
    for(int i=0; i<n; i+=4) {
        __asm__ __volatile__ (
            "movaps   %1, %%xmm0\n"
            "addps    %2, %%xmm0\n"
            "movaps   %%xmm0, %0\n"
            : "=m" (z[i])
            : "m" (y[i]), "m" (x[i])
            :
            );
    }
}

This is less efficient as it does not use an index register and instead has to add 16 to the base register of each array. The generated assembly is (gcc (Ubuntu 5.2.1-22ubuntu2) with gcc -O3 -S asmtest.c):

这样做效率较低，因为它不使用索引寄存器，而是必须向每个数组的基本寄存器添加16。生成的程序集(gcc (Ubuntu 5.2.1-22ubuntu2)带有gcc -O3 -S asmtest.c):

.L22
    movaps   (%rsi), %xmm0
    addps    (%rdi), %xmm0
    movaps   %xmm0, (%rdx)
    addl    $4, %eax
    addq    $16, %rdx
    addq    $16, %rsi
    addq    $16, %rdi
    cmpl    %eax, %ecx
    ja      .L22

Is there a better solution using the memory modifier "m"? Is there some way to get it to use an index register? The reason I asked is that it seemed more logical to me to use the memory modifer "m" since I am reading and writing memory. Additionally, with the register modifier "r" I never use an output operand list which seemed odd to me at first.

是否有更好的解决方案，使用内存修饰符“m”?有没有什么办法让它使用索引寄存器?我之所以问这个问题，是因为在我读和写记忆的时候，用记忆修饰的“m”似乎更符合逻辑。此外，使用寄存器修饰符“r”时，我从不使用输出操作数列表，这在一开始看起来很奇怪。

Maybe there is a better solution than using "r" or "m"?

也许有比使用“r”或“m”更好的解决方案吗?

Here is the full code I used to test this

下面是我用来测试它的完整代码

#include <stdio.h>
#include <x86intrin.h>

#define N 64

void add_intrin(float *x, float *y, float *z, unsigned n) {
    for(int i=0; i<n; i+=4) {
        __m128 x4 = _mm_load_ps(&x[i]);
        __m128 y4 = _mm_load_ps(&y[i]);
        __m128 s = _mm_add_ps(x4,y4);
        _mm_store_ps(&z[i], s);
    }
}

void add_intrin2(float *x, float *y, float *z, unsigned n) {
    for(int i=0; i<n/4; i++) {
        __m128 x4 = _mm_load_ps(&x[4*i]);
        __m128 y4 = _mm_load_ps(&y[4*i]);
        __m128 s = _mm_add_ps(x4,y4);
        _mm_store_ps(&z[4*i], s);
    }
}

void add_asm1(float *x, float *y, float *z, unsigned n) {
    for(int i=0; i<n; i+=4) {
        __asm__ __volatile__ (
            "movaps   (%1,%%rax,4), %%xmm0\n"
            "addps    (%2,%%rax,4), %%xmm0\n"
            "movaps   %%xmm0, (%0,%%rax,4)\n"
            :
            : "r" (z), "r" (y), "r" (x), "a" (i)
            :
        );
    }
}

void add_asm2(float *x, float *y, float *z, unsigned n) {
    for(int i=0; i<n; i+=4) {
        __asm__ __volatile__ (
            "movaps   %1, %%xmm0\n"
            "addps    %2, %%xmm0\n"
            "movaps   %%xmm0, %0\n"
            : "=m" (z[i])
            : "m" (y[i]), "m" (x[i])
            :
            );
    }
}

int main(void) {
    float x[N], y[N], z1[N], z2[N], z3[N];
    for(int i=0; i<N; i++) x[i] = 1.0f, y[i] = 2.0f;
    add_intrin2(x,y,z1,N);
    add_asm1(x,y,z2,N);
    add_asm2(x,y,z3,N);
    for(int i=0; i<N; i++) printf("%.0f ", z1[i]); puts("");
    for(int i=0; i<N; i++) printf("%.0f ", z2[i]); puts("");
    for(int i=0; i<N; i++) printf("%.0f ", z3[i]); puts("");
}

3 个解决方案

#1

Avoid inline asm whenever possible: https://gcc.gnu.org/wiki/DontUseInlineAsm. It blocks many optimizations. But if you really can't hand-hold the compiler into making the asm you want, you should probably write your whole loop in asm so you can unroll and tweak it manually, instead of doing stuff like this.

尽可能避免内联asm: https://gcc.gnu.org/wiki/DontUseInlineAsm。块许多优化。但是如果你真的不能手动控制编译器来生成你想要的asm，你应该在asm中编写整个循环，这样你就可以手动展开和调整它，而不是像这样做。

You can use an r constraint for the index. Use the q modifier to get the name of the 64bit register, so you can use it in an addressing mode. When compiled for 32bit targets, the q modifier selects the name of the 32bit register, so the same code still works.

可以对索引使用r约束。使用q修饰符获取64位寄存器的名称，因此可以在寻址模式中使用它。当编译为32位目标时，q修饰符选择32位寄存器的名称，因此相同的代码仍然有效。

If you want to choose what kind of addressing mode is used, you'll need to do it yourself, using pointer operands with r constraints.

如果您想选择使用哪种寻址模式，您需要自己使用具有r约束的指针操作数。

GNU C inline asm syntax doesn't assume that you read or write memory pointed to by pointer operands. (e.g. maybe you're using an inline-asm and on the pointer value). So you need to do something with either a "memory" clobber or memory input/output operands to let it know what memory you modify. A "memory" clobber is easy, but forces everything except locals to be spilled/reloaded. See the Clobbers section in the docs for an example of using a dummy input operand.

GNU C内联asm语法不假设您读或写指针操作数指向的内存。(例如，您可能正在使用一个内联-asm和指针值)。因此，您需要使用“内存”clobber或内存输入/输出操作数来让它知道您修改了哪些内存。一个“记忆”的重击是容易的，但是迫使除了局部的所有东西被溢出/重新加载。请参阅文档中的Clobbers部分，以获得使用伪输入操作数的示例。

Another huge benefit to a m constraint is that -funroll-loops can work by generating addresses with constant offsets. Doing the addressing ourself prevents the compiler from doing a single increment every 4 iterations or something, because every source-level value of i needs to appear in a register.

m约束的另一个巨大好处是-funroll-loop可以通过生成具有恒定偏移量的地址来工作。通过对自己进行寻址，可以防止编译器在每4次迭代中执行一次增量，因为我需要在寄存器中显示每个源代码级别的值。

Here's my version, with some tweaks as noted in comments.

这是我的版本，在评论中有一些调整。

#include <immintrin.h>
void add_asm1_memclobber(float *x, float *y, float *z, unsigned n) {
    __m128 vectmp;  // let the compiler choose a scratch register
    for(int i=0; i<n; i+=4) {
        __asm__ __volatile__ (
            "movaps   (%[y],%q[idx],4), %[vectmp]\n\t"  // q modifier: 64bit version of a GP reg
            "addps    (%[x],%q[idx],4), %[vectmp]\n\t"
            "movaps   %[vectmp], (%[z],%q[idx],4)\n\t"
            : [vectmp] "=x" (vectmp)  // "=m" (z[i])  // gives worse code if the compiler prepares a reg we don't use
            : [z] "r" (z), [y] "r" (y), [x] "r" (x),
              [idx] "r" (i) // unrolling is impossible this way (without an insn for every increment by 4)
            : "memory"
          // you can avoid a "memory" clobber with dummy input/output operands
        );
    }
}

Godbolt compiler explorer asm output for this and a couple versions below.

Godbolt编译器资源管理器asm输出和下面的几个版本。

Your version needs to declare %xmm0 as clobbered, or you will have a bad time when this is inlined. My version uses a temporary variable as an output-only operand that's never used. This gives the compiler full freedom for register allocation.

您的版本需要将%xmm0声明为clobject，否则在内联时将会有糟糕的时间。我的版本使用一个临时变量作为仅输出的操作数，这是从未使用过的。这使编译器可以完全*地分配寄存器。

If you want to avoid the "memory" clobber, you can use dummy memory input/output operands like "m" (*(const __m128*)&x[i]) to tell the compiler which memory is read and written by your function. This is necessary to ensure correct code-generation if you did something like x[4] = 1.0; right before running that loop. (And even if you didn't write something that simple, inlining and constant propagation can boil it down to that.) And also to make sure the compiler doesn't read from z[] before the loop runs.

如果希望避免“内存”clobber，可以使用“m”(*(const __m128*)和x[i])这样的虚拟内存输入/输出操作数来告诉编译器哪些内存是由函数读取和写入的。这对于确保正确的代码生成是必要的，如果您做了类似于x[4] = 1.0的事情;在运行这个循环之前。(即使你没有写一些简单的、内衬和不断传播的东西，也可以把它浓缩成这样。)还要确保编译器在循环运行之前不会读取z[]。

In this case, we get horrible results: gcc5.x actually increments 3 extra pointers because it decides to use [reg] addressing modes instead of indexed. It doesn't know that the inline asm never actually references those memory operands using the addressing mode created by the constraint!

在本例中，我们得到了可怕的结果:gcc5。x实际上增加了3个额外的指针，因为它决定使用[reg]寻址模式而不是索引。它不知道内联asm实际上从未使用约束创建的寻址模式引用这些内存操作数!

# gcc5.4 with dummy constraints like "=m" (*(__m128*)&z[i]) instead of "memory" clobber
.L11:
    movaps   (%rsi,%rax,4), %xmm0   # y, i, vectmp
    addps    (%rdi,%rax,4), %xmm0   # x, i, vectmp
    movaps   %xmm0, (%rdx,%rax,4)   # vectmp, z, i

    addl    $4, %eax        #, i
    addq    $16, %r10       #, ivtmp.19
    addq    $16, %r9        #, ivtmp.21
    addq    $16, %r8        #, ivtmp.22
    cmpl    %eax, %ecx      # i, n
    ja      .L11        #,

r8, r9, and r10 are the extra pointers that the inline asm block doesn't use.

r8、r9和r10是内联asm块不使用的额外指针。

You can use a constraint that tells gcc an entire array of arbitrary length is an input or an output: "m" (*(const struct {char a; char x[];} *) pStr) from @David Wohlferd's answer on an asm strlen. Since we want to use indexed addressing modes, we will have the base address of all three arrays in registers, and this form of constraint asks for the base address as an operand, rather than a pointer to the current memory being operated on.

您可以使用一个约束来告诉gcc，任意长度的整个数组是输入或输出:“m”(*(const struct {char a;@David Wohlferd在asm strlen上的回答。由于我们希望使用索引寻址模式，我们将在寄存器中拥有所有三个数组的基地址，这种形式的约束要求将基地址作为操作数，而不是指向正在操作的当前内存的指针。

This actually works without any extra counter increments inside the loop:

这实际上在循环中没有任何额外的计数器增量:

void add_asm1_dummy_whole_array(const float *restrict x, const float *restrict y,
                             float *restrict z, unsigned n) {
    __m128 vectmp;  // let the compiler choose a scratch register
    for(int i=0; i<n; i+=4) {
        __asm__ __volatile__ (
            "movaps   (%[y],%q[idx],4), %[vectmp]\n\t"  // q modifier: 64bit version of a GP reg
            "addps    (%[x],%q[idx],4), %[vectmp]\n\t"
            "movaps   %[vectmp], (%[z],%q[idx],4)\n\t"
            : [vectmp] "=x" (vectmp)  // "=m" (z[i])  // gives worse code if the compiler prepares a reg we don't use
             , "=m" (*(struct {float a; float x[];} *) z)
            : [z] "r" (z), [y] "r" (y), [x] "r" (x),
              [idx] "r" (i) // unrolling is impossible this way (without an insn for every increment by 4)
              , "m" (*(const struct {float a; float x[];} *) x),
                "m" (*(const struct {float a; float x[];} *) y)
        );
    }
}

This gives us the same inner loop we got with a "memory" clobber:

这就给了我们和“记忆”一样的内在循环:

.L19:   # with clobbers like "m" (*(const struct {float a; float x[];} *) y)
    movaps   (%rsi,%rax,4), %xmm0   # y, i, vectmp
    addps    (%rdi,%rax,4), %xmm0   # x, i, vectmp
    movaps   %xmm0, (%rdx,%rax,4)   # vectmp, z, i

    addl    $4, %eax        #, i
    cmpl    %eax, %ecx      # i, n
    ja      .L19        #,

It tells the compiler that each asm block reads or writes the entire arrays, so it may unnecessarily stop it from interleaving with other code (e.g. after fully unrolling with low iteration count). It doesn't stop unrolling, but the requirement to have each index value in a register does make it less effective.

它告诉编译器每个asm块读取或写入整个数组，因此它可能不必要地阻止它与其他代码的交叉(例如，在以低迭代计数完全展开之后)。它不会停止展开，但注册表中每个索引值的要求确实会降低它的效率。

A version with m constraints, that gcc can unroll:

有m约束的版本，gcc可以展开:

#include <immintrin.h>
void add_asm1(float *x, float *y, float *z, unsigned n) {
    __m128 vectmp;  // let the compiler choose a scratch register
    for(int i=0; i<n; i+=4) {
        __asm__ __volatile__ (
           // "movaps   %[yi], %[vectmp]\n\t"
            "addps    %[xi], %[vectmp]\n\t"  // We requested that the %[yi] input be in the same register as the [vectmp] dummy output
            "movaps   %[vectmp], %[zi]\n\t"
          // ugly ugly type-punning casts; __m128 is a may_alias type so it's safe.
            : [vectmp] "=x" (vectmp), [zi] "=m" (*(__m128*)&z[i])
            : [yi] "0"  (*(__m128*)&y[i])  // or [yi] "xm" (*(__m128*)&y[i]), and uncomment the movaps load
            , [xi] "xm" (*(__m128*)&x[i])
            :  // memory clobber not needed
        );
    }
}

Using [yi] as a +x input/output operand would be simpler, but writing it this way makes a smaller change for uncommenting the load in the inline asm, instead of letting the compiler get one value into registers for us.

使用[yi]作为一个+x输入/输出操作数会更简单，但是这样编写会使对内联asm中的负载取消注释变得更小，而不是让编译器为我们将一个值放入寄存器中。

#2

When I compile your add_asm2 code with gcc (4.9.2) I get:

当我用gcc(4.9.2)编译您的add_asm2代码时，我得到:

add_asm2:
.LFB0:
        .cfi_startproc
        xorl        %eax, %eax
        xorl        %r8d, %r8d
        testl       %ecx, %ecx
        je  .L1
        .p2align 4,,10
        .p2align 3
.L5:
#APP
# 3 "add_asm2.c" 1
        movaps   (%rsi,%rax), %xmm0
addps    (%rdi,%rax), %xmm0
movaps   %xmm0, (%rdx,%rax)

# 0 "" 2
#NO_APP
        addl        $4, %r8d
        addq        $16, %rax
        cmpl        %r8d, %ecx
        ja  .L5
.L1:
        rep; ret
        .cfi_endproc

so it is not perfect (it uses a redundant register), but does use indexed loads...

所以它并不完美(它使用一个冗余寄存器)，但是确实使用了索引负载……

#3

gcc also has builtin vector extensions which are even cross platform:

gcc也有构建向量扩展，甚至是跨平台的:

typedef float v4sf __attribute__((vector_size(16)));
void add_vector(float *x, float *y, float *z, unsigned n) {
    for(int i=0; i<n/4; i+=1) {
        *(v4sf*)(z + 4*i) = *(v4sf*)(x + 4*i) + *(v4sf*)(y + 4*i);
    }
}

On my gcc version 4.7.2 the generated assembly is:

在我的gcc 4.7.2版本中，生成的程序集是:

.L28:
        movaps  (%rdi,%rax), %xmm0
        addps   (%rsi,%rax), %xmm0
        movaps  %xmm0, (%rdx,%rax)
        addq    $16, %rax
        cmpq    %rcx, %rax
        jne     .L28

#1