SSE有双打,不值得吗?

时间:2022-06-12 04:09:45

I read a bit about using SSE intrinsics and tried my luck with implementing quaternion rotation with doubles. Below are the normal and SSE functions I wrote,

我读了一些关于使用SSE内在函数的内容,并试着用双执行实现四元数旋转。以下是我写的正常和SSE函数,


void quat_rot(quat_t a, REAL* restrict b){
    ///////////////////////////////////////////
    // Multiply vector b by quaternion a     //
    ///////////////////////////////////////////
    REAL cross_temp[3],result[3];

    cross_temp[0]=a.el[2]*b[2]-a.el[3]*b[1]+a.el[0]*b[0];
    cross_temp[1]=a.el[3]*b[0]-a.el[1]*b[2]+a.el[0]*b[1];
    cross_temp[2]=a.el[1]*b[1]-a.el[2]*b[0]+a.el[0]*b[2];

    result[0]=b[0]+2.0*(a.el[2]*cross_temp[2]-a.el[3]*cross_temp[1]);
    result[1]=b[1]+2.0*(a.el[3]*cross_temp[0]-a.el[1]*cross_temp[2]);
    result[2]=b[2]+2.0*(a.el[1]*cross_temp[1]-a.el[2]*cross_temp[0]);

    b[0]=result[0];
    b[1]=result[1];
    b[2]=result[2];
}

With SSE


inline void cross_p(__m128d *a, __m128d *b, __m128d *c){
    const __m128d SIGN_NP = _mm_set_pd(0.0, -0.0);


    __m128d l1 = _mm_mul_pd( _mm_unpacklo_pd(a[1], a[1]), b[0] );

    __m128d l2 = _mm_mul_pd( _mm_unpacklo_pd(b[1], b[1]), a[0] );
    __m128d m1 = _mm_sub_pd(l1, l2);
            m1 = _mm_shuffle_pd(m1, m1, 1);
            m1 = _mm_xor_pd(m1, SIGN_NP);

            l1 = _mm_mul_pd( a[0], _mm_shuffle_pd(b[0], b[0], 1) );

    __m128d m2 = _mm_sub_sd(l1, _mm_unpackhi_pd(l1, l1));

    c[0] = m1;
    c[1] = m2;
}

void quat_rotSSE(quat_t a, REAL* restrict b){
    ///////////////////////////////////////////
    // Multiply vector b by quaternion a     //
    ///////////////////////////////////////////

    __m128d axb[2];
    __m128d aa[2];
            aa[0] = _mm_load_pd(a.el+1);
            aa[1] = _mm_load_sd(a.el+3);
    __m128d bb[2];
            bb[0] = _mm_load_pd(b);
            bb[1] = _mm_load_sd(b+2);

    cross_p(aa, bb, axb);

    __m128d w = _mm_set1_pd(a.el[0]);

    axb[0] = _mm_add_pd(axb[0], _mm_mul_pd(w, bb[0]));
    axb[1] = _mm_add_sd(axb[1], _mm_mul_sd(w, bb[1]));

    cross_p(aa, axb, axb);

    _mm_store_pd(b, _mm_add_pd(bb[0], _mm_add_pd(axb[0], axb[0])));
    _mm_store_sd(b+2, _mm_add_pd(bb[1], _mm_add_sd(axb[1], axb[1])));
}

The rotation is basically done using the function,

旋转基本上是使用函数完成的,

SSE有双打,不值得吗?

I then run the following test to check how much time each function takes to do a set of rotations,

然后我运行以下测试来检查每个函数执行一组旋转所花费的时间,


int main(int argc, char *argv[]){

    REAL a[] __attribute__ ((aligned(16))) = {0.2, 1.3, 2.6};
    quat_t q = {{0.1, 0.7, -0.3, -3.2}};

    REAL sum = 0.0;
    for(int i = 0; i < 4; i++) sum += q.el[i] * q.el[i];
    sum = sqrt(sum);
    for(int i = 0; i < 4; i++) q.el[i] /= sum;

    int N = 1000000000;

    for(int i = 0; i < N; i++){
        quat_rotSSE(q, a);
    }

    printf("rot = ");
    for(int i = 0; i < 3; i++) printf("%f, ", a[i]);
    printf("\n");

    return 0;
}

I compiled using gcc 4.6.3 with -O3 -std=c99 -msse3.

我使用gcc 4.6.3编译了-O3 -std = c99 -msse3。

The timings for the normal function, using the unix time, was 18.841s and 21.689s for the SSE one.

使用unix时间的正常功能的时间是SSE 1的18.841s和21.689s。

Am I missing something, why is the SSE implementation 15% slower than the normal one? In which cases would an SSE implementation be faster for double precision?

我错过了什么,为什么SSE实施比正常实施慢15%?在哪种情况下,SSE实现对双精度更快?


EDIT: Taking advice from the comments, I tried several things,

编辑:从评论中获取建议,我尝试了几件事,

  • -O1 option gives very similar results.
  • -O1选项提供非常相似的结果。

  • Tried using restrict on the cross_p function and added an __m128d to hold the second cross product. This had no difference in the assembly produced.
  • 尝试在cross_p函数上使用restrict并添加__m128d来保存第二个交叉产品。这与所生产的组件没有区别。

  • The assembly produced for the normal function, from what I understand, only contains scalar instructions except of some movapd.
  • 根据我的理解,为正常函数生成的程序集只包含除某些movapd之外的标量指令。

The assembly code generated for the SSE function is only 4 lines less than the normal one.

为SSE功能生成的汇编代码仅比正常函数少4行。


EDIT: Added links to the assembly generated,

编辑:添加了生成的程序集的链接,

quat_rot

quat_rotSSE

2 个解决方案

#1


10  

SSE (and SIMD in general) works really well when you're performing the same operations on a large number of elements, where there's no dependencies between operations. For example, if you had an array of double and needed to do array[i] = (array[i] * K + L)/M + N; for each element then SSE/SIMD would help.

当您在大量元素上执行相同的操作时,SSE(以及一般的SIMD)非常有效,其中操作之间没有依赖关系。例如,如果你有一个double数组并且需要做array [i] =(array [i] * K + L)/ M + N;对于每个元素,然后SSE / SIMD会有所帮助。

If you're not performing the same operations on a large number of elements, then SSE doesn't help. For example, if you had one double and needed to do foo = (foo * K + L)/M + N; then SSE/SIMD isn't going to help.

如果您没有对大量元素执行相同的操作,那么SSE没有帮​​助。例如,如果你有一个double并且需要做foo =(foo * K + L)/ M + N;然后SSE / SIMD无济于事。

Basically, SSE is the wrong tool for the job. You need to change the job into something where SSE is the right tool. For example, rather than multiplying one vector by one quaternion; try multiplying an array of 1000 vectors by a quaternion, or perhaps multiplying an array of 1000 vectors by an array of 1000 quaternions.

基本上,SSE是这项工作的错误工具。您需要将作业更改为SSE是正确工具的内容。例如,而不是将一个向量乘以一个四元数;尝试将一个1000个向量的数组乘以四元数,或者将1000个向量的数组乘以1000个四元数的数组。

EDIT: Added everything below here!

编辑:在这里添加了所有内容!

Note that this typically means modifying data structures to suit. For example, rather than having an array of structures it's often better to have a structure of arrays.

请注意,这通常意味着修改数据结构以适应。例如,不是拥有一个结构数组,而是拥有一个数组结构通常会更好。

For a better example, imagine your code uses an array of quaternions, like this:

更好的例子,假设你的代码使用了一个四元数组,如下所示:

for(i = 0; i < quaternionCount; i++) {
   cross_temp[i][0] = a[i][2] * b[i][2] - a[i][3] * b[i][1] + a[i][0] * b[i][0];
   cross_temp[i][1] = a[i][3] * b[i][0] - a[i][1] * b[i][2] + a[i][0] * b[i][1];
   cross_temp[i][2] = a[i][1] * b[i][1] - a[i][2] * b[i][0] + a[i][0] * b[i][2];

   b[i][0] = b[i][0] + 2.0 * (a[i][2] * cross_temp[i][2] - a[i][3] * cross_temp[i][1]);
   b[i][1] = b[i][1] + 2.0 * (a[i][3] * cross_temp[i][0] - a[i][1] * cross_temp[i][2]);
   b[i][2] = b[i][2] + 2.0 * (a[i][1] * cross_temp[i][1] - a[i][2] * cross_temp[i][0]);
}

The first step would be to transform it into a quaternion of arrays, and do this:

第一步是将其转换为数组的四元数,并执行以下操作:

for(i = 0; i < quaternionCount; i++) {
   cross_temp[0][i] = a[2][i] * b[2][i] - a[3][i] * b[1][i] + a[0][i] * b[0][i];
   cross_temp[1][i] = a[3][i] * b[0][i] - a[1][i] * b[2][i] + a[0][i] * b[1][i];
   cross_temp[2][i] = a[1][i] * b[1][i] - a[2][i] * b[0][i] + a[0][i] * b[2][i];

   b[0][i] = b[0][i] + 2.0 * (a[2][i] * cross_temp[2][i] - a[3][i] * cross_temp[1][i]);
   b[1][i] = b[1][i] + 2.0 * (a[3][i] * cross_temp[0][i] - a[1][i] * cross_temp[2][i]);
   b[2][i] = b[2][i] + 2.0 * (a[1][i] * cross_temp[1][i] - a[2][i] * cross_temp[0][i]);
}

Then, because 2 adjacent doubles fit into a single SSE register you want to unroll the loop by 2:

然后,因为2个相邻的双精度数适合单个SSE寄存器,所以要将循环展开2:

for(i = 0; i < quaternionCount; i += 2) {
   cross_temp[0][i] = a[2][i] * b[2][i] - a[3][i] * b[1][i] + a[0][i] * b[0][i];
   cross_temp[0][i+1] = a[2][i+1] * b[2][i+1] - a[3][i+1] * b[1][i+1] + a[0][i+1] * b[0][i+1];
   cross_temp[1][i] = a[3][i] * b[0][i] - a[1][i] * b[2][i] + a[0][i] * b[1][i];
   cross_temp[1][i+1] = a[3][i+1] * b[0][i+1] - a[1][i+1] * b[2][i+1] + a[0][i+1] * b[1][i+1];
   cross_temp[2][i] = a[1][i] * b[1][i] - a[2][i] * b[0][i] + a[0][i] * b[2][i];
   cross_temp[2][i+1] = a[1][i+1] * b[1][i+1] - a[2][i+1] * b[0][i+1] + a[0][i+1] * b[2][i+1];

   b[0][i] = b[0][i] + 2.0 * (a[2][i] * cross_temp[2][i] - a[3][i] * cross_temp[1][i]);
   b[0][i+1] = b[0][i+1] + 2.0 * (a[2][i+1] * cross_temp[2][i+1] - a[3][i+1] * cross_temp[1][i+1]);
   b[1][i] = b[1][i] + 2.0 * (a[3][i] * cross_temp[0][i] - a[1][i] * cross_temp[2][i]);
   b[1][i+1] = b[1][i+1] + 2.0 * (a[3][i+1] * cross_temp[0][i+1] - a[1][i+1] * cross_temp[2][i+1]);
   b[2][i] = b[2][i] + 2.0 * (a[1][i] * cross_temp[1][i] - a[2][i] * cross_temp[0][i]);
   b[2][i+1] = b[2][i+1] + 2.0 * (a[1][i+1] * cross_temp[1][i+1] - a[2][i+1] * cross_temp[0][i+1]);
}

Now, you want to break this down into individual operations. For example, the first 2 lines of the inner loop would become:

现在,您想将其分解为单独的操作。例如,内部循环的前2行将变为:

   cross_temp[0][i] = a[2][i] * b[2][i];
   cross_temp[0][i] -= a[3][i] * b[1][i];
   cross_temp[0][i] += a[0][i] * b[0][i];
   cross_temp[0][i+1] = a[2][i+1] * b[2][i+1];
   cross_temp[0][i+1] -= a[3][i+1] * b[1][i+1];
   cross_temp[0][i+1] += a[0][i+1] * b[0][i+1];

Now re-order that:

现在重新订购:

   cross_temp[0][i] = a[2][i] * b[2][i];
   cross_temp[0][i+1] = a[2][i+1] * b[2][i+1];

   cross_temp[0][i] -= a[3][i] * b[1][i];
   cross_temp[0][i+1] -= a[3][i+1] * b[1][i+1];

   cross_temp[0][i] += a[0][i] * b[0][i];
   cross_temp[0][i+1] += a[0][i+1] * b[0][i+1];

Once you've done all of this, think about converting to SSE. The first 2 lines of code is one load (that loads both a[2][i] and a[2][i+1] into an SSE register) followed by one multiplication (and not 2 separate loads and 2 separate multiplications). Those 6 lines might become (pseudo-code):

完成所有这些后,请考虑转换为SSE。前2行代码是一个负载(将[2] [i]和[2] [i + 1]加载到SSE寄存器中),然后进行一次乘法(而不是2次单独的加载和2次单独的乘法) 。这6行可能成为(伪代码):

   load SSE_register1 with both a[2][i] and a[2][i+1]
   multiply SSE_register1 with both b[2][i] and b[2][i+1]

   load SSE_register2 with both a[3][i] and a[3][i+1]
   multiply SSE_register2 with both b[1][i] and b[1][i+1]

   load SSE_register2 with both a[0][i] and a[0][i+1]
   multiply SSE_register2 with both b[0][i] and b[0][i+1]

   SE_register1 = SE_register1 - SE_register2
   SE_register1 = SE_register1 + SE_register3

Each line of pseudo-code here is a single SSE instruction/intrinsic; and every SSE instruction/intrinsic is doing 2 operations in parallel.

这里的每一行伪代码都是单个SSE指令/内在函数;并且每个SSE指令/内部函数并行执行2个操作。

If every instruction does 2 operations in parallel, then (in theory) it could be up to twice as fast as the original "one operation per instruction" code.

如果每个指令并行执行2个操作,那么(理论上)它可以是原始“每个指令一个操作”代码的两倍。

#2


1  

Some ideas that would perhaps enable full optimization of your code.

一些想法可能会完全优化您的代码。

  • your functions should be inlined
  • 你的功能应该内联

  • you should add restrict specifications to cross_p to avoid the compiler reloading of parameters several times
  • 您应该将限制规范添加到cross_p,以避免编译器多次重新加载参数

  • if you do that you'd have to introduce a fourth __m128d variable that will receive the result of the second call to cross_p
  • 如果你这样做,你必须引入第四个__m128d变量,它将接收第二次调用cross_p的结果

Then look into the assembler (gcc option -S) to see what is produced by all of this.

然后查看汇编程序(gcc选项-S)以查看所有这些内容生成的内容。

#1


10  

SSE (and SIMD in general) works really well when you're performing the same operations on a large number of elements, where there's no dependencies between operations. For example, if you had an array of double and needed to do array[i] = (array[i] * K + L)/M + N; for each element then SSE/SIMD would help.

当您在大量元素上执行相同的操作时,SSE(以及一般的SIMD)非常有效,其中操作之间没有依赖关系。例如,如果你有一个double数组并且需要做array [i] =(array [i] * K + L)/ M + N;对于每个元素,然后SSE / SIMD会有所帮助。

If you're not performing the same operations on a large number of elements, then SSE doesn't help. For example, if you had one double and needed to do foo = (foo * K + L)/M + N; then SSE/SIMD isn't going to help.

如果您没有对大量元素执行相同的操作,那么SSE没有帮​​助。例如,如果你有一个double并且需要做foo =(foo * K + L)/ M + N;然后SSE / SIMD无济于事。

Basically, SSE is the wrong tool for the job. You need to change the job into something where SSE is the right tool. For example, rather than multiplying one vector by one quaternion; try multiplying an array of 1000 vectors by a quaternion, or perhaps multiplying an array of 1000 vectors by an array of 1000 quaternions.

基本上,SSE是这项工作的错误工具。您需要将作业更改为SSE是正确工具的内容。例如,而不是将一个向量乘以一个四元数;尝试将一个1000个向量的数组乘以四元数,或者将1000个向量的数组乘以1000个四元数的数组。

EDIT: Added everything below here!

编辑:在这里添加了所有内容!

Note that this typically means modifying data structures to suit. For example, rather than having an array of structures it's often better to have a structure of arrays.

请注意,这通常意味着修改数据结构以适应。例如,不是拥有一个结构数组,而是拥有一个数组结构通常会更好。

For a better example, imagine your code uses an array of quaternions, like this:

更好的例子,假设你的代码使用了一个四元数组,如下所示:

for(i = 0; i < quaternionCount; i++) {
   cross_temp[i][0] = a[i][2] * b[i][2] - a[i][3] * b[i][1] + a[i][0] * b[i][0];
   cross_temp[i][1] = a[i][3] * b[i][0] - a[i][1] * b[i][2] + a[i][0] * b[i][1];
   cross_temp[i][2] = a[i][1] * b[i][1] - a[i][2] * b[i][0] + a[i][0] * b[i][2];

   b[i][0] = b[i][0] + 2.0 * (a[i][2] * cross_temp[i][2] - a[i][3] * cross_temp[i][1]);
   b[i][1] = b[i][1] + 2.0 * (a[i][3] * cross_temp[i][0] - a[i][1] * cross_temp[i][2]);
   b[i][2] = b[i][2] + 2.0 * (a[i][1] * cross_temp[i][1] - a[i][2] * cross_temp[i][0]);
}

The first step would be to transform it into a quaternion of arrays, and do this:

第一步是将其转换为数组的四元数,并执行以下操作:

for(i = 0; i < quaternionCount; i++) {
   cross_temp[0][i] = a[2][i] * b[2][i] - a[3][i] * b[1][i] + a[0][i] * b[0][i];
   cross_temp[1][i] = a[3][i] * b[0][i] - a[1][i] * b[2][i] + a[0][i] * b[1][i];
   cross_temp[2][i] = a[1][i] * b[1][i] - a[2][i] * b[0][i] + a[0][i] * b[2][i];

   b[0][i] = b[0][i] + 2.0 * (a[2][i] * cross_temp[2][i] - a[3][i] * cross_temp[1][i]);
   b[1][i] = b[1][i] + 2.0 * (a[3][i] * cross_temp[0][i] - a[1][i] * cross_temp[2][i]);
   b[2][i] = b[2][i] + 2.0 * (a[1][i] * cross_temp[1][i] - a[2][i] * cross_temp[0][i]);
}

Then, because 2 adjacent doubles fit into a single SSE register you want to unroll the loop by 2:

然后,因为2个相邻的双精度数适合单个SSE寄存器,所以要将循环展开2:

for(i = 0; i < quaternionCount; i += 2) {
   cross_temp[0][i] = a[2][i] * b[2][i] - a[3][i] * b[1][i] + a[0][i] * b[0][i];
   cross_temp[0][i+1] = a[2][i+1] * b[2][i+1] - a[3][i+1] * b[1][i+1] + a[0][i+1] * b[0][i+1];
   cross_temp[1][i] = a[3][i] * b[0][i] - a[1][i] * b[2][i] + a[0][i] * b[1][i];
   cross_temp[1][i+1] = a[3][i+1] * b[0][i+1] - a[1][i+1] * b[2][i+1] + a[0][i+1] * b[1][i+1];
   cross_temp[2][i] = a[1][i] * b[1][i] - a[2][i] * b[0][i] + a[0][i] * b[2][i];
   cross_temp[2][i+1] = a[1][i+1] * b[1][i+1] - a[2][i+1] * b[0][i+1] + a[0][i+1] * b[2][i+1];

   b[0][i] = b[0][i] + 2.0 * (a[2][i] * cross_temp[2][i] - a[3][i] * cross_temp[1][i]);
   b[0][i+1] = b[0][i+1] + 2.0 * (a[2][i+1] * cross_temp[2][i+1] - a[3][i+1] * cross_temp[1][i+1]);
   b[1][i] = b[1][i] + 2.0 * (a[3][i] * cross_temp[0][i] - a[1][i] * cross_temp[2][i]);
   b[1][i+1] = b[1][i+1] + 2.0 * (a[3][i+1] * cross_temp[0][i+1] - a[1][i+1] * cross_temp[2][i+1]);
   b[2][i] = b[2][i] + 2.0 * (a[1][i] * cross_temp[1][i] - a[2][i] * cross_temp[0][i]);
   b[2][i+1] = b[2][i+1] + 2.0 * (a[1][i+1] * cross_temp[1][i+1] - a[2][i+1] * cross_temp[0][i+1]);
}

Now, you want to break this down into individual operations. For example, the first 2 lines of the inner loop would become:

现在,您想将其分解为单独的操作。例如,内部循环的前2行将变为:

   cross_temp[0][i] = a[2][i] * b[2][i];
   cross_temp[0][i] -= a[3][i] * b[1][i];
   cross_temp[0][i] += a[0][i] * b[0][i];
   cross_temp[0][i+1] = a[2][i+1] * b[2][i+1];
   cross_temp[0][i+1] -= a[3][i+1] * b[1][i+1];
   cross_temp[0][i+1] += a[0][i+1] * b[0][i+1];

Now re-order that:

现在重新订购:

   cross_temp[0][i] = a[2][i] * b[2][i];
   cross_temp[0][i+1] = a[2][i+1] * b[2][i+1];

   cross_temp[0][i] -= a[3][i] * b[1][i];
   cross_temp[0][i+1] -= a[3][i+1] * b[1][i+1];

   cross_temp[0][i] += a[0][i] * b[0][i];
   cross_temp[0][i+1] += a[0][i+1] * b[0][i+1];

Once you've done all of this, think about converting to SSE. The first 2 lines of code is one load (that loads both a[2][i] and a[2][i+1] into an SSE register) followed by one multiplication (and not 2 separate loads and 2 separate multiplications). Those 6 lines might become (pseudo-code):

完成所有这些后,请考虑转换为SSE。前2行代码是一个负载(将[2] [i]和[2] [i + 1]加载到SSE寄存器中),然后进行一次乘法(而不是2次单独的加载和2次单独的乘法) 。这6行可能成为(伪代码):

   load SSE_register1 with both a[2][i] and a[2][i+1]
   multiply SSE_register1 with both b[2][i] and b[2][i+1]

   load SSE_register2 with both a[3][i] and a[3][i+1]
   multiply SSE_register2 with both b[1][i] and b[1][i+1]

   load SSE_register2 with both a[0][i] and a[0][i+1]
   multiply SSE_register2 with both b[0][i] and b[0][i+1]

   SE_register1 = SE_register1 - SE_register2
   SE_register1 = SE_register1 + SE_register3

Each line of pseudo-code here is a single SSE instruction/intrinsic; and every SSE instruction/intrinsic is doing 2 operations in parallel.

这里的每一行伪代码都是单个SSE指令/内在函数;并且每个SSE指令/内部函数并行执行2个操作。

If every instruction does 2 operations in parallel, then (in theory) it could be up to twice as fast as the original "one operation per instruction" code.

如果每个指令并行执行2个操作,那么(理论上)它可以是原始“每个指令一个操作”代码的两倍。

#2


1  

Some ideas that would perhaps enable full optimization of your code.

一些想法可能会完全优化您的代码。

  • your functions should be inlined
  • 你的功能应该内联

  • you should add restrict specifications to cross_p to avoid the compiler reloading of parameters several times
  • 您应该将限制规范添加到cross_p,以避免编译器多次重新加载参数

  • if you do that you'd have to introduce a fourth __m128d variable that will receive the result of the second call to cross_p
  • 如果你这样做,你必须引入第四个__m128d变量,它将接收第二次调用cross_p的结果

Then look into the assembler (gcc option -S) to see what is produced by all of this.

然后查看汇编程序(gcc选项-S)以查看所有这些内容生成的内容。