使用SSE内在函数矢量化2D模板

I am trying to vectorize a 2D stencil using only aligned, loads and stores. For this I want to essentially use _mm_load_ps and _mm_shuffle_ps to obtain the desired addresses.

我试图仅使用对齐,加载和存储来矢量化2D模板。为此,我想基本上使用_mm_load_ps和_mm_shuffle_ps来获取所需的地址。

My scalar version of code is:

我的标量版代码是:

    void FDTD_base (float *V, float *U, int dx, int dy, float c0, float c1, float c2, float c3, float c4)
    {
    int i, j, k;

            for (j = 4; j < dy-4; j++)
            {
                    for (i = 4; i < dx-4; i++)
                    {

                            U[j*dx+i] = (c0 * (V[j*dx+i]) //center
                                    + c1 * (V[j*dx+(i-1)] + V[(j-1)*dx+i] + V[j*dx+(i+1)] + V[(j+1)*dx+i] )
                                    + c2 * (V[j*dx+(i-2)] + V[(j-2)*dx+i] + V[j*dx+(i+2)] + V[(j+2)*dx+i] )
                                    + c3 * (V[j*dx+(i-3)] + V[(j-3)*dx+i] + V[j*dx+(i+3)] + V[(j+3)*dx+i] )
                                    + c4 * (V[j*dx+(i-4)] + V[(j-4)*dx+i] + V[j*dx+(i+4)] + V[(j+4)*dx+i] ));

                    }
            }

      }

My vector see version of code so far:

我的载体到目前为止看到的代码版本:

     for (j = 4; j < dy-4; j++)
    {
            for (i = 4; i < dx-4; i+=4)
            {
                    __m128 b = _mm_load_ps(&V[j*dx+i]);
                    center = _mm_mul_ps(b,c0_i);
                    a = _mm_load_ps(&V[j*dx+(i-4)]);
                    c = _mm_load_ps(&V[j*dx+(i+4)]);

                    d = _mm_load_ps(&V[(j-4)*dx+i]);
                    e = _mm_load_ps(&V[(j+4)*dx+i]);

                    u_i2 = _mm_shuffle_ps(a,b,_MM_SHUFFLE(1,0,3,2));//i-2
                    u_i6 = _mm_shuffle_ps(b,c,_MM_SHUFFLE(1,0,3,2));//i+2

                    u_i1 = _mm_shuffle_ps(u_i2,b,_MM_SHUFFLE(2,1,2,1));//i-1
                    u_i5 = _mm_shuffle_ps(b,u_i6,_MM_SHUFFLE(2,1,2,1));//i+1

                    u_i3 = _mm_shuffle_ps(a,u_i2,_MM_SHUFFLE(2,1,2,1));//i-3
                    u_i7 = _mm_shuffle_ps(u_i6,c,_MM_SHUFFLE(2,1,2,1));//i+3

                    u_i4 = a; //i-4
                    u_i8 = c; //i+4

Can someone help me in obtaining the positions of j-1,j+1 .....j-4,j+4.

有人可以帮助我获得j-1,j + 1 ...... j-4,j + 4的位置。

This does not work:

这不起作用:

                    u_j2 = _mm_shuffle_ps(d,b,_MM_SHUFFLE(1,0,3,2));//j-2 (this is incorrect)
                    u_j6 = _mm_shuffle_ps(b,e,_MM_SHUFFLE(1,0,3,2));//j+2

                    u_j1 = _mm_shuffle_ps(u_j2,b,_MM_SHUFFLE(2,1,2,1));//j-1
                    u_j5 = _mm_shuffle_ps(b,u_j6,_MM_SHUFFLE(2,1,2,1));//j+1

                    u_j3 = _mm_shuffle_ps(d,u_j2,_MM_SHUFFLE(2,1,2,1));//j-3
                    u_j7 = _mm_shuffle_ps(u_j6,e,_MM_SHUFFLE(2,1,2,1));//j+3

                    u_j4 = d; //j-4 (this is fine)
                    u_j8 = e; //j+4

I need help only to determine how to obtain (j-1)*dx+i,(j+1)*dx+1 ..... (j-4)*dx+i and (j+4)*dx+i without using unaligned loads.

我只需要帮助来确定如何获得(j-1)* dx + i,(j + 1)* dx + 1 .....(j-4)* dx + i和(j + 4)* dx + i不使用未对齐的载荷。

As a potential solution i thought of adding a displacement 3*dx to addresses stored in d to obtain (j-1)*dx+i. And Subtracting a displacement of 3*dx to address stored in e to obtain (j+1)*dx+i. Similarly adding 2*dx to address of d to obtain j-2 and so on. But I dont know to implement this strategy using the SSE intrinsics.

作为一种潜在的解决方案,我想到将位移3 * dx添加到存储在d中的地址以获得(j-1)* dx + i。并将3 * dx的位移减去存储在e中的地址,得到(j + 1)* dx + i。同样地将2 * dx添加到d的地址以获得j-2,依此类推。但我不知道使用SSE内在函数实现此策略。

Please help. I am using the Intel icc compiler.

请帮忙。我正在使用Intel icc编译器。

1 个解决方案

#1

"Can someone help me in obtaining the positions of j-1,j+1 .....j-4,j+4." - these do not require a shuffle; they are already aligned with your SIMD lanes.

“有人可以帮助我获得j-1,j + 1 ...... j-4,j + 4的位置。” - 这些不需要洗牌;它们已经与您的SIMD通道对齐。

u_j2 = _mm_load_ps(&V[(j-2)*dx+i]); 
u_j6 = _mm_load_ps(&V[(j+2)*dx+i]); 
u_j1 = _mm_load_ps(&V[(j-1)*dx+i]); 
u_j5 = _mm_load_ps(&V[(j+1)*dx+i]); 
// and so forth

You definitely cannot these from the variables you have labelled as d and e by any possible rearrangement because the values in d (for example) are V[j-4, i], V[j-4, i+1], V[j-4, i+2], V[j-4, i+3], and you cannot get V[j-2, i] out of that.

你绝对不能通过任何可能的重新排列标记为d和e的变量,因为d(例如)中的值是V [j-4,i],V [j-4,i + 1],V [ j-4,i + 2],V [j-4,i + 3],你不能得到V [j-2,i]。

Tip: Think in terms of SIMD lanes; that makes it clear that you need to rearrange horizontally but not vertically.

提示:考虑SIMD通道;这表明你需要水平重新排列而不是垂直重排。

Tip: Consider what happens when the inner loop counter increments (i+=4). What was u_i5 (V[j, i+1..i+5]) in the last loop is now u_i3 (V[j, i-3..i+1]) in the current loop. You are calculating each offset version of the data in your row at least twice. You can probably unroll the loop a few times and avoid doing all the extra work.

提示:考虑内循环计数器递增时发生的情况(i + = 4)。最后一个循环中的u_i5(V [j,i + 1..i + 5])现在是当前循环中的u_i3(V [j,i-3..i + 1])。您正在计算行中数据的每个偏移版本至少两次。您可以将循环展开几次,避免执行所有额外的工作。

Tip: Why not use AVX? Use _mm256_permute_ps (and _mm256_permute2f128_ps if needed) to shuffle, and the corresponding load instructions. It will be almost twice faster since you have twice as wide SIMD registers and most AVX instructions still take only one cycle on modern CPU, same as SSE instructions.

提示:为什么不使用AVX?使用_mm256_permute_ps(和_mm256_permute2f128_ps,如果需要)进行随机播放,以及相应的加载指令。它的速度几乎快两倍,因为你有两倍宽的SIMD寄存器,大多数AVX指令在现代CPU上只需要一个周期,与SSE指令相同。

#1