如何加速我的稀疏矩阵求解器？

I'm writing a sparse matrix solver using the Gauss-Seidel method. By profiling, I've determined that about half of my program's time is spent inside the solver. The performance-critical part is as follows:

我正在使用Gauss-Seidel方法编写稀疏矩阵求解器。通过剖析,我已经确定我的程序的大约一半时间都花在解算器中。性能关键部分如下:

size_t ic = d_ny + 1, iw = d_ny, ie = d_ny + 2, is = 1, in = 2 * d_ny + 1;
for (size_t y = 1; y < d_ny - 1; ++y) {
    for (size_t x = 1; x < d_nx - 1; ++x) {
        d_x[ic] = d_b[ic]
                - d_w[ic] * d_x[iw] - d_e[ic] * d_x[ie]
                - d_s[ic] * d_x[is] - d_n[ic] * d_x[in];
        ++ic; ++iw; ++ie; ++is; ++in;
    }
    ic += 2; iw += 2; ie += 2; is += 2; in += 2;
}

All arrays involved are of float type. Actually, they are not arrays but objects with an overloaded [] operator, which (I think) should be optimized away, but is defined as follows:

涉及的所有数组都是浮点型。实际上,它们不是数组,而是具有重载[]运算符的对象,(我认为)应该被优化掉,但定义如下:

inline float &operator[](size_t i) { return d_cells[i]; }
inline float const &operator[](size_t i) const { return d_cells[i]; }

For d_nx = d_ny = 128, this can be run about 3500 times per second on an Intel i7 920. This means that the inner loop body runs 3500 * 128 * 128 = 57 million times per second. Since only some simple arithmetic is involved, that strikes me as a low number for a 2.66 GHz processor.

对于d_nx = d_ny = 128,这可以在Intel i7 920上以每秒3500次运行。这意味着内部循环体每秒运行3500 * 128 * 128 = 5700万次。由于只涉及一些简单的算法,因此对于2.66 GHz处理器来说这是一个很低的数字。

Maybe it's not limited by CPU power, but by memory bandwidth? Well, one 128 * 128 float array eats 65 kB, so all 6 arrays should easily fit into the CPU's L3 cache (which is 8 MB). Assuming that nothing is cached in registers, I count 15 memory accesses in the inner loop body. On a 64-bits system this is 120 bytes per iteration, so 57 million * 120 bytes = 6.8 GB/s. The L3 cache runs at 2.66 GHz, so it's the same order of magnitude. My guess is that memory is indeed the bottleneck.

也许它不受CPU功率的限制,而是受内存带宽的限制?好吧,一个128 * 128浮点阵列吃65 kB,所以所有6个阵列都应该很容易适应CPU的L3缓存(8 MB)。假设寄存器中没有任何缓存,我在内部循环体中计算了15次内存访问。在64位系统上,这是每次迭代120个字节,因此5700万* 120字节= 6.8 GB / s。 L3缓存以2.66 GHz运行,因此它具有相同的数量级。我的猜测是记忆确实是瓶颈。

To speed this up, I've attempted the following:

为了加快速度,我尝试了以下方法:

Compile with g++ -O3. (Well, I'd been doing this from the beginning.)

用g ++ -O3编译。 (好吧,我从一开始就是这样做的。)
Parallelizing over 4 cores using OpenMP pragmas. I have to change to the Jacobi algorithm to avoid reads from and writes to the same array. This requires that I do twice as many iterations, leading to a net result of about the same speed.

使用OpenMP pragma并行化4个核心。我必须改为Jacobi算法,以避免从同一个数组读取和写入。这要求我做两次迭代,导致大约相同速度的净结果。
Fiddling with implementation details of the loop body, such as using pointers instead of indices. No effect.

摆弄循环体的实现细节,例如使用指针而不是索引。没有效果。

What's the best approach to speed this guy up? Would it help to rewrite the inner body in assembly (I'd have to learn that first)? Should I run this on the GPU instead (which I know how to do, but it's such a hassle)? Any other bright ideas?

加速这个人的最佳方法是什么?是否有助于在装配中重写内部主体(我必须首先学习)?我应该在GPU上运行它(我知道该怎么做,但这是一个麻烦)?还有其他好主意吗?

(N.B. I do take "no" for an answer, as in: "it can't be done significantly faster, because...")

(N.B.我的答案是“不”,如:“它不能明显加快,因为......”)

Update: as requested, here's a full program:

更新:根据要求,这是一个完整的程序:

#include <iostream>
#include <cstdlib>
#include <cstring>

using namespace std;

size_t d_nx = 128, d_ny = 128;
float *d_x, *d_b, *d_w, *d_e, *d_s, *d_n;

void step() {
    size_t ic = d_ny + 1, iw = d_ny, ie = d_ny + 2, is = 1, in = 2 * d_ny + 1;
    for (size_t y = 1; y < d_ny - 1; ++y) {
        for (size_t x = 1; x < d_nx - 1; ++x) {
            d_x[ic] = d_b[ic]
                - d_w[ic] * d_x[iw] - d_e[ic] * d_x[ie]
                - d_s[ic] * d_x[is] - d_n[ic] * d_x[in];
            ++ic; ++iw; ++ie; ++is; ++in;
        }
        ic += 2; iw += 2; ie += 2; is += 2; in += 2;
    }
}

void solve(size_t iters) {
    for (size_t i = 0; i < iters; ++i) {
        step();
    }
}

void clear(float *a) {
    memset(a, 0, d_nx * d_ny * sizeof(float));
}

int main(int argc, char **argv) {
    size_t n = d_nx * d_ny;
    d_x = new float[n]; clear(d_x);
    d_b = new float[n]; clear(d_b);
    d_w = new float[n]; clear(d_w);
    d_e = new float[n]; clear(d_e);
    d_s = new float[n]; clear(d_s);
    d_n = new float[n]; clear(d_n);
    solve(atoi(argv[1]));
    cout << d_x[0] << endl; // prevent the thing from being optimized away
}

I compile and run it as follows:

我按如下方式编译并运行它:

$ g++ -o gstest -O3 gstest.cpp
$ time ./gstest 8000
0

real    0m1.052s
user    0m1.050s
sys     0m0.010s

(It does 8000 instead of 3500 iterations per second because my "real" program does a lot of other stuff too. But it's representative.)

(它每秒执行8000而不是3500次迭代,因为我的“真实”程序也做了很多其他的事情。但它具有代表性。)

Update 2: I've been told that unititialized values may not be representative because NaN and Inf values may slow things down. Now clearing the memory in the example code. It makes no difference for me in execution speed, though.

更新2:我被告知单元化值可能不具有代表性,因为NaN和Inf值可能会减慢速度。现在清除示例代码中的内存。不过,它对我的执行速度没有任何影响。

7 个解决方案

#1

Couple of ideas:

几点想法:

Use SIMD. You could load 4 floats at a time from each array into a SIMD register (e.g. SSE on Intel, VMX on PowerPC). The disadvantage of this is that some of the d_x values will be "stale" so your convergence rate will suffer (but not as bad as a jacobi iteration); it's hard to say whether the speedup offsets it.

使用SIMD。您可以从每个阵列一次加载4个浮点数到SIMD寄存器(例如Intel上的SSE,PowerPC上的VMX)。这样做的缺点是某些d_x值将“陈旧”,因此您的收敛速度将受到影响(但不会像jacobi迭代那样糟糕);很难说加速是否会抵消它。
Use SOR. It's simple, doesn't add much computation, and can improve your convergence rate quite well, even for a relatively conservative relaxation value (say 1.5).

使用SOR。这很简单,不会增加太多计算,并且可以很好地提高收敛速度,即使是相对保守的松弛值(比如说1.5)。
Use conjugate gradient. If this is for the projection step of a fluid simulation (i.e. enforcing non-compressability), you should be able to apply CG and get a much better convergence rate. A good preconditioner helps even more.

使用共轭梯度。如果这是流体模拟的投影步骤(即强制执行非压缩性),您应该能够应用CG并获得更好的收敛速度。一个好的预处理器可以帮助更多。
Use a specialized solver. If the linear system arises from the Poisson equation, you can do even better than conjugate gradient using an FFT-based methods.

使用专门的解算器。如果线性系统来自泊松方程,则可以使用基于FFT的方法比共轭梯度更好。

If you can explain more about what the system you're trying to solve looks like, I can probably give some more advice on #3 and #4.

如果您可以解释更多关于您要解决的系统的内容,我可以就#3和#4提供更多建议。

#2

I think I've managed to optimize it, here's a code, create a new project in VC++, add this code and simply compile under "Release".

我想我已经成功优化了它,这里是一个代码,用VC ++创建一个新项目,添加这个代码并简单地在“Release”下编译。

#include <iostream>
#include <cstdlib>
#include <cstring>

#define _WIN32_WINNT 0x0400
#define WIN32_LEAN_AND_MEAN
#include <windows.h>

#include <conio.h>

using namespace std;

size_t d_nx = 128, d_ny = 128;
float *d_x, *d_b, *d_w, *d_e, *d_s, *d_n;

void step_original() {
    size_t ic = d_ny + 1, iw = d_ny, ie = d_ny + 2, is = 1, in = 2 * d_ny + 1;
    for (size_t y = 1; y < d_ny - 1; ++y) {
        for (size_t x = 1; x < d_nx - 1; ++x) {
            d_x[ic] = d_b[ic]
                - d_w[ic] * d_x[iw] - d_e[ic] * d_x[ie]
                - d_s[ic] * d_x[is] - d_n[ic] * d_x[in];
            ++ic; ++iw; ++ie; ++is; ++in;
        }
        ic += 2; iw += 2; ie += 2; is += 2; in += 2;
    }
}
void step_new() {
    //size_t ic = d_ny + 1, iw = d_ny, ie = d_ny + 2, is = 1, in = 2 * d_ny + 1;
    float
        *d_b_ic,
        *d_w_ic,
        *d_e_ic,
        *d_x_ic,
        *d_x_iw,
        *d_x_ie,
        *d_x_is,
        *d_x_in,
        *d_n_ic,
        *d_s_ic;

    d_b_ic = d_b;
    d_w_ic = d_w;
    d_e_ic = d_e;
    d_x_ic = d_x;
    d_x_iw = d_x;
    d_x_ie = d_x;
    d_x_is = d_x;
    d_x_in = d_x;
    d_n_ic = d_n;
    d_s_ic = d_s;

    for (size_t y = 1; y < d_ny - 1; ++y)
    {
        for (size_t x = 1; x < d_nx - 1; ++x)
        {
            /*d_x[ic] = d_b[ic]
                - d_w[ic] * d_x[iw] - d_e[ic] * d_x[ie]
                - d_s[ic] * d_x[is] - d_n[ic] * d_x[in];*/
            *d_x_ic = *d_b_ic
                - *d_w_ic * *d_x_iw - *d_e_ic * *d_x_ie
                - *d_s_ic * *d_x_is - *d_n_ic * *d_x_in;
            //++ic; ++iw; ++ie; ++is; ++in;
            d_b_ic++;
            d_w_ic++;
            d_e_ic++;
            d_x_ic++;
            d_x_iw++;
            d_x_ie++;
            d_x_is++;
            d_x_in++;
            d_n_ic++;
            d_s_ic++;
        }
        //ic += 2; iw += 2; ie += 2; is += 2; in += 2;
        d_b_ic += 2;
        d_w_ic += 2;
        d_e_ic += 2;
        d_x_ic += 2;
        d_x_iw += 2;
        d_x_ie += 2;
        d_x_is += 2;
        d_x_in += 2;
        d_n_ic += 2;
        d_s_ic += 2;
    }
}

void solve_original(size_t iters) {
    for (size_t i = 0; i < iters; ++i) {
        step_original();
    }
}
void solve_new(size_t iters) {
    for (size_t i = 0; i < iters; ++i) {
        step_new();
    }
}

void clear(float *a) {
    memset(a, 0, d_nx * d_ny * sizeof(float));
}

int main(int argc, char **argv) {
    size_t n = d_nx * d_ny;
    d_x = new float[n]; clear(d_x);
    d_b = new float[n]; clear(d_b);
    d_w = new float[n]; clear(d_w);
    d_e = new float[n]; clear(d_e);
    d_s = new float[n]; clear(d_s);
    d_n = new float[n]; clear(d_n);

    if(argc < 3)
        printf("app.exe (x)iters (o/n)algo\n");

    bool bOriginalStep = (argv[2][0] == 'o');
    size_t iters = atoi(argv[1]);

    /*printf("Press any key to start!");
    _getch();
    printf(" Running speed test..\n");*/

    __int64 freq, start, end, diff;
    if(!::QueryPerformanceFrequency((LARGE_INTEGER*)&freq))
        throw "Not supported!";
    freq /= 1000000; // microseconds!
    {
        ::QueryPerformanceCounter((LARGE_INTEGER*)&start);
        if(bOriginalStep)
            solve_original(iters);
        else
            solve_new(iters);
        ::QueryPerformanceCounter((LARGE_INTEGER*)&end);
        diff = (end - start) / freq;
    }
    printf("Speed (%s)\t\t: %u\n", (bOriginalStep ? "original" : "new"), diff);
    //_getch();


    //cout << d_x[0] << endl; // prevent the thing from being optimized away
}

Run it like this:

像这样运行:

app.exe 10000 o

app.exe 10000 n

"o" means old code, yours.

“o”表示旧代码,您的代码。

"n" is mine, the new one.

“n”是我的,新的。

My results: Speed (original):

我的结果:速度(原创):

1515028

1523171

1495988

Speed (new):

966012

984110

1006045

Improvement of about 30%.

改善约30%。

The logic behind: You've been using index counters to access/manipulate. I use pointers. While running, breakpoint at a certain calculation code line in VC++'s debugger, and press F8. You'll get the disassembler window. The you'll see the produced opcodes (assembly code).

背后的逻辑:你一直在使用索引计数器来访问/操作。我用指针。在VC ++的调试器中的某个计算代码行上运行断点,然后按F8。你会得到反汇编窗口。您将看到生成的操作码(汇编代码)。

Anyway, look:

int *x = ...;

int * x = ...;

x[3] = 123;

x [3] = 123;

This tells the PC to put the pointer x at a register (say EAX). The add it (3 * sizeof(int)). Only then, set the value to 123.

这告诉PC将指针x放在寄存器(比如说EAX)。添加它(3 * sizeof(int))。只有这样,将值设置为123。

The pointers approach is much better as you can understand, because we cut the adding process, actually we handle it ourselves, thus able to optimize as needed.

你可以理解,指针方法要好得多,因为我们削减了添加过程,实际上我们自己处理它,因此能够根据需要进行优化。

I hope this helps.

我希望这有帮助。

Sidenote to *.com's staff: Great website, I hope I've heard of it long ago!

*.com的工作人员旁注:很棒的网站,我希望很久以前我听说过它!

#3

For one thing, there seems to be a pipelining issue here. The loop reads from the value in d_x that has just been written to, but apparently it has to wait for that write to complete. Just rearranging the order of the computation, doing something useful while it's waiting, makes it almost twice as fast:

首先,这里似乎存在一个流水线问题。循环读取刚刚写入的d_x中的值,但显然它必须等待该写入完成。只需重新排列计算的顺序,在等待时做一些有用的事情,使它几乎快两倍:

d_x[ic] = d_b[ic]
                        - d_e[ic] * d_x[ie]
    - d_s[ic] * d_x[is] - d_n[ic] * d_x[in]
    - d_w[ic] * d_x[iw] /* d_x[iw] has just been written to, process this last */;

It was Eamon Nerbonne who figured this out. Many upvotes to him! I would never have guessed.

Eamon Nerbonne是这样想的。很多人对他赞不绝口!我永远不会猜到。

#4

Poni's answer looks like the right one to me.

Poni的答案看起来对我来说是正确的。

I just want to point out that in this type of problem, you often gain benefits from memory locality. Right now, the b,w,e,s,n arrays are all at separate locations in memory. If you could not fit the problem in L3 cache (mostly in L2), then this would be bad, and a solution of this sort would be helpful:

我只想指出,在这类问题中,您经常从内存局部性中获益。现在,b,w,e,s,n数组都在内存中的不同位置。如果你不能解决L3缓存中的问题(主要是在L2中),那么这将是不好的,这种解决方案会有所帮助:

size_t d_nx = 128, d_ny = 128;
float *d_x;

struct D { float b,w,e,s,n; };
D *d;

void step() {
    size_t ic = d_ny + 1, iw = d_ny, ie = d_ny + 2, is = 1, in = 2 * d_ny + 1;
    for (size_t y = 1; y < d_ny - 1; ++y) {
        for (size_t x = 1; x < d_nx - 1; ++x) {
            d_x[ic] = d[ic].b
                - d[ic].w * d_x[iw] - d[ic].e * d_x[ie]
                - d[ic].s * d_x[is] - d[ic].n * d_x[in];
            ++ic; ++iw; ++ie; ++is; ++in;
        }
        ic += 2; iw += 2; ie += 2; is += 2; in += 2;
    }
}
void solve(size_t iters) { for (size_t i = 0; i < iters; ++i) step(); }
void clear(float *a) { memset(a, 0, d_nx * d_ny * sizeof(float)); }

int main(int argc, char **argv) {
    size_t n = d_nx * d_ny;
    d_x = new float[n]; clear(d_x);
    d = new D[n]; memset(d,0,n * sizeof(D));
    solve(atoi(argv[1]));
    cout << d_x[0] << endl; // prevent the thing from being optimized away
}

For example, this solution at 1280x1280 is a little less than 2x faster than Poni's solution (13s vs 23s in my test--your original implementation is then 22s), while at 128x128 it's 30% slower (7s vs. 10s--your original is 10s).

例如,1280x1280的这个解决方案比Poni的解决方案快了不到2倍(在我的测试中13s vs 23s - 你的原始实现是22s),而在128x128则慢30%(7s vs 10s - 你原来的是10s)。

(Iterations were scaled up to 80000 for the base case, and 800 for the 100x larger case of 1280x1280.)

(对于基本情况,迭代次数扩大到80000次,对于100x大型1280x1280的情况,扩展为800次。)

#5

I think you're right about memory being a bottleneck. It's a pretty simple loop with just some simple arithmetic per iteration. the ic, iw, ie, is, and in indices seem to be on opposite sides of the matrix so i'm guessing that there's a bunch of cache misses there.

我认为你对记忆是一个瓶颈是正确的。这是一个非常简单的循环,每次迭代只需要一些简单的算术。 ic,iw,ie,is,并且在索引中似乎位于矩阵的两侧,因此我猜测那里有一堆缓存未命中。

#6

I'm no expert on the subject, but I've seen that there are several academic papers on improving the cache usage of the Gauss-Seidel method.

我不是这方面的专家,但我看到有几篇关于改进Gauss-Seidel方法的缓存使用的学术论文。

Another possible optimization is the use of the red-black variant, where points are updated in two sweeps in a chessboard-like pattern. In this way, all updates in a sweep are independent and can be parallelized.

另一种可能的优化是使用红黑变体,其中点以棋盘式模式在两次扫描中更新。通过这种方式,扫描中的所有更新都是独立的,并且可以并行化。

#7

I suggest putting in some prefetch statements and also researching "data oriented design":

我建议加入一些预取语句,并研究“面向数据的设计”:

void step_original() {
    size_t ic = d_ny + 1, iw = d_ny, ie = d_ny + 2, is = 1, in = 2 * d_ny + 1;
    float dw_ic, dx_ic, db_ic, de_ic, dn_ic, ds_ic;
    float dx_iw, dx_is, dx_ie, dx_in, de_ic, db_ic;
    for (size_t y = 1; y < d_ny - 1; ++y) {
        for (size_t x = 1; x < d_nx - 1; ++x) {
// Perform the prefetch
// Sorting these statements by array may increase speed;
//    although sorting by index name may increase speed too.
            db_ic = d_b[ic];
            dw_ic = d_w[ic];
            dx_iw = d_x[iw];
            de_ic = d_e[ic];
            dx_ie = d_x[ie];
            ds_ic = d_s[ic];
            dx_is = d_x[is];
            dn_ic = d_n[ic];
            dx_in = d_x[in];
// Calculate
            d_x[ic] = db_ic
                - dw_ic * dx_iw - de_ic * dx_ie
                - ds_ic * dx_is - dn_ic * dx_in;
            ++ic; ++iw; ++ie; ++is; ++in;
        }
        ic += 2; iw += 2; ie += 2; is += 2; in += 2;
    }
}

This differs from your second method since the values are copied to local temporary variables before the calculation is performed.

这与第二种方法不同,因为在执行计算之前将值复制到本地临时变量。

#1