如何使用缓存技术提高性能?

Hello I am trying to run a program that finds closest pair using brute force with caching techniques like the pdf here: Caching Performance Stanford

你好，我正在运行一个程序，它使用蛮力和缓存技术找到最接近的一对，比如这里的pdf:缓存性能斯坦福

My original code is:

我的原始代码是:

float compare_points_BF(int N,point *P){
    int i,j;
    float  distance=0, min_dist=FLT_MAX;
    point *p1, *p2;
    unsigned long long calc = 0;
    for (i=0;i<(N-1);i++){
        for (j=i+1;j<N;j++){
            if ((distance = (P[i].x - P[j].x) * (P[i].x - P[j].x) +
                    (P[i].y - P[j].y) * (P[i].y - P[j].y)) < min_dist){
            min_dist = distance;
            p1 = &P[i];
            p2 = &P[j];
            }
        }
    }
    return sqrt(min_dist);
}

This program gives approximately these running times:

这个程序给出了大约这些运行时间:

      N     8192    16384   32768   65536   131072  262144  524288  1048576      
 seconds    0,070   0,280   1,130   5,540   18,080  72,838  295,660 1220,576
            0,080   0,330   1,280   5,190   20,290  80,880  326,460 1318,631

The cache version of the above program is:

上述程序的缓存版本为:

float compare_points_BF(register int N, register int B, point *P){
    register int i, j, ib, jb, num_blocks = (N + (B-1)) / B;
    register point *p1, *p2;
    register float distance=0, min_dist=FLT_MAX, regx, regy;

    //break array data in N/B blocks, ib is index for i cached block and jb is index for j strided cached block
    //each i block is compared with the j block, (which j block is always after the i block) 
    for (i = 0; i < num_blocks; i++){
        for (j = i; j < num_blocks; j++){
            //reads the moving frame block to compare with the i cached block
            for (jb = j * B; jb < ( ((j+1)*B) < N ? ((j+1)*B) : N); jb++){
                //avoid float comparisons that occur when i block = j block
                //Register Allocated
                regx = P[jb].x;
                regy = P[jb].y;
                for (i == j ? (ib = jb + 1) : (ib = i * B); ib < ( ((i+1)*B) < N ? ((i+1)*B) : N); ib++){
                    //calculate distance of current points
                    if((distance = (P[ib].x - regx) * (P[ib].x - regx) +
                            (P[ib].y - regy) * (P[ib].y - regy)) < min_dist){
                        min_dist = distance;
                        p1 = &P[ib];
                        p2 = &P[jb];
                    }
                }
            }
        }
    }
    return sqrt(min_dist);
}

and some results:

和一些结果:

Block_size = 256        N = 8192        Run time: 0.090 sec
Block_size = 512        N = 8192        Run time: 0.090 sec
Block_size = 1024       N = 8192        Run time: 0.090 sec
Block_size = 2048       N = 8192        Run time: 0.100 sec
Block_size = 4096       N = 8192        Run time: 0.090 sec
Block_size = 8192       N = 8192        Run time: 0.090 sec


Block_size = 256        N = 16384       Run time: 0.357 sec
Block_size = 512        N = 16384       Run time: 0.353 sec
Block_size = 1024       N = 16384       Run time: 0.360 sec
Block_size = 2048       N = 16384       Run time: 0.360 sec
Block_size = 4096       N = 16384       Run time: 0.370 sec
Block_size = 8192       N = 16384       Run time: 0.350 sec
Block_size = 16384      N = 16384       Run time: 0.350 sec

Block_size = 128        N = 32768       Run time: 1.420 sec
Block_size = 256        N = 32768       Run time: 1.420 sec
Block_size = 512        N = 32768       Run time: 1.390 sec
Block_size = 1024       N = 32768       Run time: 1.410 sec
Block_size = 2048       N = 32768       Run time: 1.430 sec
Block_size = 4096       N = 32768       Run time: 1.430 sec
Block_size = 8192       N = 32768       Run time: 1.400 sec
Block_size = 16384      N = 32768       Run time: 1.380 sec

Block_size = 256        N = 65536       Run time: 5.760 sec
Block_size = 512        N = 65536       Run time: 5.790 sec
Block_size = 1024       N = 65536       Run time: 5.720 sec
Block_size = 2048       N = 65536       Run time: 5.720 sec
Block_size = 4096       N = 65536       Run time: 5.720 sec
Block_size = 8192       N = 65536       Run time: 5.530 sec
Block_size = 16384      N = 65536       Run time: 5.550 sec

Block_size = 256        N = 131072      Run time: 22.750 sec
Block_size = 512        N = 131072      Run time: 23.130 sec
Block_size = 1024       N = 131072      Run time: 22.810 sec
Block_size = 2048       N = 131072      Run time: 22.690 sec
Block_size = 4096       N = 131072      Run time: 22.710 sec
Block_size = 8192       N = 131072      Run time: 21.970 sec
Block_size = 16384      N = 131072      Run time: 22.010 sec

Block_size = 256        N = 262144      Run time: 90.220 sec
Block_size = 512        N = 262144      Run time: 92.140 sec
Block_size = 1024       N = 262144      Run time: 91.181 sec
Block_size = 2048       N = 262144      Run time: 90.681 sec
Block_size = 4096       N = 262144      Run time: 90.760 sec
Block_size = 8192       N = 262144      Run time: 87.660 sec
Block_size = 16384      N = 262144      Run time: 87.760 sec

Block_size = 256        N = 524288      Run time: 361.151 sec
Block_size = 512        N = 524288      Run time: 379.521 sec
Block_size = 1024       N = 524288      Run time: 379.801 sec

From what we can see the running time is slower than the non-cached code. Is this due to compiler optimization? Is the code bad or is it just because of the algorithm that does not perform well with tiling? I use VS 2010 compiled with 32bit executable. Thanks in advance!

从我们可以看到的运行时间比非缓存的代码慢。这是由于编译器的优化吗?是代码不好，还是仅仅因为算法不能很好地处理瓦片?我使用了VS 2010编译的32位可执行文件。提前谢谢!

2 个解决方案

#1

This is an interesting case. The compiler did a poor job of loop invariant hoisting in the two inner loops. Namely, the two inner for-loop checks the following condition in each iteration:

这是一个有趣的例子。编译器在两个内循环中做了一个很差的循环不变量提升。即，两个内部for-loop在每次迭代中检查以下条件:

(j+1)*B) < N ? ((j+1)*B) : N

and

和

(i+1)*B) < N ? ((i+1)*B) : N

The calculation and branching are both expensive; but they are actually loop invariant for the two inner for-loops. Once manually hoisting them out of the two inner for-loops, I was able to get the cache optimized version to perform better than the unoptimized version (10% when N==524288, 30% when N=1048576).

计算和分支都很昂贵;但它们实际上是两个for循环的循环不变量。一旦手动将它们从两个内部for循环中提取出来，我就能够获得缓存优化版本，使其比未优化版本的性能更好(N= 524288时为10%，N=1048576时为30%)。

Here is the modified code (simple change really, look for u1, u2):

这是修改后的代码(简单更改，查找u1, u2):

//break array data in N/B blocks, ib is index for i cached block and jb is index for j strided cached block
//each i block is compared with the j block, (which j block is always after the i block) 
for (i = 0; i < num_blocks; i++){
    for (j = i; j < num_blocks; j++){
        int u1 =  (((j+1)*B) < N ? ((j+1)*B) : N);
        int u2 =  (((i+1)*B) < N ? ((i+1)*B) : N);
        //reads the moving frame block to compare with the i cached block
        for (jb = j * B; jb < u1 ; jb++){
            //avoid float comparisons that occur when i block = j block
            //Register Allocated
            regx = P[jb].x;
            regy = P[jb].y;
            for (i == j ? (ib = jb + 1) : (ib = i * B); ib < u2; ib++){
                //calculate distance of current points
                if((distance = (P[ib].x - regx) * (P[ib].x - regx) +
                        (P[ib].y - regy) * (P[ib].y - regy)) < min_dist){
                    min_dist = distance;
                    p1 = &P[ib];
                    p2 = &P[jb];
                }
            }
        }
    }
}

#2

Tiling may be an old concept, but it's still very relevant today. In your original piece of code, for each i, you may be able to reuse most of the P[j] elements while still cached, but only if the length of the inner loop was small enough to fit there. The actual size should be determined by which cache level you want to target for tiling - the L1 would provide best performance as it's fastest, but as it's also the smallest you'd need small blocks and the tiling overhead may be too much. The L2 allows bigger tiles but sligthly reduced performance, and so on.

瓦片可能是一个古老的概念，但它在今天仍然非常重要。在您的原始代码中，对于每个i，您可以在仍然缓存的同时重用大多数P[j]元素，但前提是内部循环的长度足够小，以适应这里。实际的大小应该由您希望针对于哪个缓存级别的瓦片来决定——L1的性能最好，因为它是最快的，但是因为它也是最小的，所以您需要小块，而且瓦片开销可能太大。L2允许更大的瓦片，但会降低性能，等等。

Note that you don't need to use 2d tiling here, this is not matrix multiplication - you're traversing over the same array. You could simply tile the inner loop since it's the one overflowing the cache, once you've done that - the outer loop (i) can run all the way to the end on the current block of cached inner-loop elements. There's actually no sense in 2d tiling since no one is going to reuse the elements of the outer loop (as opposed to matrix mul)

注意，这里不需要使用2d平铺，这不是矩阵乘法——您是在同一个数组上遍历。您可以简单地将内部循环进行平铺，因为一旦您完成了这一操作，就可以在缓存的内环元素的当前块上运行外部循环(i)。实际上二维平铺没有意义因为没有人会重复使用外部循环的元素(与矩阵mul相反)

So, assuming Point is 64 bit large, you can fit 512 such elements of the array safely in your 32k L1, or 4096 elements in your 256k L2. you'll have to miss once for P[i] on each block if i is out of bounds of the current j block, but that's negligible.

所以，假设点是64位的，你可以在32k L1中安全的匹配512个数组元素，或者256k L2中的4096个元素。如果i超出了当前j块的界限，那么每个块上的P[i]就必须错过一次，但这是可以忽略的。

By the way - this explanation may still be obsolete, since a sufficiently good compiler might try to do all this for you. It's quite complicated though, so i'm a bit skeptic any of the common ones would even try, but it should be easy to prove here that reordering is safe. One might argue of course that a "sufficiently good compiler" is a paradox, but that's off topic...

顺便说一句，这个解释可能仍然是过时的，因为一个足够好的编译器可能会尝试为您做所有这些。这是相当复杂的，所以我有点怀疑任何一个普通的尝试，但这应该很容易证明重新排序是安全的。有人可能会认为“足够优秀的编译器”是一个悖论，但那不是话题……

#1