如何在C语言中改进/加速这个频率函数?

How can I improve / speed up this frequent function?

我如何改进/加快这个频繁的功能?

#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#define M 10 // This is fixed
#define N 8  // This is NOT fixed

// Assumptions: 1. x, a, b and c are all arrays of 10 (M).
//              2. y and z are all matrices of 8 x 10 (N x M).
// Requirement: 1. return the value of ret;
//              2. get all elements of array c
float fnFrequentFunction(const float* x, const float* const* y, const float* const* z,
                         const float* a, const float* b, float *c, int n)
{
    register float tmp;
    register float sum;
    register float ret = 0;
    register const float* yy;
    register const float* zz;
    int i;

    for (i = 0; i < n; i++)  // M == 1, 2, 4, or 8
    {
        sum = 0;
        yy = y[i];
        zz = z[i];

        tmp = x[0] - yy[0]; sum += tmp * tmp * zz[0];
        tmp = x[1] - yy[1]; sum += tmp * tmp * zz[1];
        tmp = x[2] - yy[2]; sum += tmp * tmp * zz[2];
        tmp = x[3] - yy[3]; sum += tmp * tmp * zz[3];
        tmp = x[4] - yy[4]; sum += tmp * tmp * zz[4];
        tmp = x[5] - yy[5]; sum += tmp * tmp * zz[5];
        tmp = x[6] - yy[6]; sum += tmp * tmp * zz[6];
        tmp = x[7] - yy[7]; sum += tmp * tmp * zz[7];
        tmp = x[8] - yy[8]; sum += tmp * tmp * zz[8];
        tmp = x[9] - yy[9]; sum += tmp * tmp * zz[9];

        ret += (c[i] = log(a[i] * b[i]) + sum);
    }

    return ret;
}

// In the main function, all values are just example data.
int main()
{
    float x[M] = {0.001251f, 0.563585f, 0.193304f, 0.808741f, 0.585009f, 0.479873f, 0.350291f, 0.895962f, 0.622840f, 0.746605f};
    float* y[N];
    float* z[N];
    float a[M] = {0.870205f, 0.733879f, 0.711386f, 0.588244f, 0.484176f, 0.852962f, 0.168126f, 0.684286f, 0.072573f, 0.632160f};
    float b[M] = {0.871487f, 0.998108f, 0.798608f, 0.134831f, 0.576281f, 0.410779f, 0.402936f, 0.522935f, 0.623218f, 0.193030f};
    float c[N];

    float t1[M] = {0.864406f, 0.709006f, 0.091433f, 0.995727f, 0.227180f, 0.902585f, 0.659047f, 0.865627f, 0.846767f, 0.514359f};
    float t2[M] = {0.866817f, 0.581347f, 0.175542f, 0.620197f, 0.781823f, 0.778588f, 0.938688f, 0.721610f, 0.940214f, 0.811353f};
    int i, j;

    int n = 10000000;
    long start;

    // Initialize y, z for test example:
    for(i = 0; i < N; ++i)
    {
        y[i] = (float*)malloc(sizeof(float) * M);
        z[i] = (float*)malloc(sizeof(float) * M);

        for(j = 0; j < M; ++j)
        {
            y[i][j] = t1[j] * j;
            z[i][j] = t2[j] * j;
        }
    }


    // Speed test here:
    start = clock();
    while(--n)
        fnFrequentFunction(x, y, z, a, b, c, 8);
    printf("Time used: %ld\n", clock() - start);


    // Output the result here:
    printf("fnFrequentFunction == %f\n", fnFrequentFunction(x, y, z, a, b, c, 8));
    for(j = 0; j < N; ++j)
        printf("  c[%d] == %f\n", j, c[j]);
    printf("\n");


    // Free memory
    for(j = 0; j < N; ++j)
    {
        free(y[j]);
        free(z[j]);
    }

    return 0;
}

Any suggestions are welcome :-)

欢迎提出建议:-)

I feel terrible that I made a big mistake in my function. The above code is the new one. I'm rechecking it now to make sure that is what I need.

我觉得我犯了一个很大的错误。上面的代码是新的。我现在正在重新检查，以确保这是我需要的。

9 个解决方案

#1

put this outside the loop

把这个放到循环之外

sum = 0;

tmp = x[0] - y[0]; sum += tmp * tmp * z[0];
tmp = x[1] - y[1]; sum += tmp * tmp * z[1];
tmp = x[2] - y[2]; sum += tmp * tmp * z[2];
tmp = x[3] - y[3]; sum += tmp * tmp * z[3];
tmp = x[4] - y[4]; sum += tmp * tmp * z[4];
tmp = x[5] - y[5]; sum += tmp * tmp * z[5];
tmp = x[6] - y[6]; sum += tmp * tmp * z[6];
tmp = x[7] - y[7]; sum += tmp * tmp * z[7];
tmp = x[8] - y[8]; sum += tmp * tmp * z[8];
tmp = x[9] - y[9]; sum += tmp * tmp * z[9];

#2

This function is perfectly amenable to SIMD processing. Look into your compiler documentation for the intrinsic functions that correspond to the SSE instructions.
此函数完全适合SIMD处理。在编译器文档中查找与SSE指令对应的内部函数。
You could break up the dependence chain on the sum variable. Instead of a single sum accumulator, use two accumulators sum1 and sum2 alternately - one for even, one for odd indices. Add them up afterwards.
你可以打破对和变量的依赖链。用两个累加器，一个是偶数，一个是奇数。把它们加起来。
The single biggest performance bottleneck here is the log() function. Check if an approximation would be sufficient. The calculation of this could also be vectorized - I believe Intel published a high-performance math library - including vectorized versions of functions like log(). You may like to use this.
这里最大的性能瓶颈是log()函数。检查一个近似是否足够。计算过程也可以是矢量化的——我相信Intel发布了一个高性能的数学库——包括像log()这样的矢量化函数。你可能喜欢用这个。
You are operating on floats here, and log() uses double precision. Use logf() instead. It may (or may not) be faster. It will certainly be no slower.
您在这里操作浮点数，log()使用双精度。使用logf()。它可能(也可能不是)更快。它肯定不会变慢。
If your compiler understands C99, place a restrict qualifier on the pointers which are function arguments. This tells the compiler that those arrays do not overlap, and may help it generate more efficient code.
如果您的编译器理解C99，那么在指向函数参数的指针上放置一个限制限定符。这告诉编译器这些数组不重叠，并可能帮助它生成更有效的代码。
Change the way matrices are kept in memory. Instead of an array of pointers pointing to disjoint memory blocks, use a single array M*N elements in size.
改变矩阵在内存中的保存方式。不要使用指向不相交内存块的指针数组，而是使用一个大小为M*N的数组元素。

So, to put it together, this is how the function should look like. This is portable C99. Using the compiler-specific SIMD intrinsics, this could be made WAAAAY faster.

把它放在一起，这就是函数的样子。这是便携式C99。使用特定于编译器的SIMD intrinsic，可以使其更快。

UPDATE: Note that I changed the way input matrices are defined. A matrix is a single, large array.

更新:注意，我改变了输入矩阵的定义方式。矩阵是一个单独的，大的数组。

float fnFrequentFunction(const float *restrict x, const float *restrict y,
                         const float *restrict z, const float *restrict a,
                         const float *restrict b, float *restrict c, int n)
{
    float ret = 0;
    const float *restrict yy = y; //for readability
    const float *restrict zz = z; // -||-

    for (int i = 0; i < n; i++, yy += M, zz += M)  // n == 1, 2, 4, or 8
    {
        float sum = 0;
        float sum2 = 0;

        for(int j = 0; j < 10; j += 2)
        {
            float tmp  = x[j]   - yy[j];   sum  += tmp  * tmp  * zz[j];
            float tmp2 = x[j+1] - yy[j+1]; sum2 += tmp2 * tmp2 * zz[j+1];
        }
        sum += sum2;

        ret += (c[i] = logf(a[i] * b[i]) + sum);
    }
    return ret;
}

#3

Use memoization to cache the results. This is a time/space trade-off optimization.

使用内存化来缓存结果。这是一个时间/空间权衡优化。

It's really easy to do this in Perl with the memoize package, and probably in many other dynamic languages. In C, you'd need to roll your own.

在Perl中使用memoize包很容易做到这一点，而且很可能在许多其他动态语言中也是如此。在C语言中，你需要自己动手。

Use a wrapper function to make a hash of the arguments and use it to check if the value has already been calculated. If it has, return it. If not, pass through to the original function and cache the returned result.

使用包装器函数对参数进行散列，并使用它检查值是否已被计算。如果有，就把它退回去。如果不是，则传递到原始函数并缓存返回的结果。

Alternatively, you could pre-calculate your lookup table at program startup, or even calculate it once and then persist it, depending on your needs.

或者，您可以在程序启动时预先计算查找表，或者甚至计算它一次，然后根据需要持久化它。

#4

The above suggestions of strength reducing the tmp values out of the loop are correct. I might even consider dropping those 10 lines into a for loop of their own as this may improve code cache efficiency.

上面关于强度减少循环外的tmp值的建议是正确的。我甚至可能考虑将这10行代码放到自己的for循环中，因为这样可以提高代码缓存效率。

Beyond this, you start getting to the point where you want to know what type of processor you are targetting. If it has native SIMD support, an FPU, what kind of cache it uses, etc. Also depending on how many arguments get passed via registers, reducing the parameters by combining into a single struct and pass by reference might get you a small boost. Declaring vars as register may or may not help. Again profiling and examining the assembler output will answer that.

除此之外，您开始想要知道您要的处理器类型。如果它有原生的SIMD支持、一个FPU、它使用的缓存类型等。还取决于有多少参数通过寄存器传递，通过组合到一个结构并通过引用传递来减少参数可能会给您一个小小的提升。宣布vars为寄存器可能有用，也可能没有帮助。再次分析和检查汇编程序输出将回答这个问题。

As sum is known at before the loop, you could get away with adding M * its value in after the loop for a boost. That just leaves the 2 log muls on the inside.

当sum在循环之前被知道时，您可以在循环之后添加M *它的值以获得boost。只剩下2个log项在里面。

If M is always 8 or has some other known pattern, you could do some minor loop unrolling, but the gains there are almost nil against the log calls.

如果M总是8或者有其他已知的模式，你可以做一些小的循环展开，但是对于日志调用，那里的增益几乎为nil。

The only other major thing to look at is log(). How is this implemented? Can you perhaps roll your own faster version through table lookups if your input range is known. Better yet, table the log products if there's enough RAM available.

要查看的另一个主要内容是log()。这是如何实现的?如果您的输入范围是已知的，您是否可以通过表查找来滚动您自己的更快的版本?更好的是，如果有足够的RAM可用，则将日志产品放在表中。

Just a few thoughts.

几个想法。

#5

Do you use compiler optimization?

你使用编译器优化吗?

Register before variables is antiqued for modern compilers. You can even harm the compiler performance if you use them with compiler optimization. For example gcc simple compilation provides:

对现代编译器来说，在变量之前注册。如果您在编译器优化中使用它们，甚至会损害编译器的性能。例如，gcc简单编译提供:

Time used: 8720000

and without register floats:

没有注册浮动:

Time used: 8710000

I know this is not much.

我知道这并不多。

I assume you made all those sums to avoid a for loop because you think that is much slower. It is not. A modern compiler will do that optimization for you too.

我假设你做了所有这些求和来避免一个for循环因为你认为那要慢得多。它不是。现代编译器也会为您进行这种优化。

One big optimization I think is to use a table for log, if you don't mind the memory, that will be faster, use log only when you are out of range.

我认为一个很大的优化是使用一个表来记录，如果你不介意内存，那将会更快，只有当你超出范围时才使用日志。

#6

I wonder if doing it as scaled ints rather than floats might speed it up. I dont know the data ranges so I dont know if this is even possible

我不知道是否按比例放大而不是浮点数来加速它。我不知道数据的范围，所以我不知道这是否可能

#7

In addition to Andrey's answer, you can add some prefetching to the loop:

除了Andrey的回答之外，您还可以在循环中添加一些预取:

float fnFrequentFunction(const float* x, const float* y, const float* z,
                         const float *a, const float *b, float *c, int M)
{
    register float tmp;
    register float sum;
    register float ret = 0;
    int i;
    sum = 0;

    tmp = x[0] - y[0]; sum += tmp * tmp * z[0];
    tmp = x[1] - y[1]; sum += tmp * tmp * z[1];
    tmp = x[2] - y[2]; sum += tmp * tmp * z[2];
    tmp = x[3] - y[3]; sum += tmp * tmp * z[3];
    tmp = x[4] - y[4]; sum += tmp * tmp * z[4];
    tmp = x[5] - y[5]; sum += tmp * tmp * z[5];
    tmp = x[6] - y[6]; sum += tmp * tmp * z[6];
    tmp = x[7] - y[7]; sum += tmp * tmp * z[7];
    tmp = x[8] - y[8]; sum += tmp * tmp * z[8];
    tmp = x[9] - y[9]; sum += tmp * tmp * z[9];

    for (i = 0; i < M; i++)  // M == 1, 2, 4, or 8
    {
        //----------------------------------------
        // Prefetch data into the processor's cache
        //----------------------------------------
        float a_value = a[i];
        float b_value = b[i];
        float c_value = 0.0;

        //----------------------------------------
        // Calculate using prefetched data.
        //----------------------------------------
        c_value = log(a_value * b_value) + sum;
        c[i] = c_value;
        ret += c_value;
    }

    return ret;
}

You could also try unrolling the loop:

你也可以尝试展开循环:

float a_value = 0.0;
float b_value = 0.0;
float c_value = 0.0;
--M;
switch (M)
{
    case 7:
        a_value = a[M];
        b_value = b[M];
        c_value = log(a_value * b_value) + sum;
        c[M] = c_value;
    ret += c_value;
    --M;
    case 6:
        a_value = a[M];
        b_value = b[M];
        c_value = log(a_value * b_value) + sum;
        c[M] = c_value;
    ret += c_value;
    --M;
    case 5:
        a_value = a[M];
        b_value = b[M];
        c_value = log(a_value * b_value) + sum;
        c[M] = c_value;
    ret += c_value;
    --M;
    case 4:
        a_value = a[M];
        b_value = b[M];
        c_value = log(a_value * b_value) + sum;
        c[M] = c_value;
    ret += c_value;
    --M;
    case 3:
        a_value = a[M];
        b_value = b[M];
        c_value = log(a_value * b_value) + sum;
        c[M] = c_value;
    ret += c_value;
    --M;
    case 2:
        a_value = a[M];
        b_value = b[M];
        c_value = log(a_value * b_value) + sum;
        c[M] = c_value;
    ret += c_value;
    --M;
    case 1:
        a_value = a[M];
        b_value = b[M];
        c_value = log(a_value * b_value) + sum;
        c[M] = c_value;
    ret += c_value;
    --M;
    case 0:
        a_value = a[M];
        b_value = b[M];
        c_value = log(a_value * b_value) + sum;
        c[M] = c_value;
    ret += c_value;
    break;
}

Looking at the unrolled version, you could take the " + sum" out of the "loop" and add it in at the end as: ret += (M + 1) * sum; since sum doesn't change.

查看未滚动版本，可以将“+ sum”从“loop”中取出，并在末尾添加为:ret += (M + 1) * sum;因为金额不改变。

Finally, another alternative is to perform all multiplications at once, followed by all log calculations, then sum up everything:

最后，另一种选择是一次执行所有的乘法，然后进行所有的日志计算，然后对所有内容进行总结:

float product[8];
for (i = 0; i < M; ++i)
{
  product[i] = a[i] * b[i];
}
for (i = 0; i < M; ++i)
{
  c[i] = log(product);
  ret += c[i];
}
ret += M * sum;

#8

If you are calling this multiple times when a and b have not changed, then combine a and b into logab where logab[i] = log(a[i] * b[i]) since a and b are not used anywhere else.

如果在a和b未更改时多次调用此函数，则将a和b合并到logab中，其中logab[i] = log(a[i] * b[i])，因为a和b在其他任何地方都不使用。

#9

This appears to be a gaussian mixture model computation. Several years ago, I worked on an effort to optimize this same algorithm which was being used as part of a speech processing program. I investigated a number of optimizations like you're attempting to do but never found anything using straight C to gain more than just a few percent. My biggest gain came from recoding the basic GMM kernel using SIMD instructions. Since that still wasn't providing the performance I was looking for, the next (and final) step was to use an Nvidia GPU. This sort of worked but programming that thing was a headache in itself.

这似乎是一个高斯混合模型的计算。几年前，我致力于优化同样的算法，该算法被用作语音处理程序的一部分。我研究了一些优化，比如您正在尝试做的，但是从来没有发现任何使用直线C来获得超过仅仅百分之几。我最大的收获是使用SIMD指令重新编码基本的GMM内核。由于这仍然不能提供我所期望的性能，下一个(也是最后一个)步骤是使用Nvidia GPU。这种方法很有效，但编程本身就很麻烦。

Sorry I can't be more helpful but I don't think you are going to pick up more than just a nominal amount of speed if you're sticking to a regular CPU.

对不起，我不能提供更多的帮助，但我认为如果你坚持使用常规CPU，你只能获得一个名义上的速度。

#1