当构建“循环并行化”时,为什么CPU不会超过25%

时间:2021-09-24 13:54:24

Was testing and timing some computations (was trying to find a for loop that runs 4 times faster when paralleled with all 4 threads on my processor) when I noticed that this one won't run at 100% cpu usage despite the compiler reporting that it was parallelized. It would only run at 25% cpu usage. Each core on my processor was supposed to have its own copy of arr4, a C style array allocated on the stack, and each core is supposed to modify each value of that stack array repeatedly. At the end, a timer prints the time taken in seconds. If the time with parallelization takes 40 seconds, I want the time of the for loop without parallelization to take just under 4*40 seconds, or 160 seconds. Optimization is set to max speed and the stack size on physical memory is set to 800 million Bytes (to prevent stack overflow). Anyway, here is the test code below...

正在测试和计时一些计算(试图找到一个比我的处理器上的所有4个线程并行运行快4倍的for循环)当我发现尽管编译器报告它,它不会以100%cpu使用率运行是平行的。它只能以25%的CPU使用率运行。我的处理器上的每个核心都应该有自己的arr4副本,这是一个在堆栈上分配的C样式数组,每个核心应该重复修改该堆栈数组的每个值。最后,计时器打印以秒为单位的时间。如果并行化的时间需要40秒,我希望没有并行化的for循环的时间不到4 * 40秒或160秒。优化设置为最大速度,物理内存上的堆栈大小设置为8亿字节(以防止堆栈溢出)。无论如何,这是下面的测试代码......

#ifdef _MSC_VER
#define _CRT_SECURE_NO_WARNINGS
#endif

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <time.h>
#include <malloc.h>

int main (void)
{
    clock_t begin, end;
    double time_spent;

    begin = clock();
    {
        //int j;

        #pragma loop(hint_parallel(4))
        #pragma loop(ivdep)
        for (int j=0; j < 8; ++j)
        {
            int * __restrict const arr4 = (int *) _alloca(16000000*sizeof(int));

            for (int z = 0; z < 16000000; ++z)
            {
                arr4[z] = z;
            }

            #pragma loop(no_vector)
            for (int i = 0; i < 16000000; ++i)
            {
                for (int k = 0; k < 160; ++k)
                {
                    arr4[i] -= (7 - arr4[i] * 6 % (i+77) + 5 * 4 / 3 + 3 % 2 + 1 - (i+7));
                    arr4[i] += ((77 - 2 - (i+9)/2 + arr4[i]));
                    arr4[i] *= (8 - 2 + 6 % 3 / 2 + (i+6));
                }
            }
            printf(" %i ", arr4[((j+1)*666)%16]);
        }
    }
    end = clock();
    time_spent = (double)(end - begin) / ((double)CLOCKS_PER_SEC);
    printf("Test1: time as a floating point type is %f \n", time_spent);
    return 0;
}

This revised example also yields the same 25% CPU problem.

此修订示例也产生相同的25%CPU问题。

#ifdef _MSC_VER
#define _CRT_SECURE_NO_WARNINGS
#endif

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <time.h>
#include <malloc.h>

int main (void)
{
clock_t begin, end;
double time_spent;

begin = clock();
int * __restrict const arr4 = (int *) _alloca(16000000*sizeof(int));

#pragma loop(hint_parallel(4))
#pragma loop(ivdep)
for (int j=0; j < 8; ++j)
{
    for (int i = 0; i < 16000000; ++i)
    {
       int v = i;    // eliminate initialization pass (z loop)
       for (int k = 0; k < 160; ++k)
       {
           v -= (7 - v * 6 % (i+77) + 5 * 4 / 3 + 3 % 2 + 1 - (i+7));
           v += ((77 - 2 - (i+9)/2 + v));
           v *= (8 - 2 + 6 % 3 / 2 + (i+6));
       }
       arr4[i] = v;
    }
    //printf(" %i ", arr4[((j+1)*666)%16]);
}

end = clock();
//time_spent = (double)(end - begin) / ((double)CLOCKS_PER_SEC);
time_spent = (double)(end - begin);
printf(" %i ", arr4[666]);
printf("Test1: time as a floating point type is %f \n", time_spent);
return 0;
}

1 个解决方案

#1


4  

First of all, you should not expect linear speed improvements as you add processors. Doubling the number of cores available typically only improves execution by about 1.8 times under ideal conditions.

首先,您不应期望在添加处理器时提高线性速度。在理想条件下,将可用内核数量加倍通常只能将执行速度提高约1.8倍。

Think of it in people terms: does doubling a 10 person dev team to 20 people automatically allow you to get twice as much work done? No, because communication and coordination become a larger task as the number of participants increases.

从人们的角度来看:将10人开发团队加倍到20人会自动让你完成两倍的工作吗?不,因为随着参与者数量的增加,沟通和协调成为一项更大的任务。

Second, you have a lot of non-computational stuff going on inside your timer loop. You've got memory allocation and printf in your outer loop, and you've got multiple memory reads and writes in your inner loop. In particular, you are reading from a memory address, writing to it, reading from it again, etc. which may nullify some register variable compiler optimizations.

其次,你的定时器循环中有很多非计算的东西。你的外部循环中有内存分配和printf,你的内循环中有多个内存读写。特别是,您正在读取存储器地址,写入它,再次读取它等等,这可能会使某些寄存器变量编译器优化无效。

It could well be that your cpu is spending much of its time waiting for memory reads and writes to complete.

很可能你的cpu花了很多时间等待内存读写完成。

Since your modification of the data in the array does not appear to be visible to external observers, you should consider pulling the arr4[i] value into a local int variable and perform all the operations on that local int variable, then write the local int variable back to the arr4[i] memory address. This should reduce your memory load from 5 reads, 3 writes to 1 read, 1 write per inner loop iteration and eliminate costly read-after-write pipeline stalls.

由于您对数组中数据的修改似乎对外部观察者不可见,因此您应该考虑将arr4 [i]值拉入本地int变量并对该局部int变量执行所有操作,然后编写本地int变量回到arr4 [i]内存地址。这样可以减少5次读取的内存负载,3次写入1次读取,1次内部循环迭代写入,并消除代价高昂的读写后流水线停顿。

Since these memory writes are taking place inside of the k loop, moving the initial load and final store out of the k loop will reduce your memory load from (5+3)*160 = 1280 memory I/Os per iteration of the i loop to 2 memory I/Os per iteration of the i loop. Oh, and the entire initialization loop (the z loop) can be eliminated as well, since the initial value is the iteration count. So, we can reduce memory I/Os to 1 per i iteration.

由于这些存储器写入发生在k循环内部,因此将初始加载和最终存储移出k循环将减少每次迭代i循环的(5 + 3)* 160 = 1280内存I / O的内存负载每次循环i循环到2个内存I / O.哦,整个初始化循环(z循环)也可以被消除,因为初始值是迭代计数。因此,我们可以将每次迭代的内存I / O减少到1。

Something like this:

像这样的东西:

for (int j=0; j < 8; ++j)
{
    int * __restrict const arr4 = (int *) _alloca(16000000*sizeof(int));

    for (int i = 0; i < 16000000; ++i)
    {
       int v = i;    // eliminate initialization pass (z loop)
       for (int k = 0; k < 160; ++k)
       {
           v -= (7 - v * 6 % (i+77) + 5 * 4 / 3 + 3 % 2 + 1 - (i+7));
           v += ((77 - 2 - (i+9)/2 + v));
           v *= (8 - 2 + 6 % 3 / 2 + (i+6));
       }
       arr4[i] = v;
    }
    printf(" %i ", arr4[((j+1)*666)%16]);
}

The compiler cannot always make this optimization because memory writes are usually considered sacred because they can be observed by unknown parties outside of the current context. If you know more than the compiler about the situation, you can write better code than the compiler can.

编译器不能总是进行这种优化,因为内存写入通常被认为是神圣的,因为它们可以被当前上下文之外的未知方观察到。如果您不仅仅了解编译器的情况,那么您可以编写比编译器更好的代码。

#1


4  

First of all, you should not expect linear speed improvements as you add processors. Doubling the number of cores available typically only improves execution by about 1.8 times under ideal conditions.

首先,您不应期望在添加处理器时提高线性速度。在理想条件下,将可用内核数量加倍通常只能将执行速度提高约1.8倍。

Think of it in people terms: does doubling a 10 person dev team to 20 people automatically allow you to get twice as much work done? No, because communication and coordination become a larger task as the number of participants increases.

从人们的角度来看:将10人开发团队加倍到20人会自动让你完成两倍的工作吗?不,因为随着参与者数量的增加,沟通和协调成为一项更大的任务。

Second, you have a lot of non-computational stuff going on inside your timer loop. You've got memory allocation and printf in your outer loop, and you've got multiple memory reads and writes in your inner loop. In particular, you are reading from a memory address, writing to it, reading from it again, etc. which may nullify some register variable compiler optimizations.

其次,你的定时器循环中有很多非计算的东西。你的外部循环中有内存分配和printf,你的内循环中有多个内存读写。特别是,您正在读取存储器地址,写入它,再次读取它等等,这可能会使某些寄存器变量编译器优化无效。

It could well be that your cpu is spending much of its time waiting for memory reads and writes to complete.

很可能你的cpu花了很多时间等待内存读写完成。

Since your modification of the data in the array does not appear to be visible to external observers, you should consider pulling the arr4[i] value into a local int variable and perform all the operations on that local int variable, then write the local int variable back to the arr4[i] memory address. This should reduce your memory load from 5 reads, 3 writes to 1 read, 1 write per inner loop iteration and eliminate costly read-after-write pipeline stalls.

由于您对数组中数据的修改似乎对外部观察者不可见,因此您应该考虑将arr4 [i]值拉入本地int变量并对该局部int变量执行所有操作,然后编写本地int变量回到arr4 [i]内存地址。这样可以减少5次读取的内存负载,3次写入1次读取,1次内部循环迭代写入,并消除代价高昂的读写后流水线停顿。

Since these memory writes are taking place inside of the k loop, moving the initial load and final store out of the k loop will reduce your memory load from (5+3)*160 = 1280 memory I/Os per iteration of the i loop to 2 memory I/Os per iteration of the i loop. Oh, and the entire initialization loop (the z loop) can be eliminated as well, since the initial value is the iteration count. So, we can reduce memory I/Os to 1 per i iteration.

由于这些存储器写入发生在k循环内部,因此将初始加载和最终存储移出k循环将减少每次迭代i循环的(5 + 3)* 160 = 1280内存I / O的内存负载每次循环i循环到2个内存I / O.哦,整个初始化循环(z循环)也可以被消除,因为初始值是迭代计数。因此,我们可以将每次迭代的内存I / O减少到1。

Something like this:

像这样的东西:

for (int j=0; j < 8; ++j)
{
    int * __restrict const arr4 = (int *) _alloca(16000000*sizeof(int));

    for (int i = 0; i < 16000000; ++i)
    {
       int v = i;    // eliminate initialization pass (z loop)
       for (int k = 0; k < 160; ++k)
       {
           v -= (7 - v * 6 % (i+77) + 5 * 4 / 3 + 3 % 2 + 1 - (i+7));
           v += ((77 - 2 - (i+9)/2 + v));
           v *= (8 - 2 + 6 % 3 / 2 + (i+6));
       }
       arr4[i] = v;
    }
    printf(" %i ", arr4[((j+1)*666)%16]);
}

The compiler cannot always make this optimization because memory writes are usually considered sacred because they can be observed by unknown parties outside of the current context. If you know more than the compiler about the situation, you can write better code than the compiler can.

编译器不能总是进行这种优化,因为内存写入通常被认为是神圣的,因为它们可以被当前上下文之外的未知方观察到。如果您不仅仅了解编译器的情况,那么您可以编写比编译器更好的代码。