性能与openmp中的线程数有关

I have written a small matrix multiplication program using OpenMP. I get best peroformance when I use 2 threads and worst performance when I use 1000 threads. I have total 64 processors. I get best performance when number threads in 1 or 2.

我用OpenMP编写了一个小矩阵乘法程序。当我使用2个线程时，我获得最佳性能，当我使用1000个线程时，性能最差。我有64个处理器。当数字线程在1或2时，我获得最佳性能。

    ~/openmp/mat_mul>  cat /proc/cpuinfo | grep processor | wc -l
    64
    ~/openmp/mat_mul> export OMP_NUM_THREADS=2
    ~/openmp/mat_mul> time ./main 
    Total threads : 2
    Master thread initializing

    real    0m1.536s
    user    0m2.728s
    sys     0m0.200s
    ~/openmp/mat_mul> export OMP_NUM_THREADS=64
    ~/openmp/mat_mul> time ./main 
    Total threads : 64
    Master thread initializing

    real    0m25.755s
    user    4m34.665s
    sys     21m5.595s

This is my code for matrix multiplication.

这是我的矩阵乘法代码。

#include <omp.h>
#include <stdio.h>
#include <stdlib.h>

#define ROW_SIZE_A 100
#define COL_SIZE_A 5000
#define COL_SIZE_B 300

int get_random();

int main(int argc, char* argv[])
{
        int a[ROW_SIZE_A][COL_SIZE_A];
        int b[COL_SIZE_A][COL_SIZE_B];
        int c[ROW_SIZE_A][COL_SIZE_B];
        int i,j,k, tid, thread_cnt;

        srand(time(NULL));

        #pragma omp parallel shared(a,b,c,thread_cnt) private(i,j,k,tid)
        {
                tid = omp_get_thread_num();
                if(tid == 0)
                {
                        thread_cnt = omp_get_num_threads();
                        printf("Total threads : %d\n", thread_cnt);
                        printf("Master thread initializing\n");
                }
                #pragma omp parallel for schedule(static) 
                for(i=0; i<ROW_SIZE_A; i++)
                {
                        for(j=0; j<COL_SIZE_A; j++)
                        {
                                a[i][j] = get_random();
                        }
                }
               #pragma omp parallel for schedule(static) 
                for(i=0; i<COL_SIZE_A; i++)
                {
                        for(j=0; j<COL_SIZE_B; j++)
                        {
                                b[i][j] = get_random();
                        }
                }
                #pragma omp parallel for schedule(static)
                for(i=0; i<ROW_SIZE_A; i++)
                {
                        for(j=0; j<COL_SIZE_B; j++)
                        {
                                c[i][j] = 0;
                        }
                }

                #pragma omp barrier

                #pragma omp parallel for schedule(static) 
                for(i=0; i<ROW_SIZE_A; i++)
                {
                        for(j=0; j<COL_SIZE_B; j++)
                        {
                                c[i][j] = 0;
                                for(k=0; k<COL_SIZE_A; k++)
                                {
                                        c[i][j] += a[i][k] + b[k][j];
                                }
                        }
                }

        }

        return 0;


}

Can somebody tell me why this is happening ?

有人能告诉我为什么会这样吗？

2 个解决方案

#1

Your for-loops are not properly parallelised since you are using the wrong OpenMP construct. parallel for is a combined directive, which both creates a new parallel region and embeds a for worksharing construct in it. The iterations of the loop are then distributed among the threads of the inner region. As a result, you have 64 threads each running all the loops in their entirety and writing simultaneously over c. Besides producing the wrong answer, it also has catastrophic consequences regarding the performance as observed. Also, nested regions by default execute in serial, unless nested parallelism is explicitly enabled by calling omp_set_nested(1); or by setting appropriately the OMP_NESTED environment variable.

由于使用了错误的OpenMP构造，因此for循环未正确并行化。 parallel for是一个组合指令，它既可以创建一个新的并行区域，也可以在其中嵌入一个工作共享结构。然后，循环的迭代在内部区域的线程之间分配。因此，您有64个线程，每个线程完整地运行所有循环并通过c同时写入。除了产生错误答案外，它还会对所观察到的性能产生灾难性后果。此外，默认情况下，嵌套区域以串行方式执行，除非通过调用omp_set_nested（1）显式启用嵌套并行性;或者通过适当地设置OMP_NESTED环境变量。

Remove the parallel keyword from all for-loops within the parallel region:

从并行区域内的所有for循环中删除parallel关键字：

    #pragma omp parallel shared(a,b,c,thread_cnt) private(i,j,k,tid)
    {
        ...
        #pragma omp parallel for schedule(static)
                    ^^^^^^^^ 
        for(i=0; i<ROW_SIZE_A; i++)
        {
           ...
        }
        ...
    }

should become

应该成为

    #pragma omp parallel shared(a,b,c,thread_cnt) private(i,j,k,tid)
    {
        ...
        #pragma omp for schedule(static) 
        for(i=0; i<ROW_SIZE_A; i++)
        {
           ...
        }
        ...
    }

This will enable worksharing of the loop iterations between the threads of the outer region as expected.

这将实现外部区域的线程之间的循环迭代的工作共享，如预期的那样。

#2

In general, your processor can only run a fixed number of threads in parallel. Increasing the number of threads beyond that number is not going to speed up your program. Indeed, the high number of threads is causing a considerate scheduling overhead that slows down your computations to a crawl.

通常，您的处理器只能并行运行固定数量的线程。增加超过该数量的线程数不会加速您的程序。实际上，大量线程会导致周到的调度开销，从而减慢计算速度。

Also remember Amdahl's law, parallelism only improves your performance so much.

还要记住Amdahl定律，并行性只能提高你的表现。

#1