在并行化矩阵乘法时，虚拟核心是否有助于提高性能？

I have an O(n^3) matrix multiplication function in C.

我在C中有一个O(n ^ 3)矩阵乘法函数。

void matrixMultiplication(int N, double **A, double **B, double **C, int threadCount) {
  int i = 0, j = 0, k = 0, tid;

  pragma omp parallel num_threads(4) shared(N, A, B, C, threadCount) private(i, j, k, tid) { 
    tid = omp_get_thread_num();
    pragma omp for
      for (i = 1; i < N; i++) 
      {
        printf("Thread %d starting row %d\n", tid, i);
        for (j = 0; j < N; j++)
        {
          for (k = 0; k < N; k++) 
          {
            C[i][j] = C[i][j] + A[i][k] * B[k][j];
          }
        }
      }
    }
    return; 
    }

I am using OpenMP to parallelize this function by splitting up the multiplications. I am performing this computation on square matrices of size N = 3000 with a 1.8 GHz Intel Core i5 processor.

我正在使用OpenMP通过拆分乘法来并行化这个函数。我正在使用1.8 GHz Intel Core i5处理器在大小为N = 3000的方形矩阵上执行此计算。

This processor has two physical cores and two virtual cores. I noticed the following performances for my computation

该处理器有两个物理内核和两个虚拟内核。我注意到我的计算有以下表现

1 thread: 526.06s

1个主题:526.06s

2 threads: 264.531

2个主题:264.531

3 threads: 285.195

3个主题:285.195

4 threads: 279.914

4个主题:279.914

I had expected my gains to continue until the setting the number of threads equal to four. However, this obviously did not occur.

我曾期望我的收益继续下去,直到设置线程数等于4。但是,这显然没有发生。

Why did this happen? Is it because the performance of a core is equal to the sum of its physical and virtual cores?

为什么会这样?是因为核心的性能等于其物理核心和虚拟核心的总和?

2 个解决方案

#1

Using more than one hardware thread per core can help or hurt, depending on circumstances.

根据具体情况,每个核心使用多个硬件线程可能会有所帮助或受到伤害。

It can help if one hardware thread stalls because of a cache miss, and the other hardware thread can keep going and keep the ALU busy.

如果一个硬件线程由于高速缓存未命中而停止,并且另一个硬件线程可以继续运行并保持ALU忙,则可以提供帮助。

It can hurt if each hardware thread forces evictions of data needed by the other thread. That is the threads destructively interfere with each other.

如果每个硬件线程强制驱逐另一个线程所需的数据,则会受到伤害。那就是线程破坏性地相互干扰。

One way to address the problem is to write the kernel in a way such that each thread needs only half the cache. For example, blocked matrix multiplication can be used to minimize the cache footprint of a matrix multiplication.

解决该问题的一种方法是以一种方式编写内核,使得每个线程只需要一半的缓存。例如,阻塞矩阵乘法可用于最小化矩阵乘法的高速缓存占用。

Another way is to write the algorithm in a way such that both threads operate on the same data at the same time, so they help each other bring data into cache (constructive interference). This approach is admittedly hard to do with OpenMP unless the implementation has good support for nested parallelism.

另一种方法是以一种方式编写算法,使得两个线程同时对相同的数据进行操作,因此它们互相帮助将数据带入高速缓存(建设性干扰)。除非实现对嵌套并行性有很好的支持,否则这种方法很难用OpenMP做。

#2

I guess that the bottleneck is the memory (or L3 CPU cache) bandwidth. Arithmetic is quite cheap these days.

我想瓶颈是内存(或L3 CPU缓存)带宽。算术这些天很便宜。

If you can afford it, try to benchmark the same code with the same data on some more powerful processor (e.g. some socket 2013 i7)

如果你负担得起,尝试在一些更强大的处理器上使用相同的数据对相同的代码进行基准测试(例如某些socket 2013 i7)

Remember that on today's processors, a cache miss lasts as long as several hundred instructions (or cycles): RAM is very slow w.r.t. cache or CPU.

请记住,在今天的处理器上,缓存未命中长达数百条指令(或周期):RAM非常慢w.r.t.缓存或CPU。

BTW, if you have a GPGPU you could play with OpenCL.

顺便说一句,如果你有一个GPGPU,你可以玩OpenCL。

Also, it is probable that linear software packages like LAPACK (or some other numerical libraries) are more efficient than your naive matrix multiplication.

此外,像LAPACK(或其他一些数值库)这样的线性软件包很可能比你的朴素矩阵乘法更有效。

You could also consider using __builtin_prefetch (see this)

你也可以考虑使用__builtin_prefetch(见这个)

BTW, numerical computation is hard. I am not expert at all, but I met people who worked dozens of years in it (often after a PhD in the field).

顺便说一下,数值计算很难。我根本不是专家,但我遇到过几十年工作过的人(通常是在该领域攻读博士学位后)。

#1