单核与双精度阵列矩阵乘法在多核机器上的性能下降

UPDATE

UPDATE

Unfortunately, due to my oversight, I had an older version of MKL (11.1) linked against numpy. Newer version of MKL (11.3.1) gives same performance in C and when called from python.

不幸的是，由于我的疏忽，我有一个旧版本的MKL（11.1）链接到numpy。较新版本的MKL（11.3.1）在C中和从python调用时具有相同的性能。

What was obscuring things, was even if linking the compiled shared libraries explicitly with the newer MKL, and pointing through LD_* variables to them, and then in python doing import numpy, was somehow making python call old MKL libraries. Only by replacing in python lib folder all libmkl_*.so with newer MKL I was able to match performance in python and C calls.

什么是模糊的东西，即使将已编译的共享库与新的MKL明确地链接，并将LD_ *变量指向它们，然后在python中执行import numpy，以某种方式使python调用旧的MKL库。只有在python lib文件夹中替换所有libmkl _ *。所以使用更新的MKL我能够匹配python和C调用中的性能。

Background / library info.

背景/图书馆信息。

Matrix multiplication was done via sgemm (single-precision) and dgemm (double-precision) Intel's MKL library calls, via numpy.dot function. The actual call of the library functions can be verified with e.g. oprof.

通过numpy.dot函数，通过sgemm（单精度）和dgemm（双精度）Intel的MKL库调用完成矩阵乘法。可以用例如图1来验证库函数的实际调用。 oprof。

Using here 2x18 core CPU E5-2699 v3, hence a total of 36 physical cores. KMP_AFFINITY=scatter. Running on linux.

在这里使用2x18核心CPU E5-2699 v3，因此共有36个物理核心。 KMP_AFFINITY =散射。在linux上运行。

TL;DR

TL; DR

1) Why is numpy.dot, even though it is calling the same MKL library functions, twice slower at best compared to C compiled code?

1）为什么numpy.dot，即使它调用相同的MKL库函数，与C编译代码相比，最好慢两倍？

2) Why via numpy.dot you get performance decreasing with increasing number of cores, whereas the same effect is not observed in C code (calling the same library functions).

2）为什么通过numpy.dot随着内核数量的增加而性能下降，而在C代码中没有观察到相同的效果（调用相同的库函数）。

The problem

问题

I've observed that doing matrix multiplication of single/double precision floats in numpy.dot, as well as calling cblas_sgemm/dgemm directly from a compiled C shared library give noticeably worse performance compared to calling same MKL cblas_sgemm/dgemm functions from inside pure C code.

我观察到在numpy.dot中进行单/双精度浮点矩阵乘法，以及直接从编译的C共享库调用cblas_sgemm / dgemm，与从纯C内部调用相同的MKL cblas_sgemm / dgemm函数相比，性能明显更差码。

import numpy as np
import mkl
n = 10000
A = np.random.randn(n,n).astype('float32')
B = np.random.randn(n,n).astype('float32')
C = np.zeros((n,n)).astype('float32')

mkl.set_num_threads(3); %time np.dot(A, B, out=C)
11.5 seconds
mkl.set_num_threads(6); %time np.dot(A, B, out=C)
6 seconds
mkl.set_num_threads(12); %time np.dot(A, B, out=C)
3 seconds
mkl.set_num_threads(18); %time np.dot(A, B, out=C)
2.4 seconds
mkl.set_num_threads(24); %time np.dot(A, B, out=C)
3.6 seconds
mkl.set_num_threads(30); %time np.dot(A, B, out=C)
5 seconds
mkl.set_num_threads(36); %time np.dot(A, B, out=C)
5.5 seconds

Doing exactly the same as above, but with double precision A, B and C, you get: 3 cores: 20s, 6 cores: 10s, 12 cores: 5s, 18 cores: 4.3s, 24 cores: 3s, 30 cores: 2.8s, 36 cores: 2.8s.

完全如上所述，但具有双精度A，B和C，你得到：3核：20s，6核：10s，12核：5s，18核：4.3s，24核：3s，30核：2.8 s，36核：2.8s。

The topping up of speed for single precision floating points seem to be associated with cache misses. For 28 core run, here is the output of perf. For single precision:

单精度浮点的速度补足似乎与缓存未命中有关。对于28核心运行，这是perf的输出。对于单精度：

perf stat -e task-clock,cycles,instructions,cache-references,cache-misses ./ptestf.py
631,301,854 cache-misses # 31.478 % of all cache refs

And double precision:

双精度：

93,087,703 cache-misses # 5.164 % of all cache refs

C shared library, compiled with

C共享库，用。编译

/opt/intel/bin/icc -o comp_sgemm_mkl.so -openmp -mkl sgem_lib.c -lm -lirc -O3 -fPIC -shared -std=c99 -vec-report1 -xhost -I/opt/intel/composer/mkl/include

#include <stdio.h>
#include <stdlib.h>
#include "mkl.h"

void comp_sgemm_mkl(int m, int n, int k, float *A, float *B, float *C);

void comp_sgemm_mkl(int m, int n, int k, float *A, float *B, float *C)
{
    int i, j;
    float alpha, beta;
    alpha = 1.0; beta = 0.0;

    cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
                m, n, k, alpha, A, k, B, n, beta, C, n);
}

Python wrapper function, calling the above compiled library:

Python包装函数，调用上面编译的库：

def comp_sgemm_mkl(A, B, out=None):
    lib = CDLL(omplib)
    lib.cblas_sgemm_mkl.argtypes = [c_int, c_int, c_int, 
                                 np.ctypeslib.ndpointer(dtype=np.float32, ndim=2), 
                                 np.ctypeslib.ndpointer(dtype=np.float32, ndim=2),
                                 np.ctypeslib.ndpointer(dtype=np.float32, ndim=2)]
    lib.comp_sgemm_mkl.restype = c_void_p
    m = A.shape[0]
    n = B.shape[0]
    k = B.shape[1]
    if np.isfortran(A):
        raise ValueError('Fortran array')
    if m != n:
        raise ValueError('Wrong matrix dimensions')
    if out is None:
        out = np.empty((m,k), np.float32)
    lib.comp_sgemm_mkl(m, n, k, A, B, out)

However, explicit calls from a C-compiled binary calling MKL's cblas_sgemm / cblas_dgemm, with arrays allocated through malloc in C, gives almost 2x better performance compared to the python code, i.e. the numpy.dot call. Also, the effect of performance degradation with increasing number of cores is NOT observed. The best performance was 900 ms for single-precision matrix multiplication and was achieved when using all 36 physical cores via mkl_set_num_cores and running the C code with numactl --interleave=all.

但是，来自C编译二进制文件的显式调用调用MKL的cblas_sgemm / cblas_dgemm，其中数组通过C中的malloc分配，与python代码（即numpy.dot调用）相比，性能提高了近2倍。此外，未观察到随着芯数增加而导致性能下降的影响。单精度矩阵乘法的最佳性能为900 ms，当通过mkl_set_num_cores使用所有36个物理内核并使用numactl --interleave = all运行C代码时，可实现最佳性能。

Perhaps any fancy tools or advice for profiling/inspecting/understanding this situation further? Any reading material is much appreciated as well.

也许任何花哨的工具或建议可以进一步分析/检查/了解这种情况？任何阅读材料也非常受欢迎。

UPDATE Following @Hristo Iliev advice, running numactl --interleave=all ./ipython did not change the timings (within noise), but improves the pure C binary runtimes.

更新在@Hristo Iliev建议之后，运行numactl --interleave = all ./ipython并没有改变时间（在噪声中），但改进了纯C二进制运行时。

1 个解决方案

#1

I suspect this is due to unfortunate thread scheduling. I was able to reproduce an effect similar to yours. Python was running at ~2.2 s, while the C version was showing huge variations from 1.4-2.2 s.

我怀疑这是由于不幸的线程调度。我能够重现与你类似的效果。 Python运行时间约为2.2秒，而C版本则显示1.4-2.2秒的巨大差异。

Applying: KMP_AFFINITY=scatter,granularity=thread This ensures that the 28 threads are always running on the same processor thread.

应用：KMP_AFFINITY = scatter，granularity = thread这确保28个线程始终在同一处理器线程上运行。

Reduces both runtimes to more stable ~1.24 s for C and ~1.26 s for python.

减少两个运行时间，使C更稳定~1.24 s，python运行〜1.26 s。

This is on a 28 core dual socket Xeon E5-2680 v3 system.

这是一个28核双插槽Xeon E5-2680 v3系统。

Interestingly, on a very similar 24 core dual socket Haswell system, both python and C perform almost identical even without thread affinity / pinning.

有趣的是，在一个非常相似的24核双插槽Haswell系统上，即使没有线程亲和/固定，python和C也几乎完全相同。

Why does python affect the scheduling? Well I assume there is more runtime environment around it. Bottom line is, without pinning your performance results will be non-deterministic.

为什么python会影响调度？我假设它周围有更多的运行时环境。最重要的是，如果没有钉住你的表现，结果将是不确定的。

Also you need to consider, that the Intel OpenMP runtime spawns an extra management thread that can confuse the scheduler. There are more choices for pinning, for instance KMP_AFFINITY=compact - but for some reason that is totally messed up on my system. You can add ,verbose to the variable to see how the runtime is pinning your threads.

此外，您还需要考虑，英特尔OpenMP运行时会产生一个额外的管理线程，可能会混淆调度程序。固定有更多的选择，例如KMP_AFFINITY = compact - 但由于某种原因，我的系统完全搞砸了。您可以向变量添加详细信息以查看运行时如何固定线程。

likwid-pin is a useful alternative providing more convenient control.

likwid-pin是一种有用的替代方案，可提供更方便的控制。

In general single precision should be at least as fast as double precision. Double precision can be slower because:

通常，单精度应至少与双精度一样快。双精度可能会更慢，因为：

You need more memory/cache bandwidth for double precision.
您需要更多内存/缓存带宽才能实现双精度。
You can build ALUs that have higher througput for single precision, but that usually doesn't apply to CPUs but rather GPUs.
您可以为单精度构建具有更高吞吐量的ALU，但这通常不适用于CPU，而是适用于GPU。

I would think that once you get rid of the performance anomaly, this will be reflected in your numbers.

我认为一旦你摆脱了性能异常，这将反映在你的数字中。

When you scale up the number of threads for MKL/*gemm, consider

当你扩大MKL / * gemm的线程数时，请考虑

Memory /shared cache bandwidth may become a bottleneck, limiting the scalability
内存/共享缓存带宽可能成为瓶颈，限制了可扩展性
Turbo mode will effectively decrease the core frequency when increasing utilization. This applies even when you run at nominal frequency: On Haswell-EP processors, AVX instructions will impose a lower "AVX base frequency" - but the processor is allowed to exceed that when less cores are utilized / thermal headroom is available and in general even more for a short time. If you want perfectly neutral results, you would have to use the AVX base frequency, which is 1.9 GHz for you. It is documented here, and explained in one picture.
Turbo模式在提高利用率时会有效降低核心频率。这甚至在您以标称频率运行时也适用：在Haswell-EP处理器上，AVX指令将施加较低的“AVX基频” - 但允许处理器超过使用较少核心/热空间可用时的处理器，通常甚至更短时间内。如果您想获得完全中性的结果，则必须使用AVX基频，即1.9 GHz。这里有记录，并在一张图片中解释。

I don't think there is a really simple way to measure how your application is affected by bad scheduling. You can expose this with perf trace -e sched:sched_switch and there is some software to visualize this, but this will come with a high learning curve. And then again - for parallel performance analysis you should have the threads pinned anyway.

我认为没有一种非常简单的方法可以衡量应用程序如何受到错误调度的影响。您可以使用perf trace -e sched：sched_switch来公开它，并且有一些软件可以看到这一点，但这将带来很高的学习曲线。然后再次 - 对于并行性能分析，您应该将线程固定。

#1