基于numpy的计算的低效多处理

I'm trying to parallelize some calculations that use numpy with the help of Python's multiprocessing module. Consider this simplified example:

我试图在Python的多处理模块的帮助下并行化一些使用numpy的计算。考虑这个简化的例子:

import time
import numpy

from multiprocessing import Pool

def test_func(i):

    a = numpy.random.normal(size=1000000)
    b = numpy.random.normal(size=1000000)

    for i in range(2000):
        a = a + b
        b = a - b
        a = a - b

    return 1

t1 = time.time()
test_func(0)
single_time = time.time() - t1
print("Single time:", single_time)

n_par = 4
pool = Pool()

t1 = time.time()
results_async = [
    pool.apply_async(test_func, [i])
    for i in range(n_par)]
results = [r.get() for r in results_async]
multicore_time = time.time() - t1

print("Multicore time:", multicore_time)
print("Efficiency:", single_time / multicore_time)

When I execute it, the multicore_time is roughly equal to single_time * n_par, while I would expect it to be close to single_time. Indeed, if I replace numpy calculations with just time.sleep(10), this is what I get — perfect efficiency. But for some reason it does not work with numpy. Can this be solved, or is it some internal limitation of numpy?

当我执行它时,multicore_time大致等于single_time * n_par,而我希望它接近single_time。实际上,如果我用time.sleep(10)替换numpy计算,这就是我得到的 - 完美的效率。但由于某种原因,它不适用于numpy。这可以解决,还是numpy的内部限制?

Some additional info which may be useful:

一些可能有用的其他信息:

I'm using OSX 10.9.5, Python 3.4.2 and the CPU is Core i7 with (as reported by the system info) 4 cores (although the above program only takes 50% of CPU time in total, so the system info may not be taking into account hyperthreading).

我使用的是OSX 10.9.5,Python 3.4.2,CPU是Core i7(由系统信息报告)4个内核(虽然上面的程序总共花费50%的CPU时间,所以系统信息可能会不要考虑超线程)。
when I run this I see n_par processes in top working at 100% CPU

当我运行这个时,我看到*工作在100%CPU的n_par进程
if I replace numpy array operations with a loop and per-index operations, the efficiency rises significantly (to about 75% for n_par = 4).

如果我用循环和每索引操作替换numpy数组操作,效率会显着提高(n_par = 4时约为75%)。

3 个解决方案

#1

It looks like the test function you're using is memory bound. That means that the run time you're seeing is limited by how fast the computer can pull the arrays from memory into cache. For example, the line a = a + b is actually using 3 arrays, a, b and a new array that will replace a. These three arrays are about 8MB each (1e6 floats * 8 bytes per floats). I believe the different i7s have something like 3MB - 8MB of shared L3 cache so you cannot fit all 3 arrays in cache at once. Your cpu adds the floats faster than the array can be loaded into cache so most of the time is spent waiting on the array to be read from memory. Because the cache is shared between the cores, you don't see any speedup by spreading the work onto multiple cores.

看起来你正在使用的测试函数是内存限制的。这意味着您所看到的运行时间受到计算机将阵列从内存拉入缓存的速度的限制。例如,a = a + b行实际上是使用3个数组,a,b和一个将替换a的新数组。这三个数组各约为8MB(1e6浮点数*每个浮点数8个字节)。我相信不同的i7有3MB - 8MB的共享L3缓存,所以你不能同时在缓存中容纳所有3个阵列。你的cpu添加浮点数比可以加载到缓存中的数组更快,因此大部分时间都花在等待数组从内存中读取。由于缓存是在核心之间共享的,因此通过将工作分散到多个核心上,您看不到任何加速。

Memory bound operations are an issue for numpy in general and the only way I know to deal with them is to use something like cython or numba.

内存绑定操作一般都是numpy的问题,我知道处理它们的唯一方法是使用cython或numba之类的东西。

#2

One easy thing that should bump efficiency up should be to do in-place array operations, if possible -- so add(a,b,a) will not create a new array, while a = a + b will. If your for loop over numpy arrays could be rewritten as vector operations, that should be more efficient as well. Another possibility would be to use numpy.ctypeslib to enable shared memory numpy arrays (see: https://*.com/a/5550156/2379433).

如果可能的话,应该提高效率的一件简单事情应该是进行就地数组操作 - 因此add(a,b,a)不会创建新数组,而a = a + b将会。如果你的for numpy数组的循环可以被重写为向量操作,那么它也应该更高效。另一种可能性是使用numpy.ctypeslib来启用共享内存numpy数组(请参阅:https://*.com/a/5550156/2379433)。

#3

I have been programming numerical methods for mathematics and having the same problem: I wasn't seeing any speed-up for a supposedly cpu bounded problem. It turns out my problem was reaching the CPU cache memory limit.

我一直在为数学编写数值方法并遇到同样的问题:我没有看到任何加速因为所谓的cpu限制问题。事实证明我的问题是达到CPU缓存内存限制。

I have been using Intel PCM (Intel® Performance Counter Monitor) to see how the cpu cache memory was behaving (displaying it inside Linux ksysguard). I also disabled 2 of my processors to have clearer results (2 are active).

我一直在使用英特尔PCM(英特尔®性能计数器监视器)来查看cpu缓存内存的行为(在Linux ksysguard中显示)。我还禁用了2个处理器以获得更清晰的结果(2个处于活动状态)。

Here is what I have found out with this code:

以下是我使用此代码发现的内容:

def somethinglong(b):
    n=200000
    m=5000
    shared=np.arange(n)
    for i in np.arange(m):
        0.01*shared

pool = mp.Pool(2)
jobs = [() for i in range(8)]
for i in range(5):
    timei = time.time()
    pool.map(somethinglong, jobs , chunksize=1)
    #for job in jobs:
       #somethinglong(job)
print(time.time()-timei)

Example that doesn't reach the cache memory limit:

未达到高速缓存限制的示例:

n=10000
m=100000
Sequential execution: 15s

顺序执行:15秒

2 processor pool no cache memory limit: 8s

2处理器池无缓存内存限制:8s

It can be seen that there is no cache misses (all cache hits), therefore the speed-up is almost perfect: 15/8. Memory cache hits 2 pool

可以看出,没有缓存未命中(所有缓存命中),因此加速几乎是完美的:15/8。内存缓存命中2个池

Example that reaches the cache memory limit:

达到缓存限制的示例:

n=200000
m=5000
Sequential execution: 14s

顺序执行:14秒

2 processor pool cache memory limit: 14s

2个处理器池缓存内存限制:14秒

In this case, I increased the size of the vector onto which we operate (and decreased the loop size, to see reasonable execution times). In this case we can see that the memory gets full and the processes always miss the cache memory. Therefore not getting any speedup: 15/15. Memory cache misses 2 pool

在这种情况下,我增加了我们操作的向量的大小(并减小了循环大小,以查看合理的执行时间)。在这种情况下,我们可以看到内存已满,进程总是错过高速缓存。因此没有获得任何加速:15/15。内存缓存未命中2池

Observation: assigning an operation to a variable (aux = 0.01*shared) also uses the cache memory and can bound the problem by memory (without increasing any vector size).

观察:为变量分配操作(aux = 0.01 * shared)也使用高速缓冲存储器并且可以通过存储器限制问题(不增加任何矢量大小)。

#1