来自joblib的多处理不并行化？

Since I moved from python3.5 to 3.6 the Parallel computation using joblib is not reducing the computation time. Here are the librairies installed versions: - python: 3.6.3 - joblib: 0.11 - numpy: 1.14.0

自从我从python3.5移到3.6后,使用joblib的并行计算并没有减少计算时间。以下是安装了librairies的版本: - python:3.6.3 - joblib:0.11 - numpy:1.14.0

Based on a very well known example, I give below a sample code to reproduce the problem:

根据一个非常着名的例子,我在下面给出一个示例代码来重现问题:

import time
import numpy as np
from joblib import Parallel, delayed

def square_int(i):
    return i * i

ndata = 1000000 
ti = time.time()
results = []    
for i in range(ndata):
    results.append(square_int(i))

duration = np.round(time.time() - ti,4)
print(f"standard computation: {duration} s" )

for njobs in [1,2,3,4] :
    ti = time.time()  
    results = []
    results = Parallel(n_jobs=njobs, backend="multiprocessing")\
        (delayed(square_int)(i) for i in range(ndata))
    duration = np.round(time.time() - ti,4)
    print(f"{njobs} jobs computation: {duration} s" )

I got the following ouput:

我得到了以下输出:

standard computation: 0.2672 s

标准计算:0.2672秒

1 jobs computation: 352.3113 s

1个工作计算:352.3113 s

2 jobs computation: 6.9662 s

2个工作量计算:6.9662秒

3 jobs computation: 7.2556 s

3个工作量计算:7.2556秒

4 jobs computation: 7.097 s

4个工作量计算:7.097秒

While I am increasing by a factor of 10 the number of ndata and removing the 1 core computation, I get those results:

当我将ndata的数量增加10倍并删除1核心计算时,我得到了这些结果:

standard computation: 2.4739 s

标准计算:2.4739 s

2 jobs computation: 77.8861 s

2个工作量计算:77.8861秒

3 jobs computation: 79.9909 s

3个工作量计算:79.9909秒

4 jobs computation: 83.1523 s

4个工作量计算:83.1523秒

Does anyone have an idea in which direction I should investigate ?

有没有人知道我应该调查哪个方向?

1 个解决方案

#1

I think the primary reason is that your overhead from parallel beats the benefits. In another word, your square_int is too simple to earn any performance improvement via parallel. The square_int is so simple that passing input and output between processes may take more time than executing the function square_int.

我认为主要原因是你的并行开销会带来好处。换句话说,square_int太简单了,无法通过并行获得任何性能提升。 square_int非常简单,在进程之间传递输入和输出可能比执行square_int函数花费更多时间。

I modified your code by creating a square_int_batch function. It reduced the computation time a lot, though it is still more than the serial implementation.

我通过创建一个square_int_batch函数来修改你的代码。虽然它仍然比串行实现更多,但它减少了很多计算时间。

import time
import numpy as np
from joblib import Parallel, delayed

def square_int(i):
    return i * i

def square_int_batch(a,b):
    results=[]
    for i in range(a,b):
        results.append(square_int(i))
    return results

ndata = 1000000 
ti = time.time()
results = []    
for i in range(ndata):
    results.append(square_int(i))

# results = [square_int(i) for i in range(ndata)]

duration = np.round(time.time() - ti,4)
print(f"standard computation: {duration} s" )

batch_num = 3
batch_size=int(ndata/batch_num)

for njobs in [2,3,4] :
    ti = time.time()  
    results = []
    a = list(range(ndata))
#     results = Parallel(n_jobs=njobs, )(delayed(square_int)(i) for i in range(ndata))
#     results = Parallel(n_jobs=njobs, backend="multiprocessing")(delayed(
    results = Parallel(n_jobs=njobs)(delayed(
        square_int_batch)(i*batch_size,(i+1)*batch_size) for i in range(batch_num))
    duration = np.round(time.time() - ti,4)
    print(f"{njobs} jobs computation: {duration} s" )

And the computation timings are

并且计算时间是

standard computation: 0.3184 s
2 jobs computation: 0.5079 s
3 jobs computation: 0.6466 s
4 jobs computation: 0.4836 s

A few other suggestions that will help reduce the time.

其他一些有助于缩短时间的建议。

Use list comprehension results = [square_int(i) for i in range(ndata)] to replace for loop in your specific case, it is faster. I tested.

使用list comprehension results = [square_int(i)for i in range(ndata)]来替换你的特定情况下的循环,它更快。我测试过。

Set batch_num to a reasonable size. The larger this value is, the more overhead. It started to get significantly slower when batch_num exceed 1000 in my case.

将batch_num设置为合理的大小。该值越大,开销越大。在我的情况下,当batch_num超过1000时,它开始显着变慢。

I used the default backend loky instead of multiprocessing. It is slightly faster, at least in my case.

我使用默认的后端loky而不是多处理。它稍微快一点,至少在我的情况下。

From a few other SO questions, I read that the multiprocessing is good for cpu-heavy tasks, for which I don't have an official definition. You can explore that yourself.

从其他一些SO问题中,我读到多处理对于cpu繁重的任务是好的,我没有官方定义。你可以自己探索一下。

#1