在python中使用“多处理”包的最佳实践

I am experimenting with the use of multiprocessing module in python. I have the below sample code which executes without any errors in an ipython notebook. But I see that there are additional python processes spawned in the background with each execution of the code block in the notebook.

我正在尝试在python中使用多处理模块。我有下面的示例代码，它在ipython笔记本中执行时没有任何错误。但是，我看到在后台有其他的python进程，每个代码块在笔记本中执行。

import multiprocessing as mp

def f(x):
    print "Hello World ", mp.current_process()
    return 1

pool = mp.Pool(3)

data = range(0,10)
pool.map(f, data)

Whereas when i save the same in a normal .py file and execute, I encounter errors and have to terminate the terminal to stop the program from execution.

然而，当我在正常的.py文件中保存相同的文件并执行时，会遇到错误，并必须终止终端以阻止程序的执行。

I have corrected this by having if __name__ == '__main__': and the creation of pool under this and also using pool.close() to close the pool.

我已经纠正了这一点，如果__name__ = '__main__':以及在此基础上创建池并使用pool.close()关闭池。

I am curious to know what best practices should one follow when using multiprocessing and the associated functions such as map, apply, apply_async etc? I plan to use this module for reading files in parallel and hopefully apply it to few ML algorithms to speed up the process.

我很想知道在使用multiprocessing和相关函数(如map、apply、apply_async等)时应该遵循哪些最佳实践?我计划使用这个模块来并行读取文件，并希望将它应用到一些ML算法中，以加快进程。

3 个解决方案

#1

The reason you have to put it in if __name__ ... is because when python spawns a new process, it effectively imports this module - thus trying to run any code not in the if __name__ block again and again.

如果…你必须把它放进去的原因是……因为当python生成一个新进程时，它会有效地导入这个模块——因此试图一次又一次地运行if __name__块中没有的任何代码。

Best practice is to keep things in sensibly named, small, testable functions. Have a 'main()' function, which you then call from your if __name__ block.

最好的做法是保持事物的名称，小的，可测试的功能。有一个“main()”函数，然后从if __name__块调用它。

Avoid global state (and module level variables). It just makes things complicated. Instead, think of passing things to and from your processes. This can be slow, so thinking first about how to send as little data as possible is useful. For instance, if you have a large config object, rather than send the whole config object to each process, split your process functions into only requiring the one or two attributes that they actually use, and just send those.

避免全局状态(和模块级变量)。它只是让事情变得复杂。相反，考虑把东西传递给进程和进程。这可能很慢，所以首先考虑如何发送尽可能少的数据是有用的。例如，如果您有一个大型的配置对象，而不是将整个配置对象发送给每个进程，那么将流程函数分割为只需要它们实际使用的一个或两个属性，然后将它们发送出去。

It's a lot easier to test things when it happens serially, so writing things in such a way that it's easy to make it happen sequentially rather than using map or whatever can make it easier.

在串行的情况下测试事物会容易得多，所以用这样的方式来写东西，很容易使它按顺序发生，而不是使用map或其他可以使它变得更容易的方法。

It's a good idea to profile things, as the whole spawning new process can sometimes end up being slower than doing things all in one thread. The gevent module is pretty cool too - if your program is network bound, then gevent can sometimes be a lot quicker at doing things in parallel than using multiprocessing.

对事物进行概要分析是一个好主意，因为整个生成的新进程有时会比在一个线程中完成所有的事情要慢。gevent模块也很酷——如果您的程序是网络绑定的，那么gevent有时可以比使用多处理更快地并行处理。

#2

The python docs mentioned are good - check out Using Python's multiprocessing.Process class. That question has some similar ideas. I would also recommend checking out https://www.ibm.com/developerworks/aix/library/au-multiprocessing/. It is in python and highlights some nice pythonic approaches to multiprocessing.

上面提到的python文档是很好的——使用python的多处理进行检查。流程类。这个问题有一些相似的想法。我还建议您查看https://www.ibm.com/developerworks/aix/library/au-multiprocessing/。它是在python中，强调了一些用于多处理的python方法。

#3

The official Python documentation has lots of usage examples. It's probably the best way to learn best practices: http://docs.python.org/2/library/multiprocessing.html

官方的Python文档有很多使用示例。这可能是学习最佳实践的最佳方式:http://docs.python.org/2/library/multiprocess.html

#1