
时间:2021-12-04 20:41:12

I am experimenting with the use of multiprocessing module in python. I have the below sample code which executes without any errors in an ipython notebook. But I see that there are additional python processes spawned in the background with each execution of the code block in the notebook.


import multiprocessing as mp

def f(x):
    print "Hello World ", mp.current_process()
    return 1

pool = mp.Pool(3)

data = range(0,10)
pool.map(f, data)

Whereas when i save the same in a normal .py file and execute, I encounter errors and have to terminate the terminal to stop the program from execution.


I have corrected this by having if __name__ == '__main__': and the creation of pool under this and also using pool.close() to close the pool.

我已经纠正了这一点,如果__name__ = '__main__':以及在此基础上创建池并使用pool.close()关闭池。

I am curious to know what best practices should one follow when using multiprocessing and the associated functions such as map, apply, apply_async etc? I plan to use this module for reading files in parallel and hopefully apply it to few ML algorithms to speed up the process.


3 个解决方案



The reason you have to put it in if __name__ ... is because when python spawns a new process, it effectively imports this module - thus trying to run any code not in the if __name__ block again and again.

如果…你必须把它放进去的原因是……因为当python生成一个新进程时,它会有效地导入这个模块——因此试图一次又一次地运行if __name__块中没有的任何代码。

Best practice is to keep things in sensibly named, small, testable functions. Have a 'main()' function, which you then call from your if __name__ block.

最好的做法是保持事物的名称,小的,可测试的功能。有一个“main()”函数,然后从if __name__块调用它。

Avoid global state (and module level variables). It just makes things complicated. Instead, think of passing things to and from your processes. This can be slow, so thinking first about how to send as little data as possible is useful. For instance, if you have a large config object, rather than send the whole config object to each process, split your process functions into only requiring the one or two attributes that they actually use, and just send those.


It's a lot easier to test things when it happens serially, so writing things in such a way that it's easy to make it happen sequentially rather than using map or whatever can make it easier.


It's a good idea to profile things, as the whole spawning new process can sometimes end up being slower than doing things all in one thread. The gevent module is pretty cool too - if your program is network bound, then gevent can sometimes be a lot quicker at doing things in parallel than using multiprocessing.




The python docs mentioned are good - check out Using Python's multiprocessing.Process class. That question has some similar ideas. I would also recommend checking out https://www.ibm.com/developerworks/aix/library/au-multiprocessing/. It is in python and highlights some nice pythonic approaches to multiprocessing.




The official Python documentation has lots of usage examples. It's probably the best way to learn best practices: http://docs.python.org/2/library/multiprocessing.html




The reason you have to put it in if __name__ ... is because when python spawns a new process, it effectively imports this module - thus trying to run any code not in the if __name__ block again and again.

如果…你必须把它放进去的原因是……因为当python生成一个新进程时,它会有效地导入这个模块——因此试图一次又一次地运行if __name__块中没有的任何代码。

Best practice is to keep things in sensibly named, small, testable functions. Have a 'main()' function, which you then call from your if __name__ block.

最好的做法是保持事物的名称,小的,可测试的功能。有一个“main()”函数,然后从if __name__块调用它。

Avoid global state (and module level variables). It just makes things complicated. Instead, think of passing things to and from your processes. This can be slow, so thinking first about how to send as little data as possible is useful. For instance, if you have a large config object, rather than send the whole config object to each process, split your process functions into only requiring the one or two attributes that they actually use, and just send those.


It's a lot easier to test things when it happens serially, so writing things in such a way that it's easy to make it happen sequentially rather than using map or whatever can make it easier.


It's a good idea to profile things, as the whole spawning new process can sometimes end up being slower than doing things all in one thread. The gevent module is pretty cool too - if your program is network bound, then gevent can sometimes be a lot quicker at doing things in parallel than using multiprocessing.




The python docs mentioned are good - check out Using Python's multiprocessing.Process class. That question has some similar ideas. I would also recommend checking out https://www.ibm.com/developerworks/aix/library/au-multiprocessing/. It is in python and highlights some nice pythonic approaches to multiprocessing.




The official Python documentation has lots of usage examples. It's probably the best way to learn best practices: http://docs.python.org/2/library/multiprocessing.html
