Julia v0。5中不同形式的并行度的开销是多少?

时间:2022-10-31 13:52:12

As the title states, what is the overhead of the different forms of parallelism, at least in the current implementation of Julia (v0.5, in case the implementation changes drastically in the future)? I am looking for some "practical measures", some general heuristics or ballparks to keep in my head for when it can be useful. For example, it's pretty obvious that multiprocessing won't give you gains in a loop like:

正如标题所述,不同形式的并行性的开销是多少,至少在Julia的当前实现中是这样(v0.5,如果实现在未来发生剧烈变化)?我正在寻找一些“实用的方法”,一些一般的启发式或棒球场在我的头脑中,当它可能有用的时候。例如,很明显,多处理不会给您带来如下循环:

addprocs(4)
@parallel (+) for i=1:4
  rand()
end

doesn't give you performance gains because each process is only taking one random number, but is there general heuristic for knowing when it will be worthwhile? Also, what about a heuristic for threading. It's surely a lower overhead than multiprocessing, but for example, with 4 threads, for what N is it a good idea to multithread:

不会因为每个进程只取一个随机数而带来性能的提升,但是是否存在一般的启发式来知道何时它是值得的呢?还有,线程启发式。它肯定比多处理低开销,但是,例如,有4个线程,对于什么N,多线程是个好主意:

A = rand(4)
Base.@threads (+) for i = 1:N
  A[i%4+1] 
end

(I know there isn't a threaded reduction right now, but let's act like there is, or edit with a better example). Sure, I can benchmark every example, but some good rules to keep in mind would go a long way.

(我知道现在没有一个线程化的减少,但是让我们像这样做,或者编辑一个更好的例子)。当然,我可以对每个示例进行基准测试,但是要记住一些好的规则将大有帮助。

In more concrete terms: what are some good rules of thumb?

用更具体的话来说:什么是好的经验法则?

  • How many numbers do you need to be adding/multiplying before threading gives performance enhancements, or before multiprocessing gives performance enhancements?
  • 在线程提供性能增强之前,或者在多线程处理提供性能增强之前,您需要添加/相乘多少个数字?
  • How much does the depend on Julia's current implementation?
  • 这在多大程度上取决于茱莉亚目前的实现?
  • How much does it depend on the number of threads/processes?
  • 它在多大程度上取决于线程/进程的数量?
  • How much does the depend on the architecture? Are there good rules for knowing when the threshold should be higher/lower on a particular system?
  • 这在多大程度上取决于架构?在特定的系统中,是否有很好的规则来确定阈值何时应该更高/更低?
  • What kinds of applications violate these heuristics?
  • 哪些应用程序违反了这些启发式?

Again, I'm not looking for hard rules, just general guidelines to guide development.

再说一遍,我并不是在寻找硬性的规则,我只是在寻找指导发展的一般准则。

1 个解决方案

#1


2  

A few caveats: 1. I'm speaking from experience with version 0.4.6, (and prior), haven't played with 0.5 yet (but, as I hope my answer below demonstrates, I don't think this is essential vis-a-vis the response I give). 2. this isn't a fully comprehensive answer.

一些注意事项:1。我说的是0.4.6版本(以及之前版本)的经验,还没有使用0.5版本(但是,正如我希望下面的答案所示,我认为这与我给出的回答相比并不是必不可少的)。2。这并不是一个全面的答案。

Nevertheless, from my experience, the overhead for multiple processes itself is very small provided that you aren't dealing with data movement issues. In other words, in my experience, any time that you ever find yourself in a situation of wishing something were faster than a single process on your CPU can manage, you're well past the point where parallelism will be beneficial. For instance, in the sum of random numbers example that you gave, I found through testing just now that the break-even point was somewhere around 10,000 random numbers. Anything more and parallelism was the clear winner. Generating 10,000 random number is trivial for modern computers, taking a tiny fraction of a second, and is well below the threshold where I'd start getting frustrated by the slowness of my scripts and want parallelism to speed them up.

然而,根据我的经验,如果您不处理数据移动问题,那么多个进程本身的开销就非常小。换句话说,在我的经验中,任何时候,当你发现自己在希望某件事比你的CPU上的单个进程更快的时候,你已经超过了并行性将会带来好处的点。举个例子,在你给出的随机数字例子中,我通过测试发现,盈亏平衡点大约是10000个随机数。任何更多的并行都是明显的赢家。生成10,000个随机数对于现代计算机来说是微不足道的,只占用一小部分时间,而且远远低于我开始对我的脚本缓慢感到沮丧的阈值,并且希望并行性能够加快它们的速度。

Thus, I at least am of the opinion, that although there are probably even more wonderful things that the Julia developers could do to cut down on the overhead even more, at this point, anything pertinent to Julia isn't going to be so much of your limiting factor, at least in terms of the computation aspects of parallelism. I think that there are still improvements to be made in terms of enhancing both the ease and the efficiency of parallel data movement (I like the package that you've started on that topic as a good step. You and I would probably both agree there's still a ways more to go). But, the big limiting factors will be:

因此,至少我认为,虽然有可能更美好的事物,茱莉亚开发人员可以减少更多的开销,在这一点上,任何与茱莉亚不会太多你的限制因素,至少在并行的计算方面。我认为在提高并行数据移动的易用性和效率方面还有一些改进需要改进(我喜欢您在这个主题上开始的包,这是一个很好的步骤。你和我可能都同意还有更多的路要走。但是,最大的限制因素将是:

  1. How much data do you need to be moving around between processes?
  2. 需要在进程之间移动多少数据?
  3. How much read/write to your memory do you need to be doing during your computations? (e.g. flops per read/write)
  4. 在你的计算过程中,你需要对你的记忆做多少读写?(如每读/写失败)

Aspect 1. might at times lean against using parallelism. Aspect 2. is more likely just to mean that you won't get so much benefit from it. And, at least as I interpret "overhead," neither of these really fall so directly into that specific consideration. And, both of these are, I believe, going to be far more heavily determined by your system hardware than by Julia.

方面1。有时可能会倾向于使用并行。方面2。它更可能只是意味着你不会从中得到那么多好处。而且,至少在我解释“开销”的时候,这两者都没有直接考虑到这个具体的问题。而且,我相信,这两个都将更多地取决于你的系统硬件而不是朱莉娅。

#1


2  

A few caveats: 1. I'm speaking from experience with version 0.4.6, (and prior), haven't played with 0.5 yet (but, as I hope my answer below demonstrates, I don't think this is essential vis-a-vis the response I give). 2. this isn't a fully comprehensive answer.

一些注意事项:1。我说的是0.4.6版本(以及之前版本)的经验,还没有使用0.5版本(但是,正如我希望下面的答案所示,我认为这与我给出的回答相比并不是必不可少的)。2。这并不是一个全面的答案。

Nevertheless, from my experience, the overhead for multiple processes itself is very small provided that you aren't dealing with data movement issues. In other words, in my experience, any time that you ever find yourself in a situation of wishing something were faster than a single process on your CPU can manage, you're well past the point where parallelism will be beneficial. For instance, in the sum of random numbers example that you gave, I found through testing just now that the break-even point was somewhere around 10,000 random numbers. Anything more and parallelism was the clear winner. Generating 10,000 random number is trivial for modern computers, taking a tiny fraction of a second, and is well below the threshold where I'd start getting frustrated by the slowness of my scripts and want parallelism to speed them up.

然而,根据我的经验,如果您不处理数据移动问题,那么多个进程本身的开销就非常小。换句话说,在我的经验中,任何时候,当你发现自己在希望某件事比你的CPU上的单个进程更快的时候,你已经超过了并行性将会带来好处的点。举个例子,在你给出的随机数字例子中,我通过测试发现,盈亏平衡点大约是10000个随机数。任何更多的并行都是明显的赢家。生成10,000个随机数对于现代计算机来说是微不足道的,只占用一小部分时间,而且远远低于我开始对我的脚本缓慢感到沮丧的阈值,并且希望并行性能够加快它们的速度。

Thus, I at least am of the opinion, that although there are probably even more wonderful things that the Julia developers could do to cut down on the overhead even more, at this point, anything pertinent to Julia isn't going to be so much of your limiting factor, at least in terms of the computation aspects of parallelism. I think that there are still improvements to be made in terms of enhancing both the ease and the efficiency of parallel data movement (I like the package that you've started on that topic as a good step. You and I would probably both agree there's still a ways more to go). But, the big limiting factors will be:

因此,至少我认为,虽然有可能更美好的事物,茱莉亚开发人员可以减少更多的开销,在这一点上,任何与茱莉亚不会太多你的限制因素,至少在并行的计算方面。我认为在提高并行数据移动的易用性和效率方面还有一些改进需要改进(我喜欢您在这个主题上开始的包,这是一个很好的步骤。你和我可能都同意还有更多的路要走。但是,最大的限制因素将是:

  1. How much data do you need to be moving around between processes?
  2. 需要在进程之间移动多少数据?
  3. How much read/write to your memory do you need to be doing during your computations? (e.g. flops per read/write)
  4. 在你的计算过程中,你需要对你的记忆做多少读写?(如每读/写失败)

Aspect 1. might at times lean against using parallelism. Aspect 2. is more likely just to mean that you won't get so much benefit from it. And, at least as I interpret "overhead," neither of these really fall so directly into that specific consideration. And, both of these are, I believe, going to be far more heavily determined by your system hardware than by Julia.

方面1。有时可能会倾向于使用并行。方面2。它更可能只是意味着你不会从中得到那么多好处。而且,至少在我解释“开销”的时候,这两者都没有直接考虑到这个具体的问题。而且,我相信,这两个都将更多地取决于你的系统硬件而不是朱莉娅。