将单线程应用程序迁移到多线程,并行执行,蒙特卡罗模拟

时间:2022-04-01 21:01:18

I've been tasked with taking an existing single threaded monte carlo simulation and optimising it. This is a c# console app, no db access it loads data once from a csv file and writes it out at the end, so it's pretty much just CPU bound, also only uses about 50mb of memory.

我的任务是采用现有的单螺纹蒙特卡罗模拟并对其进行优化。这是一个c#控制台应用程序,没有数据库访问它从csv文件加载数据一次并在最后写出来,所以它几乎只是CPU绑定,也只使用大约50mb的内存。

I've run it through Jetbrains dotTrace profiler. Of total execution time about 30% is generating uniform random numbers, 24% translating uniform random numbers to normally distributed random numbers.

我通过Jetbrains dotTrace探测器运行它。在总执行时间中,大约30%生成均匀随机数,24%将均匀随机数转换为正态分布随机数。

The basic algorithm is a whole lot of nested for loops, with random number calls and matrix multiplication at the centre, each iteration returns a double which is added to a results list, this list is periodically sorted and tested for some convergence criteria (at check points every 5% of total iteration count) if acceptable the program breaks out of the loops and writes the results, else it proceeds to the end.

基本算法是一大堆嵌套for循环,在中心有随机数调用和矩阵乘法,每次迭代返回一个加到结果列表中的double,这个列表定期排序并测试一些收敛标准(检查时)如果可以接受的话,程序会从循环中断开并写入结果,否则它会继续到最后。

I'd like developers to weigh in on:

我希望开发人员能够权衡:

  • should I use new Thread v ThreadPool
  • 我应该使用新的Thread v ThreadPool

  • should I look at the Microsoft Parallels Extension library
  • 我应该看一下Microsoft Parallels Extension库

  • should I look at AForge.Net Parallel.For, http://code.google.com/p/aforge/ any other libraries?
  • 我应该看看AForge.Net Parallel.For,http://code.google.com/p/aforge/任何其他图书馆?

Some links to tutorials on the above would be most welcome as I've never written any parallel or multi-threaded code.

由于我从未编写任何并行或多线程代码,因此欢迎使用上述教程的一些链接。

  • best strategies for generating en-mass normally distributed random numbers, and then consuming these. Uniform random numbers are never used in this state by the app, they are always translated to normally distributed and then consumed.
  • 生成大量正态分布随机数的最佳策略,然后消耗它们。应用程序从未在此状态下使用统一随机数,它们始终转换为正态分布然后消耗。

  • good fast libraries (parallel?) for random number generation
  • 用于随机数生成的良好快速库(并行?)

  • memory considerations as I take this parallel, how much extra will I require.
  • 记忆考虑,因为我采取这种并行,我需要多少额外的东西。

Current app takes 2 hours for 500,000 iterations, business needs this to scale to 3,000,000 iterations and be called mulitple times a day so need some heavy optimisation.

当前应用程序需要2个小时进行500,000次迭代,业务需要将其扩展到3,000,000次迭代,并且每天被称为多次,因此需要进行一些繁重的优化。

Particulary would like to hear from people who have used Microsoft Parallels Extension or AForge.Net Parallel

特别想听听使用Microsoft Parallels Extension或AForge.Net Parallel的人的意见

This needs to be productionised fairly quickly so .net 4 beta is out even though I know it has concurrency libraries baked in, we can look at migrating to .net 4 later down the track once it's released. For the moment the server has .Net 2, I've submitted for review an upgrade to .net 3.5 SP1 which my dev box has.

这需要相当快地生产,所以尽管我知道它已经出现了并发库,但我们可以看一下,一旦它发布,我们就可以看到迁移到.net 4。目前服务器有.Net 2,我已提交审核升级到我的开发箱所具有的.net 3.5 SP1。

Thanks

Update

I've just tried the Parallel.For implementation but it comes up with some weird results. Single threaded:

我刚刚尝试了Parallel.For实现,但它提出了一些奇怪的结果。单线程:

IRandomGenerator rnd = new MersenneTwister();
IDistribution dist = new DiscreteNormalDistribution(discreteNormalDistributionSize);
List<double> results = new List<double>();

for (int i = 0; i < CHECKPOINTS; i++)
{
 results.AddRange(Oblist.Simulate(rnd, dist, n));
}

To:

Parallel.For(0, CHECKPOINTS, i =>
        {
           results.AddRange(Oblist.Simulate(rnd, dist, n));
        });

Inside simulate there are many calls to rnd.nextUniform(), I think I am getting many values that are the same, is this likely to happen because this is now parallel?

在模拟内部有很多调用rnd.nextUniform(),我想我得到的是很多相同的值,这是否可能发生,因为现在它是并行的?

Also maybe issues with the List AddRange call not being thread safe? I see this

也可能是List AddRange调用不是线程安全的问题?我明白了

System.Threading.Collections.BlockingCollection might be worth using, but it only has an Add method no AddRange so I'd have to look over there results and add in a thread safe manner. Any insight from someone who has used Parallel.For much appreciated. I switched to the System.Random for my calls temporarily as I was getting an exception when calling nextUniform with my Mersenne Twister implementation, perhaps it wasn't thread safe a certain array was getting an index out of bounds....

System.Threading.Collections.BlockingCollection可能值得使用,但它只有Add方法没有AddRange所以我必须查看结果并以线程安全的方式添加。来自使用Parallel的人的任何见解。非常感谢。我暂时切换到System.Random我的调用因为我在使用我的Mersenne Twister实现调用nextUniform时遇到异常,也许它不是线程安全的某个数组正在使索引超出界限....

3 个解决方案

#1


First you need to understand why you think that using multiple threads is an optimization - when it is, in fact, not. Using multiple threads will make your workload complete faster only if you have multiple processors, and then at most as many times faster as you have CPUs available (this is called the speed-up). The work is not "optimized" in the traditional sense of the word (i.e. the amount of work isn't reduced - in fact, with multithreading, the total amount of work typically grows because of the threading overhead).

首先,您需要了解为什么您认为使用多个线程是一种优化 - 实际上并非如此。只有拥有多个处理器时,使用多个线程才能使您的工作负载更快完成,然后最多只有您可用CPU的速度(这称为加速)。传统意义上的工作没有“优化”(即工作量没有减少 - 实际上,多线程工作总量因为线程开销而增加)。

So in designing your application, you have to find pieces of work that can be done in a parallel or overlapping fashion. It may be possible to generate random numbers in parallel (by having multiple RNGs run on different CPUs), but that would also change the results, as you get different random numbers. Another option is have generation of the random numbers on one CPU, and everything else on different CPUs. This can give you a maximum speedup of 3, as the RNG will still run sequentially, and still take 30% of the load.

因此,在设计应用程序时,您必须找到可以以并行或重叠方式完成的工作。有可能并行生成随机数(通过在不同的CPU上运行多个RNG),但这也会改变结果,因为您获得了不同的随机数。另一个选择是在一个CPU上生成随机数,在不同CPU上生成其他所有内容。这可以使您的最大加速比为3,因为RNG仍将按顺序运行,并且仍然需要30%的负载。

So if you go for this parallelization, you end up with 3 threads: thread 1 runs the RNG, thread 2 produces normal distribution, and thread 3 does the rest of the simulation.

因此,如果你进行这种并行化,最终会得到3个线程:线程1运行RNG,线程2运行正态分布,线程3执行其余的模拟。

For this architecture, a producer-consumer architecture is most appropriate. Each thread will read its input from a queue, and produce its output into another queue. Each queue should be blocking, so if the RNG thread falls behind, the normalization thread will automatically block until new random numbers are available. For efficiency, I would pass the random numbers in array of, say, 100 (or larger) across threads, to avoid synchronizations on every random number.

对于这种架构,生产者 - 消费者架构是最合适的。每个线程将从队列中读取其输入,并将其输出生成到另一个队列中。每个队列都应该是阻塞的,因此如果RNG线程落后,则规范化线程将自动阻塞,直到新的随机数可用。为了提高效率,我会在线程中传递100(或更大)数组中的随机数,以避免在每个随机数上进行同步。

For this approach, you don't need any advanced threading. Just use regular thread class, no pool, no library. The only thing that you need that is (unfortunately) not in the standard library is a blocking Queue class (the Queue class in System.Collections is no good). Codeproject provides a reasonably-looking implementation of one; there are probably others.

对于此方法,您不需要任何高级线程。只需使用常规线程类,没有池,没有库。您唯一需要的是(遗憾的是)不在标准库中的是阻塞Queue类(System.Collections中的Queue类并不好)。 Codeproject提供了一个看起来合理的实现;可能还有其他人。

#2


List<double> is definitely not thread-safe. See the section "thread safety" in the System.Collections.Generic.List documentation. The reason is performance: adding thread safety is not free.

List 绝对不是线程安全的。请参阅System.Collections.Generic.List文档中的“线程安全”部分。原因是性能:添加线程安全不是免费的。

Your random number implementation also isn't thread-safe; getting the same numbers multiple times is exactly what you'd expect in this case. Let's use the following simplified model of rnd.NextUniform() to understand what is happening:

您的随机数实现也不是线程安全的;多次获得相同的数字正是您在这种情况下所期望的。让我们使用以下简化的rnd.NextUniform()模型来理解发生了什么:

  1. calculate pseudo-random number from the current state of the object
  2. 从对象的当前状态计算伪随机数

  3. update state of the object so the next call yields a different number
  4. 更新对象的状态,以便下一个调用产生不同的数字

  5. return the pseudo-random number
  6. 返回伪随机数

Now, if two threads execute this method in parallel, something like this may happen:

现在,如果两个线程并行执行此方法,可能会发生以下情况:

  • Thread A calculates a random number as in step 1.
  • 线程A计算随机数,如步骤1所示。

  • Thread B calculates a random number as in step 1. Thread A has not yet updated the state of the object, so the result is the same.
  • 线程B计算随机数,如步骤1所示。线程A尚未更新对象的状态,因此结果相同。

  • Thread A updates the state of the object as in step 2.
  • 线程A更新对象的状态,如步骤2中所示。

  • Thread B updates the state of the object as in step 2, trampling over A's state changes or maybe giving the same result.
  • 线程B在步骤2中更新对象的状态,践踏A的状态更改或者可能给出相同的结果。

As you can see, any reasoning you can do to prove that rnd.NextUniform() works is no longer valid because two threads are interfering with each other. Worse, bugs like this depend on timing and may appear only rarely as "glitches" under certain workloads or on certain systems. Debugging nightmare!

正如您所看到的,您可以做任何证明rnd.NextUniform()工作的原因不再有效,因为两个线程相互干扰。更糟糕的是,这样的错误取决于时间,并且在某些工作负载或某些系统下可能很少出现“故障”。调试噩梦!

One possible solution is to eliminate the state sharing: give each task its own random number generator initialized with another seed (assuming that instances are not sharing state through static fields in some way).

一种可能的解决方案是消除状态共享:为每个任务提供用另一个种子初始化的自己的随机数生成器(假设实例不以某种方式通过静态字段共享状态)。

Another (inferior) solution is to create a field holding a lock object in your MersenneTwister class like this:

另一个(劣等)解决方案是在MersenneTwister类中创建一个包含锁定对象的字段,如下所示:

private object lockObject = new object();

Then use this lock in your MersenneTwister.NextUniform() implementation:

然后在MersenneTwister.NextUniform()实现中使用此锁:

public double NextUniform()
{
   lock(lockObject)
   {
      // original code here
   }
}

This will prevent two threads from executing the NextUniform() method in parallel. The problem with the list in your Parallel.For can be addressed in a similar manner: separate the Simulate call and the AddRange call, and then add locking around the AddRange call.

这将阻止两个线程并行执行NextUniform()方法。您可以通过类似的方式解决Parallel.For中列表的问题:分离Simulate调用和AddRange调用,然后在AddRange调用周围添加锁定。

My recommendation: avoid sharing any mutable state (like the RNG state) between parallel tasks if at all possible. If no mutable state is shared, no threading issues occur. This also avoids locking bottlenecks: you don't want your "parallel" tasks to wait on a single random number generator that doesn't work in parallel at all. Especially if 30% of the time is spend acquiring random numbers.

我的建议:尽可能避免在并行任务之间共享任何可变状态(如RNG状态)。如果没有共享可变状态,则不会发生线程问题。这也避免了锁定瓶颈:您不希望“并行”任务等待一个根本不并行工作的随机数生成器。特别是如果有30%的时间花在获取随机数上。

Limit state sharing and locking to places where you can't avoid it, like when aggregating the results of parallel execution (as in your AddRange calls).

将状态共享和锁定限制在无法避免的位置,例如聚合并行执行的结果(如在AddRange调用中)。

#3


Threading is going to be complicated. You will have to break your program into logical units that can each be run on their own threads, and you will have to deal with any concurrency issues that emerge.

线程将变得复杂。您必须将程序分解为逻辑单元,每个逻辑单元都可以在自己的线程上运行,并且您将不得不处理出现的任何并发问题。

The Parallel Extension Library should allow you to parallelize your program by changing some of your for loops to Parallel.For loops. If you want to see how this works, Anders Hejlsberg and Joe Duffy provide a good introduction in their 30 minute video here:

并行扩展库应该允许您通过将一些for循环更改为Parallel.For循环来并行化您的程序。如果你想看看它是如何工作的,Anders Hejlsberg和Joe Duffy在这里的30分钟视频中提供了一个很好的介绍:

http://channel9.msdn.com/shows/Going+Deep/Programming-in-the-Age-of-Concurrency-Anders-Hejlsberg-and-Joe-Duffy-Concurrent-Programming-with/

Threading vs. ThreadPool

线程与ThreadPool

The ThreadPool, as its name implies, is a pool of threads. Using the ThreadPool to obtain your threads has some advantages. Thread pooling enables you to use threads more efficiently by providing your application with a pool of worker threads that are managed by the system.

正如其名称所示,ThreadPool是一个线程池。使用ThreadPool获取线程有一些优点。线程池使您可以通过为应用程序提供由系统管理的工作线程池来更有效地使用线程。

#1


First you need to understand why you think that using multiple threads is an optimization - when it is, in fact, not. Using multiple threads will make your workload complete faster only if you have multiple processors, and then at most as many times faster as you have CPUs available (this is called the speed-up). The work is not "optimized" in the traditional sense of the word (i.e. the amount of work isn't reduced - in fact, with multithreading, the total amount of work typically grows because of the threading overhead).

首先,您需要了解为什么您认为使用多个线程是一种优化 - 实际上并非如此。只有拥有多个处理器时,使用多个线程才能使您的工作负载更快完成,然后最多只有您可用CPU的速度(这称为加速)。传统意义上的工作没有“优化”(即工作量没有减少 - 实际上,多线程工作总量因为线程开销而增加)。

So in designing your application, you have to find pieces of work that can be done in a parallel or overlapping fashion. It may be possible to generate random numbers in parallel (by having multiple RNGs run on different CPUs), but that would also change the results, as you get different random numbers. Another option is have generation of the random numbers on one CPU, and everything else on different CPUs. This can give you a maximum speedup of 3, as the RNG will still run sequentially, and still take 30% of the load.

因此,在设计应用程序时,您必须找到可以以并行或重叠方式完成的工作。有可能并行生成随机数(通过在不同的CPU上运行多个RNG),但这也会改变结果,因为您获得了不同的随机数。另一个选择是在一个CPU上生成随机数,在不同CPU上生成其他所有内容。这可以使您的最大加速比为3,因为RNG仍将按顺序运行,并且仍然需要30%的负载。

So if you go for this parallelization, you end up with 3 threads: thread 1 runs the RNG, thread 2 produces normal distribution, and thread 3 does the rest of the simulation.

因此,如果你进行这种并行化,最终会得到3个线程:线程1运行RNG,线程2运行正态分布,线程3执行其余的模拟。

For this architecture, a producer-consumer architecture is most appropriate. Each thread will read its input from a queue, and produce its output into another queue. Each queue should be blocking, so if the RNG thread falls behind, the normalization thread will automatically block until new random numbers are available. For efficiency, I would pass the random numbers in array of, say, 100 (or larger) across threads, to avoid synchronizations on every random number.

对于这种架构,生产者 - 消费者架构是最合适的。每个线程将从队列中读取其输入,并将其输出生成到另一个队列中。每个队列都应该是阻塞的,因此如果RNG线程落后,则规范化线程将自动阻塞,直到新的随机数可用。为了提高效率,我会在线程中传递100(或更大)数组中的随机数,以避免在每个随机数上进行同步。

For this approach, you don't need any advanced threading. Just use regular thread class, no pool, no library. The only thing that you need that is (unfortunately) not in the standard library is a blocking Queue class (the Queue class in System.Collections is no good). Codeproject provides a reasonably-looking implementation of one; there are probably others.

对于此方法,您不需要任何高级线程。只需使用常规线程类,没有池,没有库。您唯一需要的是(遗憾的是)不在标准库中的是阻塞Queue类(System.Collections中的Queue类并不好)。 Codeproject提供了一个看起来合理的实现;可能还有其他人。

#2


List<double> is definitely not thread-safe. See the section "thread safety" in the System.Collections.Generic.List documentation. The reason is performance: adding thread safety is not free.

List 绝对不是线程安全的。请参阅System.Collections.Generic.List文档中的“线程安全”部分。原因是性能:添加线程安全不是免费的。

Your random number implementation also isn't thread-safe; getting the same numbers multiple times is exactly what you'd expect in this case. Let's use the following simplified model of rnd.NextUniform() to understand what is happening:

您的随机数实现也不是线程安全的;多次获得相同的数字正是您在这种情况下所期望的。让我们使用以下简化的rnd.NextUniform()模型来理解发生了什么:

  1. calculate pseudo-random number from the current state of the object
  2. 从对象的当前状态计算伪随机数

  3. update state of the object so the next call yields a different number
  4. 更新对象的状态,以便下一个调用产生不同的数字

  5. return the pseudo-random number
  6. 返回伪随机数

Now, if two threads execute this method in parallel, something like this may happen:

现在,如果两个线程并行执行此方法,可能会发生以下情况:

  • Thread A calculates a random number as in step 1.
  • 线程A计算随机数,如步骤1所示。

  • Thread B calculates a random number as in step 1. Thread A has not yet updated the state of the object, so the result is the same.
  • 线程B计算随机数,如步骤1所示。线程A尚未更新对象的状态,因此结果相同。

  • Thread A updates the state of the object as in step 2.
  • 线程A更新对象的状态,如步骤2中所示。

  • Thread B updates the state of the object as in step 2, trampling over A's state changes or maybe giving the same result.
  • 线程B在步骤2中更新对象的状态,践踏A的状态更改或者可能给出相同的结果。

As you can see, any reasoning you can do to prove that rnd.NextUniform() works is no longer valid because two threads are interfering with each other. Worse, bugs like this depend on timing and may appear only rarely as "glitches" under certain workloads or on certain systems. Debugging nightmare!

正如您所看到的,您可以做任何证明rnd.NextUniform()工作的原因不再有效,因为两个线程相互干扰。更糟糕的是,这样的错误取决于时间,并且在某些工作负载或某些系统下可能很少出现“故障”。调试噩梦!

One possible solution is to eliminate the state sharing: give each task its own random number generator initialized with another seed (assuming that instances are not sharing state through static fields in some way).

一种可能的解决方案是消除状态共享:为每个任务提供用另一个种子初始化的自己的随机数生成器(假设实例不以某种方式通过静态字段共享状态)。

Another (inferior) solution is to create a field holding a lock object in your MersenneTwister class like this:

另一个(劣等)解决方案是在MersenneTwister类中创建一个包含锁定对象的字段,如下所示:

private object lockObject = new object();

Then use this lock in your MersenneTwister.NextUniform() implementation:

然后在MersenneTwister.NextUniform()实现中使用此锁:

public double NextUniform()
{
   lock(lockObject)
   {
      // original code here
   }
}

This will prevent two threads from executing the NextUniform() method in parallel. The problem with the list in your Parallel.For can be addressed in a similar manner: separate the Simulate call and the AddRange call, and then add locking around the AddRange call.

这将阻止两个线程并行执行NextUniform()方法。您可以通过类似的方式解决Parallel.For中列表的问题:分离Simulate调用和AddRange调用,然后在AddRange调用周围添加锁定。

My recommendation: avoid sharing any mutable state (like the RNG state) between parallel tasks if at all possible. If no mutable state is shared, no threading issues occur. This also avoids locking bottlenecks: you don't want your "parallel" tasks to wait on a single random number generator that doesn't work in parallel at all. Especially if 30% of the time is spend acquiring random numbers.

我的建议:尽可能避免在并行任务之间共享任何可变状态(如RNG状态)。如果没有共享可变状态,则不会发生线程问题。这也避免了锁定瓶颈:您不希望“并行”任务等待一个根本不并行工作的随机数生成器。特别是如果有30%的时间花在获取随机数上。

Limit state sharing and locking to places where you can't avoid it, like when aggregating the results of parallel execution (as in your AddRange calls).

将状态共享和锁定限制在无法避免的位置,例如聚合并行执行的结果(如在AddRange调用中)。

#3


Threading is going to be complicated. You will have to break your program into logical units that can each be run on their own threads, and you will have to deal with any concurrency issues that emerge.

线程将变得复杂。您必须将程序分解为逻辑单元,每个逻辑单元都可以在自己的线程上运行,并且您将不得不处理出现的任何并发问题。

The Parallel Extension Library should allow you to parallelize your program by changing some of your for loops to Parallel.For loops. If you want to see how this works, Anders Hejlsberg and Joe Duffy provide a good introduction in their 30 minute video here:

并行扩展库应该允许您通过将一些for循环更改为Parallel.For循环来并行化您的程序。如果你想看看它是如何工作的,Anders Hejlsberg和Joe Duffy在这里的30分钟视频中提供了一个很好的介绍:

http://channel9.msdn.com/shows/Going+Deep/Programming-in-the-Age-of-Concurrency-Anders-Hejlsberg-and-Joe-Duffy-Concurrent-Programming-with/

Threading vs. ThreadPool

线程与ThreadPool

The ThreadPool, as its name implies, is a pool of threads. Using the ThreadPool to obtain your threads has some advantages. Thread pooling enables you to use threads more efficiently by providing your application with a pool of worker threads that are managed by the system.

正如其名称所示,ThreadPool是一个线程池。使用ThreadPool获取线程有一些优点。线程池使您可以通过为应用程序提供由系统管理的工作线程池来更有效地使用线程。