Suppose I want to price a call option using a finite difference method and repa then the following does the job:
假设我想用有限差分法和repa对看涨期权进行定价,那么如下所示:
import Data.Array.Repa as Repa
r, sigma, k, t, xMax, deltaX, deltaT :: Double
m, n, p :: Int
r = 0.05
sigma = 0.2
k = 50.0
t = 3.0
m = 3
p = 1
xMax = 150
deltaX = xMax / (fromIntegral m)
n = 800
deltaT = t / (fromIntegral n)
singleUpdater a = traverse a id f
where
Z :. m = extent a
f _get (Z :. ix) | ix == 0 = 0.0
f _get (Z :. ix) | ix == m-1 = xMax - k
f get (Z :. ix) = a * get (Z :. ix-1) +
b * get (Z :. ix) +
c * get (Z :. ix+1)
where
a = deltaT * (sigma^2 * (fromIntegral ix)^2 - r * (fromIntegral ix)) / 2
b = 1 - deltaT * (r + sigma^2 * (fromIntegral ix)^2)
c = deltaT * (sigma^2 * (fromIntegral ix)^2 + r * (fromIntegral ix)) / 2
priceAtT :: Array U DIM1 Double
priceAtT = fromListUnboxed (Z :. m+1) [max 0 (deltaX * (fromIntegral j) - k) | j <- [0..m]]
testSingle :: IO (Array U DIM1 Double)
testSingle = computeP $ singleUpdater priceAtT
But now suppose I want to price options in parallel (say to do a spot ladder) then I can do this:
但现在假设我想同时对期权进行定价(比如做一个现货*),那么我可以这样做:
multiUpdater a = fromFunction (extent a) f
where
f :: DIM2 -> Double
f (Z :. ix :. jx) = (singleUpdater x)!(Z :. ix)
where
x :: Array D DIM1 Double
x = slice a (Any :. jx)
priceAtTMulti :: Array U DIM2 Double
priceAtTMulti = fromListUnboxed (Z :. m+1 :. p+1)
[max 0 (deltaX * (fromIntegral j) - k) | j <- [0..m], _l <- [0..p]]
testMulti :: IO (Array U DIM2 Double)
testMulti = computeP $ multiUpdater priceAtTMulti
Questions:
问题:
- Is there a more idiomatic way of doing this in repa?
- 在repa中是否有更习惯的方法来做这件事?
- Will the above method actually price in parallel?
- 以上方法是否会同时定价?
- How can I determine whether my code really is generating something that will be executed in parallel?
- 我如何确定我的代码是否真的在生成并行执行的东西?
I tried this for 3 but sadly encountered a bug in ghc:
我尝试了3次,但不幸的是在ghc中遇到了一个bug:
bash-3.2$ ghc -fext-core --make Test.hs
[1 of 1] Compiling Main ( Test.hs, Test.o )
ghc: panic! (the 'impossible' happened)
(GHC version 7.4.1 for x86_64-apple-darwin):
MkExternalCore died: make_lit
1 个解决方案
#1
62
Your bug is unrelated to your code -- it is your use of -fext-core
to print the output of compilation in the External Core format. Just don't do that (to see the core, I use ghc-core).
您的bug与您的代码无关——它是您使用-fext-core打印外部核心格式的编译输出。只是不要那样做(为了看到核心,我使用ghc-core)。
Compile with -O2
and -threaded
:
使用-O2和-thread编译:
$ ghc -O2 -rtsopts --make A.hs -threaded
[1 of 1] Compiling Main ( A.hs, A.o )
Linking A ...
Then run with +RTS -N4
, for example, to use 4 threads:
然后使用+RTS -N4运行,例如,使用4个线程:
$ time ./A +RTS -N4
[0.0,0.0,8.4375e-3,8.4375e-3,50.009375,50.009375,100.0,100.0]
./A 0.00s user 0.00s system 85% cpu 0.008 total
So it completes too quickly to see a result. I'll increase your m
and p
parameters to 1k and 3k
所以它完成得太快,看不到结果。我把m和p参数增加到1k和3k
$ time ./A +RTS -N2
./A +RTS -N2 3.03s user 1.33s system 159% cpu 2.735 total
So yes, it does run in parallel. 1.6x on a 2 core machine, at a first attempt. Whether or not it is efficient is another question. Use +RTS -s you can see the runtime stats:
是的,它是并行运行的。在2核机器上的1.6倍,第一次尝试。它是否有效是另一个问题。使用+RTS -s你可以看到运行时状态:
TASKS: 4 (1 bound, 3 peak workers (3 total), using -N2)
任务:4(1个界,3个峰值工(3个总数),使用-N2)
So we had 3 threads running in parallel (2 presumably for the algo, one for the IO manager).
所以我们有3个线程并行运行(2个可能用于algo,一个用于IO管理器)。
You can reduce running time by adjusting the GC settings. E.g. by using -A
we can reduce GC overhead, and yield genuine parallel speedups.
您可以通过调整GC设置来减少运行时间。例如,通过使用-A,我们可以减少GC开销,并产生真正的并行加速。
$ time ./A +RTS -N1 -A100M
./A +RTS -N1 -A100M 1.99s user 0.29s system 99% cpu 2.287 total
$ time ./A +RTS -N2 -A100M
./A +RTS -N2 -A100M 2.30s user 0.86s system 147% cpu 2.145 total
You can improve numeric performance sometimes by using the LLVM backend. This seems to be the case here as well:
有时可以通过使用LLVM后端来提高数值性能。这里似乎也是如此:
$ ghc -O2 -rtsopts --make A.hs -threaded -fforce-recomp -fllvm
[1 of 1] Compiling Main ( A.hs, A.o )
Linking A ...
$ time ./A +RTS -N2 -A100M
./A +RTS -N2 -A100M 2.09s user 0.95s system 147% cpu 2.065 total
Nothing spectacular, but you are improving running time over your single threaded version, and I've not modified your original code in any way. To really improve things, you'll need to profile and optimize.
没有什么特别的,但是您正在改进您的单线程版本的运行时间,而且我没有以任何方式修改您的原始代码。要真正地改进,您需要配置和优化。
Revisiting the -A flags, we can bring the time down yet further using a tigher bound on the initial thread allocation area.
重新访问-A标志,我们可以使用初始线程分配区域上的一个紧约束来进一步缩短时间。
$ ghc -Odph -rtsopts --make A.hs -threaded -fforce-recomp -fllvm
$ time ./A +RTS -N2 -A60M -s
./A +RTS -N2 -A60M 1.99s user 0.73s system 144% cpu 1.880 total
So have brought it down to 1.8s from 2.7 (30% improvement) by using the parallel runtime, the LLVM backend, and being careful with GC flags. You can look at the GC flag surface to find the optimum:
因此,通过使用并行运行时(LLVM后端)和小心使用GC标志,将其从2.7(30%的改进)降低到1.8s。您可以查看GC标志面以找到最优:
With the trough around -A64 -N2
being ideal for the data set size.
在-A64 -N2左右的槽是理想的数据集大小。
I would also strongly consider using manual common subexpression elimination in your inner kernel, to avoid recomputing things excessively.
我还强烈地考虑在内核中使用手动通用子表达式消除,以避免过度计算。
As Alp suggests, to see the runtime behavior of the program, compile threadscope (from Hackage) and run as follows:
正如Alp所建议的,要查看程序的运行时行为,可以编译threadscope(来自Hackage)并运行如下:
$ ghc -O2 -fllvm -rtsopts -threaded -eventlog --make A.hs
$ ./A +RTS -ls -N2 -A60M
And you get an event trace for your two cores like so:
你会得到两个核的事件跟踪信息如下:
So what's going on here? You have an initial period (0.8s) of setup time -- allocating your big list and encoding it into a repa array -- as you can see by the single threaded interleaving of GC and execution. Then there's another 0.8s of something on a single core, before your actual parallel work occurs for the last 300ms.
这是怎么回事?您有一个初始的设置时间周期(0.8秒)——分配您的大列表并将其编码到repa数组中——您可以通过GC和执行的单线程交织看到这一点。然后在单个核上还有0.8秒的东西,在最后300毫秒发生实际的并行工作之前。
So while your actual algorithm may be parallelizing nicely, all the surrounding test setup basically swamps the result. If we serialize your dataset, and then just load it back from disk, we can get better behavior:
因此,虽然实际的算法可能会很好地并行化,但是周围的所有测试设置基本上都会淹没结果。如果我们串行化你的数据集,然后从磁盘加载它,我们可以得到更好的行为:
$ time ./A +RTS -N2 -A60M
./A +RTS -N2 -A60M 1.76s user 0.25s system 186% cpu 1.073 total
and now your profile looks healthier:
现在你的个人资料看起来更健康:
This looks great! Very little GC (98.9% productivity), and my two cores running in parallel happily.
这看起来太棒了!非常小的GC(98.9%的生产力),我的两个内核并行地运行。
So, finally, we can see you get good parallelism:
最后,我们可以得到良好的并行度
With 1 core, 1.855s
1核心,1.855 s
$ time ./A +RTS -N1 -A25M
./A +RTS -N1 -A25M 1.75s user 0.11s system 100% cpu 1.855 total
and with 2 cores, 1.014s
有两个核,1。014秒
$ time ./A +RTS -N2 -A25M
./A +RTS -N2 -A25M 1.78s user 0.13s system 188% cpu 1.014 total
Now, specifically answer your questions:
现在,明确地回答你的问题:
- Is there a more idiomatic way of doing this in repa?
- 在repa中是否有更习惯的方法来做这件事?
In general, repa code should consist of parallel traverals, consumers and produces, and inlinable kernel functions. So as long as you are doing that, then the code is probably idiomatic. If in doubt, look at the tutorial. I'd in general mark your worker kernels (like f
) as inlined.
通常,repa代码应该由并行遍历、使用者和生产以及不可linable内核函数组成。所以只要你这么做,那么代码很可能是惯用的。如果有疑问,请看教程。我通常会将您的工作内核(如f)标记为内联。
- Will the above method actually price in parallel?
- 以上方法是否会同时定价?
The code will execute in parallel if you use parallel combinators like computeP
or the various maps and folds. So yes, it should and does run in parallel.
如果您使用类似computeP或各种映射和折叠的并行组合器,代码将并行执行。是的,它应该并且确实是并行运行的。
- How can I determine whether my code really is generating something that will be executed in parallel?
- 我如何确定我的代码是否真的在生成并行执行的东西?
Generally, you will know a priori because you use parallel operations. If in doubt, run the code and observe the speedup. You may then need to optimize the code.
一般来说,你会知道一个先验,因为你使用并行操作。如果有疑问,运行代码并观察加速。然后您可能需要优化代码。
#1
62
Your bug is unrelated to your code -- it is your use of -fext-core
to print the output of compilation in the External Core format. Just don't do that (to see the core, I use ghc-core).
您的bug与您的代码无关——它是您使用-fext-core打印外部核心格式的编译输出。只是不要那样做(为了看到核心,我使用ghc-core)。
Compile with -O2
and -threaded
:
使用-O2和-thread编译:
$ ghc -O2 -rtsopts --make A.hs -threaded
[1 of 1] Compiling Main ( A.hs, A.o )
Linking A ...
Then run with +RTS -N4
, for example, to use 4 threads:
然后使用+RTS -N4运行,例如,使用4个线程:
$ time ./A +RTS -N4
[0.0,0.0,8.4375e-3,8.4375e-3,50.009375,50.009375,100.0,100.0]
./A 0.00s user 0.00s system 85% cpu 0.008 total
So it completes too quickly to see a result. I'll increase your m
and p
parameters to 1k and 3k
所以它完成得太快,看不到结果。我把m和p参数增加到1k和3k
$ time ./A +RTS -N2
./A +RTS -N2 3.03s user 1.33s system 159% cpu 2.735 total
So yes, it does run in parallel. 1.6x on a 2 core machine, at a first attempt. Whether or not it is efficient is another question. Use +RTS -s you can see the runtime stats:
是的,它是并行运行的。在2核机器上的1.6倍,第一次尝试。它是否有效是另一个问题。使用+RTS -s你可以看到运行时状态:
TASKS: 4 (1 bound, 3 peak workers (3 total), using -N2)
任务:4(1个界,3个峰值工(3个总数),使用-N2)
So we had 3 threads running in parallel (2 presumably for the algo, one for the IO manager).
所以我们有3个线程并行运行(2个可能用于algo,一个用于IO管理器)。
You can reduce running time by adjusting the GC settings. E.g. by using -A
we can reduce GC overhead, and yield genuine parallel speedups.
您可以通过调整GC设置来减少运行时间。例如,通过使用-A,我们可以减少GC开销,并产生真正的并行加速。
$ time ./A +RTS -N1 -A100M
./A +RTS -N1 -A100M 1.99s user 0.29s system 99% cpu 2.287 total
$ time ./A +RTS -N2 -A100M
./A +RTS -N2 -A100M 2.30s user 0.86s system 147% cpu 2.145 total
You can improve numeric performance sometimes by using the LLVM backend. This seems to be the case here as well:
有时可以通过使用LLVM后端来提高数值性能。这里似乎也是如此:
$ ghc -O2 -rtsopts --make A.hs -threaded -fforce-recomp -fllvm
[1 of 1] Compiling Main ( A.hs, A.o )
Linking A ...
$ time ./A +RTS -N2 -A100M
./A +RTS -N2 -A100M 2.09s user 0.95s system 147% cpu 2.065 total
Nothing spectacular, but you are improving running time over your single threaded version, and I've not modified your original code in any way. To really improve things, you'll need to profile and optimize.
没有什么特别的,但是您正在改进您的单线程版本的运行时间,而且我没有以任何方式修改您的原始代码。要真正地改进,您需要配置和优化。
Revisiting the -A flags, we can bring the time down yet further using a tigher bound on the initial thread allocation area.
重新访问-A标志,我们可以使用初始线程分配区域上的一个紧约束来进一步缩短时间。
$ ghc -Odph -rtsopts --make A.hs -threaded -fforce-recomp -fllvm
$ time ./A +RTS -N2 -A60M -s
./A +RTS -N2 -A60M 1.99s user 0.73s system 144% cpu 1.880 total
So have brought it down to 1.8s from 2.7 (30% improvement) by using the parallel runtime, the LLVM backend, and being careful with GC flags. You can look at the GC flag surface to find the optimum:
因此,通过使用并行运行时(LLVM后端)和小心使用GC标志,将其从2.7(30%的改进)降低到1.8s。您可以查看GC标志面以找到最优:
With the trough around -A64 -N2
being ideal for the data set size.
在-A64 -N2左右的槽是理想的数据集大小。
I would also strongly consider using manual common subexpression elimination in your inner kernel, to avoid recomputing things excessively.
我还强烈地考虑在内核中使用手动通用子表达式消除,以避免过度计算。
As Alp suggests, to see the runtime behavior of the program, compile threadscope (from Hackage) and run as follows:
正如Alp所建议的,要查看程序的运行时行为,可以编译threadscope(来自Hackage)并运行如下:
$ ghc -O2 -fllvm -rtsopts -threaded -eventlog --make A.hs
$ ./A +RTS -ls -N2 -A60M
And you get an event trace for your two cores like so:
你会得到两个核的事件跟踪信息如下:
So what's going on here? You have an initial period (0.8s) of setup time -- allocating your big list and encoding it into a repa array -- as you can see by the single threaded interleaving of GC and execution. Then there's another 0.8s of something on a single core, before your actual parallel work occurs for the last 300ms.
这是怎么回事?您有一个初始的设置时间周期(0.8秒)——分配您的大列表并将其编码到repa数组中——您可以通过GC和执行的单线程交织看到这一点。然后在单个核上还有0.8秒的东西,在最后300毫秒发生实际的并行工作之前。
So while your actual algorithm may be parallelizing nicely, all the surrounding test setup basically swamps the result. If we serialize your dataset, and then just load it back from disk, we can get better behavior:
因此,虽然实际的算法可能会很好地并行化,但是周围的所有测试设置基本上都会淹没结果。如果我们串行化你的数据集,然后从磁盘加载它,我们可以得到更好的行为:
$ time ./A +RTS -N2 -A60M
./A +RTS -N2 -A60M 1.76s user 0.25s system 186% cpu 1.073 total
and now your profile looks healthier:
现在你的个人资料看起来更健康:
This looks great! Very little GC (98.9% productivity), and my two cores running in parallel happily.
这看起来太棒了!非常小的GC(98.9%的生产力),我的两个内核并行地运行。
So, finally, we can see you get good parallelism:
最后,我们可以得到良好的并行度
With 1 core, 1.855s
1核心,1.855 s
$ time ./A +RTS -N1 -A25M
./A +RTS -N1 -A25M 1.75s user 0.11s system 100% cpu 1.855 total
and with 2 cores, 1.014s
有两个核,1。014秒
$ time ./A +RTS -N2 -A25M
./A +RTS -N2 -A25M 1.78s user 0.13s system 188% cpu 1.014 total
Now, specifically answer your questions:
现在,明确地回答你的问题:
- Is there a more idiomatic way of doing this in repa?
- 在repa中是否有更习惯的方法来做这件事?
In general, repa code should consist of parallel traverals, consumers and produces, and inlinable kernel functions. So as long as you are doing that, then the code is probably idiomatic. If in doubt, look at the tutorial. I'd in general mark your worker kernels (like f
) as inlined.
通常,repa代码应该由并行遍历、使用者和生产以及不可linable内核函数组成。所以只要你这么做,那么代码很可能是惯用的。如果有疑问,请看教程。我通常会将您的工作内核(如f)标记为内联。
- Will the above method actually price in parallel?
- 以上方法是否会同时定价?
The code will execute in parallel if you use parallel combinators like computeP
or the various maps and folds. So yes, it should and does run in parallel.
如果您使用类似computeP或各种映射和折叠的并行组合器,代码将并行执行。是的,它应该并且确实是并行运行的。
- How can I determine whether my code really is generating something that will be executed in parallel?
- 我如何确定我的代码是否真的在生成并行执行的东西?
Generally, you will know a priori because you use parallel operations. If in doubt, run the code and observe the speedup. You may then need to optimize the code.
一般来说,你会知道一个先验,因为你使用并行操作。如果有疑问,运行代码并观察加速。然后您可能需要优化代码。