I find the array library Repa for Haskell very interesting, and wanted to make a simple program, to try to understand how to use it. I also made a simple implementation using lists, which proved to be much faster. My main question is how I could improve the Repa code below to make it the most efficient (and hopefully also very readable). I am quite new using Haskell, and I couldn't find any easily understandable tutorial on Repa [edit there is one at the Haskell Wiki, that I somehow forgot when I wrote this], so don't assume I know anything. :) For example, I'm not sure when to use force or deepSeqArray.
我发现Haskell的数组库Repa非常有趣,我想做一个简单的程序,尝试理解如何使用它。我还使用列表做了一个简单的实现,结果证明它要快得多。我的主要问题是如何改进下面的Repa代码,使之成为最高效的(希望也是可读性最强的)。我对使用Haskell很陌生,也找不到任何容易理解的关于Repa的教程(在Haskell Wiki上有一个教程,我在写这篇文章时不知怎么忘记了),所以不要以为我什么都知道。例如,我不确定什么时候使用force或者deepSeqArray。
The program is used to approximately calculate the volume of a sphere in the following way:
该程序用于近似计算一个球体的体积:
- The center point and radius of the sphere is specified, as well as equally spaced coordinates within a cuboid, which are known to encompass the sphere.
- 球面的中心点和半径是指定的,以及等距坐标在长方体内,已知长方体包含球面。
- The program takes each of the coordinates, calculates the distance to the center of the sphere, and if it is smaller than the radius of the sphere, it is used to add up on the total (approximate) volume of the sphere.
- 程序取每一个坐标,计算到球体中心的距离,如果它小于球体的半径,它被用来把球体的总体积相加。
Two versions are shown below, one using lists and one using repa. I know the code is inefficient, especially for this use case, but the idea is to make it more complicated later on.
下面显示了两个版本,一个使用列表,一个使用repa。我知道代码是低效的,特别是对于这个用例,但是这个想法是为了以后更加复杂。
For the values below, and compiling with "ghc -Odph -fllvm -fforce-recomp -rtsopts -threaded", the list version takes 1.4 s, while the repa version takes 12 s with +RTS -N1 and 10 s with +RTS -N2, though no sparks are converted (I have a dual-core Intel machine (Core 2 Duo E7400 @ 2.8 GHz) running Windows 7 64, GHC 7.0.2 and llvm 2.8). (Comment out the correct line in main below to just run one of the versions.)
下面的值,编译与“ghc -Odph -fllvm -fforce-recomp -rtsopts线程”,版本列表需要1.4秒,而repa版本需要12 + RTS n1和10年代+ RTS n2,尽管没有火花转换(我有一个英特尔双核机(Core 2 Duo E7400 @ 2.8 GHz)64年运行Windows 7,ghc 7.0.2和llvm 2.8)。(在下面的main行注释出正确的行,只运行其中一个版本。)
Thank you for any help!
谢谢你的帮助!
import Data.Array.Repa as R
import qualified Data.Vector.Unboxed as V
import Prelude as P
-- Calculate the volume of a sphere by putting it in a bath of coordinates. Generate coordinates (x,y,z) in a cuboid. Then, for each coordinate, check if it is inside the sphere. Sum those coordinates and multiply by the coordinate grid step size to find an approximate volume.
particles = [(0,0,0)] -- used for the list alternative --[(0,0,0),(0,2,0)]
particles_repa = [0,0,0::Double] -- used for the repa alternative, can currently just be one coordinate
-- Radius of the particle
a = 4
-- Generate the coordinates. Could this be done more efficiently, and at the same time simple? In Matlab I would use ndgrid.
step = 0.1 --0.05
xrange = [-10,-10+step..10] :: [Double]
yrange = [-10,-10+step..10]
zrange = [-10,-10+step..10]
-- All coordinates as triples. These are used directly in the list version below.
coords = [(x,y,z) | x <- xrange, y <- yrange, z <- zrange]
---- List code ----
volumeIndividuals = fromIntegral (length particles) * 4*pi*a**3/3
volumeInside = step**3 * fromIntegral (numberInsideParticles particles coords)
numberInsideParticles particles coords = length $ filter (==True) $ P.map (insideParticles particles) coords
insideParticles particles coord = any (==True) $ P.map (insideParticle coord) particles
insideParticle (xc,yc,zc) (xp,yp,zp) = ((xc-xp)^2+(yc-yp)^2+(zc-zp)^2) < a**2
---- End list code ----
---- Repa code ----
-- Put the coordinates in a Nx3 array.
xcoords = P.map (\(x,_,_) -> x) coords
ycoords = P.map (\(_,y,_) -> y) coords
zcoords = P.map (\(_,_,z) -> z) coords
-- Total number of coordinates
num_coords = (length xcoords) ::Int
xcoords_r = fromList (Z :. num_coords :. (1::Int)) xcoords
ycoords_r = fromList (Z :. num_coords :. (1::Int)) ycoords
zcoords_r = fromList (Z :. num_coords :. (1::Int)) zcoords
rcoords = xcoords_r R.++ ycoords_r R.++ zcoords_r
-- Put the particle coordinates in an array, then extend (replicate) this array so that its size becomes the same as that of rcoords
particle = fromList (Z :. (1::Int) :. (3::Int)) particles_repa
particle_slice = slice particle (Z :. (0::Int) :. All)
particle_extended = extend (Z :. num_coords :. All) particle_slice
-- Calculate the squared difference between the (x,y,z) coordinates of the particle and the coordinates of the cuboid.
squared_diff = deepSeqArrays [rcoords,particle_extended] ((force2 rcoords) -^ (force2 particle_extended)) **^ 2
(**^) arr pow = R.map (**pow) arr
xslice = slice squared_diff (Z :. All :. (0::Int))
yslice = slice squared_diff (Z :. All :. (1::Int))
zslice = slice squared_diff (Z :. All :. (2::Int))
-- Calculate the distance between each coordinate and the particle center
sum_squared_diff = [xslice,yslice,zslice] `deepSeqArrays` xslice +^ yslice +^ zslice
-- Do the rest using vector, since I didn't get the repa variant working.
ssd_vec = toVector sum_squared_diff
-- Determine the number of the coordinates that are within the particle (instead of taking the square root to get the distances above, I compare to the square of the radius here, to improve performance)
total_within = fromIntegral (V.length $ V.filter (<a**2) ssd_vec)
--total_within = foldAll (\x acc -> if x < a**2 then acc+1 else acc) 0 sum_squared_diff
-- Finally, calculate an approximation of the volume of the sphere by taking the volume of the cubes with side step, multiplied with the number of coordinates within the sphere.
volumeInside_repa = step**3 * total_within
-- Helper function that shows the size of a 2-D array.
rsize = reverse . listOfShape . (extent :: Array DIM2 Double -> DIM2)
---- End repa code ----
-- Comment out the list or the repa version if you want to time the calculations separately.
main = do
putStrLn $ "Step = " P.++ show step
putStrLn $ "Volume of individual particles = " P.++ show volumeIndividuals
putStrLn $ "Volume of cubes inside particles (list) = " P.++ show volumeInside
putStrLn $ "Volume of cubes inside particles (repa) = " P.++ show volumeInside_repa
Edit: Some background that explains why I have written the code as it is above:
编辑:一些背景说明为什么我写了上面的代码:
I mostly write code in Matlab, and my experience of performance improvement comes mostly from that area. In Matlab, you usually want to make your calculations using functions operating on matrices directly, to improve performance. My implementation of the problem above, in Matlab R2010b, takes 0.9 seconds using the matrix version shown below, and 15 seconds using nested loops. Although I know Haskell is very different from Matlab, my hope was that going from using lists to using Repa arrays in Haskell would improve the performance of the code. The conversions from lists->Repa arrays->vectors are there because I'm not skilled enough to replace them with something better. This is why I ask for input. :) The timing numbers above is thus subjective, since it may measure my performance more than that of the abilities of the languages, but it is a valid metric for me right now, since what decides what I will use depends on if I can make it work or not.
我主要在Matlab中编写代码,我的性能改进经验主要来自于该领域。在Matlab中,通常需要使用直接操作矩阵的函数进行计算,以提高性能。我在Matlab R2010b中实现了上面的问题,使用下面所示的矩阵版本需要0.9秒,使用嵌套循环需要15秒。虽然我知道Haskell与Matlab非常不同,但我希望从使用列表到在Haskell中使用Repa数组可以改进代码的性能。从列表->Repa数组->向量的转换,是因为我没有足够的技术来替换它们。这就是为什么我要求输入。因此,上面的计时数字是主观的,因为它可能比语言能力更能衡量我的表现,但它现在对我来说是一个有效的衡量标准,因为决定我将使用什么取决于我是否能让它发挥作用。
tl;dr: I understand that my Repa code above may be stupid or pathological, but it's the best I can do right now. I would love to be able to write better Haskell code, and I hope that you can help me in that direction (dons already did). :)
我知道我上面的Repa代码可能是愚蠢的或者是病态的,但这是我现在能做的最好的了。我希望能够写出更好的Haskell代码,希望您能在这个方向帮助我(已经做过了)。:)
function archimedes_simple()
particles = [0 0 0]';
a = 4;
step = 0.1;
xrange = [-10:step:10];
yrange = [-10:step:10];
zrange = [-10:step:10];
[X,Y,Z] = ndgrid(xrange,yrange,zrange);
dists2 = bsxfun(@minus,X,particles(1)).^2+ ...
bsxfun(@minus,Y,particles(2)).^2+ ...
bsxfun(@minus,Z,particles(3)).^2;
inside = dists2 < a^2;
num_inside = sum(inside(:));
disp('');
disp(['Step = ' num2str(step)]);
disp(['Volume of individual particles = ' num2str(size(particles,2)*4*pi*a^3/3)]);
disp(['Volume of cubes inside particles = ' num2str(step^3*num_inside)]);
end
Edit 2: New, faster and simpler version of the Repa code
编辑2:新的,更快和更简单版本的Repa代码
I have now read up a bit more on Repa, and thought a bit. Below is a new Repa version. In this case, I create the x, y, and z coordinates as 3-D arrays, using the Repa extend function, from a list of values (similar to how ndgrid works in Matlab). I then map over these arrays to calculate the distance to the spherical particle. Finally, I fold over the resulting 3-D distance array, count how many coordinates are within the sphere, and then multiply it by a constant factor to get the approximate volume. My implementation of the algorithm is now much more similar to the Matlab version above, and there are no longer any conversion to vector.
我现在已经读了更多关于Repa的内容,并且想了一下。下面是一个新的Repa版本。在本例中,我使用Repa extend函数,从一个值列表(类似于ndgrid在Matlab中工作的方式),创建了x、y和z坐标作为3-D数组。然后我映射到这些数组上,计算到球形粒子的距离。最后,我对得到的三维距离阵列进行折叠,计算球体中有多少个坐标,然后将其乘以一个常数因子得到近似的体积。我的算法实现现在更类似于上面的Matlab版本,不再有任何向量的转换。
The new version runs in about 5 seconds on my computer, a considerable improvement from above. The timing is the same if I use "threaded" while compiling, combined with "+RTS -N2" or not, but the threaded version does max out both cores of my computer. I did, however, see a few drops of the "-N2" run to 3.1 seconds, but couldn't reproduce them later. Maybe it is very sensitive to other processes running at the same time? I have shut most programs on my computer when benchmarking, but there are still some programs running, such as background processes.
新版本在我的电脑上运行大约5秒,与上面相比有了很大的改进。如果我在编译时使用“线程”,结合“+RTS -N2”或不使用“线程”,时间是一样的,但是线程版本会最大限度地显示我计算机的两个核心。然而,我确实看到了几滴“-N2”达到3.1秒,但后来无法重现。也许它对同时运行的其他进程非常敏感?在我的计算机上,我已经关闭了大多数程序,但是仍然有一些程序在运行,比如后台进程。
If we use "-N2" and add the runtime switch to turn off parallel GC (-qg), the time consistently goes down to ~4.1 seconds, and using -qa to "use the OS to set thread affinity (experimental)", the time was shaved down to ~3.5 seconds. Looking at the output from running the program with "+RTS -s", much less GC is performed using -qg.
如果我们使用“-N2”并添加运行时开关来关闭并行GC (-qg),那么时间将始终下降到~4.1秒,并使用-qa“使用操作系统来设置线程关联性(experimental)”,时间缩短至~3.5秒。查看使用“+RTS -s”运行程序的输出,使用-qg执行的GC要少得多。
This afternoon I will see if I can run the code on an 8-core computer, just for fun. :)
今天下午我将看看我是否能在8核电脑上运行代码,只是为了好玩。:)
import Data.Array.Repa as R
import Prelude as P
import qualified Data.List as L
-- Calculate the volume of a spherical particle by putting it in a bath of coordinates. Generate coordinates (x,y,z) in a cuboid. Then, for each coordinate, check if it is inside the sphere. Sum those coordinates and multiply by the coordinate grid step size to find an approximate volume.
particles :: [(Double,Double,Double)]
particles = [(0,0,0)]
-- Radius of the spherical particle
a = 4
volume_individuals = fromIntegral (length particles) * 4*pi*a^3/3
-- Generate the coordinates.
step = 0.1
coords_list = [-10,-10+step..10] :: [Double]
num_coords = (length coords_list) :: Int
coords :: Array DIM1 Double
coords = fromList (Z :. (num_coords ::Int)) coords_list
coords_slice :: Array DIM1 Double
coords_slice = slice coords (Z :. All)
-- x, y and z are 3-D arrays, where the same index into each array can be used to find a single coordinate, e.g. (x(i,j,k),y(i,j,k),z(i,j,k)).
x,y,z :: Array DIM3 Double
x = extend (Z :. All :. num_coords :. num_coords) coords_slice
y = extend (Z :. num_coords :. All :. num_coords) coords_slice
z = extend (Z :. num_coords :. num_coords :. All) coords_slice
-- Calculate the squared distance from each coordinate to the center of the spherical particle.
dist2 :: (Double, Double, Double) -> Array DIM3 Double
dist2 particle = ((R.map (squared_diff xp) x) + (R.map (squared_diff yp) y) + (R.map ( squared_diff zp) z))
where
(xp,yp,zp) = particle
squared_diff xi xa = (xa-xi)^2
-- Count how many of the coordinates are within the spherical particle.
num_inside_particle :: (Double,Double,Double) -> Double
num_inside_particle particle = foldAll (\acc x -> if x<a^2 then acc+1 else acc) 0 (force $ dist2 particle)
-- Calculate the approximate volume covered by the spherical particle.
volume_inside :: [Double]
volume_inside = P.map ((*step^3) . num_inside_particle) particles
main = do
putStrLn $ "Step = " P.++ show step
putStrLn $ "Volume of individual particles = " P.++ show volume_individuals
putStrLn $ "Volume of cubes inside each particle (repa) = " P.++ (P.concat . ( L.intersperse ", ") . P.map show) volume_inside
-- As an alternative, y and z could be generated from x, but this was slightly slower in my tests (~0.4 s).
--y = permute_dims_3D x
--z = permute_dims_3D y
-- Permute the dimensions in a 3-D array, (forward, cyclically)
permute_dims_3D a = backpermute (swap e) swap a
where
e = extent a
swap (Z :. i:. j :. k) = Z :. k :. i :. j
Space profiling for the new code
新代码的空间分析
The same types of profiles as Don Stewart made below, but for the new Repa code.
与唐·斯图尔特(Don Stewart)在下面所做的配置文件类型相同,但对于新的Repa代码。
3 个解决方案
#1
64
Code Review Notes
代码评审记录
- 47.8% of your time is spent in GC.
- 47.8%的时间花在GC上。
- 1.5G is allocated on the heap (!)
- 在堆上分配1.5G (!)
- The repa code looks a lot more complicated than the list code.
- repa代码看起来比列表代码复杂得多。
- Lots of parallel GC is occuring
- 正在发生大量并行GC
- I can get up to 300% efficiency on a -N4 machine
- 我可以在-N4机器上达到300%的效率
- Putting in more type signatures will make it easier to analyze...
- 输入更多类型的签名可以更容易地分析……
-
rsize
isn't used (looks expensive!) - 不使用rsize(看起来很贵!)
- You convert repa arrays to vectors, why?
- 你把repa数组转换成向量,为什么?
- All your uses of
(**)
could be replaced by the cheaper(^)
onInt
. - 你所有的使用(* *)可以被更便宜的(^)Int。
- There's a suspicious number of large, constant lists. Those all have to be converted to arrays -- that seems expensive.
- 有一个可疑的大的,恒定的列表。这些都必须转换为数组,这看起来很昂贵。
-
any (==True)
is the same asor
- any (==True)与or相同。
Time profiling
时间分析
COST CENTRE MODULE %time %alloc
squared_diff Main 25.0 27.3
insideParticle Main 13.8 15.3
sum_squared_diff Main 9.8 5.6
rcoords Main 7.4 5.6
particle_extended Main 6.8 9.0
particle_slice Main 5.0 7.6
insideParticles Main 5.0 4.4
yslice Main 3.6 3.0
xslice Main 3.0 3.0
ssd_vec Main 2.8 2.1
**^ Main 2.6 1.4
shows that, your function squared_diff
is a bit suspicious:
显示您的函数squared_diff有点可疑:
squared_diff :: Array DIM2 Double
squared_diff = deepSeqArrays [rcoords,particle_extended]
((force2 rcoords) -^ (force2 particle_extended)) **^ 2
though I don't see any obvious fix.
虽然我没有看到任何明显的解决办法。
Space profiling
空间分析
Nothing too amazing in the space profile: you clearly see the list phase, then the vector phase. The list phase allocates a lot, which gets reclaimed.
在空间配置文件中没有什么太神奇的:你可以清楚地看到列表阶段,然后是矢量阶段。list阶段分配了很多资源,这些资源将被回收。
Breaking down the heap by type, we see initially a lot of lists and tuples being allocated (on demand), then a big chunk of arrays are allocated and held:
按类型划分堆,我们最初看到许多列表和元组被分配(按需分配),然后分配和持有大量数组:
Again, kinda what we expected to see... the array stuff isn't allocating especially more than the list code (in fact, a bit less overall), but it is just taking a lot longer to run.
再一次,有点像我们预期的那样……数组的内容分配的空间并不比列表代码多(事实上,总体上要少一些),但它的运行时间要长得多。
Checking for space leaks with retainer profiling:
用固定器轮廓检查空间泄漏:
There's a few interesting things there, but nothing startling. zcoords
gets retained for the length of the list program execution, then some arrays (SYSTEM) are being allocated for the repa run.
这里有一些有趣的东西,但没有什么令人吃惊的。zcoords保留为列表程序执行的长度,然后为repa运行分配一些数组(系统)。
Inspecting the Core
检查的核心
So at this point I'm firstly assuming that you really did implement the same algorithms in lists and arrays (i.e. no extra work is being done in the array case), and there's no obvious space leak. So my suspicion is badly-optimized repa code. Let's look at the core (with ghc-core.
因此,在这一点上,我首先假设您确实在列表和数组中实现了相同的算法(也就是说,在数组的情况下不做额外的工作),并且没有明显的空间泄漏。所以我的怀疑是经过很好的优化的repa代码。让我们看看核心(使用ghc-core)。
- The list-based code looks fine.
- 基于列表的代码看起来不错。
- The array code looks reasonable (i.e. unboxed primitives appear), but very complex, and a lot of it.
- 数组代码看起来很合理(例如,未装箱的原语出现),但是非常复杂,而且非常复杂。
Inlining all the CAFs
内联的战乱国家
I added inline pragmas to all the top level array definitions, in a hope to remove some of the CAfs, and get GHC to optimize the array code a bit harder. This really made GHC struggle to compile the module (allocating up to 4.3G and 10 minutes while working on it). This is a clue to me that GHC wasn't able to optimize this program well before, since there's new stuff for it to do when I increase the thresholds.
我将内联实用程序添加到所有*数组定义中,希望删除一些CAfs,并使GHC更加努力地优化数组代码。这使得GHC很难编译这个模块(在编写这个模块时最多要分配4.3G和10分钟)。这给我一个提示,GHC以前不能很好地优化这个程序,因为当我增加阈值时,它有了新的功能。
Actions
行动
- Using -H can decrease the time spent in GC.
- 使用-H可以减少GC中花费的时间。
- Try to eliminate the conversions from lists to repas to vectors.
- 尝试消除从列表到向量的转换。
- All those CAFs (top level constant data structures) are kinda weird -- a real program wouldn't be a list of top level constants -- in fact, this module is pathologically so, causing lots of values to be retained over long periods, instead of being optimized away. Float local definitions inwards.
- 所有那些CAFs(顶层常量数据结构)都有点奇怪——一个真正的程序不会是*常量的列表——事实上,这个模块是病态的,它会导致大量的值在长时间内被保留,而不是被优化掉。本地定义向内浮动。
- Ask for help from Ben Lippmeier, the author of Repa, particularly since there's some funky optimization stuff happening.
- 请向Repa的作者Ben Lippmeier寻求帮助,特别是因为出现了一些有趣的优化问题。
#2
7
I changed the code to force rcoords
and particle_extended
, and disovered we were losing the lion's share of time within them directly:
我修改了代码以迫使rcoords和particle le_extend,但我不愿意看到,我们正直接失去他们内部的大部分时间:
COST CENTRE MODULE %time %alloc
rcoords Main 32.6 34.4
particle_extended Main 21.5 27.2
**^ Main 9.8 12.7
The biggest single improvement to this code would clearly be to generate those two constant inputs in a better fashion.
对这段代码最大的改进显然是以更好的方式生成这两个常量输入。
Note that this is basically a lazy, streaming algorithm, and where you're losing time is the sunk cost of allocating at least two 24361803-element arrays all in one go, and then probably allocating at least once or twice more or giving up sharing. The very best case for this code, I think, with a very good optimizer and a zillion rewrite rules, will be to roughly match the list version (which can also parallelize very easily).
请注意,这基本上是一种懒惰的流媒体算法,而您在浪费时间的地方是,至少分配了两个24361803个元素数组的成本,然后可能至少分配一次或两次,或者放弃共享。我认为,使用一个非常好的优化器和大量重写规则,这段代码最好的情况是大致匹配列表版本(它也可以很容易地并行化)。
I think dons is right that Ben & co. will be interested in this benchmark, but my overwhelming suspicion is that this is not a good use case for a strict array library, and my suspicion is that matlab is hiding some clever optimizations behind its ngrid
function (optimizations, I'll grant, which it might be useful to port to repa).]
我认为老师说的没错,Ben & co .)将这个基准测试感兴趣,但我的压倒性的怀疑是,这不是一个好的用例一系列严格的图书馆,我怀疑是matlab是其ngrid功能背后隐藏一些聪明的优化(优化,我承认,它可能是有用的港口repa)。
Edit:
编辑:
Here's a quick and dirty way to parallelize the list code. Import Control.Parallel.Strategies
and then write numberInsideParticles
as:
这里有一种快速而肮脏的方法来并行化列表代码。进口Control.Parallel。策略,然后写上numberinside粒子:
numberInsideParticles particles coords = length $ filter id $
withStrategy (parListChunk 2000 rseq) $ P.map (insideParticles particles) coords
This shows good speedup as we scale up cores (12s at one core to 3.7s at 8), but the overhead of spark creation means that even a 8 cores we only match the single core non-parallel version. I tried a few alternate strategies and got similar results. Again, I'm not sure how much better we can possibly do than a single-threaded list version here. Since the computations on each individual particle are so cheap, we're mainly stressing allocation, not computation. The big win on something like this I imagine would be vectorized computation more than anything else, and as far as I know that pretty much requires hand-coding.
当我们扩展内核时,这显示了良好的加速性能(一个内核有12个内核,8个内核有3.7个内核),但是spark创建的开销意味着即使是8个内核,我们也只能匹配单个核心的非并行版本。我尝试了几种不同的策略,得到了相似的结果。同样,我不确定我们能做的比这里的单线程列表版本好多少。由于对每个粒子的计算非常便宜,我们主要强调的是分配,而不是计算。我认为在这样的事情上最大的胜利应该是矢量计算,而不是其他任何东西,而且据我所知,这几乎需要手工编码。
Also note that the parallel version spends roughly 70% of its time in GC, while the one-core version spend 1% of its time there (i.e. the allocation is, to the extent possible, is effectively fused away.).
还要注意,并行版本大约70%的时间花在GC上,而单内核版本花1%的时间在GC上(也就是说,在可能的范围内,分配被有效地融合掉了)。
#3
7
I've added some advice about how to optimise Repa programs to the Haskell wiki: http://www.haskell.org/haskellwiki/Numeric_Haskell:_A_Repa_Tutorial#Optimising_Repa_programs
我已经向Haskell wiki添加了一些关于如何优化Repa程序的建议:http://www.haskell.org/haskellwiki/Numeric_Haskell:_A_Repa_Tutorial#Optimising_Repa_programs
#1
64
Code Review Notes
代码评审记录
- 47.8% of your time is spent in GC.
- 47.8%的时间花在GC上。
- 1.5G is allocated on the heap (!)
- 在堆上分配1.5G (!)
- The repa code looks a lot more complicated than the list code.
- repa代码看起来比列表代码复杂得多。
- Lots of parallel GC is occuring
- 正在发生大量并行GC
- I can get up to 300% efficiency on a -N4 machine
- 我可以在-N4机器上达到300%的效率
- Putting in more type signatures will make it easier to analyze...
- 输入更多类型的签名可以更容易地分析……
-
rsize
isn't used (looks expensive!) - 不使用rsize(看起来很贵!)
- You convert repa arrays to vectors, why?
- 你把repa数组转换成向量,为什么?
- All your uses of
(**)
could be replaced by the cheaper(^)
onInt
. - 你所有的使用(* *)可以被更便宜的(^)Int。
- There's a suspicious number of large, constant lists. Those all have to be converted to arrays -- that seems expensive.
- 有一个可疑的大的,恒定的列表。这些都必须转换为数组,这看起来很昂贵。
-
any (==True)
is the same asor
- any (==True)与or相同。
Time profiling
时间分析
COST CENTRE MODULE %time %alloc
squared_diff Main 25.0 27.3
insideParticle Main 13.8 15.3
sum_squared_diff Main 9.8 5.6
rcoords Main 7.4 5.6
particle_extended Main 6.8 9.0
particle_slice Main 5.0 7.6
insideParticles Main 5.0 4.4
yslice Main 3.6 3.0
xslice Main 3.0 3.0
ssd_vec Main 2.8 2.1
**^ Main 2.6 1.4
shows that, your function squared_diff
is a bit suspicious:
显示您的函数squared_diff有点可疑:
squared_diff :: Array DIM2 Double
squared_diff = deepSeqArrays [rcoords,particle_extended]
((force2 rcoords) -^ (force2 particle_extended)) **^ 2
though I don't see any obvious fix.
虽然我没有看到任何明显的解决办法。
Space profiling
空间分析
Nothing too amazing in the space profile: you clearly see the list phase, then the vector phase. The list phase allocates a lot, which gets reclaimed.
在空间配置文件中没有什么太神奇的:你可以清楚地看到列表阶段,然后是矢量阶段。list阶段分配了很多资源,这些资源将被回收。
Breaking down the heap by type, we see initially a lot of lists and tuples being allocated (on demand), then a big chunk of arrays are allocated and held:
按类型划分堆,我们最初看到许多列表和元组被分配(按需分配),然后分配和持有大量数组:
Again, kinda what we expected to see... the array stuff isn't allocating especially more than the list code (in fact, a bit less overall), but it is just taking a lot longer to run.
再一次,有点像我们预期的那样……数组的内容分配的空间并不比列表代码多(事实上,总体上要少一些),但它的运行时间要长得多。
Checking for space leaks with retainer profiling:
用固定器轮廓检查空间泄漏:
There's a few interesting things there, but nothing startling. zcoords
gets retained for the length of the list program execution, then some arrays (SYSTEM) are being allocated for the repa run.
这里有一些有趣的东西,但没有什么令人吃惊的。zcoords保留为列表程序执行的长度,然后为repa运行分配一些数组(系统)。
Inspecting the Core
检查的核心
So at this point I'm firstly assuming that you really did implement the same algorithms in lists and arrays (i.e. no extra work is being done in the array case), and there's no obvious space leak. So my suspicion is badly-optimized repa code. Let's look at the core (with ghc-core.
因此,在这一点上,我首先假设您确实在列表和数组中实现了相同的算法(也就是说,在数组的情况下不做额外的工作),并且没有明显的空间泄漏。所以我的怀疑是经过很好的优化的repa代码。让我们看看核心(使用ghc-core)。
- The list-based code looks fine.
- 基于列表的代码看起来不错。
- The array code looks reasonable (i.e. unboxed primitives appear), but very complex, and a lot of it.
- 数组代码看起来很合理(例如,未装箱的原语出现),但是非常复杂,而且非常复杂。
Inlining all the CAFs
内联的战乱国家
I added inline pragmas to all the top level array definitions, in a hope to remove some of the CAfs, and get GHC to optimize the array code a bit harder. This really made GHC struggle to compile the module (allocating up to 4.3G and 10 minutes while working on it). This is a clue to me that GHC wasn't able to optimize this program well before, since there's new stuff for it to do when I increase the thresholds.
我将内联实用程序添加到所有*数组定义中,希望删除一些CAfs,并使GHC更加努力地优化数组代码。这使得GHC很难编译这个模块(在编写这个模块时最多要分配4.3G和10分钟)。这给我一个提示,GHC以前不能很好地优化这个程序,因为当我增加阈值时,它有了新的功能。
Actions
行动
- Using -H can decrease the time spent in GC.
- 使用-H可以减少GC中花费的时间。
- Try to eliminate the conversions from lists to repas to vectors.
- 尝试消除从列表到向量的转换。
- All those CAFs (top level constant data structures) are kinda weird -- a real program wouldn't be a list of top level constants -- in fact, this module is pathologically so, causing lots of values to be retained over long periods, instead of being optimized away. Float local definitions inwards.
- 所有那些CAFs(顶层常量数据结构)都有点奇怪——一个真正的程序不会是*常量的列表——事实上,这个模块是病态的,它会导致大量的值在长时间内被保留,而不是被优化掉。本地定义向内浮动。
- Ask for help from Ben Lippmeier, the author of Repa, particularly since there's some funky optimization stuff happening.
- 请向Repa的作者Ben Lippmeier寻求帮助,特别是因为出现了一些有趣的优化问题。
#2
7
I changed the code to force rcoords
and particle_extended
, and disovered we were losing the lion's share of time within them directly:
我修改了代码以迫使rcoords和particle le_extend,但我不愿意看到,我们正直接失去他们内部的大部分时间:
COST CENTRE MODULE %time %alloc
rcoords Main 32.6 34.4
particle_extended Main 21.5 27.2
**^ Main 9.8 12.7
The biggest single improvement to this code would clearly be to generate those two constant inputs in a better fashion.
对这段代码最大的改进显然是以更好的方式生成这两个常量输入。
Note that this is basically a lazy, streaming algorithm, and where you're losing time is the sunk cost of allocating at least two 24361803-element arrays all in one go, and then probably allocating at least once or twice more or giving up sharing. The very best case for this code, I think, with a very good optimizer and a zillion rewrite rules, will be to roughly match the list version (which can also parallelize very easily).
请注意,这基本上是一种懒惰的流媒体算法,而您在浪费时间的地方是,至少分配了两个24361803个元素数组的成本,然后可能至少分配一次或两次,或者放弃共享。我认为,使用一个非常好的优化器和大量重写规则,这段代码最好的情况是大致匹配列表版本(它也可以很容易地并行化)。
I think dons is right that Ben & co. will be interested in this benchmark, but my overwhelming suspicion is that this is not a good use case for a strict array library, and my suspicion is that matlab is hiding some clever optimizations behind its ngrid
function (optimizations, I'll grant, which it might be useful to port to repa).]
我认为老师说的没错,Ben & co .)将这个基准测试感兴趣,但我的压倒性的怀疑是,这不是一个好的用例一系列严格的图书馆,我怀疑是matlab是其ngrid功能背后隐藏一些聪明的优化(优化,我承认,它可能是有用的港口repa)。
Edit:
编辑:
Here's a quick and dirty way to parallelize the list code. Import Control.Parallel.Strategies
and then write numberInsideParticles
as:
这里有一种快速而肮脏的方法来并行化列表代码。进口Control.Parallel。策略,然后写上numberinside粒子:
numberInsideParticles particles coords = length $ filter id $
withStrategy (parListChunk 2000 rseq) $ P.map (insideParticles particles) coords
This shows good speedup as we scale up cores (12s at one core to 3.7s at 8), but the overhead of spark creation means that even a 8 cores we only match the single core non-parallel version. I tried a few alternate strategies and got similar results. Again, I'm not sure how much better we can possibly do than a single-threaded list version here. Since the computations on each individual particle are so cheap, we're mainly stressing allocation, not computation. The big win on something like this I imagine would be vectorized computation more than anything else, and as far as I know that pretty much requires hand-coding.
当我们扩展内核时,这显示了良好的加速性能(一个内核有12个内核,8个内核有3.7个内核),但是spark创建的开销意味着即使是8个内核,我们也只能匹配单个核心的非并行版本。我尝试了几种不同的策略,得到了相似的结果。同样,我不确定我们能做的比这里的单线程列表版本好多少。由于对每个粒子的计算非常便宜,我们主要强调的是分配,而不是计算。我认为在这样的事情上最大的胜利应该是矢量计算,而不是其他任何东西,而且据我所知,这几乎需要手工编码。
Also note that the parallel version spends roughly 70% of its time in GC, while the one-core version spend 1% of its time there (i.e. the allocation is, to the extent possible, is effectively fused away.).
还要注意,并行版本大约70%的时间花在GC上,而单内核版本花1%的时间在GC上(也就是说,在可能的范围内,分配被有效地融合掉了)。
#3
7
I've added some advice about how to optimise Repa programs to the Haskell wiki: http://www.haskell.org/haskellwiki/Numeric_Haskell:_A_Repa_Tutorial#Optimising_Repa_programs
我已经向Haskell wiki添加了一些关于如何优化Repa程序的建议:http://www.haskell.org/haskellwiki/Numeric_Haskell:_A_Repa_Tutorial#Optimising_Repa_programs