编写一次并行数组Haskell表达式，使用repa在cpu和gpu上运行并加速

Repa and Accelerate API Similarity

The Haskell repa library is for automatically parallel array computation on CPUs. The accelerate library is automatic data parallelism on GPUs. The APIs are quite similar, with identical representations of N-dimensional arrays. One can even switch between accelerate and repa arrays with fromRepa and toRepa in Data.Array.Accelerate.IO:

Haskell repa库用于cpu上的自动并行数组计算。加速库是gpu上的自动数据并行。这些api非常相似，具有相同的n维数组表示。你甚至可以使用fromRepa和toRepa在data . array. accelerator . accelerator:

fromRepa :: (Shapes sh sh', Elt e) => Array A sh e -> Array sh' e
toRepa   :: Shapes sh sh'          => Array sh' e  -> Array A sh e

There are multiple backends for accelerate, including LLVM, CUDA and FPGA (see Figure 2 of http://www.cse.unsw.edu.au/~keller/Papers/acc-cuda.pdf). I've spotted a repa backend for accelerate, though the library doesn't appear to be maintained. Given that the repa and accelerate programming models are similar, I am hopeful that there is an elegant way of switching between them i.e. functions written once can be executed with repa's R.computeP or with one of accelerate's backends e.g. with the CUDA run function.

加速有多个后端，包括LLVM、CUDA和FPGA(参见http://www.cse.unsw.edu.au/~keller/Papers/acc-cuda.pdf的图2)。我发现了一个用于加速的repa后端，尽管库似乎没有被维护。鉴于repa和加速编程模型是相似的，我希望有一种优雅的方式在它们之间进行切换，即编写一次的函数可以使用repa的R.computeP或加速的一个后端执行，例如使用CUDA run函数。

Two very similar functions: Repa and Accelerate on a Pumpkin

Take a simple image processing thresholding function. If a grayscale pixel value is less than 50, then it is set to 0, otherwise it retains its value. Here's what it does to a pumpkin:

取一个简单的图像处理阈值函数。如果灰度像素值小于50，则将其设置为0，否则将保留其值。这是它对南瓜的作用:

The following code presents repa and accelerate implementations:

下面的代码展示了repa和加速实现:

module Main where

import qualified Data.Array.Repa as R
import qualified Data.Array.Repa.IO.BMP as R
import qualified Data.Array.Accelerate as A
import qualified Data.Array.Accelerate.IO as A
import qualified Data.Array.Accelerate.Interpreter as A

import Data.Word

-- Apply threshold over image using accelerate (interpreter)
thresholdAccelerate :: IO ()
thresholdAccelerate = do
  img <- either (error . show) id `fmap` A.readImageFromBMP "pumpkin-in.bmp"
  let newImg = A.run $ A.map evalPixel (A.use img)
  A.writeImageToBMP "pumpkin-out.bmp" newImg
    where
      -- *** Exception: Prelude.Ord.compare applied to EDSL types
      evalPixel :: A.Exp A.Word32 -> A.Exp A.Word32
      evalPixel p = if p > 50 then p else 0

-- Apply threshold over image using repa
thresholdRepa :: IO ()
thresholdRepa = do
  let arr :: IO (R.Array R.U R.DIM2 (Word8,Word8,Word8))
      arr = either (error . show) id `fmap` R.readImageFromBMP "pumpkin-in.bmp" 
  img <- arr
  newImg <- R.computeP (R.map applyAtPoint img)
  R.writeImageToBMP "pumpkin-out.bmp" newImg
  where
    applyAtPoint :: (Word8,Word8,Word8) -> (Word8,Word8,Word8)
    applyAtPoint (r,g,b) =
        let [r',g',b'] = map applyThresholdOnPixel [r,g,b]
        in (r',g',b')
    applyThresholdOnPixel x = if x > 50 then x else 0

data BackendChoice = Repa | Accelerate

main :: IO ()
main = do
  let userChoice = Repa -- pretend this command line flag
  case userChoice of
    Repa       -> thresholdRepa
    Accelerate -> thresholdAccelerate

Question: can I write this only once?

The implementations of thresholdAccelerate and thresholdRepa are very similar. Is there an elegant way to write array processing functions once, then opt for multicore CPUs (repa) or GPUs (accelerate) in a switch programmatically? I can think of choosing my import in accordance with whether I want CPU or GPU i.e. to import either Data.Array.Accelerate.CUDA or Data.Array.Repa to execute an action of type Acc a with:

阈值加速和阈值的实现非常相似。是否有一种优雅的方法一次性编写数组处理函数，然后在一个编程的交换机中选择多核cpu (repa)或gpu(加速)?我可以考虑根据我是否需要CPU或GPU来选择我的导入，即导入data . array.加速。CUDA或Data.Array。Repa用于执行Acc a类型的行为:

run :: Arrays a => Acc a -> a

Or, to use a type class e.g. something roughly like:

或者使用类型类，例如:

main :: IO ()
main = do
  let userChoice = Repa -- pretend this is a command line flag
  action <- case userChoice of
    Repa       -> applyThreshold :: RepaBackend ()
    Accelerate -> applyThreshold :: CudaBackend ()
  action

Or is it the case that, for each parallel array function I wish to express for both CPUs and GPUs, I must implement it twice --- once with the repa library and again with the accelerate library?

还是说，对于我希望同时表示cpu和gpu的每个并行数组函数，我必须实现它两次——一次是使用repa库，一次是使用加速库?

2 个解决方案

#1

The short answer is that, at the moment, you unfortunately need to write both versions.

简短的回答是，目前，您不幸需要同时编写两个版本。

However, we are working on CPU support for Accelerate, which will obviate the need for the Repa version of the code. In particular, Accelerate very recently gained a new LLVM-based backend that targets both GPUs and CPUs: https://github.com/AccelerateHS/accelerate-llvm

但是，我们正在开发针对加速的CPU支持，这将避免对代码的Repa版本的需要。特别是，加速最近获得了一个新的基于llvm的后端，它既针对gpu，也针对cpu: https://github.com/acceleratehs/- llvm。

This new backend is still incomplete, buggy, and experimental, but we are planning to make it into a viable alternative to the current CUDA backend.

这个新的后端仍然是不完整的、有bug的和实验性的，但是我们正在计划使它成为一个可行的替代CUDA后端。

#2

I thought about this a year and few months ago while designing yarr. At that time there were serious issues with type families inference or something like this (I don't remember exactly) which prevented to implement such unifying wrapper of vector, repa, yarr, accelerate, etc. both efficiently and allowing not to write too many explicit type signatures, or implement it in principle (I don't remember).

一年前和几个月前，我在设计yarr时就考虑过这个问题。当时有严重的问题类型推理或者这样的家庭(我不记得确切),防止实施这样的向量,统一包装repa,yarr,加速,等高效和允许不写太多明确的类型签名,或实现原理(我不记得了)。

That was GHC 7.6. I don't know if there meaningful improvements in GHC 7.8 in this field. Theoretically I didn't see any problems, thus we can expect such stuff someday, in short or long time, when GHC will be ready.

这是GHC 7.6。我不知道在这个领域中GHC 7.8是否有有意义的改进。理论上我没有看到任何问题，所以我们可以期待有一天，在短时间或长时间内，GHC会准备好。

#1