当对数据进行基准测试时，其次数出奇的低。向量(复制)

This question already has an answer here:

这个问题已经有了答案:

Global / local environment affects Haskell's Criterion benchmarks results 1 answer
全局/本地环境影响Haskell的标准基准测试结果1

I am benchmarking Haskell's array libraries (the array and vector packages) to come up with the best way of storing large data for my use case. I am using criterion as the benchmarking tool.

我正在对Haskell的数组库(数组和向量包)进行基准测试，以找到为我的用例存储大数据的最佳方式。我正在使用criterion作为基准的工具。

Long story short: my code simply allocates a vector and proceeds to fill it with simple structs (1M, 10M, and 100M elements, respectively). When I compare the Haskell benchmark times with a simple reference implementation I wrote in C, Haskell is a few times faster and I find it suspicious: the C code is a simple loop filling the structs in the array.

长话短说:我的代码简单地分配一个向量，然后用简单的结构(分别为1M、10M和100M)填充它。当我将Haskell基准时间与我在C中编写的一个简单引用实现进行比较时，Haskell的速度要快几倍，而且我发现这很可疑:C代码是一个简单的循环，填充数组中的结构。

The question: is it possible for Haskell's vector library to beat C in terms of performance? Or does it mean my benchmarks are flawed/something is not actually evaluated/there's some 'gotcha'?

问题是:Haskell的向量库是否有可能在性能上超过C ?还是说我的基准有缺陷/一些东西没有被实际评估/有一些“问题”?

Another question how to make sure that the Haskell vectors are actually evaluated?

另一个问题是如何确保Haskell向量是被求值的?

Longer explanation: The task at hand is to fill a vector with a large number of structs. They have Storable instances and the vector used is Data.Vector.Storable.

更长的解释:当前的任务是填充大量结构的向量。它们有可存储的实例，使用的向量是Data.Vector.Storable。

The data type is the following:

数据类型如下:

data Foo = Foo Int Int deriving (Show, Eq, Generic, NFData)

And the Storable instances look like this:

可存储的实例如下:

chunkSize :: Int
chunkSize = sizeOf (undefined :: Int)
{-# INLINE chunkSize #-}

instance Storable Foo where
    sizeOf    _ = 2 * chunkSize ; {-# INLINE sizeOf    #-}
    alignment _ = chunkSize     ; {-# INLINE alignment #-}
    peek ptr = Foo
        <$> peekByteOff ptr 0
        <*> peekByteOff ptr chunkSize
    {-# INLINE peek #-}
    poke ptr (Foo a b) = do
        pokeByteOff ptr 0 a
        pokeByteOff ptr chunkSize b
    {-# INLINE poke #-}

The serialization itself seems to work fine. The vector is then allocated:

序列化本身似乎运行良好。然后分配矢量:

mkFooVec :: Int -> IO (Vector Foo)
mkFooVec !i = unsafeFreeze =<< new (i + 1)

And populated with the structs:

并填充结构:

populateFooVec :: Int -> Vector Foo -> IO (Vector Foo)
populateFooVec !i !v = do
    v' <- unsafeThaw v
    let go 0 = return ()
        go j = unsafeWrite v' j (Foo j $ j + 1) >> go (j - 1)
    go i
    unsafeFreeze v'

Benchmark is the standard criterion one:

基准为标准一:

    defaultMain [
      bgroup "Storable vector (mutable)"
        $ (\(i :: Int) -> env (mkFooVec (10 ^ i))
        $ \v -> bench ("10e" <> show i)
        $ nfIO (populateFooVec (10 ^ i) v))  <$> [6..8]
    ]

The gist contains other benchmarks, trying to force evaluation in different ways.

这个要点包含了其他的基准，试图以不同的方式进行评估。

Reference C code doing more or less the same can be found here (gist). The main logic is the following:

参考C代码或多或少都可以在这里找到(要点)。主要逻辑如下:

Foo *allocFoos(long n) {
    return (Foo *) malloc(n * sizeof(Foo));
}

// populate the array with structs:
void createFoos(Foo *v, long n) {
    for (long i = 0; i < n; ++i) {
        v[i].name = i;
        v[i].id = i + 1;
    }
}

And the command used to run it: gcc -O2 -o bench benchmark.c && ./bench

运行它的命令是:gcc -O2 -o基准测试。c & &。/台

Now when I run the benchmarks, the C code takes about 50ms, while Criterion reports results around 800 picoseconds (!). This makes me wonder: maybe I'm interpreting the results wrong? Maybe the vector isn't actually evaluated (although if you look at the Haskell gist, I try to force the evaluation in different ways). What am I doing wrong? If nothing -- how does vector beat a simple for loop in C (that GCC further unrolls, btw)?

现在，当我运行基准测试时，C代码大约需要50ms，而Criterion报告的结果大约为800皮秒(!)。这让我怀疑:或许我把结果解释错了?也许这个向量实际上并没有被求值(尽管如果你看看Haskell的主旨，我试着用不同的方式来强制求值)。我做错了什么?如果没有的话——向量如何在C中打败一个简单的for循环(顺便说一下，GCC将进一步展开)?

Please pardon my terribly long question, I was trying to give the whole context ;)

请原谅我问了这么长时间的问题，我是想把整个情况讲出来;

1 个解决方案

#1

While I don't trust the benchmarking code I also can not reproduce the issue. I modified the Haskell gist (just removed the second two benchmarks) and the C benchmark (made it perform the operation 1000 times then divided the times by 1000).

虽然我不相信基准测试代码，但我也不能再现这个问题。我修改了Haskell gist(刚刚删除了第二个两个基准)和C benchmark(让它执行1000次操作，然后除以1000)。

EDIT: I don't trust the code because:

编辑:我不相信代码，因为:

You are using unsafe* calls that have implicit contracts you violate.
您正在使用具有违反的隐式契约的不安全*调用。
The code doesn't even compile - you have a typo and a missing language extension. This is usually an indication of other shenanigans.
代码甚至不能编译——您有一个输入错误和一个缺失的语言扩展。这通常是其他恶作剧的迹象。

My Results

我的结果

What is the result? Spot on, no oddities here.

结果是什么?现场，这里没有奇怪的东西。

% gcc bench.c -O3 && ./a.out
Starting the benchmark
[[ Malloced-array-[10000000] ]]Time taken: 11.904249 ms (cpu) 11.904249 ms (wall)
Done
./a.out  11.78s user 0.14s system 98% cpu 12.131 total

i.e. 11ms for C at 10^7 elements.

例如11女士在10 ^ 7 C元素。

and

和

% ghc -O2 bench.hs && ./bench
benchmarking Storable vector (FAKE mutable)/10e6
time                 2.362 ms   (2.236 ms .. 2.561 ms)
                     0.953 R²   (0.909 R² .. 0.989 R²)
mean                 2.344 ms   (2.268 ms .. 2.482 ms)
std dev              305.0 μs   (169.1 μs .. 477.1 μs)
variance introduced by outliers: 79% (severely inflated)

benchmarking Storable vector (FAKE mutable)/10e7
time                 23.37 ms   (22.13 ms .. 24.73 ms)
                     0.989 R²   (0.979 R² .. 0.996 R²)
mean                 23.19 ms   (22.63 ms .. 23.76 ms)
std dev              1.287 ms   (1.015 ms .. 1.713 ms)
variance introduced by outliers: 19% (moderately inflated)

benchmarking Storable vector (FAKE mutable)/10e8
time                 232.2 ms   (215.1 ms .. 247.3 ms)
                     0.994 R²   (0.974 R² .. 1.000 R²)
mean                 223.5 ms   (215.9 ms .. 231.5 ms)
std dev              10.41 ms   (7.887 ms .. 13.06 ms)
variance introduced by outliers: 14% (moderately inflated)

i.e. 23ms for Haskell at 10^7 result.

即23女士在10 ^ 7 Haskell的结果。

This is on a moderately new macbook with GHC 8.2.

这是一款带有ghc8.2的新macbook。

#1