This question already has an answer here:
这个问题已经有了答案:
- Global / local environment affects Haskell's Criterion benchmarks results 1 answer
- 全局/本地环境影响Haskell的标准基准测试结果1
I am benchmarking Haskell's array libraries (the array
and vector
packages) to come up with the best way of storing large data for my use case. I am using criterion
as the benchmarking tool.
我正在对Haskell的数组库(数组和向量包)进行基准测试,以找到为我的用例存储大数据的最佳方式。我正在使用criterion作为基准的工具。
Long story short: my code simply allocates a vector and proceeds to fill it with simple structs (1M, 10M, and 100M elements, respectively). When I compare the Haskell benchmark times with a simple reference implementation I wrote in C, Haskell is a few times faster and I find it suspicious: the C code is a simple loop filling the structs in the array.
长话短说:我的代码简单地分配一个向量,然后用简单的结构(分别为1M、10M和100M)填充它。当我将Haskell基准时间与我在C中编写的一个简单引用实现进行比较时,Haskell的速度要快几倍,而且我发现这很可疑:C代码是一个简单的循环,填充数组中的结构。
The question: is it possible for Haskell's vector
library to beat C in terms of performance? Or does it mean my benchmarks are flawed/something is not actually evaluated/there's some 'gotcha'?
问题是:Haskell的向量库是否有可能在性能上超过C ?还是说我的基准有缺陷/一些东西没有被实际评估/有一些“问题”?
Another question how to make sure that the Haskell vectors are actually evaluated?
另一个问题是如何确保Haskell向量是被求值的?
Longer explanation: The task at hand is to fill a vector with a large number of structs. They have Storable
instances and the vector used is Data.Vector.Storable
.
更长的解释:当前的任务是填充大量结构的向量。它们有可存储的实例,使用的向量是Data.Vector.Storable。
The data type is the following:
数据类型如下:
data Foo = Foo Int Int deriving (Show, Eq, Generic, NFData)
And the Storable
instances look like this:
可存储的实例如下:
chunkSize :: Int
chunkSize = sizeOf (undefined :: Int)
{-# INLINE chunkSize #-}
instance Storable Foo where
sizeOf _ = 2 * chunkSize ; {-# INLINE sizeOf #-}
alignment _ = chunkSize ; {-# INLINE alignment #-}
peek ptr = Foo
<$> peekByteOff ptr 0
<*> peekByteOff ptr chunkSize
{-# INLINE peek #-}
poke ptr (Foo a b) = do
pokeByteOff ptr 0 a
pokeByteOff ptr chunkSize b
{-# INLINE poke #-}
The serialization itself seems to work fine. The vector is then allocated:
序列化本身似乎运行良好。然后分配矢量:
mkFooVec :: Int -> IO (Vector Foo)
mkFooVec !i = unsafeFreeze =<< new (i + 1)
And populated with the structs:
并填充结构:
populateFooVec :: Int -> Vector Foo -> IO (Vector Foo)
populateFooVec !i !v = do
v' <- unsafeThaw v
let go 0 = return ()
go j = unsafeWrite v' j (Foo j $ j + 1) >> go (j - 1)
go i
unsafeFreeze v'
Benchmark is the standard criterion one:
基准为标准一:
defaultMain [
bgroup "Storable vector (mutable)"
$ (\(i :: Int) -> env (mkFooVec (10 ^ i))
$ \v -> bench ("10e" <> show i)
$ nfIO (populateFooVec (10 ^ i) v)) <$> [6..8]
]
The gist contains other benchmarks, trying to force evaluation in different ways.
这个要点包含了其他的基准,试图以不同的方式进行评估。
Reference C code doing more or less the same can be found here (gist). The main logic is the following:
参考C代码或多或少都可以在这里找到(要点)。主要逻辑如下:
Foo *allocFoos(long n) {
return (Foo *) malloc(n * sizeof(Foo));
}
// populate the array with structs:
void createFoos(Foo *v, long n) {
for (long i = 0; i < n; ++i) {
v[i].name = i;
v[i].id = i + 1;
}
}
And the command used to run it: gcc -O2 -o bench benchmark.c && ./bench
运行它的命令是:gcc -O2 -o基准测试。c & &。/台
Now when I run the benchmarks, the C code takes about 50ms, while Criterion reports results around 800 picoseconds (!). This makes me wonder: maybe I'm interpreting the results wrong? Maybe the vector isn't actually evaluated (although if you look at the Haskell gist, I try to force the evaluation in different ways). What am I doing wrong? If nothing -- how does vector
beat a simple for loop in C (that GCC further unrolls, btw)?
现在,当我运行基准测试时,C代码大约需要50ms,而Criterion报告的结果大约为800皮秒(!)。这让我怀疑:或许我把结果解释错了?也许这个向量实际上并没有被求值(尽管如果你看看Haskell的主旨,我试着用不同的方式来强制求值)。我做错了什么?如果没有的话——向量如何在C中打败一个简单的for循环(顺便说一下,GCC将进一步展开)?
Please pardon my terribly long question, I was trying to give the whole context ;)
请原谅我问了这么长时间的问题,我是想把整个情况讲出来;
1 个解决方案
#1
1
While I don't trust the benchmarking code I also can not reproduce the issue. I modified the Haskell gist (just removed the second two benchmarks) and the C benchmark (made it perform the operation 1000 times then divided the times by 1000).
虽然我不相信基准测试代码,但我也不能再现这个问题。我修改了Haskell gist(刚刚删除了第二个两个基准)和C benchmark(让它执行1000次操作,然后除以1000)。
EDIT: I don't trust the code because:
编辑:我不相信代码,因为:
- You are using unsafe* calls that have implicit contracts you violate.
- 您正在使用具有违反的隐式契约的不安全*调用。
- The code doesn't even compile - you have a typo and a missing language extension. This is usually an indication of other shenanigans.
- 代码甚至不能编译——您有一个输入错误和一个缺失的语言扩展。这通常是其他恶作剧的迹象。
My Results
我的结果
What is the result? Spot on, no oddities here.
结果是什么?现场,这里没有奇怪的东西。
% gcc bench.c -O3 && ./a.out
Starting the benchmark
[[ Malloced-array-[10000000] ]]Time taken: 11.904249 ms (cpu) 11.904249 ms (wall)
Done
./a.out 11.78s user 0.14s system 98% cpu 12.131 total
i.e. 11ms for C at 10^7 elements.
例如11女士在10 ^ 7 C元素。
and
和
% ghc -O2 bench.hs && ./bench
benchmarking Storable vector (FAKE mutable)/10e6
time 2.362 ms (2.236 ms .. 2.561 ms)
0.953 R² (0.909 R² .. 0.989 R²)
mean 2.344 ms (2.268 ms .. 2.482 ms)
std dev 305.0 μs (169.1 μs .. 477.1 μs)
variance introduced by outliers: 79% (severely inflated)
benchmarking Storable vector (FAKE mutable)/10e7
time 23.37 ms (22.13 ms .. 24.73 ms)
0.989 R² (0.979 R² .. 0.996 R²)
mean 23.19 ms (22.63 ms .. 23.76 ms)
std dev 1.287 ms (1.015 ms .. 1.713 ms)
variance introduced by outliers: 19% (moderately inflated)
benchmarking Storable vector (FAKE mutable)/10e8
time 232.2 ms (215.1 ms .. 247.3 ms)
0.994 R² (0.974 R² .. 1.000 R²)
mean 223.5 ms (215.9 ms .. 231.5 ms)
std dev 10.41 ms (7.887 ms .. 13.06 ms)
variance introduced by outliers: 14% (moderately inflated)
i.e. 23ms for Haskell at 10^7 result.
即23女士在10 ^ 7 Haskell的结果。
This is on a moderately new macbook with GHC 8.2.
这是一款带有ghc8.2的新macbook。
#1
1
While I don't trust the benchmarking code I also can not reproduce the issue. I modified the Haskell gist (just removed the second two benchmarks) and the C benchmark (made it perform the operation 1000 times then divided the times by 1000).
虽然我不相信基准测试代码,但我也不能再现这个问题。我修改了Haskell gist(刚刚删除了第二个两个基准)和C benchmark(让它执行1000次操作,然后除以1000)。
EDIT: I don't trust the code because:
编辑:我不相信代码,因为:
- You are using unsafe* calls that have implicit contracts you violate.
- 您正在使用具有违反的隐式契约的不安全*调用。
- The code doesn't even compile - you have a typo and a missing language extension. This is usually an indication of other shenanigans.
- 代码甚至不能编译——您有一个输入错误和一个缺失的语言扩展。这通常是其他恶作剧的迹象。
My Results
我的结果
What is the result? Spot on, no oddities here.
结果是什么?现场,这里没有奇怪的东西。
% gcc bench.c -O3 && ./a.out
Starting the benchmark
[[ Malloced-array-[10000000] ]]Time taken: 11.904249 ms (cpu) 11.904249 ms (wall)
Done
./a.out 11.78s user 0.14s system 98% cpu 12.131 total
i.e. 11ms for C at 10^7 elements.
例如11女士在10 ^ 7 C元素。
and
和
% ghc -O2 bench.hs && ./bench
benchmarking Storable vector (FAKE mutable)/10e6
time 2.362 ms (2.236 ms .. 2.561 ms)
0.953 R² (0.909 R² .. 0.989 R²)
mean 2.344 ms (2.268 ms .. 2.482 ms)
std dev 305.0 μs (169.1 μs .. 477.1 μs)
variance introduced by outliers: 79% (severely inflated)
benchmarking Storable vector (FAKE mutable)/10e7
time 23.37 ms (22.13 ms .. 24.73 ms)
0.989 R² (0.979 R² .. 0.996 R²)
mean 23.19 ms (22.63 ms .. 23.76 ms)
std dev 1.287 ms (1.015 ms .. 1.713 ms)
variance introduced by outliers: 19% (moderately inflated)
benchmarking Storable vector (FAKE mutable)/10e8
time 232.2 ms (215.1 ms .. 247.3 ms)
0.994 R² (0.974 R² .. 1.000 R²)
mean 223.5 ms (215.9 ms .. 231.5 ms)
std dev 10.41 ms (7.887 ms .. 13.06 ms)
variance introduced by outliers: 14% (moderately inflated)
i.e. 23ms for Haskell at 10^7 result.
即23女士在10 ^ 7 Haskell的结果。
This is on a moderately new macbook with GHC 8.2.
这是一款带有ghc8.2的新macbook。