为什么Haskell代码使用-O运行得更慢?

时间:2022-01-20 17:04:19

This piece of Haskell code runs much slower with -O, but -O should be non-dangerous. Can anyone tell me what happened? If it matters, it is an attempt to solve this problem, and it uses binary search and persistent segment tree:

这段Haskell代码的运行速度要比-O慢得多,但是-O应该是不危险的。谁能告诉我发生了什么事吗?如果重要的话,这是解决这个问题的一种尝试,它使用了二叉搜索和持久化段树:

import Control.Monad
import Data.Array

data Node =
      Leaf   Int           -- value
    | Branch Int Node Node -- sum, left child, right child
type NodeArray = Array Int Node

-- create an empty node with range [l, r)
create :: Int -> Int -> Node
create l r
    | l + 1 == r = Leaf 0
    | otherwise  = Branch 0 (create l m) (create m r)
    where m = (l + r) `div` 2

-- Get the sum in range [0, r). The range of the node is [nl, nr)
sumof :: Node -> Int -> Int -> Int -> Int
sumof (Leaf val) r nl nr
    | nr <= r   = val
    | otherwise = 0
sumof (Branch sum lc rc) r nl nr
    | nr <= r   = sum
    | r  > nl   = (sumof lc r nl m) + (sumof rc r m nr)
    | otherwise = 0
    where m = (nl + nr) `div` 2

-- Increase the value at x by 1. The range of the node is [nl, nr)
increase :: Node -> Int -> Int -> Int -> Node
increase (Leaf val) x nl nr = Leaf (val + 1)
increase (Branch sum lc rc) x nl nr
    | x < m     = Branch (sum + 1) (increase lc x nl m) rc
    | otherwise = Branch (sum + 1) lc (increase rc x m nr)
    where m = (nl + nr) `div` 2

-- signature said it all
tonodes :: Int -> [Int] -> [Node]
tonodes n = reverse . tonodes' . reverse
    where
        tonodes' :: [Int] -> [Node]
        tonodes' (h:t) = increase h' h 0 n : s' where s'@(h':_) = tonodes' t
        tonodes' _ = [create 0 n]

-- find the minimum m in [l, r] such that (predicate m) is True
binarysearch :: (Int -> Bool) -> Int -> Int -> Int
binarysearch predicate l r
    | l == r      = r
    | predicate m = binarysearch predicate l m
    | otherwise   = binarysearch predicate (m+1) r
    where m = (l + r) `div` 2

-- main, literally
main :: IO ()
main = do
    [n, m] <- fmap (map read . words) getLine
    nodes <- fmap (listArray (0, n) . tonodes n . map (subtract 1) . map read . words) getLine
    replicateM_ m $ query n nodes
    where
        query :: Int -> NodeArray -> IO ()
        query n nodes = do
            [p, k] <- fmap (map read . words) getLine
            print $ binarysearch (ok nodes n p k) 0 n
            where
                ok :: NodeArray -> Int -> Int -> Int -> Int -> Bool
                ok nodes n p k s = (sumof (nodes ! min (p + s + 1) n) s 0 n) - (sumof (nodes ! max (p - s) 0) s 0 n) >= k

(This is exactly the same code with code review but this question addresses another problem.)

(这与代码检查的代码完全相同,但这个问题解决了另一个问题。)

This is my input generator in C++:

这是我在c++的输入生成器:

#include <cstdio>
#include <cstdlib>
using namespace std;
int main (int argc, char * argv[]) {
    srand(1827);
    int n = 100000;
    if(argc > 1)
        sscanf(argv[1], "%d", &n);
    printf("%d %d\n", n, n);
    for(int i = 0; i < n; i++)
        printf("%d%c", rand() % n + 1, i == n - 1 ? '\n' : ' ');
    for(int i = 0; i < n; i++) {
        int p = rand() % n;
        int k = rand() % n + 1;
        printf("%d %d\n", p, k);
    }
}

In case you don't have a C++ compiler available, this is the result of ./gen.exe 1000.

如果您没有可用的c++编译器,这是/gen的结果。exe 1000。

This is the execution result on my computer:

这是我电脑上的执行结果:

$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 7.8.3
$ ghc -fforce-recomp 1827.hs
[1 of 1] Compiling Main             ( 1827.hs, 1827.o )
Linking 1827.exe ...
$ time ./gen.exe 1000 | ./1827.exe > /dev/null
real    0m0.088s
user    0m0.015s
sys     0m0.015s
$ ghc -fforce-recomp -O 1827.hs
[1 of 1] Compiling Main             ( 1827.hs, 1827.o )
Linking 1827.exe ...
$ time ./gen.exe 1000 | ./1827.exe > /dev/null
real    0m2.969s
user    0m0.000s
sys     0m0.045s

And this is the heap profile summary:

这是堆概要文件:

$ ghc -fforce-recomp -rtsopts ./1827.hs
[1 of 1] Compiling Main             ( 1827.hs, 1827.o )
Linking 1827.exe ...
$ ./gen.exe 1000 | ./1827.exe +RTS -s > /dev/null
      70,207,096 bytes allocated in the heap
       2,112,416 bytes copied during GC
         613,368 bytes maximum residency (3 sample(s))
          28,816 bytes maximum slop
               3 MB total memory in use (0 MB lost due to fragmentation)
                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0       132 colls,     0 par    0.00s    0.00s     0.0000s    0.0004s
  Gen  1         3 colls,     0 par    0.00s    0.00s     0.0006s    0.0010s
  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    0.03s  (  0.03s elapsed)
  GC      time    0.00s  (  0.01s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    0.03s  (  0.04s elapsed)
  %GC     time       0.0%  (14.7% elapsed)
  Alloc rate    2,250,213,011 bytes per MUT second
  Productivity 100.0% of total user, 83.1% of total elapsed
$ ghc -fforce-recomp -O -rtsopts ./1827.hs
[1 of 1] Compiling Main             ( 1827.hs, 1827.o )
Linking 1827.exe ...
$ ./gen.exe 1000 | ./1827.exe +RTS -s > /dev/null
   6,009,233,608 bytes allocated in the heap
     622,682,200 bytes copied during GC
         443,240 bytes maximum residency (505 sample(s))
          48,256 bytes maximum slop
               3 MB total memory in use (0 MB lost due to fragmentation)
                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0     10945 colls,     0 par    0.72s    0.63s     0.0001s    0.0004s
  Gen  1       505 colls,     0 par    0.16s    0.13s     0.0003s    0.0005s
  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    2.00s  (  2.13s elapsed)
  GC      time    0.87s  (  0.76s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time    2.89s  (  2.90s elapsed)
  %GC     time      30.3%  (26.4% elapsed)
  Alloc rate    3,009,412,603 bytes per MUT second
  Productivity  69.7% of total user, 69.4% of total elapsed

1 个解决方案

#1


38  

I guess it is time this question gets a proper answer.

我想该是这个问题得到正确答案的时候了。

What happened to your code with -O

Let me zoom in your main function, and rewrite it slightly:

我放大你的主函数,稍微重写一下:

main :: IO ()
main = do
    [n, m] <- fmap (map read . words) getLine
    line <- getLine
    let nodes = listArray (0, n) . tonodes n . map (subtract 1) . map read . words $ line
    replicateM_ m $ query n nodes

Clearly, the intention here is that the NodeArray is created once, and then used in every of the m invocations of query.

显然,这里的意图是,NodeArray创建一次,然后在查询的每一个m调用中使用。

Unfortunately, GHC transforms this code to, effectively,

不幸的是,GHC将这段代码有效地转换为,

main = do
    [n, m] <- fmap (map read . words) getLine
    line <- getLine
    replicateM_ m $ do
        let nodes = listArray (0, n) . tonodes n . map (subtract 1) . map read . words $ line
        query n nodes

and you can immediately see the problem here.

你马上就能看到问题了。

What is the state hack, and why does it destroy my programs performance

The reason is the state hack, which says (roughly): “When something is of type IO a, assume it is called only once.”. The official documentation is not much more elaborate:

原因是state hack(大致):“当某个东西是IO a类型时,假设它只被调用一次。”官方文件并没有详细说明:

-fno-state-hack

-fno-state-hack

Turn off the "state hack" whereby any lambda with a State# token as argument is considered to be single-entry, hence it is considered OK to inline things inside it. This can improve performance of IO and ST monad code, but it runs the risk of reducing sharing.

关闭“state hack”,使任何带有状态#令牌作为参数的lambda都被认为是单条目,因此内联在其中是可以的。这可以提高IO和ST monad代码的性能,但是会降低共享的风险。

Roughly, the idea is as follows: If you define a function with an IO type and a where clause, e.g.

大致来说,思路是这样的:如果定义一个具有IO类型和where子句的函数,例如。

foo x = do
    putStrLn y
    putStrLn y
  where y = ...x...

Something of type IO a can be viewed as something of type RealWord -> (a, RealWorld). In that view, the above becomes (roughly)

IO a类型的东西可以被看作是RealWord ->类型的东西(a, RealWorld)。在这个视图中,上面变成(粗略地)

foo x = 
   let y = ...x... in 
   \world1 ->
     let (world2, ()) = putStrLn y world1
     let (world3, ()) = putStrLn y world2
     in  (world3, ())

A call to foo would (typically) look like this foo argument world. But the definition of foo only takes one argument, and the other one is only consumed later by a local lambda expression! That is going to be a very slow call to foo. It would be much faster if the code would look like this:

对foo的调用(通常)看起来像这个foo参数世界。但是foo的定义只接受一个参数,而另一个参数只被一个本地lambda表达式消耗!对foo的调用很慢。如果代码是这样的话会快得多:

foo x world1 = 
   let y = ...x... in 
   let (world2, ()) = putStrLn y world1
   let (world3, ()) = putStrLn y world2
   in  (world3, ())

This is called eta-expansion and done on various grounds (e.g. by analyzing the function’s definition, by checking how it is being called, and – in this case – type directed heuristics).

这被称为etc - expand,并基于各种理由(例如,通过分析函数的定义,检查如何调用它,以及——在本例中——定向启发式)来实现。

Unfortunately it is unsound if the call to foo is actually of the form let fooArgument = foo argument, i.e. with an argument, but no world passed (yet). In the original code, if fooArgument is then used several times, y will still be calculated only once, and shared. In the modified code, y will be re-calculated every time – precisely what has happened to your nodes.

不幸的是,如果对foo的调用实际上是让fooArgument = foo参数的形式,即带一个参数,但是还没有传递任何世界(还),那么它就不可靠。在原始代码中,如果fooArgument被多次使用,那么y仍然只被计算一次,并被共享。在修改后的代码中,y将每次都重新计算——确切地说,你的节点发生了什么。

Can things be fixed?

Possibly. See #9388 for an attempt at doing so. The problem with fixing it is that it will cost performance in a lot of cases where the transformation happens to ok, even though the compiler cannot possibly know that for sure. And there are probably cases where it is technically not ok, i.e. sharing is lost, but it is still beneficial because the speedups from the faster calling outweigh the extra cost of the recalculation. So it is not clear where to go from here.

可能。参见第9388条,尝试这样做。修复它的问题是,在很多情况下,即使编译器不能肯定地知道转换是否成功,它也会降低性能。在技术上可能有一些情况是不允许的,比如共享丢失了,但是它仍然是有益的,因为更快的调用带来的加速超过了重新计算的额外成本。所以现在还不清楚从哪里开始。

#1


38  

I guess it is time this question gets a proper answer.

我想该是这个问题得到正确答案的时候了。

What happened to your code with -O

Let me zoom in your main function, and rewrite it slightly:

我放大你的主函数,稍微重写一下:

main :: IO ()
main = do
    [n, m] <- fmap (map read . words) getLine
    line <- getLine
    let nodes = listArray (0, n) . tonodes n . map (subtract 1) . map read . words $ line
    replicateM_ m $ query n nodes

Clearly, the intention here is that the NodeArray is created once, and then used in every of the m invocations of query.

显然,这里的意图是,NodeArray创建一次,然后在查询的每一个m调用中使用。

Unfortunately, GHC transforms this code to, effectively,

不幸的是,GHC将这段代码有效地转换为,

main = do
    [n, m] <- fmap (map read . words) getLine
    line <- getLine
    replicateM_ m $ do
        let nodes = listArray (0, n) . tonodes n . map (subtract 1) . map read . words $ line
        query n nodes

and you can immediately see the problem here.

你马上就能看到问题了。

What is the state hack, and why does it destroy my programs performance

The reason is the state hack, which says (roughly): “When something is of type IO a, assume it is called only once.”. The official documentation is not much more elaborate:

原因是state hack(大致):“当某个东西是IO a类型时,假设它只被调用一次。”官方文件并没有详细说明:

-fno-state-hack

-fno-state-hack

Turn off the "state hack" whereby any lambda with a State# token as argument is considered to be single-entry, hence it is considered OK to inline things inside it. This can improve performance of IO and ST monad code, but it runs the risk of reducing sharing.

关闭“state hack”,使任何带有状态#令牌作为参数的lambda都被认为是单条目,因此内联在其中是可以的。这可以提高IO和ST monad代码的性能,但是会降低共享的风险。

Roughly, the idea is as follows: If you define a function with an IO type and a where clause, e.g.

大致来说,思路是这样的:如果定义一个具有IO类型和where子句的函数,例如。

foo x = do
    putStrLn y
    putStrLn y
  where y = ...x...

Something of type IO a can be viewed as something of type RealWord -> (a, RealWorld). In that view, the above becomes (roughly)

IO a类型的东西可以被看作是RealWord ->类型的东西(a, RealWorld)。在这个视图中,上面变成(粗略地)

foo x = 
   let y = ...x... in 
   \world1 ->
     let (world2, ()) = putStrLn y world1
     let (world3, ()) = putStrLn y world2
     in  (world3, ())

A call to foo would (typically) look like this foo argument world. But the definition of foo only takes one argument, and the other one is only consumed later by a local lambda expression! That is going to be a very slow call to foo. It would be much faster if the code would look like this:

对foo的调用(通常)看起来像这个foo参数世界。但是foo的定义只接受一个参数,而另一个参数只被一个本地lambda表达式消耗!对foo的调用很慢。如果代码是这样的话会快得多:

foo x world1 = 
   let y = ...x... in 
   let (world2, ()) = putStrLn y world1
   let (world3, ()) = putStrLn y world2
   in  (world3, ())

This is called eta-expansion and done on various grounds (e.g. by analyzing the function’s definition, by checking how it is being called, and – in this case – type directed heuristics).

这被称为etc - expand,并基于各种理由(例如,通过分析函数的定义,检查如何调用它,以及——在本例中——定向启发式)来实现。

Unfortunately it is unsound if the call to foo is actually of the form let fooArgument = foo argument, i.e. with an argument, but no world passed (yet). In the original code, if fooArgument is then used several times, y will still be calculated only once, and shared. In the modified code, y will be re-calculated every time – precisely what has happened to your nodes.

不幸的是,如果对foo的调用实际上是让fooArgument = foo参数的形式,即带一个参数,但是还没有传递任何世界(还),那么它就不可靠。在原始代码中,如果fooArgument被多次使用,那么y仍然只被计算一次,并被共享。在修改后的代码中,y将每次都重新计算——确切地说,你的节点发生了什么。

Can things be fixed?

Possibly. See #9388 for an attempt at doing so. The problem with fixing it is that it will cost performance in a lot of cases where the transformation happens to ok, even though the compiler cannot possibly know that for sure. And there are probably cases where it is technically not ok, i.e. sharing is lost, but it is still beneficial because the speedups from the faster calling outweigh the extra cost of the recalculation. So it is not clear where to go from here.

可能。参见第9388条,尝试这样做。修复它的问题是,在很多情况下,即使编译器不能肯定地知道转换是否成功,它也会降低性能。在技术上可能有一些情况是不允许的,比如共享丢失了,但是它仍然是有益的,因为更快的调用带来的加速超过了重新计算的额外成本。所以现在还不清楚从哪里开始。