gcc的asm volatile是否与gfortran默认的递归设置等价?

I was just playing around with recursive functions in C++ and Fortran and I realised that a simple recursive function in Fortran is almost twice as fast as its equivalent C++ function. Now, before getting into this, I know there are similar questions here, specifically:

我只是在c++和Fortran中使用递归函数，我意识到Fortran中的一个简单递归函数的速度几乎是它的等效c++函数的两倍。在开始之前，我知道这里有类似的问题，尤其是:

Why does adding assembly comments cause such radical change in generated code?
为什么添加汇编注释会对生成的代码造成如此彻底的改变?
Working of asm volatile (“” : : : “memory”)
asm volatile的工作(“:”:“内存”)
Equivalent to asm volatile in gfortran
相当于gfortran中的asm挥发物

However, I am a bit more specific and puzzled as the Fortran compiler seems to be doing what you can achieve with asm volatile in gcc. To give you some context let's consider the following recursive Fibonacci number implementation:

但是，我比Fortran编译器更具体、更困惑，因为它似乎正在做您可以在gcc中使用asm来实现的事情。为了给您提供一些上下文，让我们考虑以下递归斐波纳契数实现:

Fortran Code:

Fortran代码:

module test
implicit none
private
public fib

contains

! Fibonacci function
integer recursive function fib(n) result(r)
    integer, intent(in) :: n
    if (n < 2) then
        r = n
    else
        r = fib(n-1) + fib(n-2)
    end if
end function  ! end of Fibonacci function
end module

program fibonacci
use test, only: fib
implicit none
integer :: r,i 
integer :: n = 1e09
real(8) :: start, finish, cum_time

cum_time=0
do i= 1,n 
    call cpu_time(start)
    r = fib(20)
    call cpu_time(finish) 
    cum_time = cum_time + (finish - start)
    if (cum_time >0.5) exit
enddo  

print*,i,'runs, average elapsed time is', cum_time/i/1e-06, 'us' 
end program

Compiled with:

编译:

gfortran -O3 -march=native

C++ Code:

c++代码:

#include <iostream>
#include <chrono>
using namespace std;

// Fib function
int fib(const int n)
{
    int r;
    if (n < 2)
        r = n;
    else
        r = fib(n-1) + fib(n-2);
    return r;
} // end of fib

template<typename T, typename ... Args>
double timeit(T (*func)(Args...), Args...args)
{
    double counter = 1.0;
    double mean_time = 0.0;
    for (auto iter=0; iter<1e09; ++iter){
        std::chrono::time_point<std::chrono::system_clock> start, end;
        start = std::chrono::system_clock::now();

        func(args...);

        end = std::chrono::system_clock::now();
        std::chrono::duration<double> elapsed_seconds = end-start;

        mean_time += elapsed_seconds.count();
        counter++;

        if (mean_time > 0.5){
            mean_time /= counter;
            std::cout << static_cast<long int>(counter)
            << " runs, average elapsed time is "
            << mean_time/1.0e-06 << " \xC2\xB5s" << std::endl; 
            break;
        }
    }
    return mean_time;
}

int main(){
    timeit(fib,20);
    return 0;
}

Compiled with:

编译:

g++ -O3 -march=native

Timing:

时间:

Fortran: 24991 runs, average elapsed time is 20.087 us
C++    : 12355 runs, average elapsed time is 40.471 µs

So gfortran is twice as fast there compared to gcc. Looking at the assembly code, I get

所以gfortran比gcc快一倍。看一下汇编代码，我得到

Assembly (Fortran):

组装(Fortran):

.L28:
    cmpl    $1, %r13d
    jle .L29
    leal    -8(%rbx), %eax
    movl    %ecx, 12(%rsp)
    movl    %eax, 48(%rsp)
    leaq    48(%rsp), %rdi
    leal    -9(%rbx), %eax
    movl    %eax, 16(%rsp)
    call    __bench_MOD_fib
    leaq    16(%rsp), %rdi
    movl    %eax, %r13d
    call    __bench_MOD_fib
    movl    12(%rsp), %ecx
    addl    %eax, %r13d

Assembly (C++):

组装(c++):

.L28:
    movl    72(%rsp), %edx
    cmpl    $1, %edx
    movl    %edx, %eax
    jle .L33
    subl    $3, %eax
    movl    $0, 52(%rsp)
    movl    %eax, %esi
    movl    %eax, 96(%rsp)
    movl    92(%rsp), %eax
    shrl    %eax
    movl    %eax, 128(%rsp)
    addl    %eax, %eax
    subl    %eax, %esi
    movl    %edx, %eax
    subl    $1, %eax
    movl    %esi, 124(%rsp)
    movl    %eax, 76(%rsp)

Both assembly codes are made up of almost similar blocks/labels repeated over and over again. As you can see the Fortran assembly makes two calls to fib function whereas in the C++ assembly, gcc has probably unrolled all the recursive calls which probably requires more stack push/pop and tail jumps.

这两个汇编代码都是由几乎类似的块/标签组成的，重复一遍又一遍。正如您所看到的，Fortran程序集对fib函数进行了两次调用，而在c++程序集中，gcc可能已经展开了所有的递归调用，这可能需要更多的栈push/pop和tail跳转。

Now if I just put one inline assembly comment in the C++ code like so

现在，如果我在c++代码中加入一个内联程序集注释

Modified C++ Code:

修改过的c++代码:

// Fib function
int fib(const int n)
{
    int r;
    if (n < 2)
        r = n;
    else
        r = fib(n-1) + fib(n-2);
    asm("");
    return r;
} // end of fib

The generated assembly code, changes to

生成的程序集代码将更改为

Assembly (C++ Modified):

组装(c++修改):

.L7:
    cmpl    $1, %edx
    jle .L17
    leal    -4(%rbx), %r13d
    leal    -5(%rbx), %edx
    cmpl    $1, %r13d
    jle .L19
    leal    -5(%rbx), %r14d
    cmpl    $1, %r14d
    jle .L55
    leal    -6(%rbx), %r13d
    movl    %r13d, %edi
    call    _Z3fibi
    leal    -7(%rbx), %edi
    movl    %eax, %r15d
    call    _Z3fibi
    movl    %r13d, %edi
    addl    %eax, %r15d

You can now see two calls to fib function. Timing them gives me

现在可以看到两个调用fib函数的调用。时间他们给我

Timing:

时间:

Fortran: 24991 runs, average elapsed time is 20.087 us
C++    : 25757 runs, average elapsed time is 19.412 µs

I know the effect of asm with no output and asm volatile is to suppress aggressive compiler optimisations but in this case, gcc thought it was too smart but ended up generating a less efficient code in the first place.

我知道asm没有输出和asm volatile的影响是抑制激进的编译器优化，但在这种情况下，gcc认为它太聪明了，但最终生成的代码效率更低。

So the question is:

所以问题是:

Why can gcc not see this "optimisation", when gfortan clearly can?
为什么gcc不能看到这种“优化”，而gfortan显然可以看到呢?
The inline assembly line has to be before the return statement. Put it elsewhere and it will have no effect. Why?
内联装配线必须在返回语句之前。把它放在别的地方，就不会有任何效果。为什么?
Is this behaviour compiler specific? For instance can you mimick the same behaviour with clang/MSVC?
这个行为编译器是特定的吗?例如，你能模仿clang/MSVC的相同行为吗?
Are there safer ways to make recursions faster in C or C++ (without relying on inline assembly or iteration-style coding)? Variadic templates perhaps?
在C或c++中是否有更安全的方法使递归更快(不依赖内联汇编或迭代式编码)?也许可变模板?

UPDATE:

更新:

The result shown above are all with gcc 4.8.4. I have also tried compiling it with gcc 4.9.2 and gcc 5.2 and I get identical results.
上面显示的结果都是gcc 4.8.4。我也尝试过用gcc 4.9.2和gcc 5.2编译它，得到了相同的结果。
The problem can also be replicated (fixed?) if instead of putting asm I declare the input argument as volatile i.e. (volatile int n) instead of (const int n), although this will result in a tad bit slower run-time on my machine.
如果我声明输入参数为volatile int (volatile int n)而不是const int n (const int n)，而不是asm(固定?)，那么这个问题也可以被复制(尽管这会导致我的机器运行时稍微慢一点)。
As Michael Karcher has mentioned, we can instead pass the -fno-optimize-sibling-calls flag to fix this problem. Since this flag is activated at -O2 level and beyond, even compiling with -O1 fixes this problem.
正如Michael Karcher所提到的，我们可以通过-fno-优化-sibling-call标志来解决这个问题。因为这个标志被激活在-O2级别和之后，甚至用-O1编译就解决了这个问题。
I have run the same example with clang 3.5.1 with -O3 -march=native and though the situation is not identical, clang also appears to generate faster code with asm.
我在clang 3.5.1中运行了相同的例子，使用-O3 -march=native，虽然情况并不相同，但clang似乎也生成了更快的asm代码。

Clang Timing:

铿锵声时间:

clang++ w/o asm    :  8846 runs, average elapsed time is 56.4555 µs
clang++ with asm   : 10427 runs, average elapsed time is 47.8991 µs

1 个解决方案

#1

See the bold print near the end of this answer on how to get a fast program generated by gcc. Read the answer for replies to the four questions.

关于如何获得由gcc生成的快速程序，请参阅本文末尾的粗体字。请阅读以下四个问题的答案。

Your first question assumes that gfortran is able to see an optimization possibility that gcc failed to see. In fact, the opposite is true. gcc identified something that it considered to be an optimization possibility, while gfortran missed it. Alas, gcc was wrong and the optimization it applied turns out to be a 100% speed loss on your system (comparable on mine).

第一个问题假设gfortran能够看到gcc没有看到的优化可能性。事实上，恰恰相反。gcc识别了一些它认为是优化可能性的东西，而gfortran忽略了它。唉，gcc错了，它所应用的优化结果是系统100%的速度损失(与我的系统相当)。

To address your second question: The asm statement prevented an internal transformation that made gcc see the false optimization possiblity. Without the asm statement, your code got (validly) transformed to:

要解决第二个问题:asm语句阻止了内部转换，使gcc看到了错误的优化可能性。没有asm语句，您的代码(有效地)被转换为:

int fib(const int n)
{
    if (n < 2)
        return n;
    else
        return fib(n-1) + fib(n-2);
}

The return statement containing the recursive call triggers the "sibling calls optimization" that pessimizes your code. Including the asm statement prevents moving the return instruction across it.

包含递归调用的返回语句触发“兄弟调用优化”，使代码变得悲观。包含asm语句可以防止在其中移动返回指令。

Currently, I only have gcc at hand, so I can't try the behaviour of other compilers to answer your third question by evidence, but this seems definitely compiler dependent. You ran into a quirk (or bug, however you call it) of gcc that it generates bad code while trying to optimize it. The optimizers of different compilers are highly different, so it is quite likely that some other compiler doesn't mis-optimize your code like gcc does. On the other hand, code transformations for optimization is a well-researched topic, and most compilers are implementing similar approaches to optimization, so it is possible that another compiler steps into the same trap as gcc.

目前，我手头只有gcc，所以我不能尝试其他编译器的行为来通过证据回答您的第三个问题，但这显然是依赖于编译器的。您遇到了gcc的一个怪癖(或bug，无论您怎么称呼它)，它在试图优化它时生成了错误的代码。不同编译器的优化器是高度不同的，所以很可能其他编译器不会像gcc那样错误地优化代码。另一方面，用于优化的代码转换是一个经过深入研究的主题，大多数编译器都实现了类似的优化方法，因此有可能另一个编译器会步进与gcc相同的陷阱。

To address you last question: It is not a problem about C/C++ versus Fortan, but a problem about gcc that messes up this example program (and likely similar production programs). So there is no way to make recursion faster in C++, but there is a way to speed up this example in gcc, by disabling the problematic optimization: -fno-optimize-sibling-calls, which results (on my system, in a single test run) in even faster code than just inserting the asm statement.

最后一个问题:这不是一个关于C/ c++和Fortan的问题，而是一个关于gcc的问题，它使这个示例程序(以及可能类似的生产程序)混乱。因此，在c++中无法更快地实现递归，但是有一种方法可以在gcc中加速这个例子，通过禁用问题优化:-fno-优化-sibling-调用，结果(在我的系统中，在一个测试运行中)甚至比插入asm语句更快。

#1