内联函数是作为参数传递的，是否真的在C / C ++中内联执行？

I have a very long (in number of iterations) for loop, and I like to make it possible to personalize some of its parts. The code looks as following:

我有一个非常长的(迭代次数)循环,我喜欢使其个性化的一些部分成为可能。代码如下所示:

function expensive_loop( void (*do_true)(int),  void (*do_false)(int)){
    for(i=0; i<VeryLargeN; i++){
       element=elements[i]
       // long computation that produce a boolean condition
       if (condition){ 
         do_true(element); 
       }else{
         do_false(element);
       }
    }
}

Now, the problem is that every time do_true and do_false are called, there is an overhead due to the push/pop of the stack that ruins the high performance of the code.

现在,问题是每次调用do_true和do_false时,都会由于堆栈的推送/弹出而导致开销,这会破坏代码的高性能。

To solve this I could simply create several copies of the expensive_loop function, each with its own do_true and do_false implementation. This will make impossible the code to mantain.

为了解决这个问题,我可以简单地创建一些expensive_loop函数的副本,每个副本都有自己的do_true和do_false实现。这将使代码无法保留。

So, how does someone make the internal part of an iteration so it can be personalized, and still mantain high performance?

那么,有人如何制作迭代的内部部分,以便它可以个性化,并且仍然保持高性能?

2 个解决方案

#1

The problem is that the function address (what actually is set in do_true and do_false is not resolved until link time, where there are not many opportunities for optimization.

问题是函数地址(实际在do_true和do_false中设置的内容直到链接时才解决,其中优化机会不多。

If you are explicitly setting both functions in the code (i.e., the functions themselves don't come from an external library, etc.), you can declare your function with C++ templates, so that the compiler knows exactly which functions you want to call at that time.

如果您在代码中显式设置这两个函数(即,函数本身不是来自外部库等),您可以使用C ++模板声明您的函数,以便编译器确切地知道您要调用哪些函数那时候。

struct function_one {
  void operator()( int element ) {
  }
};

extern int elements[];
extern bool condition();

template < typename DoTrue, typename DoFalse >
void expensive_loop(){
  DoTrue do_true;
  DoFalse do_false;

  for(int i=0; i<50; i++){
    int element=elements[i];
    // long computation that produce a boolean condition
    if (condition()){ 
      do_true(element); // call DoTrue's operator()
    }else{
      do_false(element); // call DoFalse's operator()
    }
  }
}

int main( int argc, char* argv[] ) {
    expensive_loop<function_one,function_one>();

return 0;
}

The compiler will instantiate an expensive_loop function for each combination of DoTrue and DoFalse types you specify. It will increase the size of the executable if you use more than one combination, but each of them should do what you expect.

编译器将为您指定的DoTrue和DoFalse类型的每个组合实例化一个expensive_loop函数。如果您使用多个组合,它将增加可执行文件的大小,但每个组合都应该按预期执行。

For the example I shown, note how the function is empty. The compiler just strips away the function call and leaves the loop:

对于我显示的示例,请注意该函数是如何为空。编译器只是删除函数调用并离开循环:

main:
    push    rbx
    mov     ebx, 50
.L2:
    call    condition()
    sub     ebx, 1
    jne     .L2
    xor     eax, eax
    pop     rbx
    ret

See example in https://godbolt.org/g/hV52Nn

请参阅https://godbolt.org/g/hV52Nn中的示例

Using function pointers as in your example, may not inline the function calls. This is the produced assembler for main and expensive_loop in a program where expensive_loop

在示例中使用函数指针,可能不会内联函数调用。这是用于main_loop的程序中main和expensive_loop的生成汇编程序

// File A.cpp
void foo( int arg );
void bar( int arg );

extern bool condition();
extern int elements[];

void expensive_loop( void (*do_true)(int),  void (*do_false)(int)){
    for(int i=0; i<50; i++){
       int element=elements[i];
       // long computation that produce a boolean condition
       if (condition()){
         do_true(element);
       }else{
         do_false(element);
       }
    }
}

int main( int argc, char* argv[] ) {
    expensive_loop( foo, bar );

    return 0;
}

and the functions passed by argument

和参数传递的函数

// File B.cpp
#include <math.h>

int elements[50];

bool condition() {
    return elements[0] == 1;
}

inline int foo( int arg ) {
    return arg%3;
}

inline int bar( int arg ) {
    return 1234%arg;
}

are defined in different translation units.

以不同的翻译单位定义。

0000000000400620 <expensive_loop(void (*)(int), void (*)(int))>:
  400620:       41 55                   push   %r13
  400622:       49 89 fd                mov    %rdi,%r13
  400625:       41 54                   push   %r12
  400627:       49 89 f4                mov    %rsi,%r12
  40062a:       55                      push   %rbp
  40062b:       53                      push   %rbx
  40062c:       bb 60 10 60 00          mov    $0x601060,%ebx
  400631:       48 83 ec 08             sub    $0x8,%rsp
  400635:       eb 19                   jmp    400650 <expensive_loop(void (*)(int), void (*)(int))+0x30>
  400637:       66 0f 1f 84 00 00 00    nopw   0x0(%rax,%rax,1)
  40063e:       00 00
  400640:       48 83 c3 04             add    $0x4,%rbx
  400644:       41 ff d5                callq  *%r13
  400647:       48 81 fb 28 11 60 00    cmp    $0x601128,%rbx
  40064e:       74 1d                   je     40066d <expensive_loop(void (*)(int), void (*)(int))+0x4d>
  400650:       8b 2b                   mov    (%rbx),%ebp
  400652:       e8 79 ff ff ff          callq  4005d0 <condition()>
  400657:       84 c0                   test   %al,%al
  400659:       89 ef                   mov    %ebp,%edi
  40065b:       75 e3                   jne    400640 <expensive_loop(void (*)(int), void (*)(int))+0x20>
  40065d:       48 83 c3 04             add    $0x4,%rbx
  400661:       41 ff d4                callq  *%r12
  400664:       48 81 fb 28 11 60 00    cmp    $0x601128,%rbx
  40066b:       75 e3                   jne    400650 <expensive_loop(void (*)(int), void (*)(int))+0x30>
  40066d:       48 83 c4 08             add    $0x8,%rsp
  400671:       5b                      pop    %rbx
  400672:       5d                      pop    %rbp
  400673:       41 5c                   pop    %r12
  400675:       41 5d                   pop    %r13
  400677:       c3                      retq
  400678:       0f 1f 84 00 00 00 00    nopl   0x0(%rax,%rax,1)
  40067f:       00

You can see how the calls are still performed even when using -O3 optimization level:

即使使用-O3优化级别,您也可以看到调用仍然是如何执行的:

400644:       41 ff d5                callq  *%r13

#2

Note that the function accepts pointers to functions, so those get called through a pointer. The optimizer may inline those calls through the function pointers if the definitions of expensive_loop and those functions are available and the compiler inlining limits have not been breached.

请注意,该函数接受函数指针,因此通过指针调用它们。如果expensive_loop和那些函数的定义可用且编译器内联限制未被破坏,优化器可以通过函数指针内联这些调用。

Another option is to make this algorithm a function template that accepts callable objects (function pointers, objects with a call operator, lambdas), just like standard algorithms do. This way the compiler may have more optimization opportunities. E.g.:

另一个选择是使这个算法成为一个函数模板,它接受可调用对象(函数指针,带调用运算符的对象,lambdas),就像标准算法一样。这样编译器可能有更多的优化机会。例如。:

template<class DoTrue, class DoFalse>
void expensive_loop(DoTrue do_true, DoFalse do_false) { 
    // Original function body here.
}

There is -Winline compiler switch for g++:

有用于g ++的-Winline编译器开关:

-Winline

Warn if a function can not be inlined and it was declared as inline. Even with this option, the compiler will not warn about failures to inline functions declared in system headers.

如果函数无法内联并且声明为内联,则发出警告。即使使用此选项,编译器也不会警告系统头中声明的内联函数失败。

The compiler uses a variety of heuristics to determine whether or not to inline a function. For example, the compiler takes into account the size of the function being inlined and the the amount of inlining that has already been done in the current function. Therefore, seemingly insignificant changes in the source program can cause the warnings produced by -Winline to appear or disappear.

编译器使用各种启发式方法来确定是否内联函数。例如,编译器会考虑内联函数的大小以及当前函数中已经完成的内联量。因此,源程序中看似无关紧要的更改可能会导致-Winline生成的警告出现或消失。

It probably does not warn about a function not being inlined when it is called through a pointer.

当通过指针调用函数时,它可能不会警告函数没有内联。

#1