A lot of literature talks about using inline functions to "avoid the overhead of a function call". However I haven't seen quantifiable data. What is the actual overhead of a function call i.e. what sort of performance increase do we achieve by inlining functions?
很多文献都谈到使用内联函数来“避免函数调用的开销”。但是我没有看到可量化的数据。函数调用的实际开销是什么,也就是说,我们通过内联函数实现了什么样的性能提升?
16 个解决方案
#1
42
On most architectures, the cost consists of saving all (or some, or none) of the registers to the stack, pushing the function arguments to the stack (or putting them in registers), incrementing the stack pointer and jumping to the beginning of the new code. Then when the function is done, you have to restore the registers from the stack. This webpage has a description of what's involved in the various calling conventions.
在大多数体系结构中,成本包括将所有寄存器(或部分寄存器或无寄存器)保存到堆栈中、将函数参数推到堆栈中(或将它们放入寄存器中)、递增堆栈指针并跳转到新代码的开头。然后,当函数完成时,您必须从堆栈中恢复寄存器。这个网页描述了各种调用约定所涉及的内容。
Most C++ compilers are smart enough now to inline functions for you. The inline keyword is just a hint to the compiler. Some will even do inlining across translation units where they decide it's helpful.
大多数c++编译器现在已经足够智能,可以为您内联函数。内联关键字只是给编译器的一个提示。有些甚至会在他们认为有用的翻译单位之间进行内联。
#2
11
There's the technical and the practical answer. The practical answer is it will never matter, and in the very rare case it does the only way you'll know is through actual profiled tests.
有技术和实际的答案。实际的答案是它永远都不重要,在非常罕见的情况下,只有通过实际的分析测试才能知道它。
The technical answer, which your literature refers to, is generally not relevant due to compiler optimizations. But if you're still interested, is well described by Josh.
您的文献提到的技术答案通常与编译器优化无关。但如果你还感兴趣的话,乔希对此有很好的描述。
As far as a "percentage" you'd have to know how expensive the function itself was. Outside of the cost of the called function there is no percentage because you are comparing to a zero cost operation. For inlined code there is no cost, the processor just moves to the next instruction. The downside to inling is a larger code size which manifests it's costs in a different way than the stack construction/tear down costs.
至于“百分比”,你必须知道这个函数本身有多贵。除了被调用函数的成本之外,没有百分比因为你在与零成本操作进行比较。对于内联代码,不需要任何成本,处理器只需要移动到下一条指令。inling的缺点是代码大小较大,这表明它的成本与堆栈构造/分解成本不同。
#3
8
The amount of overhead will depend on the compiler, CPU, etc. The percentage overhead will depend on the code you're inlining. The only way to know is to take your code and profile it both ways - that's why there's no definitive answer.
开销的大小将取决于编译器、CPU等。开销的百分比将取决于所插入的代码。要知道,唯一的方法是将代码和概要文件同时使用——这就是为什么没有明确的答案。
#4
7
Your question is one of the questions, that has no answer one could call the "absolute truth". The overhead of a normal function call depends on three factors:
你的问题是其中一个没有答案的问题,人们可以称之为“绝对真理”。正常函数调用的开销取决于三个因素:
-
The CPU. The overhead of x86, PPC, and ARM CPUs varies a lot and even if you just stay with one architecture, the overhead also varies quite a bit between an Intel Pentium 4, Intel Core 2 Duo and an Intel Core i7. The overhead might even vary noticeably between an Intel and an AMD CPU, even if both run at the same clock speed, since factors like cache sizes, caching algorithms, memory access patterns and the actual hardware implementation of the call opcode itself can have a huge influence on the overhead.
CPU。x86、PPC和ARM cpu的开销差别很大,即使您只使用一个架构,在Intel Pentium 4、Intel Core 2 Duo和Intel Core i7之间的开销也相差很大。由于缓存大小、缓存算法、内存访问模式和调用操作码本身的实际硬件实现等因素会对开销产生巨大的影响,因此,即使两者以相同的时钟速度运行,英特尔和AMD的CPU之间的开销也可能存在显著差异。
-
The ABI (Application Binary Interface). Even with the same CPU, there often exist different ABIs that specify how function calls pass parameters (via registers, via stack, or via a combination of both) and where and how stack frame initialization and clean-up takes place. All this has an influence on the overhead. Different operating systems may use different ABIs for the same CPU; e.g. Linux, Windows and Solaris may all three use a different ABI for the same CPU.
ABI(应用程序二进制接口)。即使使用相同的CPU,也经常存在不同的ABIs来指定函数调用如何传递参数(通过寄存器、堆栈或两者的组合),以及堆栈帧初始化和清理在何处以及如何进行。所有这些都对开销有影响。不同的操作系统可以对相同的CPU使用不同的ABIs;例如,Linux、Windows和Solaris对于相同的CPU都可以使用不同的ABI。
-
The Compiler. Strictly following the ABI is only important if functions are called between independent code units, e.g. if an application calls a function of a system library or a user library calls a function of another user library. As long as functions are "private", not visible outside a certain library or binary, the compiler may "cheat". It may not strictly follow the ABI but instead use shortcuts that lead to faster function calls. E.g. it may pass parameters in register instead of using the stack or it may skip stack frame setup and clean-up completely if not really necessary.
编译器。只有在独立代码单元之间调用函数,例如应用程序调用系统库的函数或用户库调用另一个用户库的函数时,严格地遵循ABI才是重要的。只要函数是“私有的”,在特定的库或二进制文件之外不可见,编译器就可能“欺骗”。它可能不会严格遵循ABI,而是使用快捷方式,从而导致更快的函数调用。例如,它可以在寄存器中传递参数,而不是使用堆栈,或者它可能跳过堆栈帧设置,如果没有必要的话,完全可以进行清理。
If you want to know the overhead for a specific combination of the three factors above, e.g. for Intel Core i5 on Linux using GCC, your only way to get this information is benchmarking the difference between two implementations, one using function calls and one where you copy the code directly into the caller; this way you force inlining for sure, since the inline statement is only a hint and does not always lead to inlining.
如果你想知道一个特定组合的开销上面的三个因素中,例如英特尔酷睿i5在Linux上使用GCC,你唯一的办法得到这个信息是基准测试两种实现之间的区别,一个使用函数调用,一个,你直接复制代码到调用者;这样,您就可以确保强制内联,因为内联语句只是一个提示,并不总是导致内联。
However, the real question here is: Does the exact overhead really matter? One thing is for sure: A function call always has an overhead. It may be small, it may be big, but it is for sure existent. And no matter how small it is if a function is called often enough in a performance critical section, the overhead will matter to some degree. Inlining rarely makes your code slower, unless you terribly overdo it; it will make the code bigger though. Today's compilers are pretty good at deciding themselves when to inline and when not, so you hardly ever have to rack your brain about it.
然而,真正的问题是:实际开销真的重要吗?有一件事是肯定的:函数调用总是有开销。它可能很小,可能很大,但它确实存在。不管它是多么的小,如果一个函数在性能关键部分经常被调用,那么它的开销在一定程度上会起作用。内联很少会使您的代码变慢,除非您过度使用它;它将使代码更大。今天的编译器非常善于决定何时内联,何时不联,所以您几乎不需要绞尽脑汁。
Personally I ignore inlining during development completely, until I have a more or less usable product that I can profile and only if profiling tells me, that a certain function is called really often and also within a performance critical section of the application, then I will consider "force-inlining" of this function.
我个人完全忽略内联在开发过程中,直到我有或多或少地使用产品,我只能概要文件和分析告诉我,是否经常调用某些函数真的也在一个应用程序的性能关键部分,然后我将考虑“force-inlining”这个函数。
So far my answer is very generic, it applies to C as much as it applies to C++ and Objective-C. As a closing word let me say something about C++ in particular: Methods that are virtual are double indirect function calls, that means they have a higher function call overhead than normal function calls and also they cannot be inlined. Non-virtual methods might be inlined by the compiler or not but even if they are not inlined, they are still significant faster than virtual ones, so you should not make methods virtual, unless you really plan to override them or have them overridden.
到目前为止,我的答案是非常通用的,它适用于c++和Objective-C。作为结束语,让我特别介绍一下c++:虚方法是双间接函数调用,这意味着它们的函数调用开销高于普通函数调用,而且它们也不能内联。非虚方法可能由编译器内联,也可能不是,但即使它们不是内联的,它们仍然比虚方法快得多,所以不应该使方法虚,除非您确实计划覆盖它们或让它们被重写。
#5
6
I made a simple benchmark against a simple increment function:
我用一个简单的增量函数做了一个简单的基准:
inc.c:
inc.c:
typedef unsigned long ulong;
ulong inc(ulong x){
return x+1;
}
main.c
c
#include <stdio.h>
#include <stdlib.h>
typedef unsigned long ulong;
#ifdef EXTERN
ulong inc(ulong);
#else
static inline ulong inc(ulong x){
return x+1;
}
#endif
int main(int argc, char** argv){
if (argc < 1+1)
return 1;
ulong i, sum = 0, cnt;
cnt = atoi(argv[1]);
for(i=0;i<cnt;i++){
sum+=inc(i);
}
printf("%lu\n", sum);
return 0;
}
Running it with a billion iterations on my Intel(R) Core(TM) i5 CPU M 430 @ 2.27GHz gave me:
我的Intel(R) Core(TM) i5 CPU m430 @ 2.27GHz给了我10亿次迭代来运行它:
- 1.4 seconds for the inlinining version
- inlinining版本的1.4秒
- 4.4 seconds for the regularly linked version
- 4.4秒的定期链接版本。
(It appears to fluctuate by up to 0.2 but I'm too lazy to calculate proper standard deviations nor do I care for them)
(它的波动幅度似乎高达0.2,但我太懒了,无法计算适当的标准差,也不关心它们)
This suggests that the overhead of function calls on this computer is about 3 nanoseconds
这表明,这台计算机上函数调用的开销大约为3纳秒
The fastest I measured something at it was about 0.3ns so that would suggest a function call costs about 9 primitive ops, to put it very simplistically.
我测量的最快的是0。3ns,这意味着一个函数调用要花费9个原始操作,简单地说。
This overhead increases by about another 2ns per call (total time call time about 6ns) for functions called through a PLT (functions in a shared library).
对于通过PLT(共享库中的函数)调用的函数,每次调用的开销增加了大约2ns(总调用时间约为6ns)。
#6
4
For very small functions inlining makes sense, because the (small) cost of the function call is significant relative to the (very small) cost of the function body. For most functions over a few lines it's not a big win.
对于非常小的函数,内联是有意义的,因为函数调用的(小的)成本相对于函数体的(非常小的)成本是重要的。对于一些行上的大多数函数来说,这并不是一个大胜利。
#7
3
It's worth pointing out that an inlined function increases the size of the calling function and anything that increases the size of a function may have a negative affect on caching. If you're right at a boundary, "just one more wafer thin mint" of inlined code might have a dramatically negative effect on performance.
值得指出的是,内联函数增加了调用函数的大小,任何增加函数大小的内容都可能对缓存产生负面影响。如果您正处在一个边界上,那么“再多一个小块”的内联代码可能会对性能产生巨大的负面影响。
If you're reading literature that's warning about "the cost of a function call," I'd suggest it may be older material that doesn't reflect modern processors. Unless you're in the embedded world, the era in which C is a "portable assembly language" has essentially passed. A large amount of the ingenuity of the chip designers in the past decade (say) has gone into all sorts of low-level complexities that can differ radically from the way things worked "back in the day."
如果你正在阅读有关“函数调用的成本”的文献,我建议你使用的是不能反映现代处理器的旧材料。除非你身处嵌入式世界,否则C语言是“便携式汇编语言”的时代已经基本过去了。在过去的十年里,芯片设计者们的聪明才智已经进入了各种各样的低层次的复杂性,这些复杂性可能与我们在“白天”工作的方式有很大的不同。
#8
1
There is a great concept called 'register shadowing', which allows to pass ( up to 6 ? ),values thru registers ( on CPU ) instead of stack ( memory ). Also, depending on the function and variables used within, compiler may just decide that frame management code is not required !!
有一个很棒的概念叫做“寄存器阴影”,它允许传递(最多6 ?)值通过注册(在CPU上)而不是堆栈(内存)。另外,根据内部使用的函数和变量,编译器可能会决定不需要框架管理代码!
Also, even C++ compiler may do a 'tail recursion optimiztaion', i.e. if A() calls B(), and after calling B(), A just returns, compiler will reuse the stack frame !!
而且,即使是c++编译器也可以执行“尾部递归优化”,例如,如果a()调用B(),并且在调用B()之后,a才返回,编译器将重用堆栈帧!
Of course, this all can be done, only if program sticks to the semantics of standard ( see pointer aliasing and it's effect on optimizations )
当然,这一切都可以实现,只有当程序坚持使用标准的语义时(参见指针别名及其对优化的影响)
#9
1
Modern CPUs are very fast (obviously!). Almost every operation involved with calls and argument passing are full speed instructions (indirect calls might be slightly more expensive, mostly the first time through a loop).
现代cpu非常快(显然!)几乎所有涉及调用和参数传递的操作都是全速指令(间接调用可能稍微贵一些,大多数是第一次通过循环)。
Function call overhead is so small, only loops that call functions can make call overhead relevant.
函数调用开销非常小,只有调用函数的循环才能使调用开销相关。
Therefore, when we talk about (and measure) function call overhead today, we are usually really talking about the overhead of not being able to hoist common subexpressions out of loops. If a function has to do a bunch of (identical) work every time it is called, the compiler would be able to "hoist" it out of the loop and do it once if it was inlined. When not inlined, the code will probably just go ahead and repeat the work, you told it to!
因此,当我们今天谈论(和度量)函数调用开销时,我们通常指的是无法从循环中提升公共子表达式的开销。如果一个函数每次被调用时都要做一堆(相同的)工作,那么编译器将能够将它从循环中“提升”出来,如果它是内联的,则可以进行一次。当没有内联的时候,代码可能会继续重复工作,你告诉它!
Inlined functions seem impossibly faster not because of call and argument overhead, but because of common subexpressions that can be hoisted out of the function.
内联函数看起来更快,不是因为调用和参数开销,而是因为可以从函数中提取的常见子表达式。
Example:
例子:
Foo::result_type MakeMeFaster()
{
Foo t = 0;
for (auto i = 0; i < 1000; ++i)
t += CheckOverhead(SomethingUnpredictible());
return t.result();
}
Foo CheckOverhead(int i)
{
auto n = CalculatePi_1000_digits();
return i * n;
}
An optimizer can see through this foolishness and do:
优化器可以看穿这一愚蠢之处,并做到:
Foo::result_type MakeMeFaster()
{
Foo t;
auto _hidden_optimizer_tmp = CalculatePi_1000_digits();
for (auto i = 0; i < 1000; ++i)
t += SomethingUnpredictible() * _hidden_optimizer_tmp;
return t.result();
}
It seems like call overhead is impossibly reduced because it really has hoised a big chunk of the function out of the loop (the CalculatePi_1000_digits call). The compiler would need to be able to prove that CalculatePi_1000_digits always returns the same result, but good optimizers can do that.
似乎调用开销的减少是不可能的,因为它确实使函数的很大一部分脱离了循环(calculatepi_1000_numbers调用)。编译器需要能够证明calculatepi_1000_numbers总是返回相同的结果,但是优秀的优化器可以做到这一点。
#10
1
There is not much overhead at all, especially with small (inline-able) functions or even classes.
根本没有太多的开销,特别是对于小的(行内的)函数甚至类。
The following example has three different tests that are each run many, many times and timed. The results are always equal to the order of a couple 1000ths of a unit of time.
下面的示例有三个不同的测试,每个测试运行多次、多次和计时。结果总是等于千分之几的时间单位。
#include <boost/timer/timer.hpp>
#include <iostream>
#include <cmath>
double sum;
double a = 42, b = 53;
//#define ITERATIONS 1000000 // 1 million - for testing
//#define ITERATIONS 10000000000 // 10 billion ~ 10s per run
//#define WORK_UNIT sum += a + b
/* output
8.609619s wall, 8.611255s user + 0.000000s system = 8.611255s CPU(100.0%)
8.604478s wall, 8.611255s user + 0.000000s system = 8.611255s CPU(100.1%)
8.610679s wall, 8.595655s user + 0.000000s system = 8.595655s CPU(99.8%)
9.5e+011 9.5e+011 9.5e+011
*/
#define ITERATIONS 100000000 // 100 million ~ 10s per run
#define WORK_UNIT sum += std::sqrt(a*a + b*b + sum) + std::sin(sum) + std::cos(sum)
/* output
8.485689s wall, 8.486454s user + 0.000000s system = 8.486454s CPU (100.0%)
8.494153s wall, 8.486454s user + 0.000000s system = 8.486454s CPU (99.9%)
8.467291s wall, 8.470854s user + 0.000000s system = 8.470854s CPU (100.0%)
2.50001e+015 2.50001e+015 2.50001e+015
*/
// ------------------------------
double simple()
{
sum = 0;
boost::timer::auto_cpu_timer t;
for (unsigned long long i = 0; i < ITERATIONS; i++)
{
WORK_UNIT;
}
return sum;
}
// ------------------------------
void call6()
{
WORK_UNIT;
}
void call5(){ call6(); }
void call4(){ call5(); }
void call3(){ call4(); }
void call2(){ call3(); }
void call1(){ call2(); }
double calls()
{
sum = 0;
boost::timer::auto_cpu_timer t;
for (unsigned long long i = 0; i < ITERATIONS; i++)
{
call1();
}
return sum;
}
// ------------------------------
class Obj3{
public:
void runIt(){
WORK_UNIT;
}
};
class Obj2{
public:
Obj2(){it = new Obj3();}
~Obj2(){delete it;}
void runIt(){it->runIt();}
Obj3* it;
};
class Obj1{
public:
void runIt(){it.runIt();}
Obj2 it;
};
double objects()
{
sum = 0;
Obj1 obj;
boost::timer::auto_cpu_timer t;
for (unsigned long long i = 0; i < ITERATIONS; i++)
{
obj.runIt();
}
return sum;
}
// ------------------------------
int main(int argc, char** argv)
{
double ssum = 0;
double csum = 0;
double osum = 0;
ssum = simple();
csum = calls();
osum = objects();
std::cout << ssum << " " << csum << " " << osum << std::endl;
}
The output for running 10,000,000 iterations (of each type: simple, six function calls, three object calls) was with this semi-convoluted work payload:
运行10,000,000次迭代的输出(每一种类型:简单的,六个函数调用,三个对象调用)与这个半卷积的工作负载:
sum += std::sqrt(a*a + b*b + sum) + std::sin(sum) + std::cos(sum)
as follows:
如下:
8.485689s wall, 8.486454s user + 0.000000s system = 8.486454s CPU (100.0%)
8.494153s wall, 8.486454s user + 0.000000s system = 8.486454s CPU (99.9%)
8.467291s wall, 8.470854s user + 0.000000s system = 8.470854s CPU (100.0%)
2.50001e+015 2.50001e+015 2.50001e+015
Using a simple work payload of
使用一个简单的工作负载
sum += a + b
Gives the same results except a couple orders of magnitude faster for each case.
除了每一种情况都快几个数量级之外,结果都是一样的。
#11
0
Each new function requires a new local stack to be created. But the overhead of this would only be noticeable if you are calling a function on every iteration of a loop over a very large number of iterations.
每个新函数都需要创建一个新的本地堆栈。但是,只有在循环的每一次迭代中调用一个函数并进行大量的迭代时,才会注意到这种开销。
#12
0
For most functions, their is no additional overhead for calling them in C++ vs C (unless you count that the "this" pointer as an unnecessary argument to every function.. You have to pass state to a function somehow tho)...
对于大多数函数来说,在c++ vs C中调用它们并不需要额外的开销(除非您将“this”指针视为每个函数的不必要参数)。你必须把状态传递给某个函数。
For virtual functions, their is an additional level of indirection (equivalent to a calling a function through a pointer in C)... But really, on today's hardware this is trivial.
对于虚拟函数,它们是额外的间接层(相当于通过C中的指针调用一个函数)……但实际上,在今天的硬件上这是微不足道的。
#13
0
I don't have any numbers, either, but I'm glad you're asking. Too often I see people try to optimize their code starting with vague ideas of overhead, but not really knowing.
我也没有任何号码,但我很高兴你问。我经常看到人们试图从模糊的开销概念开始优化他们的代码,但实际上并不知道。
#14
0
There are a few issues here.
这里有几个问题。
-
If you have a smart enough compiler, it will do some automatic inlining for you even if you did not specify inline. On the other hand, there are many things that cannot be inlined.
如果您有一个足够聪明的编译器,它将为您自动内联,即使您没有指定内联。另一方面,有许多东西是不能内联的。
-
If the function is virtual, then of course you are going to pay the price that it cannot be inlined because the target is determined at runtime. Conversely, in Java, you might be paying this price unless you indicate that the method is final.
如果函数是虚函数,那么当然要付出不能内联的代价,因为目标是在运行时确定的。相反,在Java中,您可能要为此付出代价,除非您表明该方法是最终的。
-
Depending on how your code is organized in memory, you may be paying a cost in cache misses and even page misses as the code is located elsewhere. That can end up having a huge impact in some applications.
根据您的代码在内存中的组织方式,您可能需要支付缓存丢失甚至页面丢失的费用,因为代码位于其他地方。这最终会对某些应用产生巨大的影响。
#15
0
Depending on how you structure your code, division into units such as modules and libraries it might matter in some cases profoundly.
根据您如何构造代码,将代码划分为单元(如模块和库),在某些情况下,它可能非常重要。
- Using dynamic library function with external linkage will most of the time impose full stack frame processing.
That is why using qsort from stdc library is one order of magnitude (10 times) slower than using stl code when comparison operation is as simple as integer comparison. - 使用动态库函数与外部链接,将在大多数情况下施加完整的堆栈帧处理。这就是为什么从stdc库使用qsort要比使用stl代码慢一个数量级(10倍)的原因,因为比较操作和整型比较一样简单。
- Passing function pointers between modules will also be affected.
- 模块之间传递函数指针也会受到影响。
-
The same penalty will most likely affect usage of C++'s virtual functions as well as other functions, whose code is defined in separate modules.
同样的惩罚很可能会影响c++虚拟函数和其他函数的使用,这些函数的代码是在单独的模块中定义的。
-
Good news is that whole program optimization might resolve the issue for dependencies between static libraries and modules.
好消息是,整个程序优化可能会解决静态库和模块之间的依赖问题。
#16
0
As others have said, you really don't have to worry too much about overhead, unless you're going for ultimate performance or something akin. When you make a function the compiler has to write code to:
就像其他人说的那样,你真的不用太担心开销,除非你是在追求最终的性能或者类似的东西。当你做一个函数时,编译器必须写代码给:
- Save function parameters to the stack
- 将函数参数保存到堆栈中
- Save the return address to the stack
- 将返回地址保存到堆栈中
- Jump to the starting address of the function
- 跳转到函数的起始地址。
- Allocate space for the function's local variables (stack)
- 为函数的本地变量(堆栈)分配空间
- Run the body of the function
- 运行函数的主体
- Save the return value (stack)
- 保存返回值(堆栈)
- Free space for the local variables aka garbage collection
- 本地变量(即垃圾收集)的*空间
- Jump back to the saved return address
- 跳转到保存的返回地址
- Free up save for the parameters etc...
- 免费保存参数等…
However, you have to account for lowering the readability of your code, as well as how it will impact your testing strategies, maintenance plans, and overall size impact of your src file.
但是,您必须考虑到降低代码的可读性,以及它将如何影响您的测试策略、维护计划和src文件的总体大小影响。
#1
42
On most architectures, the cost consists of saving all (or some, or none) of the registers to the stack, pushing the function arguments to the stack (or putting them in registers), incrementing the stack pointer and jumping to the beginning of the new code. Then when the function is done, you have to restore the registers from the stack. This webpage has a description of what's involved in the various calling conventions.
在大多数体系结构中,成本包括将所有寄存器(或部分寄存器或无寄存器)保存到堆栈中、将函数参数推到堆栈中(或将它们放入寄存器中)、递增堆栈指针并跳转到新代码的开头。然后,当函数完成时,您必须从堆栈中恢复寄存器。这个网页描述了各种调用约定所涉及的内容。
Most C++ compilers are smart enough now to inline functions for you. The inline keyword is just a hint to the compiler. Some will even do inlining across translation units where they decide it's helpful.
大多数c++编译器现在已经足够智能,可以为您内联函数。内联关键字只是给编译器的一个提示。有些甚至会在他们认为有用的翻译单位之间进行内联。
#2
11
There's the technical and the practical answer. The practical answer is it will never matter, and in the very rare case it does the only way you'll know is through actual profiled tests.
有技术和实际的答案。实际的答案是它永远都不重要,在非常罕见的情况下,只有通过实际的分析测试才能知道它。
The technical answer, which your literature refers to, is generally not relevant due to compiler optimizations. But if you're still interested, is well described by Josh.
您的文献提到的技术答案通常与编译器优化无关。但如果你还感兴趣的话,乔希对此有很好的描述。
As far as a "percentage" you'd have to know how expensive the function itself was. Outside of the cost of the called function there is no percentage because you are comparing to a zero cost operation. For inlined code there is no cost, the processor just moves to the next instruction. The downside to inling is a larger code size which manifests it's costs in a different way than the stack construction/tear down costs.
至于“百分比”,你必须知道这个函数本身有多贵。除了被调用函数的成本之外,没有百分比因为你在与零成本操作进行比较。对于内联代码,不需要任何成本,处理器只需要移动到下一条指令。inling的缺点是代码大小较大,这表明它的成本与堆栈构造/分解成本不同。
#3
8
The amount of overhead will depend on the compiler, CPU, etc. The percentage overhead will depend on the code you're inlining. The only way to know is to take your code and profile it both ways - that's why there's no definitive answer.
开销的大小将取决于编译器、CPU等。开销的百分比将取决于所插入的代码。要知道,唯一的方法是将代码和概要文件同时使用——这就是为什么没有明确的答案。
#4
7
Your question is one of the questions, that has no answer one could call the "absolute truth". The overhead of a normal function call depends on three factors:
你的问题是其中一个没有答案的问题,人们可以称之为“绝对真理”。正常函数调用的开销取决于三个因素:
-
The CPU. The overhead of x86, PPC, and ARM CPUs varies a lot and even if you just stay with one architecture, the overhead also varies quite a bit between an Intel Pentium 4, Intel Core 2 Duo and an Intel Core i7. The overhead might even vary noticeably between an Intel and an AMD CPU, even if both run at the same clock speed, since factors like cache sizes, caching algorithms, memory access patterns and the actual hardware implementation of the call opcode itself can have a huge influence on the overhead.
CPU。x86、PPC和ARM cpu的开销差别很大,即使您只使用一个架构,在Intel Pentium 4、Intel Core 2 Duo和Intel Core i7之间的开销也相差很大。由于缓存大小、缓存算法、内存访问模式和调用操作码本身的实际硬件实现等因素会对开销产生巨大的影响,因此,即使两者以相同的时钟速度运行,英特尔和AMD的CPU之间的开销也可能存在显著差异。
-
The ABI (Application Binary Interface). Even with the same CPU, there often exist different ABIs that specify how function calls pass parameters (via registers, via stack, or via a combination of both) and where and how stack frame initialization and clean-up takes place. All this has an influence on the overhead. Different operating systems may use different ABIs for the same CPU; e.g. Linux, Windows and Solaris may all three use a different ABI for the same CPU.
ABI(应用程序二进制接口)。即使使用相同的CPU,也经常存在不同的ABIs来指定函数调用如何传递参数(通过寄存器、堆栈或两者的组合),以及堆栈帧初始化和清理在何处以及如何进行。所有这些都对开销有影响。不同的操作系统可以对相同的CPU使用不同的ABIs;例如,Linux、Windows和Solaris对于相同的CPU都可以使用不同的ABI。
-
The Compiler. Strictly following the ABI is only important if functions are called between independent code units, e.g. if an application calls a function of a system library or a user library calls a function of another user library. As long as functions are "private", not visible outside a certain library or binary, the compiler may "cheat". It may not strictly follow the ABI but instead use shortcuts that lead to faster function calls. E.g. it may pass parameters in register instead of using the stack or it may skip stack frame setup and clean-up completely if not really necessary.
编译器。只有在独立代码单元之间调用函数,例如应用程序调用系统库的函数或用户库调用另一个用户库的函数时,严格地遵循ABI才是重要的。只要函数是“私有的”,在特定的库或二进制文件之外不可见,编译器就可能“欺骗”。它可能不会严格遵循ABI,而是使用快捷方式,从而导致更快的函数调用。例如,它可以在寄存器中传递参数,而不是使用堆栈,或者它可能跳过堆栈帧设置,如果没有必要的话,完全可以进行清理。
If you want to know the overhead for a specific combination of the three factors above, e.g. for Intel Core i5 on Linux using GCC, your only way to get this information is benchmarking the difference between two implementations, one using function calls and one where you copy the code directly into the caller; this way you force inlining for sure, since the inline statement is only a hint and does not always lead to inlining.
如果你想知道一个特定组合的开销上面的三个因素中,例如英特尔酷睿i5在Linux上使用GCC,你唯一的办法得到这个信息是基准测试两种实现之间的区别,一个使用函数调用,一个,你直接复制代码到调用者;这样,您就可以确保强制内联,因为内联语句只是一个提示,并不总是导致内联。
However, the real question here is: Does the exact overhead really matter? One thing is for sure: A function call always has an overhead. It may be small, it may be big, but it is for sure existent. And no matter how small it is if a function is called often enough in a performance critical section, the overhead will matter to some degree. Inlining rarely makes your code slower, unless you terribly overdo it; it will make the code bigger though. Today's compilers are pretty good at deciding themselves when to inline and when not, so you hardly ever have to rack your brain about it.
然而,真正的问题是:实际开销真的重要吗?有一件事是肯定的:函数调用总是有开销。它可能很小,可能很大,但它确实存在。不管它是多么的小,如果一个函数在性能关键部分经常被调用,那么它的开销在一定程度上会起作用。内联很少会使您的代码变慢,除非您过度使用它;它将使代码更大。今天的编译器非常善于决定何时内联,何时不联,所以您几乎不需要绞尽脑汁。
Personally I ignore inlining during development completely, until I have a more or less usable product that I can profile and only if profiling tells me, that a certain function is called really often and also within a performance critical section of the application, then I will consider "force-inlining" of this function.
我个人完全忽略内联在开发过程中,直到我有或多或少地使用产品,我只能概要文件和分析告诉我,是否经常调用某些函数真的也在一个应用程序的性能关键部分,然后我将考虑“force-inlining”这个函数。
So far my answer is very generic, it applies to C as much as it applies to C++ and Objective-C. As a closing word let me say something about C++ in particular: Methods that are virtual are double indirect function calls, that means they have a higher function call overhead than normal function calls and also they cannot be inlined. Non-virtual methods might be inlined by the compiler or not but even if they are not inlined, they are still significant faster than virtual ones, so you should not make methods virtual, unless you really plan to override them or have them overridden.
到目前为止,我的答案是非常通用的,它适用于c++和Objective-C。作为结束语,让我特别介绍一下c++:虚方法是双间接函数调用,这意味着它们的函数调用开销高于普通函数调用,而且它们也不能内联。非虚方法可能由编译器内联,也可能不是,但即使它们不是内联的,它们仍然比虚方法快得多,所以不应该使方法虚,除非您确实计划覆盖它们或让它们被重写。
#5
6
I made a simple benchmark against a simple increment function:
我用一个简单的增量函数做了一个简单的基准:
inc.c:
inc.c:
typedef unsigned long ulong;
ulong inc(ulong x){
return x+1;
}
main.c
c
#include <stdio.h>
#include <stdlib.h>
typedef unsigned long ulong;
#ifdef EXTERN
ulong inc(ulong);
#else
static inline ulong inc(ulong x){
return x+1;
}
#endif
int main(int argc, char** argv){
if (argc < 1+1)
return 1;
ulong i, sum = 0, cnt;
cnt = atoi(argv[1]);
for(i=0;i<cnt;i++){
sum+=inc(i);
}
printf("%lu\n", sum);
return 0;
}
Running it with a billion iterations on my Intel(R) Core(TM) i5 CPU M 430 @ 2.27GHz gave me:
我的Intel(R) Core(TM) i5 CPU m430 @ 2.27GHz给了我10亿次迭代来运行它:
- 1.4 seconds for the inlinining version
- inlinining版本的1.4秒
- 4.4 seconds for the regularly linked version
- 4.4秒的定期链接版本。
(It appears to fluctuate by up to 0.2 but I'm too lazy to calculate proper standard deviations nor do I care for them)
(它的波动幅度似乎高达0.2,但我太懒了,无法计算适当的标准差,也不关心它们)
This suggests that the overhead of function calls on this computer is about 3 nanoseconds
这表明,这台计算机上函数调用的开销大约为3纳秒
The fastest I measured something at it was about 0.3ns so that would suggest a function call costs about 9 primitive ops, to put it very simplistically.
我测量的最快的是0。3ns,这意味着一个函数调用要花费9个原始操作,简单地说。
This overhead increases by about another 2ns per call (total time call time about 6ns) for functions called through a PLT (functions in a shared library).
对于通过PLT(共享库中的函数)调用的函数,每次调用的开销增加了大约2ns(总调用时间约为6ns)。
#6
4
For very small functions inlining makes sense, because the (small) cost of the function call is significant relative to the (very small) cost of the function body. For most functions over a few lines it's not a big win.
对于非常小的函数,内联是有意义的,因为函数调用的(小的)成本相对于函数体的(非常小的)成本是重要的。对于一些行上的大多数函数来说,这并不是一个大胜利。
#7
3
It's worth pointing out that an inlined function increases the size of the calling function and anything that increases the size of a function may have a negative affect on caching. If you're right at a boundary, "just one more wafer thin mint" of inlined code might have a dramatically negative effect on performance.
值得指出的是,内联函数增加了调用函数的大小,任何增加函数大小的内容都可能对缓存产生负面影响。如果您正处在一个边界上,那么“再多一个小块”的内联代码可能会对性能产生巨大的负面影响。
If you're reading literature that's warning about "the cost of a function call," I'd suggest it may be older material that doesn't reflect modern processors. Unless you're in the embedded world, the era in which C is a "portable assembly language" has essentially passed. A large amount of the ingenuity of the chip designers in the past decade (say) has gone into all sorts of low-level complexities that can differ radically from the way things worked "back in the day."
如果你正在阅读有关“函数调用的成本”的文献,我建议你使用的是不能反映现代处理器的旧材料。除非你身处嵌入式世界,否则C语言是“便携式汇编语言”的时代已经基本过去了。在过去的十年里,芯片设计者们的聪明才智已经进入了各种各样的低层次的复杂性,这些复杂性可能与我们在“白天”工作的方式有很大的不同。
#8
1
There is a great concept called 'register shadowing', which allows to pass ( up to 6 ? ),values thru registers ( on CPU ) instead of stack ( memory ). Also, depending on the function and variables used within, compiler may just decide that frame management code is not required !!
有一个很棒的概念叫做“寄存器阴影”,它允许传递(最多6 ?)值通过注册(在CPU上)而不是堆栈(内存)。另外,根据内部使用的函数和变量,编译器可能会决定不需要框架管理代码!
Also, even C++ compiler may do a 'tail recursion optimiztaion', i.e. if A() calls B(), and after calling B(), A just returns, compiler will reuse the stack frame !!
而且,即使是c++编译器也可以执行“尾部递归优化”,例如,如果a()调用B(),并且在调用B()之后,a才返回,编译器将重用堆栈帧!
Of course, this all can be done, only if program sticks to the semantics of standard ( see pointer aliasing and it's effect on optimizations )
当然,这一切都可以实现,只有当程序坚持使用标准的语义时(参见指针别名及其对优化的影响)
#9
1
Modern CPUs are very fast (obviously!). Almost every operation involved with calls and argument passing are full speed instructions (indirect calls might be slightly more expensive, mostly the first time through a loop).
现代cpu非常快(显然!)几乎所有涉及调用和参数传递的操作都是全速指令(间接调用可能稍微贵一些,大多数是第一次通过循环)。
Function call overhead is so small, only loops that call functions can make call overhead relevant.
函数调用开销非常小,只有调用函数的循环才能使调用开销相关。
Therefore, when we talk about (and measure) function call overhead today, we are usually really talking about the overhead of not being able to hoist common subexpressions out of loops. If a function has to do a bunch of (identical) work every time it is called, the compiler would be able to "hoist" it out of the loop and do it once if it was inlined. When not inlined, the code will probably just go ahead and repeat the work, you told it to!
因此,当我们今天谈论(和度量)函数调用开销时,我们通常指的是无法从循环中提升公共子表达式的开销。如果一个函数每次被调用时都要做一堆(相同的)工作,那么编译器将能够将它从循环中“提升”出来,如果它是内联的,则可以进行一次。当没有内联的时候,代码可能会继续重复工作,你告诉它!
Inlined functions seem impossibly faster not because of call and argument overhead, but because of common subexpressions that can be hoisted out of the function.
内联函数看起来更快,不是因为调用和参数开销,而是因为可以从函数中提取的常见子表达式。
Example:
例子:
Foo::result_type MakeMeFaster()
{
Foo t = 0;
for (auto i = 0; i < 1000; ++i)
t += CheckOverhead(SomethingUnpredictible());
return t.result();
}
Foo CheckOverhead(int i)
{
auto n = CalculatePi_1000_digits();
return i * n;
}
An optimizer can see through this foolishness and do:
优化器可以看穿这一愚蠢之处,并做到:
Foo::result_type MakeMeFaster()
{
Foo t;
auto _hidden_optimizer_tmp = CalculatePi_1000_digits();
for (auto i = 0; i < 1000; ++i)
t += SomethingUnpredictible() * _hidden_optimizer_tmp;
return t.result();
}
It seems like call overhead is impossibly reduced because it really has hoised a big chunk of the function out of the loop (the CalculatePi_1000_digits call). The compiler would need to be able to prove that CalculatePi_1000_digits always returns the same result, but good optimizers can do that.
似乎调用开销的减少是不可能的,因为它确实使函数的很大一部分脱离了循环(calculatepi_1000_numbers调用)。编译器需要能够证明calculatepi_1000_numbers总是返回相同的结果,但是优秀的优化器可以做到这一点。
#10
1
There is not much overhead at all, especially with small (inline-able) functions or even classes.
根本没有太多的开销,特别是对于小的(行内的)函数甚至类。
The following example has three different tests that are each run many, many times and timed. The results are always equal to the order of a couple 1000ths of a unit of time.
下面的示例有三个不同的测试,每个测试运行多次、多次和计时。结果总是等于千分之几的时间单位。
#include <boost/timer/timer.hpp>
#include <iostream>
#include <cmath>
double sum;
double a = 42, b = 53;
//#define ITERATIONS 1000000 // 1 million - for testing
//#define ITERATIONS 10000000000 // 10 billion ~ 10s per run
//#define WORK_UNIT sum += a + b
/* output
8.609619s wall, 8.611255s user + 0.000000s system = 8.611255s CPU(100.0%)
8.604478s wall, 8.611255s user + 0.000000s system = 8.611255s CPU(100.1%)
8.610679s wall, 8.595655s user + 0.000000s system = 8.595655s CPU(99.8%)
9.5e+011 9.5e+011 9.5e+011
*/
#define ITERATIONS 100000000 // 100 million ~ 10s per run
#define WORK_UNIT sum += std::sqrt(a*a + b*b + sum) + std::sin(sum) + std::cos(sum)
/* output
8.485689s wall, 8.486454s user + 0.000000s system = 8.486454s CPU (100.0%)
8.494153s wall, 8.486454s user + 0.000000s system = 8.486454s CPU (99.9%)
8.467291s wall, 8.470854s user + 0.000000s system = 8.470854s CPU (100.0%)
2.50001e+015 2.50001e+015 2.50001e+015
*/
// ------------------------------
double simple()
{
sum = 0;
boost::timer::auto_cpu_timer t;
for (unsigned long long i = 0; i < ITERATIONS; i++)
{
WORK_UNIT;
}
return sum;
}
// ------------------------------
void call6()
{
WORK_UNIT;
}
void call5(){ call6(); }
void call4(){ call5(); }
void call3(){ call4(); }
void call2(){ call3(); }
void call1(){ call2(); }
double calls()
{
sum = 0;
boost::timer::auto_cpu_timer t;
for (unsigned long long i = 0; i < ITERATIONS; i++)
{
call1();
}
return sum;
}
// ------------------------------
class Obj3{
public:
void runIt(){
WORK_UNIT;
}
};
class Obj2{
public:
Obj2(){it = new Obj3();}
~Obj2(){delete it;}
void runIt(){it->runIt();}
Obj3* it;
};
class Obj1{
public:
void runIt(){it.runIt();}
Obj2 it;
};
double objects()
{
sum = 0;
Obj1 obj;
boost::timer::auto_cpu_timer t;
for (unsigned long long i = 0; i < ITERATIONS; i++)
{
obj.runIt();
}
return sum;
}
// ------------------------------
int main(int argc, char** argv)
{
double ssum = 0;
double csum = 0;
double osum = 0;
ssum = simple();
csum = calls();
osum = objects();
std::cout << ssum << " " << csum << " " << osum << std::endl;
}
The output for running 10,000,000 iterations (of each type: simple, six function calls, three object calls) was with this semi-convoluted work payload:
运行10,000,000次迭代的输出(每一种类型:简单的,六个函数调用,三个对象调用)与这个半卷积的工作负载:
sum += std::sqrt(a*a + b*b + sum) + std::sin(sum) + std::cos(sum)
as follows:
如下:
8.485689s wall, 8.486454s user + 0.000000s system = 8.486454s CPU (100.0%)
8.494153s wall, 8.486454s user + 0.000000s system = 8.486454s CPU (99.9%)
8.467291s wall, 8.470854s user + 0.000000s system = 8.470854s CPU (100.0%)
2.50001e+015 2.50001e+015 2.50001e+015
Using a simple work payload of
使用一个简单的工作负载
sum += a + b
Gives the same results except a couple orders of magnitude faster for each case.
除了每一种情况都快几个数量级之外,结果都是一样的。
#11
0
Each new function requires a new local stack to be created. But the overhead of this would only be noticeable if you are calling a function on every iteration of a loop over a very large number of iterations.
每个新函数都需要创建一个新的本地堆栈。但是,只有在循环的每一次迭代中调用一个函数并进行大量的迭代时,才会注意到这种开销。
#12
0
For most functions, their is no additional overhead for calling them in C++ vs C (unless you count that the "this" pointer as an unnecessary argument to every function.. You have to pass state to a function somehow tho)...
对于大多数函数来说,在c++ vs C中调用它们并不需要额外的开销(除非您将“this”指针视为每个函数的不必要参数)。你必须把状态传递给某个函数。
For virtual functions, their is an additional level of indirection (equivalent to a calling a function through a pointer in C)... But really, on today's hardware this is trivial.
对于虚拟函数,它们是额外的间接层(相当于通过C中的指针调用一个函数)……但实际上,在今天的硬件上这是微不足道的。
#13
0
I don't have any numbers, either, but I'm glad you're asking. Too often I see people try to optimize their code starting with vague ideas of overhead, but not really knowing.
我也没有任何号码,但我很高兴你问。我经常看到人们试图从模糊的开销概念开始优化他们的代码,但实际上并不知道。
#14
0
There are a few issues here.
这里有几个问题。
-
If you have a smart enough compiler, it will do some automatic inlining for you even if you did not specify inline. On the other hand, there are many things that cannot be inlined.
如果您有一个足够聪明的编译器,它将为您自动内联,即使您没有指定内联。另一方面,有许多东西是不能内联的。
-
If the function is virtual, then of course you are going to pay the price that it cannot be inlined because the target is determined at runtime. Conversely, in Java, you might be paying this price unless you indicate that the method is final.
如果函数是虚函数,那么当然要付出不能内联的代价,因为目标是在运行时确定的。相反,在Java中,您可能要为此付出代价,除非您表明该方法是最终的。
-
Depending on how your code is organized in memory, you may be paying a cost in cache misses and even page misses as the code is located elsewhere. That can end up having a huge impact in some applications.
根据您的代码在内存中的组织方式,您可能需要支付缓存丢失甚至页面丢失的费用,因为代码位于其他地方。这最终会对某些应用产生巨大的影响。
#15
0
Depending on how you structure your code, division into units such as modules and libraries it might matter in some cases profoundly.
根据您如何构造代码,将代码划分为单元(如模块和库),在某些情况下,它可能非常重要。
- Using dynamic library function with external linkage will most of the time impose full stack frame processing.
That is why using qsort from stdc library is one order of magnitude (10 times) slower than using stl code when comparison operation is as simple as integer comparison. - 使用动态库函数与外部链接,将在大多数情况下施加完整的堆栈帧处理。这就是为什么从stdc库使用qsort要比使用stl代码慢一个数量级(10倍)的原因,因为比较操作和整型比较一样简单。
- Passing function pointers between modules will also be affected.
- 模块之间传递函数指针也会受到影响。
-
The same penalty will most likely affect usage of C++'s virtual functions as well as other functions, whose code is defined in separate modules.
同样的惩罚很可能会影响c++虚拟函数和其他函数的使用,这些函数的代码是在单独的模块中定义的。
-
Good news is that whole program optimization might resolve the issue for dependencies between static libraries and modules.
好消息是,整个程序优化可能会解决静态库和模块之间的依赖问题。
#16
0
As others have said, you really don't have to worry too much about overhead, unless you're going for ultimate performance or something akin. When you make a function the compiler has to write code to:
就像其他人说的那样,你真的不用太担心开销,除非你是在追求最终的性能或者类似的东西。当你做一个函数时,编译器必须写代码给:
- Save function parameters to the stack
- 将函数参数保存到堆栈中
- Save the return address to the stack
- 将返回地址保存到堆栈中
- Jump to the starting address of the function
- 跳转到函数的起始地址。
- Allocate space for the function's local variables (stack)
- 为函数的本地变量(堆栈)分配空间
- Run the body of the function
- 运行函数的主体
- Save the return value (stack)
- 保存返回值(堆栈)
- Free space for the local variables aka garbage collection
- 本地变量(即垃圾收集)的*空间
- Jump back to the saved return address
- 跳转到保存的返回地址
- Free up save for the parameters etc...
- 免费保存参数等…
However, you have to account for lowering the readability of your code, as well as how it will impact your testing strategies, maintenance plans, and overall size impact of your src file.
但是,您必须考虑到降低代码的可读性,以及它将如何影响您的测试策略、维护计划和src文件的总体大小影响。