void foo(const int constant)
{
for(int i = 0; i < 1000000; i++) {
// do stuff
if(constant < 10) { // Condition is tested million times :(
// inner loop stuff
}
}
}
For every execution of the outer loop the value of "constant" is checked. However, constant never changes so a lot of CPU time is being wasted to test the condition constant < 10? over and over again. A human would realize after the first few passes that constant never changes, and intelligently avoid checking it over and over again. Does the compiler notice this and intelligently optimize it, or is the repeated if loop unavoidable?
对于外部循环的每一个执行,都会检查“常量”的值。但是,常量不会改变,所以为了测试常量< 10,浪费了大量的CPU时间。一遍又一遍。一个人会意识到,在最初的几次传递之后,常数永远不会改变,聪明地避免反复检查它。编译器会注意到这一点并对其进行智能优化吗?还是循环重复是不可避免的?
Personally, I think the problem is unavoidable. Even if the compiler put the comparison before the outer loop and set some kind of boolean variable "skip_inner_stuff" this variable would still have to be checked for each pass of the outer for loop.
我个人认为这个问题是不可避免的。即使编译器将比较放在外部循环之前,并设置某种类型的布尔变量“skip_inner_stuff”,这个变量仍然需要在每次外部for循环传递时进行检查。
What are your thoughts on the matter? Is there a more efficient way to write the above code segment that would avoid the problem?
你对这件事有什么看法?是否有一种更有效的方法来编写上面的代码段,以避免出现问题?
7 个解决方案
#1
6
The optimization you describe is also called loop unswitching. It has been a standard part of optimizing compilers for many years - but if you want to make sure your compiler performs it, compile your sample code with some optimization level (e.g. -O2 in gcc) and check the generated code.
您所描述的优化也称为循环不切换。多年来,它一直是优化编译器的标准部分——但如果您希望确保编译器执行它,那么使用一些优化级别(例如,gcc中的- o2)编译示例代码,并检查生成的代码。
However, in cases where the compiler cannot prove that a piece of code is invariant throughout the loop - e.g. a call to an external function which is not available at compile time - then indeed, manually hoisting the code to be outside the loop can net a very big performance boost.
然而,在这种情况下,编译器无法证明一段代码是整个循环不变——例如调用外部函数在编译时不可用,那么实际上,手动提升循环外的代码可以净很大的性能提升。
#2
3
Compiler can optimize the code but you couldn't expect it does a magic tricks on your code.
编译器可以优化代码,但是你不能指望它会对你的代码起作用。
The optimization strongly depends on your code and the usage of your code. For example if you use foo
like this:
优化很大程度上取决于代码和代码的使用。例如,如果你像这样使用foo:
foo(12345);
Compiler can optimize the code very much. Even it can compute the result in compile time.
编译器可以很好地优化代码。甚至可以在编译时计算结果。
But if you use it like this:
但是如果你像这样使用它:
int k;
cin >> k;
foo(k);
In this case it can not get rid of inner if
(the value is provided in run-time).
在这种情况下,它不能删除内部if(值在运行时提供)。
I wrote a sample code with MinGW/GCC-4.8.0:
我用MinGW/GCC-4.8.0编写了一个示例代码:
void foo(const int constant)
{
int x = 0;
for (int i = 0; i < 1000000; i++)
{
x++;
if (constant < 10)
{
x--;
}
}
cout << x << endl;
}
int main()
{
int k;
cin >> k;
foo(k);
}
Let's see the generate assembly code:
让我们看看生成的汇编代码:
004015E1 MOV EAX,0F4240 // i = 1000000
004015E6 MOV EBP,ESP
004015E8 XOR EDX,EDX
004015EA PUSH ESI
004015EB PUSH EBX
004015EC SUB ESP,10
004015EF MOV EBX,DWORD PTR SS:[EBP+8]
004015F2 XOR ECX,ECX // set ECX to 0
004015F4 CMP EBX,0A // if constant < 10
^^^^^^^^^^
004015F7 SETGE CL // then set ECX to 1
004015FA ADD EDX,ECX // add ECX to i
004015FC SUB EAX,1 // i--
004015FF JNE SHORT 004015F2 // loop if i is not zero
As you can see the inner if
exists in the code. See CMP EBX,0A
.
正如您在代码中看到的内部if。看到CMP EBX,0。
I repeat again it strongly depends on the lines with loops.
我再重复一遍,它强烈地依赖于有循环的线。
#3
2
Others have covered the relevant compiler optimizations: loop unswitching which moves the test outside the loop and provides two separate loop bodies; and code inlining that will in some cases provide the compiler with the actual value of constant
so that it can remove the test, and either execute 'inner loop stuff' unconditionally or remove it entirely.
另一些则介绍了相关的编译器优化:循环反交换,将测试移到循环之外,并提供两个单独的循环体;在某些情况下,代码内联将向编译器提供常量的实际值,以便它可以删除测试,或者无条件地执行“内部循环”,或者完全删除它。
Also be aware that quite aside from anything the compiler does, modern CPU designs actually do something similar to "A human would realize after the first few passes that constant never changes". It's called dynamic branch prediction.
还要注意的是,除了编译器所做的一切之外,现代CPU设计实际上做了一些类似于“人类在经过前几次测试后会意识到常量不会改变”的事情。这叫做动态分支预测。
The key point is that checking an integer is incredibly cheap, and even taking a branch can be very cheap. What's potentially expensive is mis-predicted branches. Modern CPUs use various strategies to guess which way a branch will go, but all of those strategies will quickly start correctly predicting a branch that goes the same way a million times in a row.
关键是检查一个整数是非常便宜的,即使取一个分支也是非常便宜的。潜在的代价是预测错误的分支。现代cpu使用各种策略来猜测分支将走向何方,但是所有这些策略都将很快开始正确地预测一个分支,该分支将连续运行100万次。
What I don't know, is whether modern CPUs are smart enough to spot that constant
is a loop invariant and do the full loop unswitching in microcode. But assuming correct branch prediction, the loop unswitch is probably a minor optimization anyway. The more specific the processor family targeted by the compiler, the more it knows about the quality of its branch predictor, and the more likely it is that the compiler can determine whether the additional benefit of loop unswitching is worth the code bloat.
我不知道的是,现代cpu是否足够聪明,能够发现常量是一个循环不变量,并在微代码中执行全循环不切换。但假设分支预测正确,环解耦可能是一个小的优化。编译器针对的处理器家族越具体,它就越了解它的分支预测器的质量,更有可能的是,编译器可以确定循环不切换的额外好处是否值得代码膨胀。
Of course there are still minimal CPUs, where the compiler has to provide all the cleverness. The CPU in your PC is not one of them.
当然,仍然有很少的cpu,编译器必须提供所有的聪明之处。你电脑里的CPU不是其中之一。
#4
1
you could optimise it by hand:
您可以手工优化:
void foo(const int constant)
{
if (constant < 10) {
for(int i = 0; i < 1000000; i++) {
// do stuff
// inner loop stuff here
}
} else {
for(int i = 0; i < 1000000; i++) {
// do stuff
// NO inner loop stuff here
}
}
}
I don't know whether most compilers would do something like this, but it doesn't seem like too much of a stretch.
我不知道大多数编译器是否会做这样的事情,但这似乎不太过分。
#5
1
A good compiler might optimize it.
一个好的编译器可能会对它进行优化。
Compilers optimize based on cost analysis. A good compiler should thus estimate the cost of each alternative (with and without hoisting) and pick whichever is cheaper.
编译器根据成本分析进行优化。因此,一个好的编译器应该估计每一种选择的成本(没有提升),选择更便宜的。
It means that if the code in the inner part is big, it might not be worth optimizing because this could lead to instruction cache trashing. On the other hand, if it is cheap, then it can be hoisted.
这意味着,如果内部部分的代码很大,那么它可能不值得进行优化,因为这可能导致指令缓存垃圾化。另一方面,如果价格便宜,就可以吊起来。
If it shows up in the profiler because it has not been optimized, the compiler messed up.
如果它因为没有优化而出现在分析器中,编译器就会出错。
#6
0
A good compiler will optimize that (when optimizations are enabled).
一个好的编译器会对其进行优化(当启用优化时)。
If using GCC you could
如果你可以使用GCC的话
-
compile with optimization and assembly code generation with
使用优化和汇编代码生成进行编译
gcc -Wall -O2 -fverbose-asm -S source.c
then look (with some editor, or a pager like
less
) into the generated assembly codesource.s
然后在生成的汇编代码源代码中查找(使用一些编辑器或类似于less的寻呼机)
-
ask GCC to dump a lot (hundreds!) of intermediate files and look inside the intermediate gimple representation in it
请GCC转储大量(数百!)的中间文件,并查看其中的中间gimple表示
gcc -Wall -O2 -fdump-tree-all -c source.c
-
use MELT and its probe to look interactively inside the gimple.
使用熔体和它的探头来交互地观察框架内部。
Take the habit of always asking all warnings with -Wall
from gcc
(or g++
if compiling C++ code.
养成经常使用gcc的-Wall请求所有警告的习惯(如果编译c++代码,则使用g++)。
BTW, in practice, such an optimization ("loop invariant code hoisting" as the other answer explains) is essential, because such kind of intermediate code happens very often, e.g. after function inlining.... (imagine several calls to foo
been inlined...)
顺便说一句,在实践中,这样的优化(“循环不变式代码提升”为其他答案解释)是至关重要的,因为这样的中间代码经常发生,如,在函数内联....(假设有几个对foo的调用是内联的…)
#7
0
Actually all the modern compiler does the optimization, and to abide this optimization if you think think that compiler shouldn't do this optimization you should make the variable as "volatile".
实际上,所有的现代编译器都会进行优化,如果你认为编译器不应该进行这种优化,那么你应该将变量设置为“volatile”。
#1
6
The optimization you describe is also called loop unswitching. It has been a standard part of optimizing compilers for many years - but if you want to make sure your compiler performs it, compile your sample code with some optimization level (e.g. -O2 in gcc) and check the generated code.
您所描述的优化也称为循环不切换。多年来,它一直是优化编译器的标准部分——但如果您希望确保编译器执行它,那么使用一些优化级别(例如,gcc中的- o2)编译示例代码,并检查生成的代码。
However, in cases where the compiler cannot prove that a piece of code is invariant throughout the loop - e.g. a call to an external function which is not available at compile time - then indeed, manually hoisting the code to be outside the loop can net a very big performance boost.
然而,在这种情况下,编译器无法证明一段代码是整个循环不变——例如调用外部函数在编译时不可用,那么实际上,手动提升循环外的代码可以净很大的性能提升。
#2
3
Compiler can optimize the code but you couldn't expect it does a magic tricks on your code.
编译器可以优化代码,但是你不能指望它会对你的代码起作用。
The optimization strongly depends on your code and the usage of your code. For example if you use foo
like this:
优化很大程度上取决于代码和代码的使用。例如,如果你像这样使用foo:
foo(12345);
Compiler can optimize the code very much. Even it can compute the result in compile time.
编译器可以很好地优化代码。甚至可以在编译时计算结果。
But if you use it like this:
但是如果你像这样使用它:
int k;
cin >> k;
foo(k);
In this case it can not get rid of inner if
(the value is provided in run-time).
在这种情况下,它不能删除内部if(值在运行时提供)。
I wrote a sample code with MinGW/GCC-4.8.0:
我用MinGW/GCC-4.8.0编写了一个示例代码:
void foo(const int constant)
{
int x = 0;
for (int i = 0; i < 1000000; i++)
{
x++;
if (constant < 10)
{
x--;
}
}
cout << x << endl;
}
int main()
{
int k;
cin >> k;
foo(k);
}
Let's see the generate assembly code:
让我们看看生成的汇编代码:
004015E1 MOV EAX,0F4240 // i = 1000000
004015E6 MOV EBP,ESP
004015E8 XOR EDX,EDX
004015EA PUSH ESI
004015EB PUSH EBX
004015EC SUB ESP,10
004015EF MOV EBX,DWORD PTR SS:[EBP+8]
004015F2 XOR ECX,ECX // set ECX to 0
004015F4 CMP EBX,0A // if constant < 10
^^^^^^^^^^
004015F7 SETGE CL // then set ECX to 1
004015FA ADD EDX,ECX // add ECX to i
004015FC SUB EAX,1 // i--
004015FF JNE SHORT 004015F2 // loop if i is not zero
As you can see the inner if
exists in the code. See CMP EBX,0A
.
正如您在代码中看到的内部if。看到CMP EBX,0。
I repeat again it strongly depends on the lines with loops.
我再重复一遍,它强烈地依赖于有循环的线。
#3
2
Others have covered the relevant compiler optimizations: loop unswitching which moves the test outside the loop and provides two separate loop bodies; and code inlining that will in some cases provide the compiler with the actual value of constant
so that it can remove the test, and either execute 'inner loop stuff' unconditionally or remove it entirely.
另一些则介绍了相关的编译器优化:循环反交换,将测试移到循环之外,并提供两个单独的循环体;在某些情况下,代码内联将向编译器提供常量的实际值,以便它可以删除测试,或者无条件地执行“内部循环”,或者完全删除它。
Also be aware that quite aside from anything the compiler does, modern CPU designs actually do something similar to "A human would realize after the first few passes that constant never changes". It's called dynamic branch prediction.
还要注意的是,除了编译器所做的一切之外,现代CPU设计实际上做了一些类似于“人类在经过前几次测试后会意识到常量不会改变”的事情。这叫做动态分支预测。
The key point is that checking an integer is incredibly cheap, and even taking a branch can be very cheap. What's potentially expensive is mis-predicted branches. Modern CPUs use various strategies to guess which way a branch will go, but all of those strategies will quickly start correctly predicting a branch that goes the same way a million times in a row.
关键是检查一个整数是非常便宜的,即使取一个分支也是非常便宜的。潜在的代价是预测错误的分支。现代cpu使用各种策略来猜测分支将走向何方,但是所有这些策略都将很快开始正确地预测一个分支,该分支将连续运行100万次。
What I don't know, is whether modern CPUs are smart enough to spot that constant
is a loop invariant and do the full loop unswitching in microcode. But assuming correct branch prediction, the loop unswitch is probably a minor optimization anyway. The more specific the processor family targeted by the compiler, the more it knows about the quality of its branch predictor, and the more likely it is that the compiler can determine whether the additional benefit of loop unswitching is worth the code bloat.
我不知道的是,现代cpu是否足够聪明,能够发现常量是一个循环不变量,并在微代码中执行全循环不切换。但假设分支预测正确,环解耦可能是一个小的优化。编译器针对的处理器家族越具体,它就越了解它的分支预测器的质量,更有可能的是,编译器可以确定循环不切换的额外好处是否值得代码膨胀。
Of course there are still minimal CPUs, where the compiler has to provide all the cleverness. The CPU in your PC is not one of them.
当然,仍然有很少的cpu,编译器必须提供所有的聪明之处。你电脑里的CPU不是其中之一。
#4
1
you could optimise it by hand:
您可以手工优化:
void foo(const int constant)
{
if (constant < 10) {
for(int i = 0; i < 1000000; i++) {
// do stuff
// inner loop stuff here
}
} else {
for(int i = 0; i < 1000000; i++) {
// do stuff
// NO inner loop stuff here
}
}
}
I don't know whether most compilers would do something like this, but it doesn't seem like too much of a stretch.
我不知道大多数编译器是否会做这样的事情,但这似乎不太过分。
#5
1
A good compiler might optimize it.
一个好的编译器可能会对它进行优化。
Compilers optimize based on cost analysis. A good compiler should thus estimate the cost of each alternative (with and without hoisting) and pick whichever is cheaper.
编译器根据成本分析进行优化。因此,一个好的编译器应该估计每一种选择的成本(没有提升),选择更便宜的。
It means that if the code in the inner part is big, it might not be worth optimizing because this could lead to instruction cache trashing. On the other hand, if it is cheap, then it can be hoisted.
这意味着,如果内部部分的代码很大,那么它可能不值得进行优化,因为这可能导致指令缓存垃圾化。另一方面,如果价格便宜,就可以吊起来。
If it shows up in the profiler because it has not been optimized, the compiler messed up.
如果它因为没有优化而出现在分析器中,编译器就会出错。
#6
0
A good compiler will optimize that (when optimizations are enabled).
一个好的编译器会对其进行优化(当启用优化时)。
If using GCC you could
如果你可以使用GCC的话
-
compile with optimization and assembly code generation with
使用优化和汇编代码生成进行编译
gcc -Wall -O2 -fverbose-asm -S source.c
then look (with some editor, or a pager like
less
) into the generated assembly codesource.s
然后在生成的汇编代码源代码中查找(使用一些编辑器或类似于less的寻呼机)
-
ask GCC to dump a lot (hundreds!) of intermediate files and look inside the intermediate gimple representation in it
请GCC转储大量(数百!)的中间文件,并查看其中的中间gimple表示
gcc -Wall -O2 -fdump-tree-all -c source.c
-
use MELT and its probe to look interactively inside the gimple.
使用熔体和它的探头来交互地观察框架内部。
Take the habit of always asking all warnings with -Wall
from gcc
(or g++
if compiling C++ code.
养成经常使用gcc的-Wall请求所有警告的习惯(如果编译c++代码,则使用g++)。
BTW, in practice, such an optimization ("loop invariant code hoisting" as the other answer explains) is essential, because such kind of intermediate code happens very often, e.g. after function inlining.... (imagine several calls to foo
been inlined...)
顺便说一句,在实践中,这样的优化(“循环不变式代码提升”为其他答案解释)是至关重要的,因为这样的中间代码经常发生,如,在函数内联....(假设有几个对foo的调用是内联的…)
#7
0
Actually all the modern compiler does the optimization, and to abide this optimization if you think think that compiler shouldn't do this optimization you should make the variable as "volatile".
实际上,所有的现代编译器都会进行优化,如果你认为编译器不应该进行这种优化,那么你应该将变量设置为“volatile”。