I conducted a simple experiment, to compare if-else to only an if (with default values preset). Example:
我进行了一个简单的实验,比较if-else到只有if(预设默认值)。例:
void test0(char c, int *x) {
*x = 0;
if (c == 99) {
*x = 15;
}
}
void test1(char c, int *x) {
if (c == 99) {
*x = 15;
} else {
*x = 0;
}
}
For the functions above, I got the exact same assembly code (using cmovne
).
对于上面的函数,我得到了完全相同的汇编代码(使用cmovne)。
However when adding an extra variable:
但是在添加额外变量时:
void test2(char c, int *x, int *y) {
*x = 0;
*y = 0;
if (c == 99) {
*x = 15;
*y = 21;
}
}
void test3(char c, int *x, int *y) {
if (c == 99) {
*x = 15;
*y = 21;
} else {
*x = 0;
*y = 0;
}
}
The assembly suddenly becomes different:
组装突然变得不同:
test2(char, int*, int*):
cmp dil, 99
mov DWORD PTR [rsi], 0
mov DWORD PTR [rdx], 0
je .L10
rep ret
.L10:
mov DWORD PTR [rsi], 15
mov DWORD PTR [rdx], 21
ret
test3(char, int*, int*):
cmp dil, 99
je .L14
mov DWORD PTR [rsi], 0
mov DWORD PTR [rdx], 0
ret
.L14:
mov DWORD PTR [rsi], 15
mov DWORD PTR [rdx], 21
ret
It seems that the only difference is if the top mov
s are done before or after the je
.
似乎唯一的区别是如果顶部的移动是在je之前或之后完成的。
Now (sorry my assembly is a bit crude), isn't it always better to have the mov
s after the jump, in order to save pipeline flushes? And if so why wouldn't the optimizer (gcc6.2 -O3) use the better method?
现在(对不起,我的程序集有点粗糙),为了节省管道冲洗,跳转后是不是总是更好?如果是这样,为什么优化器(gcc6.2 -O3)不会使用更好的方法?
1 个解决方案
#1
14
For the functions above, I got the exact same assembly code (using cmovne).
对于上面的函数,我得到了完全相同的汇编代码(使用cmovne)。
Sure, some compilers may make that optimization, but it is not guaranteed. It is very possible that you will get different object code for those two ways of writing the function.
当然,一些编译器可能会进行优化,但不能保证。您很可能会为这两种函数编写方法获得不同的目标代码。
In fact, no optimization is guaranteed (although modern optimizing compilers do an impressive job most of the time), so you should either write the code to capture the semantic meaning you intend for it to have, or you should verify the generated object code and write the code to ensure that you are getting the expected output.
实际上,没有保证优化(尽管现代优化编译器在大多数情况下都能完成令人印象深刻的工作),因此您应该编写代码来捕获您打算使用的语义,或者您应该验证生成的目标代码和编写代码以确保获得预期的输出。
Here is what older versions of MSVC will generate when targeting x86-32 (primarily because they don't know to use the CMOV instruction):
以下是针对x86-32时MSVC的旧版本将生成的内容(主要是因为他们不知道使用CMOV指令):
test0 PROC
cmp BYTE PTR [c], 99
mov eax, DWORD PTR [x]
mov DWORD PTR [eax], 0
jne SHORT LN2
mov DWORD PTR [eax], 15
LN2:
ret 0
test0 ENDP
test1 PROC
mov eax, DWORD PTR [x]
xor ecx, ecx
cmp BYTE PTR [c], 99
setne cl
dec ecx
and ecx, 15
mov DWORD PTR [eax], ecx
ret 0
test1 ENDP
Note that test1
gives you branchless code that utilizes the SETNE
instruction (a conditional set, which will set its operand to 0 or 1 based on the condition code—in this case, NE
) in conjunction with some bit-manipulation to produce the correct value. test0
uses a conditional branch to skip over the assignment of 15 to *x
.
请注意,test1为您提供无分支代码,该代码利用SETNE指令(条件集,它将根据条件代码将操作数设置为0或1,在本例中为NE),并结合一些位操作以生成正确的值。 test0使用条件分支跳过15到* x的赋值。
The reason this is interesting is because it is almost exactly the opposite of what you might expect. Naïvely, one might expect that test0
would be the way you'd hold the optimizer's hand and get it to generate branchless code. At least, that's the first thought that went through my head. But in fact, that is not the case! The optimizer is able to recognize the if
/else
idiom and optimize accordingly! It is not able to make that same optimization in the case of test0
, where you tried to outsmart it.
这很有趣的原因是因为它几乎与你所期望的完全相反。天真地,人们可能会期望test0将成为您持有优化器并使其生成无分支代码的方式。至少,这是我头脑中的第一个想法。但事实上,事实并非如此!优化器能够识别if / else成语并相应地进行优化!在test0的情况下,它无法进行相同的优化,在那里你试图超越它。
However when adding an extra variable ... The assembly suddenly becomes different
但是当添加额外的变量时...组件突然变得不同
Well, no surprise there. A small change in the code can often have a significant effect on the emitted code. Optimizers are not magic; they are just really complex pattern matchers. You changed the pattern!
好吧,那里也不奇怪。代码中的微小变化通常会对发出的代码产生重大影响。优化者不是魔术;它们只是非常复杂的模式匹配器。你改变了模式!
Granted, an optimizing compiler could have used two conditional moves here to generate branchless code. In fact, that is precisely what Clang 3.9 does for test3
(but not for test2
, consistent with our above analysis showing that optimizers may be better able to recognize standard patterns than unusual ones). But GCC doesn't do this. Again, there is no guarantee of a particular optimization being performed.
当然,优化编译器可以在这里使用两个条件移动来生成无分支代码。实际上,这正是Clang 3.9对test3的作用(但不是test2,与我们上面的分析一致,表明优化器可以更好地识别标准模式而不是异常模式)。但海湾合作委员会不这样做。同样,不能保证执行特定的优化。
It seems that the only difference is if the top "mov"s are done before or after the "je".
似乎唯一的区别是如果顶部的“mov”在“je”之前或之后完成。
Now (sorry my assembly is a bit crude), isn't it always better to have the movs after the jump, in order to save pipeline flushes?
现在(对不起,我的程序集有点粗糙),为了节省管道冲洗,跳转后是不是总是更好?
No, not really. That would not improve the code in this case. If the branch is mispredicted, you're going to have a pipeline flush no matter what. It doesn't much matter whether the speculatively mispredicted code is a ret
instruction or if it is a mov
instruction.
不,不是真的。在这种情况下,这不会改善代码。如果分支机构被错误预测,那么无论如何都会有管道冲洗。推测错误预测的代码是否是ret指令或者它是否是mov指令并不重要。
The only reason it would matter that a ret
instruction immediately followed a conditional branch is if you were writing the assembly code by hand and didn't know to use a rep ret
instruction. This is a trick necessary for certain AMD processors that avoids a branch-prediction penalty. Unless you were an assembly guru, you probably wouldn't have known this trick. But the compiler does, and also knows it is not necessary when you're specifically targeting an Intel processor or different generation of AMD processor that doesn't have this quirk.
ret指令紧跟在条件分支之后的唯一原因是你手动编写汇编代码并且不知道使用rep ret指令。这是某些AMD处理器必需的技巧,可避免分支预测损失。除非你是集会大师,否则你可能不会知道这个伎俩。但编译器确实如此,而且当你专门针对英特尔处理器或不具备这种怪癖的不同代的AMD处理器时,它也知道没有必要。
However, you might be right about it being better to have the mov
s after the branch, but not for the reason you suggested. Modern processors (I believe this is Nehalem and later, but I'd look it up in Agner Fog's excellent optimization guides if I needed to verify) are capable of macro-op fusion under certain circumstances. Basically, macro-op fusion means that the CPU's decoder will combine two eligible instructions into one micro-op, saving bandwidth at all stages of the pipeline. A cmp
or test
instruction followed by a conditional branch instruction, as you see in test3
, is eligible for macro-op fusion (actually, there are other conditions that must be met, but this code does meet those requirements). Scheduling other instructions in between the cmp
and je
, as you see in test2
, makes macro-op fusion impossible, potentially making the code execute more slowly.
但是,你可能会认为在分支之后使用mov更好,但不是出于你建议的原因。现代处理器(我相信这是Nehalem以及后来,但如果我需要验证,我会在Agner Fog的优秀优化指南中查找)在某些情况下能够进行宏观融合。基本上,宏操作融合意味着CPU的解码器将两个符合条件的指令组合成一个微操作,从而节省了管道所有阶段的带宽。正如您在test3中看到的那样,后跟条件分支指令的cmp或测试指令有资格进行宏操作融合(实际上,还有其他条件必须满足,但此代码确实符合这些要求)。正如您在test2中看到的那样,在cmp和je之间调度其他指令会使宏操作融合变得不可能,从而可能使代码执行得更慢。
Arguably, though, this is an optimization defect in the compiler. It could have reordered the mov
instructions to place the je
immediately after the cmp
, preserving the ability for macro-op fusion:
但可以说,这是编译器中的优化缺陷。它可以重新排序mov指令,以便在cmp之后立即放置je,保留了宏操作融合的能力:
test2a(char, int*, int*):
mov DWORD PTR [rsi], 0 ; do the default initialization *first*
mov DWORD PTR [rdx], 0
cmp dil, 99 ; this is now followed immediately by the conditional
je .L10 ; branch, making macro-op fusion possible
rep ret
.L10:
mov DWORD PTR [rsi], 15
mov DWORD PTR [rdx], 21
ret
Another difference between the object code for test2
and test3
is code size. Thanks to padding that is emitted by the optimizer to align the branch target, the code for test3
is 4 bytes larger than test2
. Very unlikely that is enough difference to matter, though, especially if this code is not being executed within a tight loop where it is guaranteed to be hot in the cache.
test2和test3的目标代码之间的另一个区别是代码大小。由于优化器发出的填充以对齐分支目标,test3的代码比test2大4个字节。但是,这不太可能是重要的差异,特别是如果这个代码没有在一个紧密的循环中执行,而它确保在缓存中很热。
So, does that mean you should always write the code as you did in test2
?
Well, no, for several reasons:
那么,这是否意味着你应该像在test2中那样编写代码?嗯,不,有几个原因:
- As we have seen, it might be a pessimization, since the optimizer may not recognize the pattern.
- 正如我们所看到的,它可能是一种悲观,因为优化器可能无法识别该模式。
- You should write code for readability and semantic correctness first, only going back to optimize it when your profiler indicates that it is actually a bottleneck. And then, you should only optimize after inspecting and verifying the object code emitted by your compiler, otherwise you could end up with a pessimization. (The standard "trust your compiler until proven otherwise" advice.)
- 您应首先编写可读性和语义正确性的代码,但只有当您的探查器指示它实际上是瓶颈时才返回优化它。然后,您应该只在检查和验证编译器发出的目标代码后进行优化,否则最终可能会出现悲观情绪。 (标准“信任你的编译器,直到证明不然”的建议。)
-
Even though it may be optimal in certain very simple cases, the "preset" idiom is not generalizable. If your initialization is time-consuming, it may be faster to skip over it when possible. (There is one example discussed here, in the context of VB 6, where string-manipulation is so slow that eliding it when possible actually results in faster execution time than fancy branchless code. More generally, the same rationale would apply if you were able to branch around a function call.)
尽管在某些非常简单的情况下它可能是最佳的,但是“预设”惯用语是不可推广的。如果您的初始化非常耗时,那么在可能的情况下跳过它可能会更快。 (这里讨论的一个例子,在VB 6的上下文中,字符串操作非常慢,在可能的情况下将其删除实际上比花哨的无分支代码实现更快的执行时间。更一般地说,如果你能够使用相同的基本原理分支函数调用。)
Even here, where it appears to result in very simple and possibly more optimal code, it may actually be slower because you are writing to memory twice in the case where
c
is equal to 99, and saving nothing in the case wherec
is not equal to 99.即使在这里,它似乎导致非常简单且可能更优化的代码,它实际上可能更慢,因为在c等于99的情况下你写入内存两次,并且在c不相等的情况下不保存任何内容到99。
You might save this cost by rewriting the code such that it accumulates the final value in a temporary register, only storing it to memory at the end, e.g.:
您可以通过重写代码来节省此成本,使其在临时寄存器中累积最终值,仅在最后将其存储到内存中,例如:
test2b(char, int*, int*): xor eax, eax ; pre-zero the EAX register xor ecx, ecx ; pre-zero the ECX register cmp dil, 99 je Done mov eax, 15 ; change the value in EAX if necessary mov ecx, 21 ; change the value in ECX if necessary Done: mov DWORD PTR [rsi], eax ; store our final temp values to memory mov DWORD PTR [rdx], ecx ret
but this clobbers two additional registers (
eax
andecx
) and may not actually be faster. You'd have to benchmark it. Or trust the compiler to emit this code when it is actually optimal, such as when it has inlined a function liketest2
within a tight loop.但这会破坏两个额外的寄存器(eax和ecx),实际上可能不会更快。你必须对它进行基准测试。或者信任编译器在实际最佳时发出此代码,例如在紧密循环中内联函数如test2时。
-
Even if you could guarantee that writing the code in a certain way would cause the compiler to emit branchless code, this would not necessarily be faster! While branches are slow when they are mispredicted, mispredictions are actually quite rare. Modern processors have extremely good branch prediction engines, achieving prediction accuracies of greater than 99% in most cases.
即使您可以保证以某种方式编写代码会导致编译器发出无分支代码,但这不一定会更快!虽然错误预测时分支机构很慢,但错误预测实际上很少见。现代处理器具有极好的分支预测引擎,在大多数情况下实现大于99%的预测精度。
Conditional moves are great for avoiding branch mispredictions, but they have the important disadvantage of increasing the length of a dependency chain. By contrast, a correctly predicted branch breaks the dependency chain. (This is probably why GCC doesn't emit two CMOV instructions when you add the extra variable.) A conditional move is only a performance win if you expect branch prediction to fail. If you can count on a prediction success rate of ~75% or better, a conditional branch is probably faster, because it breaks the dependency chain and has a lower latency. And I would suspect that would be the case here, unless
c
alternates rapidly back and forth between 99 and not-99 each time the function is called. (See Agner Fog's "Optimizing subroutines in assembly language", pp 70–71.)条件移动对于避免分支错误预测很有用,但它们具有增加依赖链长度的重要缺点。相比之下,正确预测的分支会破坏依赖链。 (这可能是GCC在添加额外变量时不发出两条CMOV指令的原因。)如果您希望分支预测失败,则条件移动只是性能获胜。如果你可以指望约75%或更好的预测成功率,条件分支可能更快,因为它打破了依赖链并具有较低的延迟。而且我怀疑这种情况就是这样,除非c每次调用函数时在99和99之间快速来回交替。 (参见Agner Fog的“用汇编语言优化子程序”,第70-71页。)
#1
14
For the functions above, I got the exact same assembly code (using cmovne).
对于上面的函数,我得到了完全相同的汇编代码(使用cmovne)。
Sure, some compilers may make that optimization, but it is not guaranteed. It is very possible that you will get different object code for those two ways of writing the function.
当然,一些编译器可能会进行优化,但不能保证。您很可能会为这两种函数编写方法获得不同的目标代码。
In fact, no optimization is guaranteed (although modern optimizing compilers do an impressive job most of the time), so you should either write the code to capture the semantic meaning you intend for it to have, or you should verify the generated object code and write the code to ensure that you are getting the expected output.
实际上,没有保证优化(尽管现代优化编译器在大多数情况下都能完成令人印象深刻的工作),因此您应该编写代码来捕获您打算使用的语义,或者您应该验证生成的目标代码和编写代码以确保获得预期的输出。
Here is what older versions of MSVC will generate when targeting x86-32 (primarily because they don't know to use the CMOV instruction):
以下是针对x86-32时MSVC的旧版本将生成的内容(主要是因为他们不知道使用CMOV指令):
test0 PROC
cmp BYTE PTR [c], 99
mov eax, DWORD PTR [x]
mov DWORD PTR [eax], 0
jne SHORT LN2
mov DWORD PTR [eax], 15
LN2:
ret 0
test0 ENDP
test1 PROC
mov eax, DWORD PTR [x]
xor ecx, ecx
cmp BYTE PTR [c], 99
setne cl
dec ecx
and ecx, 15
mov DWORD PTR [eax], ecx
ret 0
test1 ENDP
Note that test1
gives you branchless code that utilizes the SETNE
instruction (a conditional set, which will set its operand to 0 or 1 based on the condition code—in this case, NE
) in conjunction with some bit-manipulation to produce the correct value. test0
uses a conditional branch to skip over the assignment of 15 to *x
.
请注意,test1为您提供无分支代码,该代码利用SETNE指令(条件集,它将根据条件代码将操作数设置为0或1,在本例中为NE),并结合一些位操作以生成正确的值。 test0使用条件分支跳过15到* x的赋值。
The reason this is interesting is because it is almost exactly the opposite of what you might expect. Naïvely, one might expect that test0
would be the way you'd hold the optimizer's hand and get it to generate branchless code. At least, that's the first thought that went through my head. But in fact, that is not the case! The optimizer is able to recognize the if
/else
idiom and optimize accordingly! It is not able to make that same optimization in the case of test0
, where you tried to outsmart it.
这很有趣的原因是因为它几乎与你所期望的完全相反。天真地,人们可能会期望test0将成为您持有优化器并使其生成无分支代码的方式。至少,这是我头脑中的第一个想法。但事实上,事实并非如此!优化器能够识别if / else成语并相应地进行优化!在test0的情况下,它无法进行相同的优化,在那里你试图超越它。
However when adding an extra variable ... The assembly suddenly becomes different
但是当添加额外的变量时...组件突然变得不同
Well, no surprise there. A small change in the code can often have a significant effect on the emitted code. Optimizers are not magic; they are just really complex pattern matchers. You changed the pattern!
好吧,那里也不奇怪。代码中的微小变化通常会对发出的代码产生重大影响。优化者不是魔术;它们只是非常复杂的模式匹配器。你改变了模式!
Granted, an optimizing compiler could have used two conditional moves here to generate branchless code. In fact, that is precisely what Clang 3.9 does for test3
(but not for test2
, consistent with our above analysis showing that optimizers may be better able to recognize standard patterns than unusual ones). But GCC doesn't do this. Again, there is no guarantee of a particular optimization being performed.
当然,优化编译器可以在这里使用两个条件移动来生成无分支代码。实际上,这正是Clang 3.9对test3的作用(但不是test2,与我们上面的分析一致,表明优化器可以更好地识别标准模式而不是异常模式)。但海湾合作委员会不这样做。同样,不能保证执行特定的优化。
It seems that the only difference is if the top "mov"s are done before or after the "je".
似乎唯一的区别是如果顶部的“mov”在“je”之前或之后完成。
Now (sorry my assembly is a bit crude), isn't it always better to have the movs after the jump, in order to save pipeline flushes?
现在(对不起,我的程序集有点粗糙),为了节省管道冲洗,跳转后是不是总是更好?
No, not really. That would not improve the code in this case. If the branch is mispredicted, you're going to have a pipeline flush no matter what. It doesn't much matter whether the speculatively mispredicted code is a ret
instruction or if it is a mov
instruction.
不,不是真的。在这种情况下,这不会改善代码。如果分支机构被错误预测,那么无论如何都会有管道冲洗。推测错误预测的代码是否是ret指令或者它是否是mov指令并不重要。
The only reason it would matter that a ret
instruction immediately followed a conditional branch is if you were writing the assembly code by hand and didn't know to use a rep ret
instruction. This is a trick necessary for certain AMD processors that avoids a branch-prediction penalty. Unless you were an assembly guru, you probably wouldn't have known this trick. But the compiler does, and also knows it is not necessary when you're specifically targeting an Intel processor or different generation of AMD processor that doesn't have this quirk.
ret指令紧跟在条件分支之后的唯一原因是你手动编写汇编代码并且不知道使用rep ret指令。这是某些AMD处理器必需的技巧,可避免分支预测损失。除非你是集会大师,否则你可能不会知道这个伎俩。但编译器确实如此,而且当你专门针对英特尔处理器或不具备这种怪癖的不同代的AMD处理器时,它也知道没有必要。
However, you might be right about it being better to have the mov
s after the branch, but not for the reason you suggested. Modern processors (I believe this is Nehalem and later, but I'd look it up in Agner Fog's excellent optimization guides if I needed to verify) are capable of macro-op fusion under certain circumstances. Basically, macro-op fusion means that the CPU's decoder will combine two eligible instructions into one micro-op, saving bandwidth at all stages of the pipeline. A cmp
or test
instruction followed by a conditional branch instruction, as you see in test3
, is eligible for macro-op fusion (actually, there are other conditions that must be met, but this code does meet those requirements). Scheduling other instructions in between the cmp
and je
, as you see in test2
, makes macro-op fusion impossible, potentially making the code execute more slowly.
但是,你可能会认为在分支之后使用mov更好,但不是出于你建议的原因。现代处理器(我相信这是Nehalem以及后来,但如果我需要验证,我会在Agner Fog的优秀优化指南中查找)在某些情况下能够进行宏观融合。基本上,宏操作融合意味着CPU的解码器将两个符合条件的指令组合成一个微操作,从而节省了管道所有阶段的带宽。正如您在test3中看到的那样,后跟条件分支指令的cmp或测试指令有资格进行宏操作融合(实际上,还有其他条件必须满足,但此代码确实符合这些要求)。正如您在test2中看到的那样,在cmp和je之间调度其他指令会使宏操作融合变得不可能,从而可能使代码执行得更慢。
Arguably, though, this is an optimization defect in the compiler. It could have reordered the mov
instructions to place the je
immediately after the cmp
, preserving the ability for macro-op fusion:
但可以说,这是编译器中的优化缺陷。它可以重新排序mov指令,以便在cmp之后立即放置je,保留了宏操作融合的能力:
test2a(char, int*, int*):
mov DWORD PTR [rsi], 0 ; do the default initialization *first*
mov DWORD PTR [rdx], 0
cmp dil, 99 ; this is now followed immediately by the conditional
je .L10 ; branch, making macro-op fusion possible
rep ret
.L10:
mov DWORD PTR [rsi], 15
mov DWORD PTR [rdx], 21
ret
Another difference between the object code for test2
and test3
is code size. Thanks to padding that is emitted by the optimizer to align the branch target, the code for test3
is 4 bytes larger than test2
. Very unlikely that is enough difference to matter, though, especially if this code is not being executed within a tight loop where it is guaranteed to be hot in the cache.
test2和test3的目标代码之间的另一个区别是代码大小。由于优化器发出的填充以对齐分支目标,test3的代码比test2大4个字节。但是,这不太可能是重要的差异,特别是如果这个代码没有在一个紧密的循环中执行,而它确保在缓存中很热。
So, does that mean you should always write the code as you did in test2
?
Well, no, for several reasons:
那么,这是否意味着你应该像在test2中那样编写代码?嗯,不,有几个原因:
- As we have seen, it might be a pessimization, since the optimizer may not recognize the pattern.
- 正如我们所看到的,它可能是一种悲观,因为优化器可能无法识别该模式。
- You should write code for readability and semantic correctness first, only going back to optimize it when your profiler indicates that it is actually a bottleneck. And then, you should only optimize after inspecting and verifying the object code emitted by your compiler, otherwise you could end up with a pessimization. (The standard "trust your compiler until proven otherwise" advice.)
- 您应首先编写可读性和语义正确性的代码,但只有当您的探查器指示它实际上是瓶颈时才返回优化它。然后,您应该只在检查和验证编译器发出的目标代码后进行优化,否则最终可能会出现悲观情绪。 (标准“信任你的编译器,直到证明不然”的建议。)
-
Even though it may be optimal in certain very simple cases, the "preset" idiom is not generalizable. If your initialization is time-consuming, it may be faster to skip over it when possible. (There is one example discussed here, in the context of VB 6, where string-manipulation is so slow that eliding it when possible actually results in faster execution time than fancy branchless code. More generally, the same rationale would apply if you were able to branch around a function call.)
尽管在某些非常简单的情况下它可能是最佳的,但是“预设”惯用语是不可推广的。如果您的初始化非常耗时,那么在可能的情况下跳过它可能会更快。 (这里讨论的一个例子,在VB 6的上下文中,字符串操作非常慢,在可能的情况下将其删除实际上比花哨的无分支代码实现更快的执行时间。更一般地说,如果你能够使用相同的基本原理分支函数调用。)
Even here, where it appears to result in very simple and possibly more optimal code, it may actually be slower because you are writing to memory twice in the case where
c
is equal to 99, and saving nothing in the case wherec
is not equal to 99.即使在这里,它似乎导致非常简单且可能更优化的代码,它实际上可能更慢,因为在c等于99的情况下你写入内存两次,并且在c不相等的情况下不保存任何内容到99。
You might save this cost by rewriting the code such that it accumulates the final value in a temporary register, only storing it to memory at the end, e.g.:
您可以通过重写代码来节省此成本,使其在临时寄存器中累积最终值,仅在最后将其存储到内存中,例如:
test2b(char, int*, int*): xor eax, eax ; pre-zero the EAX register xor ecx, ecx ; pre-zero the ECX register cmp dil, 99 je Done mov eax, 15 ; change the value in EAX if necessary mov ecx, 21 ; change the value in ECX if necessary Done: mov DWORD PTR [rsi], eax ; store our final temp values to memory mov DWORD PTR [rdx], ecx ret
but this clobbers two additional registers (
eax
andecx
) and may not actually be faster. You'd have to benchmark it. Or trust the compiler to emit this code when it is actually optimal, such as when it has inlined a function liketest2
within a tight loop.但这会破坏两个额外的寄存器(eax和ecx),实际上可能不会更快。你必须对它进行基准测试。或者信任编译器在实际最佳时发出此代码,例如在紧密循环中内联函数如test2时。
-
Even if you could guarantee that writing the code in a certain way would cause the compiler to emit branchless code, this would not necessarily be faster! While branches are slow when they are mispredicted, mispredictions are actually quite rare. Modern processors have extremely good branch prediction engines, achieving prediction accuracies of greater than 99% in most cases.
即使您可以保证以某种方式编写代码会导致编译器发出无分支代码,但这不一定会更快!虽然错误预测时分支机构很慢,但错误预测实际上很少见。现代处理器具有极好的分支预测引擎,在大多数情况下实现大于99%的预测精度。
Conditional moves are great for avoiding branch mispredictions, but they have the important disadvantage of increasing the length of a dependency chain. By contrast, a correctly predicted branch breaks the dependency chain. (This is probably why GCC doesn't emit two CMOV instructions when you add the extra variable.) A conditional move is only a performance win if you expect branch prediction to fail. If you can count on a prediction success rate of ~75% or better, a conditional branch is probably faster, because it breaks the dependency chain and has a lower latency. And I would suspect that would be the case here, unless
c
alternates rapidly back and forth between 99 and not-99 each time the function is called. (See Agner Fog's "Optimizing subroutines in assembly language", pp 70–71.)条件移动对于避免分支错误预测很有用,但它们具有增加依赖链长度的重要缺点。相比之下,正确预测的分支会破坏依赖链。 (这可能是GCC在添加额外变量时不发出两条CMOV指令的原因。)如果您希望分支预测失败,则条件移动只是性能获胜。如果你可以指望约75%或更好的预测成功率,条件分支可能更快,因为它打破了依赖链并具有较低的延迟。而且我怀疑这种情况就是这样,除非c每次调用函数时在99和99之间快速来回交替。 (参见Agner Fog的“用汇编语言优化子程序”,第70-71页。)