为什么解除引用会使我的程序更快？

Considering the following test programs :

考虑以下测试程序：

堆栈上的循环值

int main( void ) {
    int iterations = 1000000000;

    while ( iterations > 0 )
        -- iterations;
}

Loop value on the stack (dereferenced)

堆栈上的循环值（取消引用）

int main( void ) {
    int iterations = 1000000000;
    int * p = & iterations;

    while ( * p > 0 )
        -- * p;
}

Loop value on the heap

堆上的循环值

#include <stdlib.h>

int main( void ) {
    int * p = malloc( sizeof( int ) );
    * p = 1000000000;

    while ( *p > 0 )
        -- * p;
}

By compiling them with -O0, I get the following execution times :

通过使用-O0编译它们，我得到以下执行时间：

case1.c
real    0m2.698s
user    0m2.690s
sys     0m0.003s

case2.c
real    0m2.574s
user    0m2.567s
sys     0m0.000s

case3.c
real    0m2.566s
user    0m2.560s
sys     0m0.000s

[edit] Following is the average on 10 executions :

[编辑]以下是10次执行的平均值：

case1.c
2.70364

case2.c
2.57091

case3.c
2.57000

Why is the execution time bigger with the first test case, which seems to be the simplest ?

为什么第一个测试用例的执行时间更长，这似乎是最简单的？

My current architecture is a x86 virtual machine (Archlinux). I get these results both with gcc (4.8.0) and clang (3.3).

我目前的架构是x86虚拟机（Archlinux）。我用gcc（4.8.0）和clang（3.3）得到了这些结果。

[edit 1] Generated assembler codes are almost identical except that the second and third ones have more instructions than the first one.

[编辑1]生成的汇编代码几乎完全相同，只是第二个和第三个汇编代码的指令多于第一个。

[edit 2] These performances are reproducible (on my system). Each execution will have the same order of magnitude.

[编辑2]这些表演是可重复的（在我的系统上）。每次执行都具有相同的数量级。

[edit 3] I don't really care about performances of a non-optimized program, but I don't understand why it would be slower, and I'm curious.

[编辑3]我并不关心非优化程序的表现，但我不明白为什么会慢一些，我很好奇。

1 个解决方案

#1

It's hard to say if this is the reason since I'm doing some guessing and you haven't given some specifics (like which target you're using). But what I see when I compile without optimziations with an x86 target is the following sequences for decrementign the iterations variable:

很难说这是不是因为我做了一些猜测而且你没有给出一些细节（比如你正在使用哪个目标）。但是，当我在没有使用x86目标进行优化的情况下进行编译时，我看到的是以下用于降低迭代变量的序列：

Case 1:

情况1：

L3:
    sub DWORD PTR [esp+12], 1
L2:
    cmp DWORD PTR [esp+12], 0
    jg  L3

Case 2:

案例2：

L3:
    mov eax, DWORD PTR [esp+12]
    mov eax, DWORD PTR [eax]
    lea edx, [eax-1]
    mov eax, DWORD PTR [esp+12]
    mov DWORD PTR [eax], edx
L2:
    mov eax, DWORD PTR [esp+12]
    mov eax, DWORD PTR [eax]
    test    eax, eax
    jg  L3

One big difference that you see in case 1 is that the instruction at L3 reads and writes the memory location. It is followed immediately byu an instruction that reads the same memory location that was just written. This sort of sequence of instructions (the same memory location written then immediate used in the next instruction) often causes some sort of pipeline stall in modern CPUs.

您在案例1中看到的一个重大区别是L3处的指令读取和写入内存位置。紧接着是一条指令，它读取刚写入的相同内存位置。这种指令序列（写入的相同存储器位置然后在下一条指令中立即使用）经常导致现代CPU中的某种管道停顿。

You'll note that the write followed immediately by a read of the same location is not present in case 2.

您会注意到，在案例2中不存在紧接着读取相同位置的写入。

Again - this answer is a bit of informed speculation.

再一次 - 这个答案是一个明智的猜测。

#1