为什么在同一个程序中，相同C循环的相同副本执行的时间会有显著的不同，但始终是不同的?

I hope that I have reduced my question to a simple and reproducible test case. The source (which is here) contains 10 copies of an identical simple loop. Each loop is of the form:

我希望我已经将我的问题简化为一个简单且可重复的测试用例。源(在这里)包含10个相同的简单循环的副本。每个循环的形式是:

#define COUNT (1000 * 1000 * 1000)
volatile uint64_t counter = 0;

void loopN(void) {
  for (int j = COUNT; j != 0; j--) {
    uint64_t val = counter;
    val = val + 1;
    counter = val;
  }
  return;
}

The 'volatile' of the variable is important, as it forces the value to read and written from memory on each iteration. Each loop is aligned to 64 bytes using '-falign-loops=64' and produces identical assembly except for the relative offset to the global:

变量的“volatile”非常重要，因为它强制在每次迭代中从内存中读取和写入值。每个循环使用'-falign-loop =64'对齐到64字节，并且产生相同的程序集，除了相对于全局的偏移量:

   400880:       48 8b 15 c1 07 20 00    mov    0x2007c1(%rip),%rdx  # 601048 <counter>
   400887:       48 83 c2 01             add    $0x1,%rdx
   40088b:       83 e8 01                sub    $0x1,%eax
   40088e:       48 89 15 b3 07 20 00    mov    %rdx,0x2007b3(%rip)  # 601048 <counter>
   400895:       75 e9                   jne    400880 <loop8+0x20>

I'm running Linux 3.11 on an Intel Haswell i7-4470. I'm compiling the program with GCC 4.8.1 and the command line:

我在Intel的Haswell i7-4470上运行Linux 3.11。我正在用GCC 4.8.1和命令行编译程序:

 gcc -std=gnu99 -O3 -falign-loops=64 -Wall -Wextra same-function.c -o same-function

I'm also using attribute((noinline)) within the source to make the assembly clearer, but this is not necessary to observe the issue. I find the fastest and slowest functions with a shell loop:

我还在源代码中使用属性(noinline)来使程序集更清晰，但这并不是观察问题的必要条件。我发现最快和最慢的函数有一个壳环:

for n in 0 1 2 3 4 5 6 7 8 9; 
do echo same-function ${n}:; 
/usr/bin/time -f "%e seconds" same-function ${n}; 
/usr/bin/time -f "%e seconds" same-function ${n}; 
/usr/bin/time -f "%e seconds" same-function ${n}; 
done

It produces results which are consistent to about 1% from run to run, with the exact numbers of the fastest and slowest functions varying depending on the exact binary layout:

它产生的结果从运行到运行的一致性约为1%，最快和最慢的函数的确切数量根据精确的二进制布局而变化:

same-function 0:
2.08 seconds
2.04 seconds
2.06 seconds
same-function 1:
2.12 seconds
2.12 seconds
2.12 seconds
same-function 2:
2.10 seconds
2.14 seconds
2.11 seconds
same-function 3:
2.04 seconds
2.04 seconds
2.05 seconds
same-function 4:
2.05 seconds
2.00 seconds
2.03 seconds
same-function 5:
2.07 seconds
2.07 seconds
1.98 seconds
same-function 6:
1.83 seconds
1.83 seconds
1.83 seconds
same-function 7:
1.95 seconds
1.98 seconds
1.95 seconds
same-function 8:
1.86 seconds
1.88 seconds
1.86 seconds
same-function 9:
2.04 seconds
2.04 seconds
2.02 seconds

In this case, we see that that loop2() is one of the slowest to execute and loop6() is one of the fastest, with a difference just over 10%. We reconfirm this by testing just these two cases repeatedly with a different method:

在本例中，我们看到loop2()是执行最慢的一个，而loop6()是最快的一个，差异略高于10%。我们用不同的方法反复测试这两种情况，再次确认这一点:

nate@haswell$ N=2; for i in {1..10}; do perf stat same-function $N 2>&1 | grep GHz; done
     7,180,104,866 cycles                    #    3.391 GHz
     7,169,930,711 cycles                    #    3.391 GHz
     7,150,190,394 cycles                    #    3.391 GHz
     7,188,959,096 cycles                    #    3.391 GHz
     7,177,272,608 cycles                    #    3.391 GHz
     7,093,246,955 cycles                    #    3.391 GHz
     7,210,636,865 cycles                    #    3.391 GHz
     7,239,838,211 cycles                    #    3.391 GHz
     7,172,716,779 cycles                    #    3.391 GHz
     7,223,252,964 cycles                    #    3.391 GHz

nate@haswell$ N=6; for i in {1..10}; do perf stat same-function $N 2>&1 | grep GHz; done
     6,234,770,361 cycles                    #    3.391 GHz
     6,199,096,296 cycles                    #    3.391 GHz
     6,213,348,126 cycles                    #    3.391 GHz
     6,217,971,263 cycles                    #    3.391 GHz
     6,224,779,686 cycles                    #    3.391 GHz
     6,194,117,897 cycles                    #    3.391 GHz
     6,225,259,274 cycles                    #    3.391 GHz
     6,244,391,509 cycles                    #    3.391 GHz
     6,189,972,381 cycles                    #    3.391 GHz
     6,205,556,306 cycles                    #    3.391 GHz

Considering this confirmed, we re-read every word in every Intel architectural manual ever written, sift through every page on the entire web that mentions the words 'computer' or 'programming', and meditate in isolation on a mountaintop for 6 years. Failing to achieve any sort of enlightenment, we come down to civilization, shave our beard, take a bath, and ask the experts of *:

考虑到这一点，我们重新阅读了每一本英特尔架构手册中的每一个字，仔细阅读了所有提到“计算机”或“编程”的网页，并在山顶上独自冥想了6年。由于没有得到任何启示，我们来到文明社会，剃掉胡须，洗个澡，并询问*的专家:

What can possibly be happening here?

这里可能会发生什么?

Edit: With Benjamin's help (see his answer below) I've come up with an even more succinct test case. It's a standalone 20 lines of assembly. Changing from using SUB to SBB causes a 15% difference in performance even though the result remains the same and the same number of instructions are executed. Explanations? I think I'm getting closer to one.

编辑:在本杰明的帮助下(见下面他的答案)，我提出了一个更简洁的测试用例。它是一个独立的20条装配线。从使用SUB到SBB的更改会导致15%的性能差异，即使结果保持不变并且执行相同数量的指令。解释吗?我想我离1越来越近了。

; Minimal example, see also http://*.com/q/26266953/3766665
; To build (Linux):
;   nasm -felf64 func.asm
;   ld func.o
; Then run:
;   perf stat -r10 ./a.out
; On Haswell and Sandy Bridge, observed runtime varies 
; ~15% depending on whether sub or sbb is used in the loop
section .text
global _start
_start:
  push qword 0h       ; put counter variable on stack
  jmp loop            ; jump to function
align 64              ; function alignment.
loop:
  mov rcx, 1000000000
align 64              ; loop alignment.
l:
  mov rax, [rsp]
  add rax, 1h
  mov [rsp], rax
; sbb rcx, 1h         ; which is faster: sbb or sub?
  sub rcx, 1h         ; switch, time it, and find out
  jne l               ; (rot13 spoiler: foo vf snfgre ol 15%)
fin:                  ; If that was too easy, explain why.
  mov eax, 60
  xor edi, edi        ; End of program. Exit with code 0
  syscall

2 个解决方案

#1

Looking at the full perf stat output, you will see that it is not the number of instructions that varies but the number of stalled cycles.

查看完整的perf stat输出，您将看到变化的不是指令的数量，而是停滞的周期的数量。

Looking at the disassembly, I have found two things:

看到这个拆卸过程，我发现了两件事:

The offset to the counter variable varies between the functions. Making counter local to each function did not make the behaviour go away, though.
计数器变量的偏移量在函数之间是不同的。但是，使每个函数的计数器都是本地的并不能使行为消失。
The functions are not placed on 64 byte boundaries, so they may cover a different amount of cache lines. Compiling with -falign-functions=64 did make the differences go away almost completely.
这些函数没有设置在64字节边界上，因此它们可能包含不同数量的缓存行。使用-falign-functions=64进行编译时，差异几乎完全消失了。

Doing the test with the aforementioned changes on my machine then yields:

在我的机器上做上述改动的测试后，结果如下:

for f in $( seq 7); do perf stat -e cycles -r3 ./same-function $f 2>&1; done|grep cycles 6,070,933,420 cycles ( +- 0.11% ) 6,052,771,142 cycles ( +- 0.06% ) 6,099,676,333 cycles ( +- 0.07% ) 6,092,962,697 cycles ( +- 0.16% ) 6,151,861,993 cycles ( +- 0.69% ) 6,074,323,033 cycles ( +- 0.36% ) 6,174,434,653 cycles ( +- 0.65% )

$ f (seq 7);执行统计-e循环-r3 ./同功能$ f2 >和1;|grep循环6 070,933420循环(+- 0.11%)6,052,771,142周期(+- 0.06%)6,099,676,333循环(+- 0.07%)6,092,962,697环(+- 0.16%)6,151,861,993个周期(+- 0.69%)6,074,323,033个周期(+- 0.36%)6,174,434,653个周期(+- 0.65%)

I don't know more about the nature of the stalls you found, though.

我不太了解你发现的摊位的性质。

EDIT: I have made counter a volatile member in each function, tested different compilations on my I7-3537U and found `-falign-loops=64' actually to be the slowest:

编辑:我在每个函数中都设置了一个易变的计数器，在我的I7-3537U上测试了不同的编译，发现' -falign-loop =64'实际上是最慢的:

$ gcc  -std=gnu99 -O3 -Wall -Wextra same-function.c -o same-function
$ gcc -falign-loops=64 -std=gnu99 -O3 -Wall -Wextra same-function.c -o same-function-l64
$ gcc -falign-functions=64 -std=gnu99 -O3 -Wall -Wextra same-function.c -o same-function-f64
$ for prog in same-function{,-l64,-f64}; do echo $prog; for f in $(seq 7); do perf stat -e cycles -r10 ./$prog $f 2>&1; done|grep cycl; done
same-function
     6,079,966,292      cycles                     ( +-  0.19% )
     7,419,053,569      cycles                     ( +-  0.07% )
     6,136,061,105      cycles                     ( +-  0.27% )
     7,282,434,896      cycles                     ( +-  0.74% )
     6,104,866,406      cycles                     ( +-  0.16% )
     7,342,985,942      cycles                     ( +-  0.52% )
     6,208,373,040      cycles                     ( +-  0.50% )
same-function-l64
     7,336,838,175      cycles                     ( +-  0.46% )
     7,358,913,923      cycles                     ( +-  0.52% )
     7,412,570,515      cycles                     ( +-  0.38% )
     7,435,048,756      cycles                     ( +-  0.10% )
     7,404,834,458      cycles                     ( +-  0.34% )
     7,291,095,582      cycles                     ( +-  0.99% )
     7,312,052,598      cycles                     ( +-  0.95% )
same-function-f64
     6,103,059,996      cycles                     ( +-  0.12% )
     6,116,601,533      cycles                     ( +-  0.29% )
     6,120,841,824      cycles                     ( +-  0.18% )
     6,114,278,098      cycles                     ( +-  0.09% )
     6,105,938,586      cycles                     ( +-  0.14% )
     6,101,672,717      cycles                     ( +-  0.19% )
     6,121,339,944      cycles                     ( +-  0.11% )

More details on align-loops vs align-functions:

更多关于align-loop vs align-function的细节:

$ for prog in same-function{-l64,-f64}; do sudo perf stat -d -r10 ./$prog 0; done

 Performance counter stats for './same-function-l64 0' (10 runs):

       2396.608194      task-clock:HG (msec)      #    1.001 CPUs utilized            ( +-  0.64% )
                56      context-switches:HG       #    0.024 K/sec                    ( +-  5.51% )
                 1      cpu-migrations:HG         #    0.000 K/sec                    ( +- 74.78% )
                46      page-faults:HG            #    0.019 K/sec                    ( +-  0.63% )
     7,331,450,530      cycles:HG                 #    3.059 GHz                      ( +-  0.51% ) [85.68%]
     5,332,248,218      stalled-cycles-frontend:HG #   72.73% frontend cycles idle     ( +-  0.71% ) [71.42%]
   <not supported>      stalled-cycles-backend:HG
     5,000,800,933      instructions:HG           #    0.68  insns per cycle
                                                  #    1.07  stalled cycles per insn  ( +-  0.04% ) [85.73%]
     1,000,446,303      branches:HG               #  417.443 M/sec                    ( +-  0.04% ) [85.75%]
             8,461      branch-misses:HG          #    0.00% of all branches          ( +-  6.05% ) [85.76%]
   <not supported>      L1-dcache-loads:HG
            45,593      L1-dcache-load-misses:HG  #    0.00% of all L1-dcache hits    ( +-  3.61% ) [85.77%]
             6,148      LLC-loads:HG              #    0.003 M/sec                    ( +-  8.80% ) [71.36%]
   <not supported>      LLC-load-misses:HG

       2.394456699 seconds time elapsed                                          ( +-  0.64% )


 Performance counter stats for './same-function-f64 0' (10 runs):

       1998.936383      task-clock:HG (msec)      #    1.001 CPUs utilized            ( +-  0.61% )
                60      context-switches:HG       #    0.030 K/sec                    ( +- 17.77% )
                 1      cpu-migrations:HG         #    0.001 K/sec                    ( +- 47.86% )
                46      page-faults:HG            #    0.023 K/sec                    ( +-  0.68% )
     6,107,877,836      cycles:HG                 #    3.056 GHz                      ( +-  0.34% ) [85.63%]
     4,112,602,649      stalled-cycles-frontend:HG #   67.33% frontend cycles idle     ( +-  0.52% ) [71.41%]
   <not supported>      stalled-cycles-backend:HG
     5,000,910,172      instructions:HG           #    0.82  insns per cycle
                                                  #    0.82  stalled cycles per insn  ( +-  0.01% ) [85.72%]
     1,000,423,026      branches:HG               #  500.478 M/sec                    ( +-  0.02% ) [85.77%]
            10,660      branch-misses:HG          #    0.00% of all branches          ( +- 13.23% ) [85.80%]
   <not supported>      L1-dcache-loads:HG
            47,492      L1-dcache-load-misses:HG  #    0.00% of all L1-dcache hits    ( +- 14.82% ) [85.80%]
            11,719      LLC-loads:HG              #    0.006 M/sec                    ( +- 42.44% ) [71.28%]
   <not supported>      LLC-load-misses:HG

       1.997319759 seconds time elapsed                                          ( +-  0.62% )

Both executables have a very low instructions/cycle count, probably due to the minimalistic nature of the loop and the memory pressure that it is inflicting, but I have no idea on why one is worse than the other.

两个可执行程序的指令/循环计数都很低，这可能是由于循环的极简性质和它所施加的内存压力，但是我不知道为什么一个比另一个更差。

I have also tried something along the lines of

我也尝试过类似的方法

$ for prog in same-function{-l64,-f64}; do sudo perf stat -eL1-{d,i}cache-load-misses,L1-dcache-store-misses,cs,cycles,instructions -r10 ./$prog 0; done

and

和

$ sudo perf record -F25000 -e'{cycles:pp,stalled-cycles-frontend}' ./same-function-l64 0
[ perf record: Woken up 28 times to write data ]
[ perf record: Captured and wrote 6.771 MB perf.data (~295841 samples) ]
$ sudo perf report --group -Sloop0 -n --show-total-period --stdio
$ sudo perf annotate --group -sloop0  --stdio

But without any success on finding the culprit. Still, I felt it might be helpful to jot it down here for you anyway...

但是没有找到罪犯。不过，我还是觉得把它写在这里对你有帮助……

Edit 2: Here is my patch to same-function.c:

编辑2:这是我的补丁到相同功能。

$ git diff -u -U0
diff --git a/same-function.c b/same-function.c
index f78449e..78a5772 100644
--- a/same-function.c
+++ b/same-function.c
@@ -20 +20 @@ done
-volatile uint64_t counter = 0;
+//volatile uint64_t counter = 0;
@@ -22,0 +23 @@ COMPILER_NO_INLINE void loop0(void) {
+volatile uint64_t counter = 0;
@@ -31,0 +33 @@ COMPILER_NO_INLINE void loop1(void) {
+volatile uint64_t counter = 0;
@@ -40,0 +43 @@ COMPILER_NO_INLINE void loop2(void) {
+volatile uint64_t counter = 0;
@@ -49,0 +53 @@ COMPILER_NO_INLINE void loop3(void) {
+volatile uint64_t counter = 0;
@@ -58,0 +63 @@ COMPILER_NO_INLINE void loop4(void) {
+volatile uint64_t counter = 0;
@@ -67,0 +73 @@ COMPILER_NO_INLINE void loop5(void) {
+volatile uint64_t counter = 0;
@@ -76,0 +83 @@ COMPILER_NO_INLINE void loop6(void) {
+volatile uint64_t counter = 0;
@@ -85,0 +93 @@ COMPILER_NO_INLINE void loop7(void) {
+volatile uint64_t counter = 0;
@@ -94,0 +103 @@ COMPILER_NO_INLINE void loop8(void) {
+volatile uint64_t counter = 0;
@@ -103,0 +113 @@ COMPILER_NO_INLINE void loop9(void) {
+volatile uint64_t counter = 0;
@@ -135 +145 @@ int main(int argc, char** argv) {
-}
\ No newline at end of file
+}

Edit 3: Same thing with even more minimal example:

编辑3:同样的事情还有更小的例子:

; Minimal example, see also http://*.com/q/26266953/3766665 
; To build (Linux):
;   nasm -felf64 func.asm
;   ld func.o
; Then run:
;   perf stat -r10 ./a.out
; Runtime varies ~10% depending on whether 
section .text
global _start
_start:
  push qword 0h       ; put counter variable on stack
  jmp loop            ; jump to function
;align 64             ; function alignment. Try commenting this
loop:
  mov rcx, 1000000000
;align 64             ; loop alignment. Try commenting this
l:
  mov rax, [rsp]
  add rax, 1h
  mov [rsp], rax
  sub rcx, 1h
  jne l
fin:                  ; End of program. Exit with code 0
  mov eax, 60
  xor edi, edi
  syscall

Same effect here. Interesting.

同样的效果。有趣。

Cheers, Benjamin

欢呼,本杰明

#2

A few years ago, I would have told you to check for any difference in the internal state of the CPU when it arrives at any of those loops; as this was known to have a profound effect on the capacity of the Out-of-Order predictions process (or something like that). For example, the performance of the very same loop could change by up to something like 15-20% depending on what the CPU was doing just before entering the loop and just the fact of jumping from two different points could be sufficient to change the execution speed.

几年前，我会告诉你，当CPU到达任何一个循环时，检查它的内部状态是否有差异;众所周知，这对无序预测过程(或类似的东西)的能力有着深远的影响。例如，同样的循环的性能可能会变化高达15-20%，这取决于CPU在进入循环之前所做的事情，仅仅从两个不同的点上跳转就足以改变执行速度。

In your case, it's quite to test. All you have to do is to change the order of the instructions in the IF block; for example replacing the following:

在你的情况下，这是很容易测试的。你所要做的就是改变IF块中指令的顺序;例如替换以下内容:

  switch (firstLetter) {
  case '0': loop0(); break;
  case '1': loop1(); break;
  case '2': loop2(); break;
  case '3': loop3(); break;
  case '4': loop4(); break;
  case '5': loop5(); break;
  case '6': loop6(); break;
  case '7': loop7(); break;
  case '8': loop8(); break;
  case '9': loop9(); break;
  default: goto die_usage;
  }

with:

  switch (firstLetter) {
  case '9': loop9(); break;
  case '8': loop8(); break;
  case '7': loop7(); break;
  case '6': loop6(); break;
  case '5': loop5(); break;
  case '4': loop4(); break;
  case '3': loop3(); break;
  case '2': loop2(); break;
  case '1': loop1(); break;
  case '0': loop0(); break;
  default: goto die_usage;
  }

or with any random order. Of course, you should check the generated assembly code to make sure that the order of the instructions has not been reordered by the compilator.

或者任意顺序。当然，您应该检查生成的汇编代码，以确保编译器没有重新排序指令的顺序。

Also, as your loops are inside individual functions; you should also make sure that these functions are themselves aligned on 64 bytes boundaries.

同样，当你的循环在单个函数内部时;您还应该确保这些函数本身在64字节边界上对齐。

#1