具有相同索引的for循环的性能

时间:2022-12-04 21:19:27

while coding I came across a question:

在编写代码时,我遇到了一个问题:

When I have to use a lot of for loops, all iterating over a different span. Is the performance (i.e. runtime) better if I just declare one variable as an index (Example I) or does it not matter at all (Example II)?

当我需要使用很多for循环时,所有迭代都在不同的范围内。如果我只将一个变量声明为索引(例如I),或者根本不重要(示例II),那么性能(即运行时)是否更好呢?

Example I:

例子:

int ind;
for(ind=0; ind < a; ind++) { /*do something*/ }
for(ind=0; ind < b; ind++) { /*to something*/ }
...
for(ind=0; ind < z; ind++) { /*to something*/ }

Example II:

示例二:

for(int ind=0; ind < a; ind++) { /*do something*/ }
...
for(int ind=0; ind < z; ind++) { /*do something*/ }

Thank you for your help

谢谢你的帮助。

5 个解决方案

#1


6  

If you're enabling optimisations (and if you don't, any discussion about performance is moot) then it's not possible to reason about what the compiler will do in the two scenarios.

如果启用了优化(如果没有,关于性能的任何讨论都是没有意义的),那么就不可能推断编译器在这两种场景中会做什么。

The answer will depend upon:

答案将取决于:

  1. The toolchain
  2. 的工具链
  3. The version of the toolchain
  4. 工具链的版本。
  5. What options the toolchain was built with
  6. 工具链是用什么选项构建的!
  7. what's happening inside the loop
  8. 循环中发生了什么!
  9. (related) whether the loop can be unrolled
  10. (相关)循环是否可以展开。
  11. (related) whether the loop actually needs the index (if you're just indexing into arrays, all mention of i will usually be optimised away).
  12. (相关)循环是否实际需要索引(如果只是对数组进行索引,通常会对i进行优化)。
  13. ...etc

Here's how to write fast code:

下面是如何编写快速代码:

  1. Write elegant code that succinctly expresses your intent.
  2. 写优雅的代码,简洁地表达你的意图。
  3. Check that your code is elegant and that it succinctly expresses your intent.
  4. 检查您的代码是否优雅,并且它简洁地表达了您的意图。
  5. Remove the bugs and go back to 2
  6. 删除bug,回到2
  7. Enable the optimiser.
  8. 使优化器。
  9. (this bit's important) wait for users to complain that your code is too slow.
  10. (这一点很重要)等待用户抱怨您的代码太慢。
  11. If 5 didn't happen, stop.
  12. 如果5没有发生,停止。
  13. measure where the most time is being spent and fix that. It won't be your loop counters, I can promise you that.
  14. 衡量一下你花的时间最多的地方,并解决这个问题。它不会是你的循环计数器,我可以向你保证。

For the record, you should write it this way:

作为记录,你应该这样写:

for(int ind=0; ind < a; ++ind)

Because that's more elegant (scope of ind is limited), less likely to be buggy, uses pre-increment for ind (better performance if ind ever happens to become a class type) and expresses intent (ind is used for this loop).

因为这样更优雅(ind的范围是有限的),不太可能出现bug,所以对ind使用预增量(如果ind碰巧成为类类型,性能会更好),并表达意图(ind用于此循环)。

#2


4  

In practice, what matters is the number of iterations and the do something complexity, not the way the index variable is defined.

在实践中,重要的是迭代的数量和做一些复杂的事情,而不是定义索引变量的方式。

Also, consider the Rules Of Optimization.

还要考虑优化的规则。

  1. don't optimize
  2. 不要优化
  3. don't optimize yet
  4. 还不优化
  5. profile before optimizing
  6. 概要文件之前优化

#3


2  

In ancient times when dinosaurs walked the earth, there might have been something like: "at the point when the compiler encounters a local variable declaration, allocate space for it on the stack".

在远古时代,当恐龙在地球上行走时,可能会出现这样的情况:“当编译器遇到局部变量声明时,在堆栈上为它分配空间”。

This could perhaps be the rationale why ancient dinosaur C only allowed variable declarations at the top of a block: ancient dinosaur compilers needed to know all variables in advance, before generating the code.

这可能是为什么古代的恐龙C只允许在块的顶部声明变量:古代的恐龙编译器需要在生成代码之前预先知道所有的变量。

Then somewhere around the 80s, optimizing compilers started to allocate space for the variable at the point where it was first used. Regardless of where that variable was actually declared. Not only would this reduce stack peak usage, it would also mean that the variable didn't need to be allocated at all, if the function didn't use it. Some compilers would even go crazy effective and allocate the variable inside a CPU register instead of putting it on the stack!

然后大约在80年代,优化编译器开始在变量最初使用的地方为它分配空间。不管那个变量是在哪里声明的。这不仅可以减少堆栈峰值的使用,还意味着如果函数没有使用该变量,则不需要分配该变量。有些编译器甚至会变得非常有效,将变量分配到CPU寄存器中,而不是将其放在堆栈中!

And since then, that's how every compiler works. So unless you have stolen a compiler from some museum, this shouldn't be something you need to ponder.

从那以后,每个编译器都是这样工作的。所以,除非你从某个博物馆偷了一个编译器,否则这不是你需要考虑的问题。

Most likely your loop iterator will be allocated in a CPU register in both examples. I would call a compiler that generated slower code for either case broken. At worst, I suppose some compilers might get a bit confused over the different variable names and perhaps use different CPU registers for every loop - which would make the disassembled C code confusing to read but will not have any impact on performance.

在这两个例子中,循环迭代器很可能被分配到CPU寄存器中。我将调用一个编译器,它为任何一种情况生成较慢的代码。在最坏的情况下,我认为一些编译器可能会对不同的变量名感到困惑,可能会对每个循环使用不同的CPU寄存器——这将使分解后的C代码难以阅读,但不会对性能产生任何影响。

As others have already mentioned, the best practice is to reduce the scope of every variable as much as possible, so you should use for(int ind=0; .... This has nothing to do with efficiency, but rather readability, maintainability, avoiding unnecessary namespace pollution and so on. The only case where you need to declared the loop iterator before the loop, is when you need to keep the value after the loop has ended.

正如其他人已经提到的,最佳实践是尽可能减少每个变量的范围,因此应该使用for(int ind=0;....这与效率无关,而是与可读性、可维护性、避免不必要的名称空间污染等有关。惟一需要在循环之前声明循环迭代器的情况是,当循环结束后需要保留该值。

#4


1  

The only way to tell if something makes a difference is to measure (and be aware that the answer may differ from compiler to compiler and platform to platform).

判断某样东西是否有影响的唯一方法是度量(并且要注意,答案可能因编译器与编译器、平台与平台而异)。

My instinct is that compilers will generate identical code for the two samples though.

我的直觉是编译器将为这两个示例生成相同的代码。

#5


1  

First of all, ind is so close to int that you typoed it in the question, so it's a bad choice of variable name. Using i as a loop index is a near-universal convention.

首先,ind非常接近int,所以你在问题中输入了它,这是一个错误的变量名。使用i作为循环索引是一种几乎通用的约定。


Any decent compiler will do lifetime analysis on the int i that's in scope for the whole function and see that the i=0 at the start disconnects it from its previous value. Uses of i after that are unrelated to uses before that, because of the unconditional assignment that didn't depend on anything computed from the previous value.

任何优秀的编译器都会对整个函数范围内的int i进行生命周期分析,并看到i=0在开始时将它与之前的值断开连接。之后i的使用与之前的使用无关,因为无条件赋值不依赖于从前一个值计算的任何东西。

So from an optimizing compiler's perspective, there shouldn't be a difference. Any difference in the actual asm output should be considered a missed-optimization bug in whichever one is worse.

所以从优化编译器的角度来看,应该没有区别。实际asm输出中的任何差异都应该被认为是一个错误的优化bug,无论哪个更糟。


In practice, gcc 5.3 -O3 -march=haswell targeting x86-64 makes identical loops for narrow scope vs. function scope in a simple test I did. I had to use three arrays inside the loop to get gcc to use indexed addressing modes instead of incrementing pointers, which is good because one-register addressing modes are more efficient on Intel SnB-family CPUs.

在实践中,gcc 5.3 -O3 -march=haswell以x86-64为目标,在我所做的一个简单测试中,对窄范围和函数范围进行了相同的循环。我必须在循环中使用三个数组来让gcc使用索引寻址模式而不是递增指针,这很好,因为在Intel SnB-family cpu上,单寄存器寻址模式更有效。

It reuses the same register for i in both loops, instead of saving/restoring another call-preserved register (e.g. r15). Thus, we can see this potential worry about more variables in a function leading to worse register allocation is not in fact a problem. gcc does a pretty good job most of the time.

它在两个循环中重用相同的i寄存器,而不是保存/恢复另一个保留调用的寄存器(例如r15)。因此,我们可以看到,在一个函数中对更多变量的潜在担忧导致寄存器分配更差实际上不是问题。gcc在大多数时候都做得很好。

These are the two functions I tested on godbolt (see the link above). They both compile to identical asm with gcc 5.3 -O3.

这是我在godbolt上测试的两个函数(参见上面的链接)。它们都编译成与gcc 5.3 -O3相同的asm。

#include <unistd.h>
// int dup(int) is a function that the compiler won't have a built-in for
// it's convenient for looking at code with function calls.

void single_var_call(int *restrict dst, const int *restrict srcA,
                     const int *restrict srcB, int a) {
    int i;
    for(i=0; i < a; i++) { dst[i] = dup(srcA[i] + srcB[i]); }
    for(i=0; i < a; i++) { dst[i] = dup(srcA[i]) + srcB[i]+2; }
}

// Even with restrict, gcc doesn't fuse these loops together and skip the first store
// I guess it can't because the called function could have a reference to dst and look at it
void smaller_scopes_call(int *restrict dst, const int *restrict srcA,
                         const int *restrict srcB, int a) {
    for(int i=0; i < a; i++) { dst[i] = dup(srcA[i] + srcB[i]); }
    for(int i=0; i < a; i++) { dst[i] = dup(srcA[i]) + srcB[i]+2; }
}

For correctness / readability reasons: prefer for (int i=...)

The C++ / C99 style of limiting the scope of loop variables has advantages for humans working on the code. You can see right away that the loop counter isn't used outside the loop. (So can the compiler).

限制循环变量范围的c++ / C99样式对处理代码的人有好处。您可以立即看到循环计数器不在循环之外使用。(这样可以编译器)。

It's a good way to prevent errors like initializing the wrong variable.

这是防止错误的好方法,比如初始化错误的变量。

#1


6  

If you're enabling optimisations (and if you don't, any discussion about performance is moot) then it's not possible to reason about what the compiler will do in the two scenarios.

如果启用了优化(如果没有,关于性能的任何讨论都是没有意义的),那么就不可能推断编译器在这两种场景中会做什么。

The answer will depend upon:

答案将取决于:

  1. The toolchain
  2. 的工具链
  3. The version of the toolchain
  4. 工具链的版本。
  5. What options the toolchain was built with
  6. 工具链是用什么选项构建的!
  7. what's happening inside the loop
  8. 循环中发生了什么!
  9. (related) whether the loop can be unrolled
  10. (相关)循环是否可以展开。
  11. (related) whether the loop actually needs the index (if you're just indexing into arrays, all mention of i will usually be optimised away).
  12. (相关)循环是否实际需要索引(如果只是对数组进行索引,通常会对i进行优化)。
  13. ...etc

Here's how to write fast code:

下面是如何编写快速代码:

  1. Write elegant code that succinctly expresses your intent.
  2. 写优雅的代码,简洁地表达你的意图。
  3. Check that your code is elegant and that it succinctly expresses your intent.
  4. 检查您的代码是否优雅,并且它简洁地表达了您的意图。
  5. Remove the bugs and go back to 2
  6. 删除bug,回到2
  7. Enable the optimiser.
  8. 使优化器。
  9. (this bit's important) wait for users to complain that your code is too slow.
  10. (这一点很重要)等待用户抱怨您的代码太慢。
  11. If 5 didn't happen, stop.
  12. 如果5没有发生,停止。
  13. measure where the most time is being spent and fix that. It won't be your loop counters, I can promise you that.
  14. 衡量一下你花的时间最多的地方,并解决这个问题。它不会是你的循环计数器,我可以向你保证。

For the record, you should write it this way:

作为记录,你应该这样写:

for(int ind=0; ind < a; ++ind)

Because that's more elegant (scope of ind is limited), less likely to be buggy, uses pre-increment for ind (better performance if ind ever happens to become a class type) and expresses intent (ind is used for this loop).

因为这样更优雅(ind的范围是有限的),不太可能出现bug,所以对ind使用预增量(如果ind碰巧成为类类型,性能会更好),并表达意图(ind用于此循环)。

#2


4  

In practice, what matters is the number of iterations and the do something complexity, not the way the index variable is defined.

在实践中,重要的是迭代的数量和做一些复杂的事情,而不是定义索引变量的方式。

Also, consider the Rules Of Optimization.

还要考虑优化的规则。

  1. don't optimize
  2. 不要优化
  3. don't optimize yet
  4. 还不优化
  5. profile before optimizing
  6. 概要文件之前优化

#3


2  

In ancient times when dinosaurs walked the earth, there might have been something like: "at the point when the compiler encounters a local variable declaration, allocate space for it on the stack".

在远古时代,当恐龙在地球上行走时,可能会出现这样的情况:“当编译器遇到局部变量声明时,在堆栈上为它分配空间”。

This could perhaps be the rationale why ancient dinosaur C only allowed variable declarations at the top of a block: ancient dinosaur compilers needed to know all variables in advance, before generating the code.

这可能是为什么古代的恐龙C只允许在块的顶部声明变量:古代的恐龙编译器需要在生成代码之前预先知道所有的变量。

Then somewhere around the 80s, optimizing compilers started to allocate space for the variable at the point where it was first used. Regardless of where that variable was actually declared. Not only would this reduce stack peak usage, it would also mean that the variable didn't need to be allocated at all, if the function didn't use it. Some compilers would even go crazy effective and allocate the variable inside a CPU register instead of putting it on the stack!

然后大约在80年代,优化编译器开始在变量最初使用的地方为它分配空间。不管那个变量是在哪里声明的。这不仅可以减少堆栈峰值的使用,还意味着如果函数没有使用该变量,则不需要分配该变量。有些编译器甚至会变得非常有效,将变量分配到CPU寄存器中,而不是将其放在堆栈中!

And since then, that's how every compiler works. So unless you have stolen a compiler from some museum, this shouldn't be something you need to ponder.

从那以后,每个编译器都是这样工作的。所以,除非你从某个博物馆偷了一个编译器,否则这不是你需要考虑的问题。

Most likely your loop iterator will be allocated in a CPU register in both examples. I would call a compiler that generated slower code for either case broken. At worst, I suppose some compilers might get a bit confused over the different variable names and perhaps use different CPU registers for every loop - which would make the disassembled C code confusing to read but will not have any impact on performance.

在这两个例子中,循环迭代器很可能被分配到CPU寄存器中。我将调用一个编译器,它为任何一种情况生成较慢的代码。在最坏的情况下,我认为一些编译器可能会对不同的变量名感到困惑,可能会对每个循环使用不同的CPU寄存器——这将使分解后的C代码难以阅读,但不会对性能产生任何影响。

As others have already mentioned, the best practice is to reduce the scope of every variable as much as possible, so you should use for(int ind=0; .... This has nothing to do with efficiency, but rather readability, maintainability, avoiding unnecessary namespace pollution and so on. The only case where you need to declared the loop iterator before the loop, is when you need to keep the value after the loop has ended.

正如其他人已经提到的,最佳实践是尽可能减少每个变量的范围,因此应该使用for(int ind=0;....这与效率无关,而是与可读性、可维护性、避免不必要的名称空间污染等有关。惟一需要在循环之前声明循环迭代器的情况是,当循环结束后需要保留该值。

#4


1  

The only way to tell if something makes a difference is to measure (and be aware that the answer may differ from compiler to compiler and platform to platform).

判断某样东西是否有影响的唯一方法是度量(并且要注意,答案可能因编译器与编译器、平台与平台而异)。

My instinct is that compilers will generate identical code for the two samples though.

我的直觉是编译器将为这两个示例生成相同的代码。

#5


1  

First of all, ind is so close to int that you typoed it in the question, so it's a bad choice of variable name. Using i as a loop index is a near-universal convention.

首先,ind非常接近int,所以你在问题中输入了它,这是一个错误的变量名。使用i作为循环索引是一种几乎通用的约定。


Any decent compiler will do lifetime analysis on the int i that's in scope for the whole function and see that the i=0 at the start disconnects it from its previous value. Uses of i after that are unrelated to uses before that, because of the unconditional assignment that didn't depend on anything computed from the previous value.

任何优秀的编译器都会对整个函数范围内的int i进行生命周期分析,并看到i=0在开始时将它与之前的值断开连接。之后i的使用与之前的使用无关,因为无条件赋值不依赖于从前一个值计算的任何东西。

So from an optimizing compiler's perspective, there shouldn't be a difference. Any difference in the actual asm output should be considered a missed-optimization bug in whichever one is worse.

所以从优化编译器的角度来看,应该没有区别。实际asm输出中的任何差异都应该被认为是一个错误的优化bug,无论哪个更糟。


In practice, gcc 5.3 -O3 -march=haswell targeting x86-64 makes identical loops for narrow scope vs. function scope in a simple test I did. I had to use three arrays inside the loop to get gcc to use indexed addressing modes instead of incrementing pointers, which is good because one-register addressing modes are more efficient on Intel SnB-family CPUs.

在实践中,gcc 5.3 -O3 -march=haswell以x86-64为目标,在我所做的一个简单测试中,对窄范围和函数范围进行了相同的循环。我必须在循环中使用三个数组来让gcc使用索引寻址模式而不是递增指针,这很好,因为在Intel SnB-family cpu上,单寄存器寻址模式更有效。

It reuses the same register for i in both loops, instead of saving/restoring another call-preserved register (e.g. r15). Thus, we can see this potential worry about more variables in a function leading to worse register allocation is not in fact a problem. gcc does a pretty good job most of the time.

它在两个循环中重用相同的i寄存器,而不是保存/恢复另一个保留调用的寄存器(例如r15)。因此,我们可以看到,在一个函数中对更多变量的潜在担忧导致寄存器分配更差实际上不是问题。gcc在大多数时候都做得很好。

These are the two functions I tested on godbolt (see the link above). They both compile to identical asm with gcc 5.3 -O3.

这是我在godbolt上测试的两个函数(参见上面的链接)。它们都编译成与gcc 5.3 -O3相同的asm。

#include <unistd.h>
// int dup(int) is a function that the compiler won't have a built-in for
// it's convenient for looking at code with function calls.

void single_var_call(int *restrict dst, const int *restrict srcA,
                     const int *restrict srcB, int a) {
    int i;
    for(i=0; i < a; i++) { dst[i] = dup(srcA[i] + srcB[i]); }
    for(i=0; i < a; i++) { dst[i] = dup(srcA[i]) + srcB[i]+2; }
}

// Even with restrict, gcc doesn't fuse these loops together and skip the first store
// I guess it can't because the called function could have a reference to dst and look at it
void smaller_scopes_call(int *restrict dst, const int *restrict srcA,
                         const int *restrict srcB, int a) {
    for(int i=0; i < a; i++) { dst[i] = dup(srcA[i] + srcB[i]); }
    for(int i=0; i < a; i++) { dst[i] = dup(srcA[i]) + srcB[i]+2; }
}

For correctness / readability reasons: prefer for (int i=...)

The C++ / C99 style of limiting the scope of loop variables has advantages for humans working on the code. You can see right away that the loop counter isn't used outside the loop. (So can the compiler).

限制循环变量范围的c++ / C99样式对处理代码的人有好处。您可以立即看到循环计数器不在循环之外使用。(这样可以编译器)。

It's a good way to prevent errors like initializing the wrong variable.

这是防止错误的好方法,比如初始化错误的变量。