C & C++ compilers are allowed to reorder operations as long as the as-if rule holds. What is an example of such a reordering performed by a compiler, and what is the potential performance gain to be had by doing it?
只要使用as-if规则,就允许C和c++编译器重新排序操作。编译器执行这种重新排序的一个例子是什么,这样做的潜在性能收益是什么?
Examples involving any (C/C++) compiler on any platform are welcome.
欢迎在任何平台上使用任何(C/ c++)编译器的示例。
3 个解决方案
#1
10
Suppose you have the following operations being performed:
假设您正在执行以下操作:
int i=0,j=0;
i++;
i++;
i++;
j++;
j++;
j++;
Ignoring for the moment that the three increments would likely be optimized away by the compiler into one +=3
, you will end up having a higher processor-pipeline throughput if you reordered the operations as
忽略这三个增量可能被编译器优化为1 +=3的时刻,如果重新排序操作,您将获得更高的处理器-管道吞吐量
i++;
j++;
i++;
j++;
i++;
j++;
since j++
doesn't have to wait for the result of i++
while in the previous case, most of the instructions had a data dependency on the previous instruction. In more complicated computations, where there isn't an easy way to reducing the number of instructions to be performed, the compiler can still look at data dependencies and reorder instructions so that an instruction depending on the result of an earlier instruction is as far away from it as possible.
由于j++ +不必等待i++ +的结果,而在前面的例子中,大多数指令都依赖于前面的指令。在更复杂的计算中,如果没有一种简单的方法来减少要执行的指令数量,编译器仍然可以查看数据依赖项并重新排序指令,以便依赖于前面指令的结果的指令尽可能地远离它。
Another example of such an optimization is when you are dealing with pure functions. Looking at a simple example again, assume you have a pure function f(int x)
which you are summing over a loop.
这种优化的另一个例子是当你处理纯函数时。再看一个简单的例子,假设你有一个纯函数f(int x),它对一个循环求和。
int tot = 0;
int x;//something known only at runtime
for(int i = 0; i < 100; i++)
tot += f(x);
Since f
is a pure function, the compiler can reorder calls to it as it pleases. In particular, it can transform this loop to
由于f是一个纯函数,编译器可以随心所欲地重新对它的调用排序。特别是,它可以将这个循环转换为
int tot = 0;
int x;//something known only at runtime
int fval = f(x);
for(int i = 0; i < 100; i++)
tot += fval;
#2
4
I'm sure there are quite a few examples where reordering operations will yield faster performance. An obvious example would be to reorder loads as early as possible, since these are typically much slower than other CPU operations. By doing other, unrelated work whilst the memory is being fetched, the CPU can save time overall.
我确信有相当多的例子表明重新排序操作将产生更快的性能。一个明显的例子是尽可能早地重新排序负载,因为这些负载通常比其他CPU操作要慢得多。通过在获取内存的同时进行其他不相关的工作,CPU可以总体上节省时间。
That is, given something like this:
也就是说,如果是这样
expensive_calculation();
x = load();
do_something(x);
We can reorder it like this:
我们可以这样重新排序:
x = load();
expensive_calculation();
do_something(x);
So while we're waiting for the load to complete, we can essentially do expensive_calculation()
for free.
因此,当我们在等待加载完成时,我们可以从本质上免费计算()。
#3
4
Suppose you have a loop like:
假设你有这样一个循环:
for (i=0; i<n; i++) dest[i] = src[i];
Think memcpy
. You might want the compiler to be able to vectorize this, i.e. load 8 or 16 bytes at a time and then store 8 or 16 at a time. Making that transformation is a reordering, since it would cause src[1]
to be read before dest[0]
is stored. Moreover, unless the compiler knows that src
and dest
don't overlap, it's an invalid transformation, i.e. one the compiler is not allowed to make. Use of the restrict
keyword (C99 and later) allows you to tell the compiler that they don't overlap so that this kind of (extremely valuable) optimization is possible.
认为memcpy。您可能希望编译器能够向它进行矢量化,即一次加载8或16字节,然后一次存储8或16个字节。进行这种转换是一种重新排序,因为它将使src[1]在最大[0]存储之前被读取。此外,除非编译器知道src和dest没有重叠,否则它是一个无效的转换,即编译器不允许进行这种转换。使用限制关键字(C99和later)允许您告诉编译器它们不重叠,这样就可以进行这种(极有价值的)优化。
The same sort of thing arises all the time in operations on arrays that aren't just copying - things like vector/matrix operations, transformations of sound/image sample data, etc.
同样的事情一直在数组的操作中出现,而不是复制——比如向量/矩阵运算,声音/图像样本数据的转换等等。
#1
10
Suppose you have the following operations being performed:
假设您正在执行以下操作:
int i=0,j=0;
i++;
i++;
i++;
j++;
j++;
j++;
Ignoring for the moment that the three increments would likely be optimized away by the compiler into one +=3
, you will end up having a higher processor-pipeline throughput if you reordered the operations as
忽略这三个增量可能被编译器优化为1 +=3的时刻,如果重新排序操作,您将获得更高的处理器-管道吞吐量
i++;
j++;
i++;
j++;
i++;
j++;
since j++
doesn't have to wait for the result of i++
while in the previous case, most of the instructions had a data dependency on the previous instruction. In more complicated computations, where there isn't an easy way to reducing the number of instructions to be performed, the compiler can still look at data dependencies and reorder instructions so that an instruction depending on the result of an earlier instruction is as far away from it as possible.
由于j++ +不必等待i++ +的结果,而在前面的例子中,大多数指令都依赖于前面的指令。在更复杂的计算中,如果没有一种简单的方法来减少要执行的指令数量,编译器仍然可以查看数据依赖项并重新排序指令,以便依赖于前面指令的结果的指令尽可能地远离它。
Another example of such an optimization is when you are dealing with pure functions. Looking at a simple example again, assume you have a pure function f(int x)
which you are summing over a loop.
这种优化的另一个例子是当你处理纯函数时。再看一个简单的例子,假设你有一个纯函数f(int x),它对一个循环求和。
int tot = 0;
int x;//something known only at runtime
for(int i = 0; i < 100; i++)
tot += f(x);
Since f
is a pure function, the compiler can reorder calls to it as it pleases. In particular, it can transform this loop to
由于f是一个纯函数,编译器可以随心所欲地重新对它的调用排序。特别是,它可以将这个循环转换为
int tot = 0;
int x;//something known only at runtime
int fval = f(x);
for(int i = 0; i < 100; i++)
tot += fval;
#2
4
I'm sure there are quite a few examples where reordering operations will yield faster performance. An obvious example would be to reorder loads as early as possible, since these are typically much slower than other CPU operations. By doing other, unrelated work whilst the memory is being fetched, the CPU can save time overall.
我确信有相当多的例子表明重新排序操作将产生更快的性能。一个明显的例子是尽可能早地重新排序负载,因为这些负载通常比其他CPU操作要慢得多。通过在获取内存的同时进行其他不相关的工作,CPU可以总体上节省时间。
That is, given something like this:
也就是说,如果是这样
expensive_calculation();
x = load();
do_something(x);
We can reorder it like this:
我们可以这样重新排序:
x = load();
expensive_calculation();
do_something(x);
So while we're waiting for the load to complete, we can essentially do expensive_calculation()
for free.
因此,当我们在等待加载完成时,我们可以从本质上免费计算()。
#3
4
Suppose you have a loop like:
假设你有这样一个循环:
for (i=0; i<n; i++) dest[i] = src[i];
Think memcpy
. You might want the compiler to be able to vectorize this, i.e. load 8 or 16 bytes at a time and then store 8 or 16 at a time. Making that transformation is a reordering, since it would cause src[1]
to be read before dest[0]
is stored. Moreover, unless the compiler knows that src
and dest
don't overlap, it's an invalid transformation, i.e. one the compiler is not allowed to make. Use of the restrict
keyword (C99 and later) allows you to tell the compiler that they don't overlap so that this kind of (extremely valuable) optimization is possible.
认为memcpy。您可能希望编译器能够向它进行矢量化,即一次加载8或16字节,然后一次存储8或16个字节。进行这种转换是一种重新排序,因为它将使src[1]在最大[0]存储之前被读取。此外,除非编译器知道src和dest没有重叠,否则它是一个无效的转换,即编译器不允许进行这种转换。使用限制关键字(C99和later)允许您告诉编译器它们不重叠,这样就可以进行这种(极有价值的)优化。
The same sort of thing arises all the time in operations on arrays that aren't just copying - things like vector/matrix operations, transformations of sound/image sample data, etc.
同样的事情一直在数组的操作中出现,而不是复制——比如向量/矩阵运算,声音/图像样本数据的转换等等。