优化级别-O3在g++中是危险的吗?

时间:2021-05-17 02:12:45

I have heard from various sources (though mostly from a colleague of mine), that compiling with an optimisation level of -O3 in g++ is somehow 'dangerous', and should be avoided in general unless proven to be necessary.

我从不同的来源(虽然大部分来自我的同事)听说过,在g++中使用-O3优化级别的编译在某种程度上是“危险的”,除非被证明是必要的,否则一般应该避免。

Is this true, and if so, why? Should I just be sticking to -O2?

这是真的吗?如果是的话,为什么?我应该坚持-O2吗?

5 个解决方案

#1


174  

In the early days of gcc (2.8 etc.) and in the times of egcs, and redhat 2.96 -O3 was quite buggy sometimes. But this is over a decade ago, and -O3 is not much different than other levels of optimizations (in buggyness).

在海湾合作委员会(gcc)的早期(2.8等等)和egcs时代,redhat 2.96 -O3有时非常有问题。但这已经是十多年前的事情了,并且-O3与其他级别的优化(在“忙碌”中)没有太大区别。

It does however tend to reveal cases where people rely on undefined behavior, due to relying more strictly on the rules, and especially corner cases, of the language(s).

然而,它却倾向于揭示人们依赖于未定义行为的情况,因为更严格地依赖于语言的规则,尤其是语言的拐点。

As a personal note, I am running production software in the financial sector for many years now with -O3 and have not yet encountered a bug that would not have been there if I would have used -O2.

就我个人而言,我使用-O3在金融领域运行产品软件已有多年,如果我使用-O2的话,我还没有遇到一个错误。

By popular demand, here an addition:

根据大众的需求,这里有一个补充:

-O3 and especially additional flags like -funroll-loops (not enabled by -O3) can sometimes lead to more machine code being generated. Under certain circumstances (e.g. on a cpu with exceptionally small L1 instruction cache) this can cause a slowdown due to all the code of e.g. some inner loop now not fitting anymore into L1I. Generally gcc tries quite hard to not to generate so much code, but since it usually optimizes the generic case, this can happen. Options especially prone to this (like loop unrolling) are normally not included in -O3 and are marked accordingly in the manpage. As such it is generally a good idea to use -O3 for generating fast code, and only fall back to -O2 or -Os (which tries to optimize for code size) when appropriate (e.g. when a profiler indicates L1I misses).

-O3,特别是-funroll-loop之类的附加标志(不是由-O3启用的)有时会导致生成更多的机器代码。在某些情况下(例如,在一个L1指令缓存非常小的cpu上),这可能会由于L1I中的所有代码而导致速度减慢。一般来说,gcc尽量不生成这么多代码,但是由于它通常优化通用的情况,这是可以实现的。特别容易出现这种情况的选项(例如循环展开)通常不包含在-O3中,并且在manpage中相应地标记。因此,最好在适当的时候使用-O3来生成快速的代码,并且只使用-O2或-Os(试图优化代码大小)(例如,当分析器指出L1I缺失时)。

If you want to take optimization into the extreme, you can tweak in gcc via --param the costs associated with certain optimizations. Additionally note that gcc now has the ability to put attributes at functions that control optimization settings just for these functions, so when you find you have a problem with -O3 in one function (or want to try out special flags for just that function), you don't need to compile the whole file or even whole project with O2.

如果您想将优化发挥到极致,可以通过—解析与某些优化相关的成本—在gcc中进行调整。另外注意,gcc现在有能力把属性函数控制优化设置这些功能,所以当你发现你有一个问题,o3在一个函数(或者想尝试特殊标志的功能),你不需要编译整个文件或甚至整个项目与O2。

otoh it seems that care must be taken when using -Ofast, which states:

otoh似乎在使用-Ofast时必须小心,其中指出:

-Ofast enables all -O3 optimizations. It also enables optimizations that are not valid for all standard compliant programs.

-Ofast支持所有-O3优化。它还支持对所有标准兼容程序无效的优化。

which makes me conclude that -O3 is intended to be fully standards compliant.

这使我认为-O3是完全符合标准的。

#2


31  

This is already said in Neel's answer, but not plainly or strongly enough:

尼埃尔的回答中已经提到了这一点,但还不够清楚或有力:

In my somewhat checkered experience, applying -O3 to an entire program almost always makes it slower (relative to -O2), because it turns on aggressive loop unrolling and inlining that make the program no longer fit in the instruction cache. For larger programs, this can also be true for -O2 relative to -Os!

在我的一些检查经验中,将-O3应用到整个程序中几乎总是会使它(相对于-O2)的速度变慢,因为它会启动一个具有攻击功能的循环展开和内联,使程序不再适合于指令缓存。对于更大的程序,这也适用于-O2相对于-Os!

The intended use pattern for -O3 is, after profiling your program, you manually apply it to a small handful of files containing critical inner loops that actually benefit from these aggressive space-for-speed tradeoffs. With very recent GCC, I think the shiny new link-time profile-guided optimization mode can selectively apply the -O3 optimizations to hot functions -- effectively automating this process.

对-O3的预期使用模式是,在分析程序之后,您将它手工应用到一小部分包含关键内部循环的文件中,这些文件实际上受益于这些积极的空间对速度的权衡。使用最近的GCC,我认为闪亮的新的链接时概要文件引导的优化模式可以有选择地将-O3优化应用到热函数中——有效地自动化这个过程。

#3


7  

-O3 option turns on more expensive optimizations, such as function inlining, in addition to all the optimizations of the lower levels ‘-O2’ and ‘-O1’. The ‘-O3’ optimization level may increase the speed of the resulting executable, but can also increase its size. Under some circumstances where these optimizations are not favorable, this option might actually make a program slower.

-O3选项打开了更昂贵的优化,比如函数内联,以及较低级别' -O2 '和' -O1 '的所有优化。“-O3”优化级别可以提高结果可执行文件的速度,但也可以增加其大小。在某些情况下,这些优化是不利的,这个选项实际上可能会使程序变慢。

#4


3  

Some time ago, I came into collision with optimization. There was a PCI card, that represented it's registers (for command and data) by memory cell. My driver just mapped phisical address of that memory to application level's pointer and gave it to called process, which worked with it like this:

不久前,我遇到了优化问题。有一个PCI卡,表示它的寄存器(用于命令和数据)由内存单元。我的驱动程序只是将内存的物理地址映射到应用程序层的指针,并将它提供给调用进程,进程就像这样工作:

unsigned int * pciMemory;
askDriverForMapping( & pciMemory );
...
pciMemory[ 0 ] = someCommandIdx;
pciMemory[ 0 ] = someCommandLength;
for ( int i = 0; i < sizeof( someCommand ); i++ )
    pciMemory[ 0 ] = someCommand[ i ];

I was amazing why card didn't act as expected. And only when I saw assembler I understood that compiler wrote only someCommand[ the last ] into pciMemory, omitting all preceding writes.

我很惊讶为什么卡德没有按预期行动。而且只有当我看到汇编器时,我才明白编译器只在pciMemory中写入了someCommand(最后的),省略了前面所有的写。

In conclusion: be accurate and attentive with optimization )))

综上所述:精准、专注、优化)

#5


3  

Yes, O3 is buggier. I'm a compiler developer and I've identified clear and obvious gcc bugs caused by O3 generating buggy SIMD assembly instructions when building my own software. From what I've seen, most production software ships with O2 which means O3 will get less attention wrt testing and bug fixes.

是的,O3是更多缺陷。我是一名编译器开发人员,在构建我自己的软件时,我发现了由O3生成错误的SIMD汇编指令导致的明显的gcc bug。从我所看到的情况来看,大多数的生产软件都使用了O2,这意味着O3将得到更少的关注wrt测试和bug修复。

Think of it this way: O3 adds more transformations on top of O2, which adds more transformations on top of O1. Statistically speaking, more transformations means more bugs. That's true for any compiler.

这样想:O3在O2上增加了更多的变换,在O1上增加了更多的变换。从统计学上讲,更多的转换意味着更多的bug。对任何编译器都是如此。

#1


174  

In the early days of gcc (2.8 etc.) and in the times of egcs, and redhat 2.96 -O3 was quite buggy sometimes. But this is over a decade ago, and -O3 is not much different than other levels of optimizations (in buggyness).

在海湾合作委员会(gcc)的早期(2.8等等)和egcs时代,redhat 2.96 -O3有时非常有问题。但这已经是十多年前的事情了,并且-O3与其他级别的优化(在“忙碌”中)没有太大区别。

It does however tend to reveal cases where people rely on undefined behavior, due to relying more strictly on the rules, and especially corner cases, of the language(s).

然而,它却倾向于揭示人们依赖于未定义行为的情况,因为更严格地依赖于语言的规则,尤其是语言的拐点。

As a personal note, I am running production software in the financial sector for many years now with -O3 and have not yet encountered a bug that would not have been there if I would have used -O2.

就我个人而言,我使用-O3在金融领域运行产品软件已有多年,如果我使用-O2的话,我还没有遇到一个错误。

By popular demand, here an addition:

根据大众的需求,这里有一个补充:

-O3 and especially additional flags like -funroll-loops (not enabled by -O3) can sometimes lead to more machine code being generated. Under certain circumstances (e.g. on a cpu with exceptionally small L1 instruction cache) this can cause a slowdown due to all the code of e.g. some inner loop now not fitting anymore into L1I. Generally gcc tries quite hard to not to generate so much code, but since it usually optimizes the generic case, this can happen. Options especially prone to this (like loop unrolling) are normally not included in -O3 and are marked accordingly in the manpage. As such it is generally a good idea to use -O3 for generating fast code, and only fall back to -O2 or -Os (which tries to optimize for code size) when appropriate (e.g. when a profiler indicates L1I misses).

-O3,特别是-funroll-loop之类的附加标志(不是由-O3启用的)有时会导致生成更多的机器代码。在某些情况下(例如,在一个L1指令缓存非常小的cpu上),这可能会由于L1I中的所有代码而导致速度减慢。一般来说,gcc尽量不生成这么多代码,但是由于它通常优化通用的情况,这是可以实现的。特别容易出现这种情况的选项(例如循环展开)通常不包含在-O3中,并且在manpage中相应地标记。因此,最好在适当的时候使用-O3来生成快速的代码,并且只使用-O2或-Os(试图优化代码大小)(例如,当分析器指出L1I缺失时)。

If you want to take optimization into the extreme, you can tweak in gcc via --param the costs associated with certain optimizations. Additionally note that gcc now has the ability to put attributes at functions that control optimization settings just for these functions, so when you find you have a problem with -O3 in one function (or want to try out special flags for just that function), you don't need to compile the whole file or even whole project with O2.

如果您想将优化发挥到极致,可以通过—解析与某些优化相关的成本—在gcc中进行调整。另外注意,gcc现在有能力把属性函数控制优化设置这些功能,所以当你发现你有一个问题,o3在一个函数(或者想尝试特殊标志的功能),你不需要编译整个文件或甚至整个项目与O2。

otoh it seems that care must be taken when using -Ofast, which states:

otoh似乎在使用-Ofast时必须小心,其中指出:

-Ofast enables all -O3 optimizations. It also enables optimizations that are not valid for all standard compliant programs.

-Ofast支持所有-O3优化。它还支持对所有标准兼容程序无效的优化。

which makes me conclude that -O3 is intended to be fully standards compliant.

这使我认为-O3是完全符合标准的。

#2


31  

This is already said in Neel's answer, but not plainly or strongly enough:

尼埃尔的回答中已经提到了这一点,但还不够清楚或有力:

In my somewhat checkered experience, applying -O3 to an entire program almost always makes it slower (relative to -O2), because it turns on aggressive loop unrolling and inlining that make the program no longer fit in the instruction cache. For larger programs, this can also be true for -O2 relative to -Os!

在我的一些检查经验中,将-O3应用到整个程序中几乎总是会使它(相对于-O2)的速度变慢,因为它会启动一个具有攻击功能的循环展开和内联,使程序不再适合于指令缓存。对于更大的程序,这也适用于-O2相对于-Os!

The intended use pattern for -O3 is, after profiling your program, you manually apply it to a small handful of files containing critical inner loops that actually benefit from these aggressive space-for-speed tradeoffs. With very recent GCC, I think the shiny new link-time profile-guided optimization mode can selectively apply the -O3 optimizations to hot functions -- effectively automating this process.

对-O3的预期使用模式是,在分析程序之后,您将它手工应用到一小部分包含关键内部循环的文件中,这些文件实际上受益于这些积极的空间对速度的权衡。使用最近的GCC,我认为闪亮的新的链接时概要文件引导的优化模式可以有选择地将-O3优化应用到热函数中——有效地自动化这个过程。

#3


7  

-O3 option turns on more expensive optimizations, such as function inlining, in addition to all the optimizations of the lower levels ‘-O2’ and ‘-O1’. The ‘-O3’ optimization level may increase the speed of the resulting executable, but can also increase its size. Under some circumstances where these optimizations are not favorable, this option might actually make a program slower.

-O3选项打开了更昂贵的优化,比如函数内联,以及较低级别' -O2 '和' -O1 '的所有优化。“-O3”优化级别可以提高结果可执行文件的速度,但也可以增加其大小。在某些情况下,这些优化是不利的,这个选项实际上可能会使程序变慢。

#4


3  

Some time ago, I came into collision with optimization. There was a PCI card, that represented it's registers (for command and data) by memory cell. My driver just mapped phisical address of that memory to application level's pointer and gave it to called process, which worked with it like this:

不久前,我遇到了优化问题。有一个PCI卡,表示它的寄存器(用于命令和数据)由内存单元。我的驱动程序只是将内存的物理地址映射到应用程序层的指针,并将它提供给调用进程,进程就像这样工作:

unsigned int * pciMemory;
askDriverForMapping( & pciMemory );
...
pciMemory[ 0 ] = someCommandIdx;
pciMemory[ 0 ] = someCommandLength;
for ( int i = 0; i < sizeof( someCommand ); i++ )
    pciMemory[ 0 ] = someCommand[ i ];

I was amazing why card didn't act as expected. And only when I saw assembler I understood that compiler wrote only someCommand[ the last ] into pciMemory, omitting all preceding writes.

我很惊讶为什么卡德没有按预期行动。而且只有当我看到汇编器时,我才明白编译器只在pciMemory中写入了someCommand(最后的),省略了前面所有的写。

In conclusion: be accurate and attentive with optimization )))

综上所述:精准、专注、优化)

#5


3  

Yes, O3 is buggier. I'm a compiler developer and I've identified clear and obvious gcc bugs caused by O3 generating buggy SIMD assembly instructions when building my own software. From what I've seen, most production software ships with O2 which means O3 will get less attention wrt testing and bug fixes.

是的,O3是更多缺陷。我是一名编译器开发人员,在构建我自己的软件时,我发现了由O3生成错误的SIMD汇编指令导致的明显的gcc bug。从我所看到的情况来看,大多数的生产软件都使用了O2,这意味着O3将得到更少的关注wrt测试和bug修复。

Think of it this way: O3 adds more transformations on top of O2, which adds more transformations on top of O1. Statistically speaking, more transformations means more bugs. That's true for any compiler.

这样想:O3在O2上增加了更多的变换,在O1上增加了更多的变换。从统计学上讲,更多的转换意味着更多的bug。对任何编译器都是如此。