C ++ ARM设备代码优化技巧

时间:2021-09-20 03:14:38

I have been developing C++ code for augmented reality on ARM devices and optimization of the code is very important in order to keep a good frame rate. In order to rise efficiency to the maximum level I think it is important to gather general tips that make life easier for compilers and reduce the number of cicles of the program. Any suggestion is welcomed.

我一直在为ARM设备开发增强现实的C ++代码,为了保持良好的帧速率,优化代码非常重要。为了将效率提升到最高水平,我认为收集使编译器更容易生活并减少程序数量的一般提示非常重要。任何建议都受到欢迎。

1- Avoid high-cost instructions: division, square root, sin, cos

1-避免使用高成本指令:除法,平方根,sin,cos

  • Use logical shifts to divide or multiply by 2.
  • 使用逻辑移位除以2或乘以2。

  • Multiply by the inverse when possible.
  • 尽可能乘以逆。

2- Optimize inner "for" loops: they are a botleneck so we should avoid making many calculations inside, especially divisions, square roots..

2-优化内部“for”循环:它们是一个botleneck,所以我们应该避免在里面进行很多计算,特别是划分,平方根。

3- Use look-up tables for some mathematical functions (sin, cos, ...)

3-使用查找表来查找某些数学函数(sin,cos,...)

USEFUL TOOLS

  • objdump: gets assembly code of compiled program. This allows to compare two functions and check if it is really optimized.
  • objdump:获取已编译程序的汇编代码。这允许比较两个函数并检查它是否真的被优化。

2 个解决方案

#1


17  

To answer your question about general rules when optimizing C++ code for ARM, here are a few suggestions:

要在为ARM优化C ++代码时回答有关一般规则的问题,请参考以下建议:

1) As you mentioned, there is no divide instruction. Use logical shifts or multiply by the inverse when possible.
2) Memory is much slower than CPU execution; use logical operations to avoid small lookup tables.
3) Try to write 32-bits at a time to make best use of the write buffer. Writing shorts or chars will slow the code down considerably. In other words, it's faster to logical-OR the smaller bits together and write them as DWORDS.
4) Be aware of your L1/L2 cache size. As a general rule, ARM chips have much smaller caches than Intel.
5) Use SIMD (NEON) when possible. NEON instructions are quite powerful and for "vectorizable" code, can be quite fast. NEON intrinsics are available in most C++ environments and can be nearly as fast as writing hand tuned ASM code.
6) Use the cache prefetch hint (PLD) to speed up looping reads. ARM doesn't have smart precache logic the way that modern Intel chips do.
7) Don't trust the compiler to generate good code. Look at the ASM output and rewrite hotspots in ASM. For bit/byte manipulation, the C language can't specify things as efficiently as they can be accomplished in ASM. ARM has powerful 3-operand instructions, multi-load/store and "free" shifts that can outperform what the compiler is capable of generating.

1)正如你所提到的,没有除法指令。尽可能使用逻辑移位或乘以逆。 2)内存比CPU执行慢得多;使用逻辑运算来避免小型查找表。 3)尝试一次写32位以充分利用写缓冲区。编写短路或字符会大大减慢代码速度。换句话说,逻辑上更快或者将较小的位组合在一起并将它们写为DWORDS。 4)注意您的L1 / L2高速缓存大小。作为一般规则,ARM芯片的缓存比英特尔小得多。 5)尽可能使用SIMD(NEON)。 NEON指令非常强大,对于“可矢量化”代码,可以非常快。 NEON内在函数在大多数C ++环境中都可用,并且几乎与编写手动调优的ASM代码一样快。 6)使用缓存预取提示(PLD)来加速循环读取。 ARM没有像现代英特尔芯片那样的智能预先缓存逻辑。 7)不要相信编译器生成好的代码。查看ASM输出并重写ASM中的热点。对于位/字节操作,C语言不能像在ASM中那样有效地指定事物。 ARM具有强大的3操作数指令,多重加载/存储和“*”移位,可以胜过编译器能够生成的内容。

#2


15  

The best way to optimize an application is to use a good profiler. Its always a good idea to write code thinking about efficiency, but you also want to avoid making changes where you "think" the code may be slow, this could possibly make things worse if you're not 100% sure.

优化应用程序的最佳方法是使用优秀的分析器。编写考虑效率的代码总是一个好主意,但是你也想避免在“认为”代码可能很慢的情况下进行更改,如果你不是100%肯定的话,这可能会使事情变得更糟。

Find out where the bottlenecks are and focus on those.

找出瓶颈的位置并关注这些瓶颈。

For me profiling is an iterative process, because usually when you fix one bottleneck, other less important ones manifest themselves.

对我来说,分析是一个迭代过程,因为通常当你修复一个瓶颈时,其他不太重要的东西会表现出来。

In addition to profiling the SW, check what sort of HW profiling is available. Check if you can get different HW metrics, like cache misses, memory bus accesses, etc. This is also very helpful to know if your mem bus or cache is a bottleneck.

除了分析SW外,还要检查哪种类型的HW分析可用。检查是否可以获得不同的硬件指标,如缓存未命中,内存总线访问等。这也非常有助于了解您的内存总线或缓存是否是瓶颈。

I recently asked this similar question and got some good answers: Looking for a low impact c++ profiler

我最近问了这个类似的问题并得到了一些很好的答案:寻找低影响力的c ++探查器

#1


17  

To answer your question about general rules when optimizing C++ code for ARM, here are a few suggestions:

要在为ARM优化C ++代码时回答有关一般规则的问题,请参考以下建议:

1) As you mentioned, there is no divide instruction. Use logical shifts or multiply by the inverse when possible.
2) Memory is much slower than CPU execution; use logical operations to avoid small lookup tables.
3) Try to write 32-bits at a time to make best use of the write buffer. Writing shorts or chars will slow the code down considerably. In other words, it's faster to logical-OR the smaller bits together and write them as DWORDS.
4) Be aware of your L1/L2 cache size. As a general rule, ARM chips have much smaller caches than Intel.
5) Use SIMD (NEON) when possible. NEON instructions are quite powerful and for "vectorizable" code, can be quite fast. NEON intrinsics are available in most C++ environments and can be nearly as fast as writing hand tuned ASM code.
6) Use the cache prefetch hint (PLD) to speed up looping reads. ARM doesn't have smart precache logic the way that modern Intel chips do.
7) Don't trust the compiler to generate good code. Look at the ASM output and rewrite hotspots in ASM. For bit/byte manipulation, the C language can't specify things as efficiently as they can be accomplished in ASM. ARM has powerful 3-operand instructions, multi-load/store and "free" shifts that can outperform what the compiler is capable of generating.

1)正如你所提到的,没有除法指令。尽可能使用逻辑移位或乘以逆。 2)内存比CPU执行慢得多;使用逻辑运算来避免小型查找表。 3)尝试一次写32位以充分利用写缓冲区。编写短路或字符会大大减慢代码速度。换句话说,逻辑上更快或者将较小的位组合在一起并将它们写为DWORDS。 4)注意您的L1 / L2高速缓存大小。作为一般规则,ARM芯片的缓存比英特尔小得多。 5)尽可能使用SIMD(NEON)。 NEON指令非常强大,对于“可矢量化”代码,可以非常快。 NEON内在函数在大多数C ++环境中都可用,并且几乎与编写手动调优的ASM代码一样快。 6)使用缓存预取提示(PLD)来加速循环读取。 ARM没有像现代英特尔芯片那样的智能预先缓存逻辑。 7)不要相信编译器生成好的代码。查看ASM输出并重写ASM中的热点。对于位/字节操作,C语言不能像在ASM中那样有效地指定事物。 ARM具有强大的3操作数指令,多重加载/存储和“*”移位,可以胜过编译器能够生成的内容。

#2


15  

The best way to optimize an application is to use a good profiler. Its always a good idea to write code thinking about efficiency, but you also want to avoid making changes where you "think" the code may be slow, this could possibly make things worse if you're not 100% sure.

优化应用程序的最佳方法是使用优秀的分析器。编写考虑效率的代码总是一个好主意,但是你也想避免在“认为”代码可能很慢的情况下进行更改,如果你不是100%肯定的话,这可能会使事情变得更糟。

Find out where the bottlenecks are and focus on those.

找出瓶颈的位置并关注这些瓶颈。

For me profiling is an iterative process, because usually when you fix one bottleneck, other less important ones manifest themselves.

对我来说,分析是一个迭代过程,因为通常当你修复一个瓶颈时,其他不太重要的东西会表现出来。

In addition to profiling the SW, check what sort of HW profiling is available. Check if you can get different HW metrics, like cache misses, memory bus accesses, etc. This is also very helpful to know if your mem bus or cache is a bottleneck.

除了分析SW外,还要检查哪种类型的HW分析可用。检查是否可以获得不同的硬件指标,如缓存未命中,内存总线访问等。这也非常有助于了解您的内存总线或缓存是否是瓶颈。

I recently asked this similar question and got some good answers: Looking for a low impact c++ profiler

我最近问了这个类似的问题并得到了一些很好的答案:寻找低影响力的c ++探查器