实时编程C性能难题

I am working on a an embedded architecture where ASM is predominent. I would like to refactor most of our legacy ASM code in C in order to increase readability and modularity.

我正在开发一个以ASM为主导的嵌入式架构。我想用C重构我们的大部分遗留ASM代码，以增加可读性和模块化。

So I am still puzzling with minor details which causes my hopes to vanish. The real problem is far more complex that this following example, but I would like to share this as an entry point to the discussion.

因此，我仍然对一些细节感到困惑，这些细节使我的希望破灭了。真正的问题比下面这个例子复杂得多，但是我想把它作为讨论的切入点来分享。

My goal is to find a optimal workaround.

我的目标是找到一个最佳的解决方案。

Here is the original example (do not worry about what the code does. I wrote this randomly just to show the issue I would like to talk about).

这里是原始示例(不要担心代码的作用。我随机写这个只是为了展示我想谈论的问题)。

int foo;
int bar;
int tmp;
int sum;

void do_something() {
    tmp = bar;
    bar = foo + bar;
    foo = foo + tmp;
}

void compute_sum() {
    for(tmp = 1; tmp < 3; tmp++)
        sum += foo * sum + bar * sum;
}

void a_function() {
    compute_sum();
    do_something();
}

With this dummy code, anyone would immediately remove all the global variables and replace them with local ones:

使用这个伪代码，任何人都会立即删除所有全局变量，并用本地变量替换它们:

void do_something(int *a, int *b) {
    int tmp = *b;
    *b = *a + *b;
    *b = tmp + *a;
}

void compute_sum(int *sum, int foo, int bar) {
    int tmp;
    for(tmp = 1; tmp < 3; tmp++)
        sum += *foo * sum + *bar * sum;
}

void a_function(int *sum, int *foo, int *bar) {
    compute_sum(sum, foo, bar);
    do_something(foo, bar);
}

Unfortunately this rework is worse than the original code because all the parameters are pushed into the stack which leads to latencies and larger code size.

不幸的是，这种重做比原始代码更糟糕，因为所有的参数都被放入堆栈中，这会导致延迟和更大的代码大小。

The everything globals solution is both the best the ugliest solution. Especially when the source code is about 300k lines long with almost 3000 global variables.

所有的全局解都是最好的，最丑的解。特别是当源代码长约300k行，有将近3000个全局变量时。

Here we are not facing a compiler problem, but a structural issue. Writing beautiful, portable, readable, modular and robust code will never pass the ultimate performance test because compilers are dumb, even is 2015.

这里我们没有遇到编译器问题，而是一个结构问题。编写漂亮的、可移植的、可读的、模块化的、健壮的代码永远不会通过最终的性能测试，因为编译器很笨，即使是在2015年也是如此。

An alternative solution is to rather prefer inline functions. Unfortunately these functions have to be located into a header file which is also ugly.

另一种解决方案是更喜欢内联函数。不幸的是，这些函数必须位于一个同样丑陋的头文件中。

A compiler cannot see further the file it is working on. When a function is marked as extern it will irrevocably lead to performance issues. The reason is the compiler cannot make any assumptions regarding the external declarations.

编译器不能进一步查看它正在处理的文件。当一个函数被标记为extern时，它将不可逆转地导致性能问题。原因是编译器不能对外部声明做出任何假设。

In the other way, the linker could do the job and ask the compiler to rebuild objects files by givin additionnal information to the compiler. Unfortunately not many compilers offer such features and when they do, they considerably slow down the build process.

换句话说，链接器可以通过向编译器提供附加信息来完成这项工作，并要求编译器重新构建对象文件。不幸的是，并没有很多编译器提供这样的特性，当它们这样做时，它们会大大降低构建过程的速度。

I eventually came accross this dilemma:

我最终遇到了这样的困境:

Keep the code ugly to preserve performances

保持代码的丑陋以保持性能
- Everything's global
- 所有的全球
- Functions without parameters (same as procedures)
- 无参数的函数(与过程相同)
- Keeping everything in the same file
- 将所有文件保存在同一个文件中
Follow standards and write clean code

遵循标准并编写干净的代码
- Think of modules
- 认为的模块
- Write small but numerous functions with well defined parameters
- 编写具有定义良好的参数的小而多的函数
- Write small but numerous source files
- 编写小而多的源文件

What to do when the target architecture has limited ressources. Going back to the assembly is my last option.

当目标架构的资源有限时该怎么办。回到大会是我最后的选择。

Additional Information

额外的信息

I am working on a SHARC architecture which is a quite powerful Harvard CISC architecture. Unfortunately one code instruction takes 48bits and a long only takes 32bits. With this fact it is better to keep to version of a variable rather than evaluating the second value on the fly:

我正在研究SHARC架构，这是一个非常强大的哈佛CISC架构。不幸的是，一个代码指令需要48位，而一个长指令只需要32位。有了这个事实，最好保持变量的版本，而不是动态地计算第二个值:

The optimized example:

优化的例子:

int foo;
int bar;
int half_foo;

void example_a() {
   write(foo); 
   write(half_foo + bar);
}

The bad one:

坏:

void example_a(int foo, int bar) {
   write(foo); 
   write(bar + (foo >> 1));
}

3 个解决方案

#1

I'm used to working in performance-critical core/kernel-type areas with very tight needs, often being beneficial to accept the optimizer and standard library performance with some grain of salt (ex: not getting too excited about the speed of malloc or auto-generated vectorization).

我习惯在性能关键型核心/内核类型的区域工作，这些区域的需求非常紧，通常有利于接受优化器和标准库的性能，但也有一些保留(例如:不要对malloc或自动生成矢量化的速度太过兴奋)。

However, I've never had such tight needs so as to make the number of instructions or the speed of pushing more arguments to the stack be a considerable concern. If it is, indeed, a major concern for the target system and performance tests are failing, one thing to note is that performance tests modeled at a micro level of granularity often do have you obsessed with smallest of micro-efficiencies.

然而，我从未遇到过如此紧迫的需求，以至于需要考虑指令的数量或将更多参数推送到堆栈的速度。如果它确实是目标系统和性能测试的主要关注点，那么需要注意的一点是，以微观粒度为模型的性能测试通常会让您着迷于最小的微效率。

Micro-Efficiency Performance Tests

We made the mistake of writing all kinds of superficial micro-level tests in a former workplace I was at where we made tests to simply time something as basic as reading one 32-bit float from a file. Meanwhile, we made optimizations that significantly sped up the broad, real-world test cases associated with reading and parsing the contents of entire files while, at the same time, some of those uber-micro tests actually got slower for some unbeknownst reason (they weren't even directly modified, but changes to the code around them may have had some indirect impact relating to dynamic factors like caches, paging, etc., or merely how the optimizer treated such code).

我们犯了这样的错误:在我以前工作的地方，我们编写了各种表面的微观级别测试，只是为了简单地对一些基本的东西计时，比如从文件中读取一个32位的浮点数。与此同时,我们做了优化,大大加快了广泛,实际测试用例与读取和解析相关的全部文件的内容,与此同时,一些uber-micro测试慢了一些不知道的原因(他们甚至没有直接修改,但修改代码周围可能有一些间接影响有关动态缓存等因素,分页,等等,或者仅仅是优化器如何对待这样的代码)。

So the micro-level world can get a bit more chaotic when you work with a higher-level language than assembly. The performance of the teeny things can shift under your feet a bit, but you have to ask yourself what's more important: a slight decrease in the performance of reading one 32-bit float from a file, or having real-world operations that read from entire files go significantly faster. Modeling your performance tests and profiling sessions at a higher level will give you room to selectively and productively optimize the parts that really matter. There you have many ways to skin a cat.

因此，当您使用高级语言而不是汇编语言时，微观世界可能会变得更加混乱。这些东西的性能可能会在你的脚下发生一些变化，但是你必须问问自己什么更重要:从文件中读取一个32位浮点数的性能稍有下降，或者从整个文件中读取操作的实际操作速度明显加快。在更高的层次上对性能测试和分析会话进行建模，将给您提供有选择地、高效地优化那些真正重要的部分的空间。在那里你有很多方法可以剥猫皮。

Run a profiler on an ultra-granular operation being executed a million times repeatedly and you would have already backed yourself into an assembly-type micro-corner for everything performing such micro-level tests just by the nature of how you are profiling the code. So you really want to zoom out a bit there, test things at a coarser level so that you can act like a disciplined sniper and hone in on the micro-efficiency of very select parts, dispatching the leaders behind inefficiencies rather than trying to be a hero taking out every little insignificant foot soldier that might be a performance obstacle.

在一个超细粒度的操作上运行一个分析器，重复执行一百万次，你就会把自己投入到一个装配型的微角中，因为所有的微观测试都是通过对代码进行分析的方式进行的。所以你真的想缩小一点,在粗级别测试,这样你就可以像一个训练有素的狙击手和磨练的micro-efficiency选择部分,调度效率背后的*而不是试图成为一个英雄取出每一个微不足道的步兵可能性能障碍。

Optimizing Linker

One of your misconceptions is that only the compiler can act as an optimizer. Linkers can perform a variety of optimizations when linking object files together, including inlining code. So there should rarely, if ever, be a need to jam everything into a single object file as an optimization. I'd try looking more into the settings of your linker if you find otherwise.

您的一个误解是，只有编译器才能充当优化器。在将对象文件链接在一起时，链接器可以执行各种优化，包括内联代码。因此，作为一种优化，几乎不需要将所有内容都塞到一个对象文件中。如果你发现了，我会试着更深入地研究你的链接器的设置。

Interface Design

With these things aside, the key to a maintainable, large-scale codebase lies more in interface (i.e., header files) than implementation (source files). If you have a car with an engine that goes a thousand miles per hour, you might peer under the hood and find that there are little fire-breathing demons dancing around to allow that to happen. Perhaps there was a pact involved with demons to get such speed. But you don't have to expose that fact to the people driving the car. You can still give them a nice set of intuitive, safe controls to drive that beast.

撇开这些不谈，实现可维护的大规模代码库的关键更多地在于接口(即。，头文件)比实现(源文件)。如果你有一辆引擎每小时能跑一千英里的车，你可能会从引擎盖下面窥视，发现周围有一些会喷火的小恶魔在跳来跳去。也许有一个与恶魔有关的协议，以达到这样的速度。但你不必向开车的人暴露这个事实。你仍然可以给他们一套直观的、安全的控制来驱动那只野兽。

So you might have a system that makes uninlined function calls 'expensive', but expensive relative to what? If you are calling a function that sorts a million elements, the relative cost of pushing a few small arguments to the stack like pointers and integers should be absolutely trivial no matter what kind of hardware you're dealing with. Inside the function, you might do all sorts of profiler-assisted things to boost performance like macros to forcefully inline code no matter what, perhaps even some inlined assembly, but the key to keeping that code from cascading its complexity throughout your system is to keep all that demon code hidden away from the people who are using your sort function and to make sure it's well-tested so that people don't have to constantly pop the hood trying to figure out the source of a malfunction.

所以你可能有一个系统让无内联函数调用昂贵，但是相对于什么来说昂贵?如果您正在调用一个对一百万个元素进行排序的函数，那么无论您处理的是哪种硬件，将一些小参数像指针和整数一样推入堆栈的相对成本应该是非常小的。内部的功能,你可能会做各种各样的profiler-assisted事情来提高性能像宏有力内联代码无论如何,甚至一些内联汇编,但从层叠其复杂性保持代码的关键在你系统是保持所有恶魔代码隐藏的人使用你的排序功能,以确保它的经过,这样人们不用不断流行罩试图找出故障的来源。

Ignoring that 'relative to what?' question and only focusing on absolutes is also what leads to the micro-profiling which can be more counter-productive than helpful.

忽略“相对于什么?”“问题是，我们只关注绝对的东西，这也导致了微观剖析的产生，这种分析可能会适得其反，而不是有益的。”

So I'd suggest looking at this more from a public interface design level, because behind an interface, if you look behind the curtains/under the hood, you might find all kinds of evil things going on to get that needed edge in performance in hotspot areas shown in a profiler. But you shouldn't need to pop the hood very often if your interfaces are well-designed and well-tested.

所以我建议从公共界面设计层面来看这一点，因为在一个界面后面，如果你看一下窗帘后面的东西，你可能会发现各种各样的坏事在热点地区的表现中得到了必要的优势。但是，如果您的接口设计良好且经过了良好的测试，那么您不需要经常打开这个引擎盖。

Globals become a bigger problem the wider their scope. If you have globals defined statically with internal linkage inside a source file that no one else can access, then those are actually rather 'local' globals. If thread-safety isn't a concern (if it is, then you should avoid mutable globals as much as possible), then you might have a number of performance-critical areas in your codebase where if you peer under the hood, you find file scope-static variables a lot to mitigate the overhead of function calls. That's still a whole lot easier to maintain than assembly, especially when the visibility of such globals are reduced with smaller and smaller source files dedicated to performing more singular, clear responsibilities.

全球变暖的范围越广，问题就越大。如果在源文件中静态地定义了具有内部链接的全局变量，其他人无法访问，那么这些全局变量实际上是“本地”全局变量。如果线程安全不是问题(如果是,那么你应该尽可能避免可变的全局变量),那么您可能有许多性能关键型地区你的代码库,如果你同伴在引擎盖下,你找到文件scope-static变量减少函数调用的开销。这仍然比汇编要容易维护得多，特别是当使用越来越小的源文件来执行更单一、更明确的职责时，这种全局变量的可见性会降低。

#2

Ugly C code is still a lot more readable than assembler. In addition, it's likely that you'll net some unexpected free optimizations.

丑陋的C代码仍然比汇编程序可读性强得多。此外，您可能会得到一些意外的免费优化。

A compiler cannot see further the file it is working on. When a function is marked as extern it will irrevocably lead to performance issues. The reason is the compiler cannot make any assumptions regarding the external declarations.

编译器不能进一步查看它正在处理的文件。当一个函数被标记为extern时，它将不可逆转地导致性能问题。原因是编译器不能对外部声明做出任何假设。

False and false. Have you tried "Whole Program Optimization" yet? The benefits of inline functions, without having to organize into headers. Not that putting things in headers is necessarily ugly, if you organize the headers.

虚假和错误的。你试过“全程序优化”吗?内联函数的好处是，不必组织成头。并不是说把东西放到header中一定很难看，如果你组织header的话。

In your VisualDSP++ compiler, this is enabled by the -ipa switch.

在VisualDSP+编译器中，这是由-ipa开关启用的。

The ccts compiler has a capability called interprocedural analysis (IPA), a mechanism that allows the compiler to optimize across translation units instead of within just one translation unit. This capability effectively allows the compiler to see all of the source files that are used in a final link at compilation time and make use of that information when optimizing.

ccts编译器有一种称为过程间分析(interprocess analysis, IPA)的功能，它允许编译器跨翻译单元进行优化，而不是只在一个翻译单元内进行优化。此功能有效地允许编译器在编译时查看最终链接中使用的所有源文件，并在优化时利用这些信息。

All of the -ipa optimizations are invoked after the initial link, whereupon a special program called the prelinker reinvokes the compiler to perform the new optimizations.

所有-ipa优化都在初始链接之后调用，因此一个名为prelinker的特殊程序将重新调用编译器来执行新的优化。

#3

I have designed/written/tested/documented many many real time embedded systems.

我设计/编写/测试/记录了许多实时嵌入式系统。

Both 'soft' real time and 'hard' real time.

“软”实时和“硬”实时。

I can tell you from hard earned experience that the algorithm used to implement the application is the place to make the biggest gains in speed.

我可以从辛苦积累的经验中告诉您，用于实现应用程序的算法是获得最大速度收益的地方。

Little stuff like a function call compared to in-line is trivial unless performed thousands (or even hundreds of thousands) of times

与in-line相比，函数调用之类的小事情是微不足道的，除非执行数千次(甚至数十万次)

#1

Micro-Efficiency Performance Tests

Optimizing Linker

Interface Design

Ignoring that 'relative to what?' question and only focusing on absolutes is also what leads to the micro-profiling which can be more counter-productive than helpful.

忽略“相对于什么?”“问题是，我们只关注绝对的东西，这也导致了微观剖析的产生，这种分析可能会适得其反，而不是有益的。”

#2

Ugly C code is still a lot more readable than assembler. In addition, it's likely that you'll net some unexpected free optimizations.

丑陋的C代码仍然比汇编程序可读性强得多。此外，您可能会得到一些意外的免费优化。

A compiler cannot see further the file it is working on. When a function is marked as extern it will irrevocably lead to performance issues. The reason is the compiler cannot make any assumptions regarding the external declarations.

编译器不能进一步查看它正在处理的文件。当一个函数被标记为extern时，它将不可逆转地导致性能问题。原因是编译器不能对外部声明做出任何假设。

虚假和错误的。你试过“全程序优化”吗?内联函数的好处是，不必组织成头。并不是说把东西放到header中一定很难看，如果你组织header的话。

In your VisualDSP++ compiler, this is enabled by the -ipa switch.

在VisualDSP+编译器中，这是由-ipa开关启用的。

The ccts compiler has a capability called interprocedural analysis (IPA), a mechanism that allows the compiler to optimize across translation units instead of within just one translation unit. This capability effectively allows the compiler to see all of the source files that are used in a final link at compilation time and make use of that information when optimizing.

ccts编译器有一种称为过程间分析(interprocess analysis, IPA)的功能，它允许编译器跨翻译单元进行优化，而不是只在一个翻译单元内进行优化。此功能有效地允许编译器在编译时查看最终链接中使用的所有源文件，并在优化时利用这些信息。

All of the -ipa optimizations are invoked after the initial link, whereupon a special program called the prelinker reinvokes the compiler to perform the new optimizations.

所有-ipa优化都在初始链接之后调用，因此一个名为prelinker的特殊程序将重新调用编译器来执行新的优化。

#3

I have designed/written/tested/documented many many real time embedded systems.

我设计/编写/测试/记录了许多实时嵌入式系统。

Both 'soft' real time and 'hard' real time.

“软”实时和“硬”实时。

I can tell you from hard earned experience that the algorithm used to implement the application is the place to make the biggest gains in speed.

我可以从辛苦积累的经验中告诉您，用于实现应用程序的算法是获得最大速度收益的地方。

Little stuff like a function call compared to in-line is trivial unless performed thousands (or even hundreds of thousands) of times

与in-line相比，函数调用之类的小事情是微不足道的，除非执行数千次(甚至数十万次)

秒客网

实时编程C性能难题

3 个解决方案

#1

Micro-Efficiency Performance Tests

Optimizing Linker

Interface Design

#2

#3

#1

Micro-Efficiency Performance Tests

Optimizing Linker

Interface Design

#2

#3

相关文章