在使用c++实现长方程时，如何通过高级方法提高性能

I am developing some engineering simulations. This involves implementing some long equations such as this equation to calculate stress in a rubber like material:

我正在开发一些工程模拟。这涉及到实现一些长方程，如这个方程来计算橡胶类材料的应力:

T = (
    mu * (
            pow(l1 * pow(l1 * l2 * l3, -0.1e1 / 0.3e1), a) * a
            * (
                pow(l1 * l2 * l3, -0.1e1 / 0.3e1)
                - l1 * l2 * l3 * pow(l1 * l2 * l3, -0.4e1 / 0.3e1) / 0.3e1
            ) * pow(l1 * l2 * l3, 0.1e1 / 0.3e1) / l1
            - pow(l2 * pow(l1 * l2 * l3, -0.1e1 / 0.3e1), a) * a / l1 / 0.3e1
            - pow(l3 * pow(l1 * l2 * l3, -0.1e1 / 0.3e1), a) * a / l1 / 0.3e1
        ) / a
    + K * (l1 * l2 * l3 - 0.1e1) * l2 * l3
) * N1 / l2 / l3

+ (
    mu * (
        - pow(l1 * pow(l1 * l2 * l3, -0.1e1 / 0.3e1), a) * a / l2 / 0.3e1
        + pow(l2 * pow(l1 * l2 * l3, -0.1e1 / 0.3e1), a) * a
        * (
            pow(l1 * l2 * l3, -0.1e1 / 0.3e1)
            - l1 * l2 * l3 * pow(l1 * l2 * l3, -0.4e1 / 0.3e1) / 0.3e1
        ) * pow(l1 * l2 * l3, 0.1e1 / 0.3e1) / l2
        - pow(l3 * pow(l1 * l2 * l3, -0.1e1 / 0.3e1), a) * a / l2 / 0.3e1
    ) / a
    + K * (l1 * l2 * l3 - 0.1e1) * l1 * l3
) * N2 / l1 / l3

+ (
    mu * (
        - pow(l1 * pow(l1 * l2 * l3, -0.1e1 / 0.3e1), a) * a / l3 / 0.3e1
        - pow(l2 * pow(l1 * l2 * l3, -0.1e1 / 0.3e1), a) * a / l3 / 0.3e1
        + pow(l3 * pow(l1 * l2 * l3, -0.1e1 / 0.3e1), a) * a
        * (
            pow(l1 * l2 * l3, -0.1e1 / 0.3e1)
            - l1 * l2 * l3 * pow(l1 * l2 * l3, -0.4e1 / 0.3e1) / 0.3e1
        ) * pow(l1 * l2 * l3, 0.1e1 / 0.3e1) / l3
    ) / a
+ K * (l1 * l2 * l3 - 0.1e1) * l1 * l2
) * N3 / l1 / l2;

I use Maple to generate the C++ code to avoid mistakes (and save time with tedious algebra). As this code is executed thousands (if not millions) of times, the performance is a concern. Unfortunately the math only simplifies so far; the long equations are unavoidable.

我使用Maple生成c++代码，以避免错误(并使用冗长的代数来节省时间)。由于该代码执行数千次(如果不是数百万次)，性能是一个问题。不幸的是，到目前为止数学化简了;冗长的方程式是不可避免的。

What approach can I take to optimize this implementation? I'm looking for high-level strategies that I should be applying when implementing such equations, not necessarily specific optimizations for the example shown above.

我可以采用什么方法来优化这个实现?我正在寻找在实现此类方程时应该应用的高级策略，而不是对上面所示的示例进行特定的优化。

I'm compiling using g++ with --enable-optimize=-O3.

我用g++进行编译，使-enabl -最优化=-O3。

Update:

更新:

I know there are a lot of repeated expressions, I am using the assumption that the compiler would handle these; my tests so far suggest it does.

我知道有很多重复的表达式，我假设编译器会处理这些;到目前为止，我的测试表明它是正确的。

l1, l2, l3, mu, a, K are all positive real numbers (not zero).

l1 l2 l3 a K都是正数(不是0)

I have replaced l1*l2*l3 with an equivalent variable: J. This did help improve performance.

我将l1*l2*l3替换为等效变量j，这确实有助于提高性能。

Replacing pow(x, 0.1e1/0.3e1) with cbrt(x) was a good suggestion.

用cbrt(x)替换pow(x, 0.1e1/0.3e1)是一个很好的建议。

This will be run on CPUs, In the near future this would likely run better on GPUs, but for now that option is not available.

这将在cpu上运行，在不久的将来，这可能会在gpu上运行得更好，但是目前还没有这个选项。

10 个解决方案

#1

Edit summary

My original answer merely noted that the code contained a lot of replicated computations and that many of the powers involved factors of 1/3. For example, pow(x, 0.1e1/0.3e1) is the same as cbrt(x).
我最初的回答只是指出，代码包含了大量的复制计算，而且很多幂都包含了1/3的因数。例如，pow(x, 0.1e1/0.3e1)与cbrt(x)相同。
My second edit was just wrong, and and my third extrapolated on this wrongness. This is what makes people afraid to change the oracle-like results from symbolic math programs that start with the letter 'M'. I've stricken out (i.e., ~~strike~~ ) those edits and pushed them to the bottom of the current revision of this answer. However, I did not delete them. I'm human. It's easy for us to make a mistake.
我的第二次编辑是错误的，第三次推断是错误的。这就是为什么人们害怕改变从字母“M”开始的符号数学程序的结果。(也就是我的。这些编辑并将它们推到当前对这个答案的修订的底部。但是，我没有删除它们。我是人类。我们很容易犯错误。
My fourth edit developed a very compact expression that correctly represents the convoluted expression in the question IF the parameters l1, l2, and l3 are positive real numbers and if a is a non-zero real number. (We have yet to hear from the OP regarding the specific nature of these coefficients. Given the nature of the problem, these are reasonable assumptions.)
我的第四次编辑开发了一个非常紧凑的表达式，如果参数l1、l2和l3是正数，如果a是非零实数，那么它就可以正确地表示问题中的卷积表达式。(我们还没有从OP中得知这些系数的具体性质。鉴于问题的本质，这些都是合理的假设。
This edit attempts to answer the generic problem of how to simplify these expressions.
此编辑尝试回答如何简化这些表达式的一般问题。

First things first

I use Maple to generate the C++ code to avoid mistakes.

我使用Maple生成c++代码以避免错误。

Maple and Mathematica sometimes miss the obvious. Even more importantly, the users of Maple and Mathematica sometimes make mistakes. Substituting "oftentimes", or maybe even "almost always", in lieu of "sometimes is probably closer to the mark.

Maple和Mathematica有时会漏掉那些显而易见的东西。更重要的是，枫树和Mathematica的用户有时会出错。用“often”，甚至“almost always”来代替“sometimes is probably closer to the mark”。

You could have helped Maple simplify that expression by telling it about the parameters in question. In the example at hand, I suspect that l1, l2, and l3 are positive real numbers and that a is a non-zero real number. If that's the case, tell it that. Those symbolic math programs typically assume the quantities at hand are complex. Restricting the domain lets the program make assumptions that are not valid in the complex numbers.

你可以通过告诉它有关的参数来帮助Maple简化这个表达式。在手边的示例中，我怀疑l1、l2和l3是正实数，而a是非零实数。如果是这样，那就说出来。这些符号数学程序通常假设手边的数量是复杂的。限制域允许程序做出在复数中无效的假设。

How to simplify those big messes from symbolic math programs (this edit)

Symbolic math programs typically provide the ability to provide information about the various parameters. Use that ability, particularly if your problem involves division or exponentiation. In the example at hand, you could have helped Maple simplify that expression by telling it that l1, l2, and l3 are positive real numbers and that a is a non-zero real number. If that's the case, tell it that. Those symbolic math programs typically assume the quantities at hand are complex. Restricting the domain lets the program make assumptions such as a^xb^x=(ab)^x. This is only if a and b are positive real numbers and if x is real. It is not valid in the complex numbers.

符号数学程序通常提供关于各种参数的信息。使用这种能力，特别是当你的问题涉及到除法或求幂时。在这个例子中，你可以通过告诉它l1 l2 l3是正实数a是非零实数来简化这个表达式。如果是这样，那就说出来。这些符号数学程序通常假设手边的数量是复杂的。限制域允许程序进行假设，比如axbx=(ab)x。这只适用于a和b是正数，x是实数。它在复数中是无效的。

Ultimately, those symbolic math programs follow algorithms. Help it along. Try playing with expanding, collecting, and simplifying before you generate code. In this case, you could have collected those terms involving a factor of mu and those involving a factor of K. Reducing an expression to its "simplest form" remains a bit of an art.

最终，这些符号数学程序遵循算法。帮助它。在生成代码之前，尝试扩展、收集和简化。在这种情况下，你可能已经收集了涉及到的系数mu和涉及到k的系数k的这些术语，将表达式简化为“最简单的形式”仍然是一种艺术。

When you get an ugly mess of generated code, don't accept it as a truth that you must not touch. Try to simplify it yourself. Look at what the symbolic math program had before it generated code. Look at how I reduced your expression to something much simpler and much faster, and how Walter's answer took mine several steps further. There is no magic recipe. If there was a magical recipe, Maple would have applied it and given the answer that Walter gave.

当你得到一堆乱七八糟的生成的代码时，不要认为这是你不能碰的事实。试着把它简化一下。看看它生成代码之前的符号数学程序。看看我是如何把你的表情简化成更简单、更快的表情的，以及沃尔特的回答是如何让我的表情更进一步的。没有神奇的配方。如果有一个神奇的配方，Maple就会应用它并给出Walter给出的答案。

About the specific question

You are doing a lot of addition and subtraction in that calculation. You can get in deep trouble if you have terms that nearly cancel one another. You are wasting a lot of CPU if you have one term that dominates over the others.

你在计算中做了大量的加减运算。如果你有一些几乎互相抵消的项，你就会有大麻烦。如果你有一个术语支配着其他术语，你就浪费了大量的CPU资源。

Next, you are wasting a lot of CPU by performing repeated calculations. Unless you have enabled -ffast-math, which lets the compiler break some of the rules of IEEE floating point, the compiler will not (in fact, must not) simplify that expression for you. It will instead do exactly what you told it to do. At a minimum, you should calculate l1 * l2 * l3 prior to computing that mess.

接下来，通过执行重复计算，您将浪费大量CPU。除非您启用了-ffast-math，它允许编译器破坏IEEE浮点数的一些规则，否则编译器不会(实际上，一定不会)为您简化表达式。相反，它会按照你说的去做。至少，在计算这种混乱之前，应该先计算l1 * l2 * l3。

Finally, you are making a lot of calls to pow, which is extremely slow. Note that several of those calls are of the form (l1*l2*l3)^(1/3). Many of those calls to pow could be performed with a single call to std::cbrt:

最后，您正在对pow进行大量的调用，这是非常缓慢的。注意，其中有几个调用是形式(l1*l2*l3)(1/3)。许多对pow的调用可以通过一个对std::cbrt的调用来执行:

l123 = l1 * l2 * l3;
l123_pow_1_3 = std::cbrt(l123);
l123_pow_4_3 = l123 * l123_pow_1_3;

With this,

用这个,

X * pow(l1 * l2 * l3, 0.1e1 / 0.3e1) becomes X * l123_pow_1_3.
X * pow(l1 * l2 * l3, 0.1e1 / 0.3e1)变成X * l123_pow_1_3。
X * pow(l1 * l2 * l3, -0.1e1 / 0.3e1) becomes X / l123_pow_1_3.
X * pow(l1 * l2 * l3， -0.1e1 / 0.3e1)变为X / l123_pow_1_3。
X * pow(l1 * l2 * l3, 0.4e1 / 0.3e1) becomes X * l123_pow_4_3.
X * pow(l1 * l2 * l3, 0.4e1 / 0.3e1)变为X * l123_pow_4_3。
X * pow(l1 * l2 * l3, -0.4e1 / 0.3e1) becomes X / l123_pow_4_3.
X * pow(l1 * l2 * l3， -0.4e1 / 0.3e1)变为X / l123_pow_4_3。

Maple did miss the obvious.
For example, there's a much easier way to write

枫树确实没注意到。例如，有一种更简单的书写方式

(pow(l1 * l2 * l3, -0.1e1 / 0.3e1) - l1 * l2 * l3 * pow(l1 * l2 * l3, -0.4e1 / 0.3e1) / 0.3e1)

Assuming that l1, l2, and l3 are real rather than complex numbers, and that the real cube root (rather than the principle complex root) are to be extracted, the above reduces to

假设l1、l2和l3都是实数而不是复数，并且要提取真正的立方体根(而不是主复数根)，那么上面的式子可以简化为

2.0/(3.0 * pow(l1 * l2 * l3, 1.0/3.0))

或

2.0/(3.0 * l123_pow_1_3)

Using cbrt_l123 instead of l123_pow_1_3, the nasty expression in the question reduces to

使用cbrt_l123而不是l123_pow_1_3，这个问题中令人讨厌的表达式变为

l123 = l1 * l2 * l3; 
cbrt_l123 = cbrt(l123);
T = 
  mu/(3.0*l123)*(  pow(l1/cbrt_l123,a)*(2.0*N1-N2-N3)
                 + pow(l2/cbrt_l123,a)*(2.0*N2-N3-N1)
                 + pow(l3/cbrt_l123,a)*(2.0*N3-N1-N2))
 +K*(l123-1.0)*(N1+N2+N3);

Always double check, but always simplify as well.

总是重复检查，但也要简化。

Here are some of my steps in arriving at the above:

以下是我实现上述目标的一些步骤:

// Step 0: Trim all whitespace.
T=(mu*(pow(l1*pow(l1*l2*l3,-0.1e1/0.3e1),a)*a*(pow(l1*l2*l3,-0.1e1/0.3e1)-l1*l2*l3*pow(l1*l2*l3,-0.4e1/0.3e1)/0.3e1)*pow(l1*l2*l3,0.1e1/0.3e1)/l1-pow(l2*pow(l1*l2*l3,-0.1e1/0.3e1),a)*a/l1/0.3e1-pow(l3*pow(l1*l2*l3,-0.1e1/0.3e1),a)*a/l1/0.3e1)/a+K*(l1*l2*l3-0.1e1)*l2*l3)*N1/l2/l3+(mu*(-pow(l1*pow(l1*l2*l3,-0.1e1/0.3e1),a)*a/l2/0.3e1+pow(l2*pow(l1*l2*l3,-0.1e1/0.3e1),a)*a*(pow(l1*l2*l3,-0.1e1/0.3e1)-l1*l2*l3*pow(l1*l2*l3,-0.4e1/0.3e1)/0.3e1)*pow(l1*l2*l3,0.1e1/0.3e1)/l2-pow(l3*pow(l1*l2*l3,-0.1e1/0.3e1),a)*a/l2/0.3e1)/a+K*(l1*l2*l3-0.1e1)*l1*l3)*N2/l1/l3+(mu*(-pow(l1*pow(l1*l2*l3,-0.1e1/0.3e1),a)*a/l3/0.3e1-pow(l2*pow(l1*l2*l3,-0.1e1/0.3e1),a)*a/l3/0.3e1+pow(l3*pow(l1*l2*l3,-0.1e1/0.3e1),a)*a*(pow(l1*l2*l3,-0.1e1/0.3e1)-l1*l2*l3*pow(l1*l2*l3,-0.4e1/0.3e1)/0.3e1)*pow(l1*l2*l3,0.1e1/0.3e1)/l3)/a+K*(l1*l2*l3-0.1e1)*l1*l2)*N3/l1/l2;

// Step 1:
//   l1*l2*l3 -> l123
//   0.1e1 -> 1.0
//   0.4e1 -> 4.0
//   0.3e1 -> 3
l123 = l1 * l2 * l3;
T=(mu*(pow(l1*pow(l123,-1.0/3),a)*a*(pow(l123,-1.0/3)-l123*pow(l123,-4.0/3)/3)*pow(l123,1.0/3)/l1-pow(l2*pow(l123,-1.0/3),a)*a/l1/3-pow(l3*pow(l123,-1.0/3),a)*a/l1/3)/a+K*(l123-1.0)*l2*l3)*N1/l2/l3+(mu*(-pow(l1*pow(l123,-1.0/3),a)*a/l2/3+pow(l2*pow(l123,-1.0/3),a)*a*(pow(l123,-1.0/3)-l123*pow(l123,-4.0/3)/3)*pow(l123,1.0/3)/l2-pow(l3*pow(l123,-1.0/3),a)*a/l2/3)/a+K*(l123-1.0)*l1*l3)*N2/l1/l3+(mu*(-pow(l1*pow(l123,-1.0/3),a)*a/l3/3-pow(l2*pow(l123,-1.0/3),a)*a/l3/3+pow(l3*pow(l123,-1.0/3),a)*a*(pow(l123,-1.0/3)-l123*pow(l123,-4.0/3)/3)*pow(l123,1.0/3)/l3)/a+K*(l123-1.0)*l1*l2)*N3/l1/l2;

// Step 2:
//   pow(l123,1.0/3) -> cbrt_l123
//   l123*pow(l123,-4.0/3) -> pow(l123,-1.0/3)
//   (pow(l123,-1.0/3)-pow(l123,-1.0/3)/3) -> 2.0/(3.0*cbrt_l123)
//   *pow(l123,-1.0/3) -> /cbrt_l123
l123 = l1 * l2 * l3;
cbrt_l123 = cbrt(l123);
T=(mu*(pow(l1/cbrt_l123,a)*a*2.0/(3.0*cbrt_l123)*cbrt_l123/l1-pow(l2/cbrt_l123,a)*a/l1/3-pow(l3/cbrt_l123,a)*a/l1/3)/a+K*(l123-1.0)*l2*l3)*N1/l2/l3+(mu*(-pow(l1/cbrt_l123,a)*a/l2/3+pow(l2/cbrt_l123,a)*a*2.0/(3.0*cbrt_l123)*cbrt_l123/l2-pow(l3/cbrt_l123,a)*a/l2/3)/a+K*(l123-1.0)*l1*l3)*N2/l1/l3+(mu*(-pow(l1/cbrt_l123,a)*a/l3/3-pow(l2/cbrt_l123,a)*a/l3/3+pow(l3/cbrt_l123,a)*a*2.0/(3.0*cbrt_l123)*cbrt_l123/l3)/a+K*(l123-1.0)*l1*l2)*N3/l1/l2;

// Step 3:
//   Whitespace is nice.
l123 = l1 * l2 * l3;
cbrt_l123 = cbrt(l123);
T =
  (mu*( pow(l1/cbrt_l123,a)*a*2.0/(3.0*cbrt_l123)*cbrt_l123/l1
       -pow(l2/cbrt_l123,a)*a/l1/3
       -pow(l3/cbrt_l123,a)*a/l1/3)/a
   +K*(l123-1.0)*l2*l3)*N1/l2/l3
 +(mu*(-pow(l1/cbrt_l123,a)*a/l2/3
       +pow(l2/cbrt_l123,a)*a*2.0/(3.0*cbrt_l123)*cbrt_l123/l2
       -pow(l3/cbrt_l123,a)*a/l2/3)/a
   +K*(l123-1.0)*l1*l3)*N2/l1/l3
 +(mu*(-pow(l1/cbrt_l123,a)*a/l3/3
       -pow(l2/cbrt_l123,a)*a/l3/3
       +pow(l3/cbrt_l123,a)*a*2.0/(3.0*cbrt_l123)*cbrt_l123/l3)/a
   +K*(l123-1.0)*l1*l2)*N3/l1/l2;

// Step 4:
//   Eliminate the 'a' in (term1*a + term2*a + term3*a)/a
//   Expand (mu_term + K_term)*something to mu_term*something + K_term*something
l123 = l1 * l2 * l3;
cbrt_l123 = cbrt(l123);
T =
  (mu*( pow(l1/cbrt_l123,a)*2.0/(3.0*cbrt_l123)*cbrt_l123/l1
       -pow(l2/cbrt_l123,a)/l1/3
       -pow(l3/cbrt_l123,a)/l1/3))*N1/l2/l3
 +K*(l123-1.0)*l2*l3*N1/l2/l3
 +(mu*(-pow(l1/cbrt_l123,a)/l2/3
       +pow(l2/cbrt_l123,a)*2.0/(3.0*cbrt_l123)*cbrt_l123/l2
       -pow(l3/cbrt_l123,a)/l2/3))*N2/l1/l3
 +K*(l123-1.0)*l1*l3*N2/l1/l3
 +(mu*(-pow(l1/cbrt_l123,a)/l3/3
       -pow(l2/cbrt_l123,a)/l3/3
       +pow(l3/cbrt_l123,a)*2.0/(3.0*cbrt_l123)*cbrt_l123/l3))*N3/l1/l2
 +K*(l123-1.0)*l1*l2*N3/l1/l2;

// Step 5:
//   Rearrange
//   Reduce l2*l3*N1/l2/l3 to N1 (and similar)
//   Reduce 2.0/(3.0*cbrt_l123)*cbrt_l123/l1 to 2.0/3.0/l1 (and similar)
l123 = l1 * l2 * l3;
cbrt_l123 = cbrt(l123);
T =
  (mu*( pow(l1/cbrt_l123,a)*2.0/3.0/l1
       -pow(l2/cbrt_l123,a)/l1/3
       -pow(l3/cbrt_l123,a)/l1/3))*N1/l2/l3
 +(mu*(-pow(l1/cbrt_l123,a)/l2/3
       +pow(l2/cbrt_l123,a)*2.0/3.0/l2
       -pow(l3/cbrt_l123,a)/l2/3))*N2/l1/l3
 +(mu*(-pow(l1/cbrt_l123,a)/l3/3
       -pow(l2/cbrt_l123,a)/l3/3
       +pow(l3/cbrt_l123,a)*2.0/3.0/l3))*N3/l1/l2
 +K*(l123-1.0)*N1
 +K*(l123-1.0)*N2
 +K*(l123-1.0)*N3;

// Step 6:
//   Factor out mu and K*(l123-1.0)
l123 = l1 * l2 * l3;
cbrt_l123 = cbrt(l123);
T =
  mu*(  ( pow(l1/cbrt_l123,a)*2.0/3.0/l1
         -pow(l2/cbrt_l123,a)/l1/3
         -pow(l3/cbrt_l123,a)/l1/3)*N1/l2/l3
      + (-pow(l1/cbrt_l123,a)/l2/3
         +pow(l2/cbrt_l123,a)*2.0/3.0/l2
         -pow(l3/cbrt_l123,a)/l2/3)*N2/l1/l3
      + (-pow(l1/cbrt_l123,a)/l3/3
         -pow(l2/cbrt_l123,a)/l3/3
         +pow(l3/cbrt_l123,a)*2.0/3.0/l3)*N3/l1/l2)
 +K*(l123-1.0)*(N1+N2+N3);

// Step 7:
//   Expand
l123 = l1 * l2 * l3;
cbrt_l123 = cbrt(l123);
T =
  mu*( pow(l1/cbrt_l123,a)*2.0/3.0/l1*N1/l2/l3
      -pow(l2/cbrt_l123,a)/l1/3*N1/l2/l3
      -pow(l3/cbrt_l123,a)/l1/3*N1/l2/l3
      -pow(l1/cbrt_l123,a)/l2/3*N2/l1/l3
      +pow(l2/cbrt_l123,a)*2.0/3.0/l2*N2/l1/l3
      -pow(l3/cbrt_l123,a)/l2/3*N2/l1/l3
      -pow(l1/cbrt_l123,a)/l3/3*N3/l1/l2
      -pow(l2/cbrt_l123,a)/l3/3*N3/l1/l2
      +pow(l3/cbrt_l123,a)*2.0/3.0/l3*N3/l1/l2)
 +K*(l123-1.0)*(N1+N2+N3);

// Step 8:
//   Simplify.
l123 = l1 * l2 * l3;
cbrt_l123 = cbrt(l123);
T =
  mu/(3.0*l123)*(  pow(l1/cbrt_l123,a)*(2.0*N1-N2-N3)
                 + pow(l2/cbrt_l123,a)*(2.0*N2-N3-N1)
                 + pow(l3/cbrt_l123,a)*(2.0*N3-N1-N2))
 +K*(l123-1.0)*(N1+N2+N3);

Wrong answer, intentionally kept for humility

Note that this is stricken. It's wrong.

注意，这是受到了打击。这是错误的。

~~Update~~

更新

Maple did miss the obvious. For example, there's a much easier way to write

枫树确实没注意到。例如，有一种更简单的书写方式

(pow(l1 * l2 * l3, -0.1e1 / 0.3e1) - l1 * l2 * l3 * pow(l1 * l2 * l3, -0.4e1 / 0.3e1) / 0.3e1)

(pow(l1 * l2 * l3， -0.1e1 / 0.3e1) - l1 * l2 * l3 * pow(l1 * l2 * l3， -0.4e1 / 0.3e1) / 0.3e1)

Assuming that l1, l2, and l3 are real rather than complex numbers, and that the real cube root (rather than the principle complex root) are to be extracted, the above reduces to zero. This calculation of zero is repeated many times over.

假设l1、l2和l3是实数，而不是复数，而实际的立方根(而不是主复根)将被提取，上述值将减少为零。这个零的计算重复了很多次。

Second update

第二次更新

If I've done the math right (there is no guarantee that I've done the math right), the nasty expression in the question reduces to

如果我做对了数学(不能保证我做对了数学)，那么问题中令人讨厌的表达就变成了

l123 = l1 * l2 * l3; 
cbrt_l123_inv = 1.0 / cbrt(l123);
nasty_expression =
    K * (l123 - 1.0) * (N1 + N2 + N3) 
    - (  pow(l1 * cbrt_l123_inv, a) * (N2 + N3) 
       + pow(l2 * cbrt_l123_inv, a) * (N1 + N3) 
       + pow(l3 * cbrt_l123_inv, a) * (N1 + N2)) * mu / (3.0*l123);

~~The above assumes that l1, l2, and l3 are positive real numbers.~~

上面假设l1、l2和l3是正实数。

#2

First thing to note is that pow is really expensive, so you should get rid of this as much as possible. Scanning through the expression I see many repetitions of pow(l1 * l2 * l3, -0.1e1 / 0.3e1) and pow(l1 * l2 * l3, -0.4e1 / 0.3e1). So I would expect a big gain from pre-computing those:

首先要注意的是pow非常昂贵，所以你应该尽可能地摆脱它。透过这个表达，我看到许多重复的pow(l1 * l2 * l3， -0.1e1 / 0.3e1)和pow(l1 * l2 * l3， -0.4e1 / 0.3e1)。因此，我预计，这些技术将会带来巨大的收益:

 const double c1 = pow(l1 * l2 * l3, -0.1e1 / 0.3e1);
const double c2 = boost::math::pow<4>(c1);

where I am using the boost pow function.

我使用的是boost pow函数。

Furthermore, you have some more pow with exponent a. If a is Integer and known at compiler time, you can also replace those with boost::math::pow<a>(...) to gain further performance. I would also suggest to replace terms like a / l1 / 0.3e1 with a / (l1 * 0.3e1) as multiplication is faster then division.

此外，你还有一些指数a的pow。如果a是整数，并且在编译时已知，你也可以用boost::math::::pow(…)来替换它们，以获得进一步的性能。我还建议将a / l1 / 0.3e1等项替换为a / (l1 * 0.3e1)，因为乘法比除法快。

Finally, if you use g++ you can use the -ffast-math flag that allows the optimizer to be more aggressive in transforming equations. Read about what this flag actually does, as it has side effects though.

最后，如果您使用g++，您可以使用-ffast-math标记，它允许优化器在转换方程时更加积极。请阅读这面旗帜的实际作用，因为它有副作用。

#3

Woah, what a hell of an expression. Creating the expression with Maple actually was a suboptimal choice here. The result is simply unreadable.

哇，真是个该死的表达。在这里，用Maple创建表达式实际上是一个不太理想的选择。结果是无法读懂的。

chose speaking variable names (not l1, l2, l3, but e.g. height, width, depth, if that is what they mean). Then it is easier for you to understand your own code.
选择说变量名(不是l1, l2, l3，而是高度，宽度，深度，如果这是它们的意思的话)。这样您就更容易理解自己的代码。
calculate subterms, that you use multiple times, upfront and store the results in variables with speaking names.
先计算多次使用的子项，并将结果存储在带有发音名称的变量中。
You mention, that the expression is evaluated very many times. I guess, only few parameters vary in the inner most loop. Calculate all invariant subterms before that loop. Repeat for the second inner loop and so on until all invariants are outside the loop.
您提到，表达式被计算了很多次。我猜，在最内部的循环中只有很少的参数变化。在循环之前计算所有不变子项。重复第二个内部循环，直到所有不变量都在循环之外。

Theoretically the compiler should be able to do all of that for you, but sometimes it can't - e.g. when the loop nesting spreads over multiple functions in different compilation units. Anyway, that will give you much better readable, understandable, and maintainable code.

理论上，编译器应该能够为您完成所有这些工作，但有时不能——例如，当循环嵌套在不同编译单元中的多个函数上时。无论如何，这将给您提供更好的可读、可理解和可维护的代码。

#4

David Hammen's answer is good, but still far from optimal. Let's continue with his last expression (at the time of writing this)

大卫·哈曼的答案是好的，但仍远不是最佳答案。让我们继续他的最后一个表达(在写这篇文章的时候)

auto l123 = l1 * l2 * l3;
auto cbrt_l123 = cbrt(l123);
T = mu/(3.0*l123)*(  pow(l1/cbrt_l123,a)*(2.0*N1-N2-N3)
                   + pow(l2/cbrt_l123,a)*(2.0*N2-N3-N1)
                   + pow(l3/cbrt_l123,a)*(2.0*N3-N1-N2))
  + K*(l123-1.0)*(N1+N2+N3);

which can be optimised further. In particular, we can avoid the call to cbrt() and one of the calls to pow() if exploiting some mathematical identities. Let's do this again step by step.

可以进一步优化。特别是，如果利用一些数学恒等式，我们可以避免调用cbrt()和调用pow()。让我们一步一步地再做一次。

// step 1 eliminate cbrt() by taking the exponent into pow()
auto l123 = l1 * l2 * l3;
auto athird = 0.33333333333333333 * a; // avoid division
T = mu/(3.0*l123)*(  (N1+N1-N2-N3)*pow(l1*l1/(l2*l3),athird)
                   + (N2+N2-N3-N1)*pow(l2*l2/(l1*l3),athird)
                   + (N3+N3-N1-N2)*pow(l3*l3/(l1*l2),athird))
  + K*(l123-1.0)*(N1+N2+N3);

Note that I have also optimised 2.0*N1 to N1+N1 etc. Next, we can do with only two calls to pow().

注意，我还优化了2.0*N1到N1+N1等。接下来，我们只需要对pow()进行两次调用。

// step 2  eliminate one call to pow
auto l123 = l1 * l2 * l3;
auto athird = 0.33333333333333333 * a;
auto pow_l1l2_athird = pow(l1/l2,athird);
auto pow_l1l3_athird = pow(l1/l3,athird);
auto pow_l2l3_athird = pow_l1l3_athird/pow_l1l2_athird;
T = mu/(3.0*l123)*(  (N1+N1-N2-N3)* pow_l1l2_athird*pow_l1l3_athird
                   + (N2+N2-N3-N1)* pow_l2l3_athird/pow_l1l2_athird
                   + (N3+N3-N1-N2)/(pow_l1l3_athird*pow_l2l3_athird))
  + K*(l123-1.0)*(N1+N2+N3);

Since the calls to pow() are by far the most costly operation here, it is worth to reduce them as far as possible (the next costly operation was the call to cbrt(), which we eliminated).

由于对pow()的调用是到目前为止最昂贵的操作，所以尽可能地减少它们是值得的(下一个代价高昂的操作是对cbrt()的调用，我们删除了它)。

If by any chance a is integer, the calls to pow could be optimized to calls to cbrt (plus integer powers), or if athird is half-integer, we can use sqrt (plus integer powers). Furthermore, if by any chance l1==l2 or l1==l3 or l2==l3 one or both calls to pow can be eliminated. So, it's worth to consider these as special cases if such chances realistically exist.

如果碰巧a是整数，对pow的调用可以优化为对cbrt的调用(加上整数幂)，或者如果三分之一是半整数，我们可以使用sqrt(加上整数幂)。此外，如果l1==l2或l1= l3或l2==l3，则可以消除对pow的一次或两次调用。因此，如果这些机会真的存在，就值得考虑这些特殊情况。

#5

How many is "many many"?
多少是“许多”?
How long does it take?
要花多长时间?
Do ALL parameters change between recalculation of this formula? Or can you cache some pre-calculated values?
在重新计算这个公式时，所有的参数都改变了吗?或者可以缓存一些预先计算的值?

I've attempted a manual simplification of that formula, would like to know if it saves anything?

我已经尝试过手动化简这个公式，想知道它能节省什么吗?

C1 = -0.1e1 / 0.3e1;
C2 =  0.1e1 / 0.3e1;
C3 = -0.4e1 / 0.3e1;

X0 = l1 * l2 * l3;
X1 = pow(X0, C1);
X2 = pow(X0, C2);
X3 = pow(X0, C3);
X4 = pow(l1 * X1, a);
X5 = pow(l2 * X1, a);
X6 = pow(l3 * X1, a);
X7 = a / 0.3e1;
X8 = X3 / 0.3e1;
X9 = mu / a;
XA = X0 - 0.1e1;
XB = K * XA;
XC = X1 - X0 * X8;
XD = a * XC * X2;

XE = X4 * X7;
XF = X5 * X7;
XG = X6 * X7;

T = (X9 * ( X4 * XD - XF - XG) / l1 + XB * l2 * l3) * N1 / l2 / l3 
  + (X9 * (-XE + X5 * XD - XG) / l2 + XB * l1 * l3) * N2 / l1 / l3 
  + (X9 * (-XE - XF + X6 * XD) / l3 + XB * l1 * l2) * N3 / l1 / l2;

[ADDED] I've worked some more on the last three-lines formula and got it down to this beauty:

[补充]我在最后的三行公式上做了更多的工作，并将其归结为这样的美丽:

T = X9 / X0 * (
      (X4 * XD - XF - XG) * N1 + 
      (X5 * XD - XE - XG) * N2 + 
      (X5 * XD - XE - XF) * N3)
  + XB * (N1 + N2 + N3)

Let me show my work, step by step:

让我一步一步地展示我的工作:

T = (X9 * (X4 * XD - XF - XG) / l1 + XB * l2 * l3) * N1 / l2 / l3 
  + (X9 * (X5 * XD - XE - XG) / l2 + XB * l1 * l3) * N2 / l1 / l3 
  + (X9 * (X5 * XD - XE - XF) / l3 + XB * l1 * l2) * N3 / l1 / l2;


T = (X9 * (X4 * XD - XF - XG) / l1 + XB * l2 * l3) * N1 / (l2 * l3) 
  + (X9 * (X5 * XD - XE - XG) / l2 + XB * l1 * l3) * N2 / (l1 * l3) 
  + (X9 * (X5 * XD - XE - XF) / l3 + XB * l1 * l2) * N3 / (l1 * l2);

T = (X9 * (X4 * XD - XF - XG) + XB * l1 * l2 * l3) * N1 / (l1 * l2 * l3) 
  + (X9 * (X5 * XD - XE - XG) + XB * l1 * l2 * l3) * N2 / (l1 * l2 * l3) 
  + (X9 * (X5 * XD - XE - XF) + XB * l1 * l2 * l3) * N3 / (l1 * l2 * l3);

T = (X9 * (X4 * XD - XF - XG) + XB * X0) * N1 / X0 
  + (X9 * (X5 * XD - XE - XG) + XB * X0) * N2 / X0 
  + (X9 * (X5 * XD - XE - XF) + XB * X0) * N3 / X0;

T = X9 * (X4 * XD - XF - XG) * N1 / X0 + XB * N1 
  + X9 * (X5 * XD - XE - XG) * N2 / X0 + XB * N2
  + X9 * (X5 * XD - XE - XF) * N3 / X0 + XB * N3;


T = X9 * (X4 * XD - XF - XG) * N1 / X0 
  + X9 * (X5 * XD - XE - XG) * N2 / X0
  + X9 * (X5 * XD - XE - XF) * N3 / X0
  + XB * (N1 + N2 + N3)

#6

This may be a little terse, but I've actually found good speedup for polynomials (interpolation of energy functions) by using Horner Form, which basically rewrites ax^3 + bx^2 + cx + d as d + x(c + x(b + x(a))). This will avoid a lot of repeated calls to pow() and stops you from doing silly things like separately calling pow(x,6) and pow(x,7) instead of just doing x*pow(x,6).

这可能有点简短,但我已经发现好加速多项式(能量函数的插值)用霍纳形式,基本上重写ax ^ 3 + bx ^ 2 +残雪+ d d + x(c + x(b + x(a)))。这将避免对pow()的多次调用，并阻止您做一些愚蠢的事情，比如分别调用pow(x,6)和pow(x,7)，而不是只做x*pow(x,6)。

This is not directly applicable to your current problem, but if you have high order polynomials with integer powers it can help. You might have to watch out for numerical stability and overflow issues since the order of operations is important for that (although in general I actually think Horner Form helps for this, since x^20 and x are usually many orders of magnitude apart).

这并不直接适用于当前的问题，但如果您有具有整数幂的高阶多项式，这将有所帮助。你可能不得不提防数值稳定性和溢出问题由于操作的顺序是重要的(虽然一般而言我认为霍纳形式帮助,因为x ^ 20和x通常是许多数量级除外)。

Also as a practical tip, if you haven't done so already, try to simplify the expression in maple first. You can probably get it to do most of the common subexpression elimination for you. I don't know how much it affects the code generator in that program in particular, but I know in Mathematica doing a FullSimplify before generating the code can result in a huge difference.

还有一个实用的提示，如果您还没有这样做，请先简化maple中的表达式。您可能可以让它为您完成大多数常见的子表达式消除。我不知道它对程序中的代码生成器有多大的影响，但我知道在Mathematica中，在生成代码之前做一个完整的简化会产生巨大的差异。

#7

It looks like you have a lot of repeated operations going on.

看起来你有很多重复的操作。

pow(l1 * l2 * l3, -0.1e1 / 0.3e1)
pow(l1 * l2 * l3, -0.4e1 / 0.3e1)

You could pre-calculate those so you are not repeatedly calling the pow function which can be expensive.

你可以预先计算它们，这样你就不会反复调用pow函数，这会很昂贵。

You could also pre-calutate

你也可以pre-calutate

l1 * l2 * l3

as you use that term repeatedly.

当你反复使用这个词的时候。

#8

If you have a Nvidia CUDA graphics card, you could consider offloading the calculations to the graphics card - which itself is more suitable for computationally complicated calculations.

如果你有一个Nvidia CUDA显卡，你可以考虑将计算转移到显卡上——显卡本身更适合计算复杂的计算。

https://developer.nvidia.com/how-to-cuda-c-cpp

If not, you may want to consider multiple threads for calculations.

如果不是，您可能需要考虑使用多个线程进行计算。

#9

By any chance, could you supply the calculation symbolically. If there are vector operations, you might really want to investigate using blas or lapack which in some cases can run operations in parallel.

无论如何，你能象征性地给出计算结果吗?如果有矢量操作，您可能真的想使用blas或lapack进行研究，在某些情况下，它们可以并行运行操作。

It is conceivable (at the risk of being off-topic?) that you might be able to use python with numpy and/or scipy. To the extent that it was possible, your calculations might be more readable.

可以想象(冒着偏离主题的风险?)您可能可以使用带有numpy和/或scipy的python。在可能的范围内，您的计算可能更容易读懂。

#10

As you explicitly asked about high level optimizations, it might be worth trying different C++ compilers. Nowadays, compilers are very complex optimization beasts and CPU vendors might implement very powerful and specific optimizations. But please note, some of them are not free (but there might be a free academic program).

当您明确地询问高级优化时，您可能需要尝试不同的c++编译器。现在，编译器是非常复杂的优化工具，CPU供应商可能会实现非常强大和特定的优化。但请注意，有些课程不是免费的(但可能有免费的学术课程)。

GNU compiler collection is free, flexible, and available on many architectures
GNU编译器集合是免费的、灵活的，并且可以在许多架构上使用。
Intel compilers are very fast, very expensive, and may also produce good results for AMD architectures (I believe there is an academic program)
英特尔编译器非常快速，非常昂贵，并且可能为AMD架构产生良好的结果(我相信有一个学术程序)
Clang compilers are fast, free, and might produce similar results to GCC (some people say they are faster, better, but this might differ for each application case, I suggest to make your own experiences)
Clang编译器是快速、免费的，并且可能会产生与GCC类似的结果(有些人说它们更快、更好，但是对于每个应用程序，这可能有所不同，我建议您自己体验一下)
PGI (Portland Group) is not free as the Intel compilers.
PGI(波特兰集团)不是免费的英特尔编译器。
PathScale compilers might perform good results on AMD architectures
PathScale编译器可能在AMD架构上执行良好的结果

I've seen code snippets differ in execution speed by the factor of 2, only by changing the compiler (with full optimizations of course). But be aware of checking the identity of the output. To aggressive optimization might lead to different output, which is something you definitely want to avoid.

我看到代码片段执行速度的差异是2倍，只是通过修改编译器(当然是完全优化)。但是要注意检查输出的标识。积极的优化可能导致不同的输出，这是您一定要避免的。

Good Luck!

好运！

#1

Edit summary

My original answer merely noted that the code contained a lot of replicated computations and that many of the powers involved factors of 1/3. For example, pow(x, 0.1e1/0.3e1) is the same as cbrt(x).
我最初的回答只是指出，代码包含了大量的复制计算，而且很多幂都包含了1/3的因数。例如，pow(x, 0.1e1/0.3e1)与cbrt(x)相同。
My second edit was just wrong, and and my third extrapolated on this wrongness. This is what makes people afraid to change the oracle-like results from symbolic math programs that start with the letter 'M'. I've stricken out (i.e., ~~strike~~ ) those edits and pushed them to the bottom of the current revision of this answer. However, I did not delete them. I'm human. It's easy for us to make a mistake.
我的第二次编辑是错误的，第三次推断是错误的。这就是为什么人们害怕改变从字母“M”开始的符号数学程序的结果。(也就是我的。这些编辑并将它们推到当前对这个答案的修订的底部。但是，我没有删除它们。我是人类。我们很容易犯错误。
My fourth edit developed a very compact expression that correctly represents the convoluted expression in the question IF the parameters l1, l2, and l3 are positive real numbers and if a is a non-zero real number. (We have yet to hear from the OP regarding the specific nature of these coefficients. Given the nature of the problem, these are reasonable assumptions.)
我的第四次编辑开发了一个非常紧凑的表达式，如果参数l1、l2和l3是正数，如果a是非零实数，那么它就可以正确地表示问题中的卷积表达式。(我们还没有从OP中得知这些系数的具体性质。鉴于问题的本质，这些都是合理的假设。
This edit attempts to answer the generic problem of how to simplify these expressions.
此编辑尝试回答如何简化这些表达式的一般问题。

First things first

I use Maple to generate the C++ code to avoid mistakes.

我使用Maple生成c++代码以避免错误。

How to simplify those big messes from symbolic math programs (this edit)

About the specific question

你在计算中做了大量的加减运算。如果你有一些几乎互相抵消的项，你就会有大麻烦。如果你有一个术语支配着其他术语，你就浪费了大量的CPU资源。

最后，您正在对pow进行大量的调用，这是非常缓慢的。注意，其中有几个调用是形式(l1*l2*l3)(1/3)。许多对pow的调用可以通过一个对std::cbrt的调用来执行:

l123 = l1 * l2 * l3;
l123_pow_1_3 = std::cbrt(l123);
l123_pow_4_3 = l123 * l123_pow_1_3;

With this,

用这个,

X * pow(l1 * l2 * l3, 0.1e1 / 0.3e1) becomes X * l123_pow_1_3.
X * pow(l1 * l2 * l3, 0.1e1 / 0.3e1)变成X * l123_pow_1_3。
X * pow(l1 * l2 * l3, -0.1e1 / 0.3e1) becomes X / l123_pow_1_3.
X * pow(l1 * l2 * l3， -0.1e1 / 0.3e1)变为X / l123_pow_1_3。
X * pow(l1 * l2 * l3, 0.4e1 / 0.3e1) becomes X * l123_pow_4_3.
X * pow(l1 * l2 * l3, 0.4e1 / 0.3e1)变为X * l123_pow_4_3。
X * pow(l1 * l2 * l3, -0.4e1 / 0.3e1) becomes X / l123_pow_4_3.
X * pow(l1 * l2 * l3， -0.4e1 / 0.3e1)变为X / l123_pow_4_3。

Maple did miss the obvious.
For example, there's a much easier way to write

枫树确实没注意到。例如，有一种更简单的书写方式

(pow(l1 * l2 * l3, -0.1e1 / 0.3e1) - l1 * l2 * l3 * pow(l1 * l2 * l3, -0.4e1 / 0.3e1) / 0.3e1)

Assuming that l1, l2, and l3 are real rather than complex numbers, and that the real cube root (rather than the principle complex root) are to be extracted, the above reduces to

假设l1、l2和l3都是实数而不是复数，并且要提取真正的立方体根(而不是主复数根)，那么上面的式子可以简化为

2.0/(3.0 * pow(l1 * l2 * l3, 1.0/3.0))

或

2.0/(3.0 * l123_pow_1_3)

Using cbrt_l123 instead of l123_pow_1_3, the nasty expression in the question reduces to

使用cbrt_l123而不是l123_pow_1_3，这个问题中令人讨厌的表达式变为

l123 = l1 * l2 * l3; 
cbrt_l123 = cbrt(l123);
T = 
  mu/(3.0*l123)*(  pow(l1/cbrt_l123,a)*(2.0*N1-N2-N3)
                 + pow(l2/cbrt_l123,a)*(2.0*N2-N3-N1)
                 + pow(l3/cbrt_l123,a)*(2.0*N3-N1-N2))
 +K*(l123-1.0)*(N1+N2+N3);

Always double check, but always simplify as well.

总是重复检查，但也要简化。

Here are some of my steps in arriving at the above:

以下是我实现上述目标的一些步骤:

// Step 0: Trim all whitespace.
T=(mu*(pow(l1*pow(l1*l2*l3,-0.1e1/0.3e1),a)*a*(pow(l1*l2*l3,-0.1e1/0.3e1)-l1*l2*l3*pow(l1*l2*l3,-0.4e1/0.3e1)/0.3e1)*pow(l1*l2*l3,0.1e1/0.3e1)/l1-pow(l2*pow(l1*l2*l3,-0.1e1/0.3e1),a)*a/l1/0.3e1-pow(l3*pow(l1*l2*l3,-0.1e1/0.3e1),a)*a/l1/0.3e1)/a+K*(l1*l2*l3-0.1e1)*l2*l3)*N1/l2/l3+(mu*(-pow(l1*pow(l1*l2*l3,-0.1e1/0.3e1),a)*a/l2/0.3e1+pow(l2*pow(l1*l2*l3,-0.1e1/0.3e1),a)*a*(pow(l1*l2*l3,-0.1e1/0.3e1)-l1*l2*l3*pow(l1*l2*l3,-0.4e1/0.3e1)/0.3e1)*pow(l1*l2*l3,0.1e1/0.3e1)/l2-pow(l3*pow(l1*l2*l3,-0.1e1/0.3e1),a)*a/l2/0.3e1)/a+K*(l1*l2*l3-0.1e1)*l1*l3)*N2/l1/l3+(mu*(-pow(l1*pow(l1*l2*l3,-0.1e1/0.3e1),a)*a/l3/0.3e1-pow(l2*pow(l1*l2*l3,-0.1e1/0.3e1),a)*a/l3/0.3e1+pow(l3*pow(l1*l2*l3,-0.1e1/0.3e1),a)*a*(pow(l1*l2*l3,-0.1e1/0.3e1)-l1*l2*l3*pow(l1*l2*l3,-0.4e1/0.3e1)/0.3e1)*pow(l1*l2*l3,0.1e1/0.3e1)/l3)/a+K*(l1*l2*l3-0.1e1)*l1*l2)*N3/l1/l2;

// Step 1:
//   l1*l2*l3 -> l123
//   0.1e1 -> 1.0
//   0.4e1 -> 4.0
//   0.3e1 -> 3
l123 = l1 * l2 * l3;
T=(mu*(pow(l1*pow(l123,-1.0/3),a)*a*(pow(l123,-1.0/3)-l123*pow(l123,-4.0/3)/3)*pow(l123,1.0/3)/l1-pow(l2*pow(l123,-1.0/3),a)*a/l1/3-pow(l3*pow(l123,-1.0/3),a)*a/l1/3)/a+K*(l123-1.0)*l2*l3)*N1/l2/l3+(mu*(-pow(l1*pow(l123,-1.0/3),a)*a/l2/3+pow(l2*pow(l123,-1.0/3),a)*a*(pow(l123,-1.0/3)-l123*pow(l123,-4.0/3)/3)*pow(l123,1.0/3)/l2-pow(l3*pow(l123,-1.0/3),a)*a/l2/3)/a+K*(l123-1.0)*l1*l3)*N2/l1/l3+(mu*(-pow(l1*pow(l123,-1.0/3),a)*a/l3/3-pow(l2*pow(l123,-1.0/3),a)*a/l3/3+pow(l3*pow(l123,-1.0/3),a)*a*(pow(l123,-1.0/3)-l123*pow(l123,-4.0/3)/3)*pow(l123,1.0/3)/l3)/a+K*(l123-1.0)*l1*l2)*N3/l1/l2;

// Step 2:
//   pow(l123,1.0/3) -> cbrt_l123
//   l123*pow(l123,-4.0/3) -> pow(l123,-1.0/3)
//   (pow(l123,-1.0/3)-pow(l123,-1.0/3)/3) -> 2.0/(3.0*cbrt_l123)
//   *pow(l123,-1.0/3) -> /cbrt_l123
l123 = l1 * l2 * l3;
cbrt_l123 = cbrt(l123);
T=(mu*(pow(l1/cbrt_l123,a)*a*2.0/(3.0*cbrt_l123)*cbrt_l123/l1-pow(l2/cbrt_l123,a)*a/l1/3-pow(l3/cbrt_l123,a)*a/l1/3)/a+K*(l123-1.0)*l2*l3)*N1/l2/l3+(mu*(-pow(l1/cbrt_l123,a)*a/l2/3+pow(l2/cbrt_l123,a)*a*2.0/(3.0*cbrt_l123)*cbrt_l123/l2-pow(l3/cbrt_l123,a)*a/l2/3)/a+K*(l123-1.0)*l1*l3)*N2/l1/l3+(mu*(-pow(l1/cbrt_l123,a)*a/l3/3-pow(l2/cbrt_l123,a)*a/l3/3+pow(l3/cbrt_l123,a)*a*2.0/(3.0*cbrt_l123)*cbrt_l123/l3)/a+K*(l123-1.0)*l1*l2)*N3/l1/l2;

// Step 3:
//   Whitespace is nice.
l123 = l1 * l2 * l3;
cbrt_l123 = cbrt(l123);
T =
  (mu*( pow(l1/cbrt_l123,a)*a*2.0/(3.0*cbrt_l123)*cbrt_l123/l1
       -pow(l2/cbrt_l123,a)*a/l1/3
       -pow(l3/cbrt_l123,a)*a/l1/3)/a
   +K*(l123-1.0)*l2*l3)*N1/l2/l3
 +(mu*(-pow(l1/cbrt_l123,a)*a/l2/3
       +pow(l2/cbrt_l123,a)*a*2.0/(3.0*cbrt_l123)*cbrt_l123/l2
       -pow(l3/cbrt_l123,a)*a/l2/3)/a
   +K*(l123-1.0)*l1*l3)*N2/l1/l3
 +(mu*(-pow(l1/cbrt_l123,a)*a/l3/3
       -pow(l2/cbrt_l123,a)*a/l3/3
       +pow(l3/cbrt_l123,a)*a*2.0/(3.0*cbrt_l123)*cbrt_l123/l3)/a
   +K*(l123-1.0)*l1*l2)*N3/l1/l2;

// Step 4:
//   Eliminate the 'a' in (term1*a + term2*a + term3*a)/a
//   Expand (mu_term + K_term)*something to mu_term*something + K_term*something
l123 = l1 * l2 * l3;
cbrt_l123 = cbrt(l123);
T =
  (mu*( pow(l1/cbrt_l123,a)*2.0/(3.0*cbrt_l123)*cbrt_l123/l1
       -pow(l2/cbrt_l123,a)/l1/3
       -pow(l3/cbrt_l123,a)/l1/3))*N1/l2/l3
 +K*(l123-1.0)*l2*l3*N1/l2/l3
 +(mu*(-pow(l1/cbrt_l123,a)/l2/3
       +pow(l2/cbrt_l123,a)*2.0/(3.0*cbrt_l123)*cbrt_l123/l2
       -pow(l3/cbrt_l123,a)/l2/3))*N2/l1/l3
 +K*(l123-1.0)*l1*l3*N2/l1/l3
 +(mu*(-pow(l1/cbrt_l123,a)/l3/3
       -pow(l2/cbrt_l123,a)/l3/3
       +pow(l3/cbrt_l123,a)*2.0/(3.0*cbrt_l123)*cbrt_l123/l3))*N3/l1/l2
 +K*(l123-1.0)*l1*l2*N3/l1/l2;

// Step 5:
//   Rearrange
//   Reduce l2*l3*N1/l2/l3 to N1 (and similar)
//   Reduce 2.0/(3.0*cbrt_l123)*cbrt_l123/l1 to 2.0/3.0/l1 (and similar)
l123 = l1 * l2 * l3;
cbrt_l123 = cbrt(l123);
T =
  (mu*( pow(l1/cbrt_l123,a)*2.0/3.0/l1
       -pow(l2/cbrt_l123,a)/l1/3
       -pow(l3/cbrt_l123,a)/l1/3))*N1/l2/l3
 +(mu*(-pow(l1/cbrt_l123,a)/l2/3
       +pow(l2/cbrt_l123,a)*2.0/3.0/l2
       -pow(l3/cbrt_l123,a)/l2/3))*N2/l1/l3
 +(mu*(-pow(l1/cbrt_l123,a)/l3/3
       -pow(l2/cbrt_l123,a)/l3/3
       +pow(l3/cbrt_l123,a)*2.0/3.0/l3))*N3/l1/l2
 +K*(l123-1.0)*N1
 +K*(l123-1.0)*N2
 +K*(l123-1.0)*N3;

// Step 6:
//   Factor out mu and K*(l123-1.0)
l123 = l1 * l2 * l3;
cbrt_l123 = cbrt(l123);
T =
  mu*(  ( pow(l1/cbrt_l123,a)*2.0/3.0/l1
         -pow(l2/cbrt_l123,a)/l1/3
         -pow(l3/cbrt_l123,a)/l1/3)*N1/l2/l3
      + (-pow(l1/cbrt_l123,a)/l2/3
         +pow(l2/cbrt_l123,a)*2.0/3.0/l2
         -pow(l3/cbrt_l123,a)/l2/3)*N2/l1/l3
      + (-pow(l1/cbrt_l123,a)/l3/3
         -pow(l2/cbrt_l123,a)/l3/3
         +pow(l3/cbrt_l123,a)*2.0/3.0/l3)*N3/l1/l2)
 +K*(l123-1.0)*(N1+N2+N3);

// Step 7:
//   Expand
l123 = l1 * l2 * l3;
cbrt_l123 = cbrt(l123);
T =
  mu*( pow(l1/cbrt_l123,a)*2.0/3.0/l1*N1/l2/l3
      -pow(l2/cbrt_l123,a)/l1/3*N1/l2/l3
      -pow(l3/cbrt_l123,a)/l1/3*N1/l2/l3
      -pow(l1/cbrt_l123,a)/l2/3*N2/l1/l3
      +pow(l2/cbrt_l123,a)*2.0/3.0/l2*N2/l1/l3
      -pow(l3/cbrt_l123,a)/l2/3*N2/l1/l3
      -pow(l1/cbrt_l123,a)/l3/3*N3/l1/l2
      -pow(l2/cbrt_l123,a)/l3/3*N3/l1/l2
      +pow(l3/cbrt_l123,a)*2.0/3.0/l3*N3/l1/l2)
 +K*(l123-1.0)*(N1+N2+N3);

// Step 8:
//   Simplify.
l123 = l1 * l2 * l3;
cbrt_l123 = cbrt(l123);
T =
  mu/(3.0*l123)*(  pow(l1/cbrt_l123,a)*(2.0*N1-N2-N3)
                 + pow(l2/cbrt_l123,a)*(2.0*N2-N3-N1)
                 + pow(l3/cbrt_l123,a)*(2.0*N3-N1-N2))
 +K*(l123-1.0)*(N1+N2+N3);

Wrong answer, intentionally kept for humility

Note that this is stricken. It's wrong.

注意，这是受到了打击。这是错误的。

~~Update~~

更新

Maple did miss the obvious. For example, there's a much easier way to write

枫树确实没注意到。例如，有一种更简单的书写方式

(pow(l1 * l2 * l3, -0.1e1 / 0.3e1) - l1 * l2 * l3 * pow(l1 * l2 * l3, -0.4e1 / 0.3e1) / 0.3e1)

(pow(l1 * l2 * l3， -0.1e1 / 0.3e1) - l1 * l2 * l3 * pow(l1 * l2 * l3， -0.4e1 / 0.3e1) / 0.3e1)

假设l1、l2和l3是实数，而不是复数，而实际的立方根(而不是主复根)将被提取，上述值将减少为零。这个零的计算重复了很多次。

Second update

第二次更新

If I've done the math right (there is no guarantee that I've done the math right), the nasty expression in the question reduces to

如果我做对了数学(不能保证我做对了数学)，那么问题中令人讨厌的表达就变成了

l123 = l1 * l2 * l3; 
cbrt_l123_inv = 1.0 / cbrt(l123);
nasty_expression =
    K * (l123 - 1.0) * (N1 + N2 + N3) 
    - (  pow(l1 * cbrt_l123_inv, a) * (N2 + N3) 
       + pow(l2 * cbrt_l123_inv, a) * (N1 + N3) 
       + pow(l3 * cbrt_l123_inv, a) * (N1 + N2)) * mu / (3.0*l123);

~~The above assumes that l1, l2, and l3 are positive real numbers.~~

上面假设l1、l2和l3是正实数。

#2

 const double c1 = pow(l1 * l2 * l3, -0.1e1 / 0.3e1);
const double c2 = boost::math::pow<4>(c1);

where I am using the boost pow function.

我使用的是boost pow函数。

最后，如果您使用g++，您可以使用-ffast-math标记，它允许优化器在转换方程时更加积极。请阅读这面旗帜的实际作用，因为它有副作用。

#3

Woah, what a hell of an expression. Creating the expression with Maple actually was a suboptimal choice here. The result is simply unreadable.

哇，真是个该死的表达。在这里，用Maple创建表达式实际上是一个不太理想的选择。结果是无法读懂的。

chose speaking variable names (not l1, l2, l3, but e.g. height, width, depth, if that is what they mean). Then it is easier for you to understand your own code.
选择说变量名(不是l1, l2, l3，而是高度，宽度，深度，如果这是它们的意思的话)。这样您就更容易理解自己的代码。
calculate subterms, that you use multiple times, upfront and store the results in variables with speaking names.
先计算多次使用的子项，并将结果存储在带有发音名称的变量中。
You mention, that the expression is evaluated very many times. I guess, only few parameters vary in the inner most loop. Calculate all invariant subterms before that loop. Repeat for the second inner loop and so on until all invariants are outside the loop.
您提到，表达式被计算了很多次。我猜，在最内部的循环中只有很少的参数变化。在循环之前计算所有不变子项。重复第二个内部循环，直到所有不变量都在循环之外。

#4

David Hammen's answer is good, but still far from optimal. Let's continue with his last expression (at the time of writing this)

大卫·哈曼的答案是好的，但仍远不是最佳答案。让我们继续他的最后一个表达(在写这篇文章的时候)

auto l123 = l1 * l2 * l3;
auto cbrt_l123 = cbrt(l123);
T = mu/(3.0*l123)*(  pow(l1/cbrt_l123,a)*(2.0*N1-N2-N3)
                   + pow(l2/cbrt_l123,a)*(2.0*N2-N3-N1)
                   + pow(l3/cbrt_l123,a)*(2.0*N3-N1-N2))
  + K*(l123-1.0)*(N1+N2+N3);

which can be optimised further. In particular, we can avoid the call to cbrt() and one of the calls to pow() if exploiting some mathematical identities. Let's do this again step by step.

可以进一步优化。特别是，如果利用一些数学恒等式，我们可以避免调用cbrt()和调用pow()。让我们一步一步地再做一次。

// step 1 eliminate cbrt() by taking the exponent into pow()
auto l123 = l1 * l2 * l3;
auto athird = 0.33333333333333333 * a; // avoid division
T = mu/(3.0*l123)*(  (N1+N1-N2-N3)*pow(l1*l1/(l2*l3),athird)
                   + (N2+N2-N3-N1)*pow(l2*l2/(l1*l3),athird)
                   + (N3+N3-N1-N2)*pow(l3*l3/(l1*l2),athird))
  + K*(l123-1.0)*(N1+N2+N3);

Note that I have also optimised 2.0*N1 to N1+N1 etc. Next, we can do with only two calls to pow().

注意，我还优化了2.0*N1到N1+N1等。接下来，我们只需要对pow()进行两次调用。

// step 2  eliminate one call to pow
auto l123 = l1 * l2 * l3;
auto athird = 0.33333333333333333 * a;
auto pow_l1l2_athird = pow(l1/l2,athird);
auto pow_l1l3_athird = pow(l1/l3,athird);
auto pow_l2l3_athird = pow_l1l3_athird/pow_l1l2_athird;
T = mu/(3.0*l123)*(  (N1+N1-N2-N3)* pow_l1l2_athird*pow_l1l3_athird
                   + (N2+N2-N3-N1)* pow_l2l3_athird/pow_l1l2_athird
                   + (N3+N3-N1-N2)/(pow_l1l3_athird*pow_l2l3_athird))
  + K*(l123-1.0)*(N1+N2+N3);

Since the calls to pow() are by far the most costly operation here, it is worth to reduce them as far as possible (the next costly operation was the call to cbrt(), which we eliminated).

由于对pow()的调用是到目前为止最昂贵的操作，所以尽可能地减少它们是值得的(下一个代价高昂的操作是对cbrt()的调用，我们删除了它)。

#5

How many is "many many"?
多少是“许多”?
How long does it take?
要花多长时间?
Do ALL parameters change between recalculation of this formula? Or can you cache some pre-calculated values?
在重新计算这个公式时，所有的参数都改变了吗?或者可以缓存一些预先计算的值?

I've attempted a manual simplification of that formula, would like to know if it saves anything?

我已经尝试过手动化简这个公式，想知道它能节省什么吗?

C1 = -0.1e1 / 0.3e1;
C2 =  0.1e1 / 0.3e1;
C3 = -0.4e1 / 0.3e1;

X0 = l1 * l2 * l3;
X1 = pow(X0, C1);
X2 = pow(X0, C2);
X3 = pow(X0, C3);
X4 = pow(l1 * X1, a);
X5 = pow(l2 * X1, a);
X6 = pow(l3 * X1, a);
X7 = a / 0.3e1;
X8 = X3 / 0.3e1;
X9 = mu / a;
XA = X0 - 0.1e1;
XB = K * XA;
XC = X1 - X0 * X8;
XD = a * XC * X2;

XE = X4 * X7;
XF = X5 * X7;
XG = X6 * X7;

T = (X9 * ( X4 * XD - XF - XG) / l1 + XB * l2 * l3) * N1 / l2 / l3 
  + (X9 * (-XE + X5 * XD - XG) / l2 + XB * l1 * l3) * N2 / l1 / l3 
  + (X9 * (-XE - XF + X6 * XD) / l3 + XB * l1 * l2) * N3 / l1 / l2;

[ADDED] I've worked some more on the last three-lines formula and got it down to this beauty:

[补充]我在最后的三行公式上做了更多的工作，并将其归结为这样的美丽:

T = X9 / X0 * (
      (X4 * XD - XF - XG) * N1 + 
      (X5 * XD - XE - XG) * N2 + 
      (X5 * XD - XE - XF) * N3)
  + XB * (N1 + N2 + N3)

Let me show my work, step by step:

让我一步一步地展示我的工作:

T = (X9 * (X4 * XD - XF - XG) / l1 + XB * l2 * l3) * N1 / l2 / l3 
  + (X9 * (X5 * XD - XE - XG) / l2 + XB * l1 * l3) * N2 / l1 / l3 
  + (X9 * (X5 * XD - XE - XF) / l3 + XB * l1 * l2) * N3 / l1 / l2;


T = (X9 * (X4 * XD - XF - XG) / l1 + XB * l2 * l3) * N1 / (l2 * l3) 
  + (X9 * (X5 * XD - XE - XG) / l2 + XB * l1 * l3) * N2 / (l1 * l3) 
  + (X9 * (X5 * XD - XE - XF) / l3 + XB * l1 * l2) * N3 / (l1 * l2);

T = (X9 * (X4 * XD - XF - XG) + XB * l1 * l2 * l3) * N1 / (l1 * l2 * l3) 
  + (X9 * (X5 * XD - XE - XG) + XB * l1 * l2 * l3) * N2 / (l1 * l2 * l3) 
  + (X9 * (X5 * XD - XE - XF) + XB * l1 * l2 * l3) * N3 / (l1 * l2 * l3);

T = (X9 * (X4 * XD - XF - XG) + XB * X0) * N1 / X0 
  + (X9 * (X5 * XD - XE - XG) + XB * X0) * N2 / X0 
  + (X9 * (X5 * XD - XE - XF) + XB * X0) * N3 / X0;

T = X9 * (X4 * XD - XF - XG) * N1 / X0 + XB * N1 
  + X9 * (X5 * XD - XE - XG) * N2 / X0 + XB * N2
  + X9 * (X5 * XD - XE - XF) * N3 / X0 + XB * N3;


T = X9 * (X4 * XD - XF - XG) * N1 / X0 
  + X9 * (X5 * XD - XE - XG) * N2 / X0
  + X9 * (X5 * XD - XE - XF) * N3 / X0
  + XB * (N1 + N2 + N3)

#6

#7

It looks like you have a lot of repeated operations going on.

看起来你有很多重复的操作。

pow(l1 * l2 * l3, -0.1e1 / 0.3e1)
pow(l1 * l2 * l3, -0.4e1 / 0.3e1)

You could pre-calculate those so you are not repeatedly calling the pow function which can be expensive.

你可以预先计算它们，这样你就不会反复调用pow函数，这会很昂贵。

You could also pre-calutate

你也可以pre-calutate

l1 * l2 * l3

as you use that term repeatedly.

当你反复使用这个词的时候。

#8

If you have a Nvidia CUDA graphics card, you could consider offloading the calculations to the graphics card - which itself is more suitable for computationally complicated calculations.

如果你有一个Nvidia CUDA显卡，你可以考虑将计算转移到显卡上——显卡本身更适合计算复杂的计算。

https://developer.nvidia.com/how-to-cuda-c-cpp

If not, you may want to consider multiple threads for calculations.

如果不是，您可能需要考虑使用多个线程进行计算。

#9

无论如何，你能象征性地给出计算结果吗?如果有矢量操作，您可能真的想使用blas或lapack进行研究，在某些情况下，它们可以并行运行操作。

It is conceivable (at the risk of being off-topic?) that you might be able to use python with numpy and/or scipy. To the extent that it was possible, your calculations might be more readable.

可以想象(冒着偏离主题的风险?)您可能可以使用带有numpy和/或scipy的python。在可能的范围内，您的计算可能更容易读懂。

#10

GNU compiler collection is free, flexible, and available on many architectures
GNU编译器集合是免费的、灵活的，并且可以在许多架构上使用。
Intel compilers are very fast, very expensive, and may also produce good results for AMD architectures (I believe there is an academic program)
英特尔编译器非常快速，非常昂贵，并且可能为AMD架构产生良好的结果(我相信有一个学术程序)
Clang compilers are fast, free, and might produce similar results to GCC (some people say they are faster, better, but this might differ for each application case, I suggest to make your own experiences)
Clang编译器是快速、免费的，并且可能会产生与GCC类似的结果(有些人说它们更快、更好，但是对于每个应用程序，这可能有所不同，我建议您自己体验一下)
PGI (Portland Group) is not free as the Intel compilers.
PGI(波特兰集团)不是免费的英特尔编译器。
PathScale compilers might perform good results on AMD architectures
PathScale编译器可能在AMD架构上执行良好的结果

Good Luck!

好运！

秒客网

在使用c++实现长方程时，如何通过高级方法提高性能

10 个解决方案

#1

Edit summary

First things first

How to simplify those big messes from symbolic math programs (this edit)

About the specific question

Wrong answer, intentionally kept for humility

#2

#3

#4

#5

#6

#7

#8

#9

#10

#1

Edit summary

First things first

How to simplify those big messes from symbolic math programs (this edit)

About the specific question

Wrong answer, intentionally kept for humility

#2

#3

#4

#5

#6

#7

#8

#9

#10

相关文章