With GCC 5.3 the following code compield with -O3 -fma
使用GCC 5.3,以下代码用-O3 -fma编译
float mul_add(float a, float b, float c) {
return a*b + c;
}
produces the following assembly
产生以下装配
vfmadd132ss %xmm1, %xmm2, %xmm0
ret
I noticed GCC doing this with -O3
already in GCC 4.8.
我注意到GCC已经在GCC 4.8中使用-O3实现了这个功能。
Clang 3.7 with -O3 -mfma
produces
Clang 3.7与-O3 -mfma生产
vmulss %xmm1, %xmm0, %xmm0
vaddss %xmm2, %xmm0, %xmm0
retq
but Clang 3.7 with -Ofast -mfma
produces the same code as GCC with -O3 fast
.
但是Clang 3.7与-Ofast -mfma生成的代码与使用-O3的GCC生成的代码一样快。
I am surprised that GCC does with -O3
because from this answer it says
我对GCC使用-O3感到惊讶,因为从这个答案它说
The compiler is not allowed to fuse a separated add and multiply unless you allow for a relaxed floating-point model.
除非允许使用宽松的浮点模型,否则编译器不允许合并分离的add和multiply。
This is because an FMA has only one rounding, while an ADD + MUL has two. So the compiler will violate strict IEEE floating-point behaviour by fusing.
这是因为FMA只有一个舍入,而ADD + MUL只有两个舍入。因此,编译器将通过融合来违反严格的IEEE浮点行为。
However, from this link it says
然而,从这个链接它说
Regardless of the value of FLT_EVAL_METHOD, any floating-point expression may be contracted, that is, calculated as if all intermediate results have infinite range and precision.
不管FLT_EVAL_METHOD的值是多少,任何浮点表达式都可能被压缩,也就是说,如果所有中间结果都具有无限的范围和精度,那么就可以进行计算。
So now I am confused and concerned.
所以现在我感到困惑和担心。
- Is GCC justified in using FMA with
-O3
? - GCC使用FMA与-O3是否合理?
- Does fusing violate strict IEEE floating-point behaviour?
- 熔融是否违反了严格的IEEE浮点行为?
- If fusing does violate IEEE floating-point beahviour and since GCC returns
__STDC_IEC_559__
isn't this a contradiction? - 如果fuse确实违反了IEEE浮点数beahviour并且由于GCC返回了__STDC_IEC_559__,这难道不矛盾吗?
Since FMA can be emulated in software it seems to be there should be two compiler switches for FMA: one to tell the compiler to use FMA in calculations and one to tell the compiler that the hardware has FMA.
因为FMA可以在软件中被模拟,所以FMA应该有两个编译器开关:一个是告诉编译器在计算中使用FMA,另一个是告诉编译器硬件有FMA。
Apprently this can be controlled with the option -ffp-contract
. With GCC the default is -ffp-contract=fast
and with Clang it's not. Other options such as -ffp-contract=on
and -ffp-contract=off
do no produce the FMA instruction.
显然,这可以通过期权-ffp合同来控制。对于GCC,默认值是-ffp-contract=fast,而对于Clang,则不是。其他选项如-ffp-contract=on和-ffp-contract=off不会产生FMA指令。
For example Clang 3.7 with -O3 -mfma -ffp-contract=fast
produces vfmadd132ss
.
例如Clang 3.7与-O3 -mfma -ffp-contract=fast生产vfmadd132ss。
I checked some permutations of #pragma STDC FP_CONTRACT
set to ON
and OFF
with -ffp-contract
set to on
, off
, and fast
. IN all cases I also used -O3 -mfma
.
我检查了#pragma STDC FP_CONTRACT设置为ON和OFF的排列,而-ffp-contract设置为ON、OFF和fast。在所有情况下,我都使用-O3 -mfma。
With GCC the answer is simple. #pragma STDC FP_CONTRACT
ON or OFF makes no difference. Only -ffp-contract
matters.
对于GCC,答案很简单。打开或关闭实用程序STDC FP_CONTRACT没有区别。只有-ffp-contract至关重要。
GCC it uses fma
with
它使用fma
-
-ffp-contract=fast
(default). - -ffp-contract =快速(默认)。
With Clang it uses fma
Clang使用fma
- with
-ffp-contract=fast
. - -ffp-contract =快。
- with
-ffp-contract=on
(default) and#pragma STDC FP_CONTRACT ON
(default isOFF
). - 使用-ffp-contract=on (default)和#pragma STDC FP_CONTRACT on (default is OFF)。
In other words with Clang you can get fma
with #pragma STDC FP_CONTRACT ON
(since -ffp-contract=on
is the default) or with -ffp-contract=fast
. -ffast-math
(and hence -Ofast
) set -ffp-contract=fast
.
换句话说,使用Clang,您可以获得fma的#pragma STDC FP_CONTRACT ON(因为-ffp-contract= ON是默认值)或-ffp-contract=fast。-ffast-math(因此-Ofast) set -ffp-contract=fast。
I looked into MSVC and ICC.
我研究了MSVC和ICC。
With MSVC it uses the fma instruction with /O2 /arch:AVX2 /fp:fast
. With MSVC /fp:precise
is the default.
使用MSVC时,它使用的fma指令为/O2 /arch:AVX2 /fp:fast。使用MSVC /fp: precision是默认值。
With ICC it uses fma with -O3 -march=core-avx2
(acctually -O1
is sufficient). This is because by default ICC uses -fp-model fast
. But ICC uses fma even with -fp-model precise
. To disable fma with ICC use -fp-model strict
or -no-fma
.
在ICC中,它使用fma与-O3 -march=core-avx2(照例-O1是充分的)。这是因为ICC默认使用-fp模型的速度很快。但是ICC使用fma甚至精确的-fp模型。使用-fp-model严格或-no-fma禁用fma。
So by default GCC and ICC use fma when fma is enabled (with -mfma
for GCC/Clang or -march=core-avx2
with ICC) but Clang and MSVC do not.
因此,默认情况下,GCC和ICC在启用fma时使用fma(使用-mfma表示GCC/Clang,或者使用-march=core-avx2表示ICC),但Clang和MSVC不使用fma。
2 个解决方案
#1
3
It doesn't violate IEEE-754, because IEEE-754 defers to languages on this point:
它没有违反IEEE-754,因为IEEE-754在这一点上偏向于语言:
A language standard should also define, and require implementations to provide, attributes that allow and disallow value-changing optimizations, separately or collectively, for a block. These optimizations might include, but are not limited to:
语言标准还应该定义并要求实现为块提供允许和不允许改变值的优化(分别或共同地)的属性。这些优化可能包括,但不限于:
...
…
― Synthesis of a fusedMultiplyAdd operation from a multiplication and an addition.
-从乘法和加法中合成一个复杂的多重运算。
In standard C, the STDC FP_CONTRACT
pragma provides the means to control this value-changing optimization. So GCC is licensed to perform the fusion by default, so long as it allows you to disable the optimization by setting STDC FP_CONTRACT OFF
. Not supporting that means not adhering to the C standard.
在标准C中,STDC FP_CONTRACT pragma提供了控制这种值变化优化的方法。因此GCC默认允许执行融合,只要允许您通过设置STDC FP_CONTRACT来禁用优化。不支持这意味着不遵守C标准。
#2
4
When you quoted that fused multiply-add is allowed, you left out the important condition "unless pragma FP_CONTRACT is off". Which is a newish feature in C (I think introduced in C99) and was made absolutely necessary by PowerPC, which all had fused multiply-add from the start - actually, x*y was equivalent to fma (x, y, 0) and x+y was equivalent to fma (1.0, x, y).
当您引用了允许合并的multiply-add时,您忽略了重要的条件“除非pragma FP_CONTRACT是off”。这是C中的一个新特性(我认为是在C99中引入的),是PowerPC必不可少的,PowerPC从一开始就融合了multiply-add——实际上,x*y等于fma (x, y, 0), x+y等于fma (1.0, x, y)。
FP_CONTRACT is what controls fused multiply/add, not FLT_EVAL_METHOD. Although if FLT_EVAL_METHOD allows higher precision, then contracting is always legal; just pretend that the operations were performed with very high precision and then rounded.
FP_CONTRACT是控制融合乘法/添加的方法,而不是FLT_EVAL_METHOD。虽然FLT_EVAL_METHOD允许更高的精度,但是收缩始终是合法的;假设这些操作的执行精度非常高,然后是圆的。
The fma function is useful if you don't want the speed, but the precision. It will calculate the contracted result slowly but correctly even if it isn't available in hardware. And should be inlined if it is available in hardware.
如果您不想要速度,而是想要精度,那么fma函数是有用的。它将缓慢而正确地计算压缩结果,即使它在硬件中不可用。如果在硬件中可用,则应该内联。
#1
3
It doesn't violate IEEE-754, because IEEE-754 defers to languages on this point:
它没有违反IEEE-754,因为IEEE-754在这一点上偏向于语言:
A language standard should also define, and require implementations to provide, attributes that allow and disallow value-changing optimizations, separately or collectively, for a block. These optimizations might include, but are not limited to:
语言标准还应该定义并要求实现为块提供允许和不允许改变值的优化(分别或共同地)的属性。这些优化可能包括,但不限于:
...
…
― Synthesis of a fusedMultiplyAdd operation from a multiplication and an addition.
-从乘法和加法中合成一个复杂的多重运算。
In standard C, the STDC FP_CONTRACT
pragma provides the means to control this value-changing optimization. So GCC is licensed to perform the fusion by default, so long as it allows you to disable the optimization by setting STDC FP_CONTRACT OFF
. Not supporting that means not adhering to the C standard.
在标准C中,STDC FP_CONTRACT pragma提供了控制这种值变化优化的方法。因此GCC默认允许执行融合,只要允许您通过设置STDC FP_CONTRACT来禁用优化。不支持这意味着不遵守C标准。
#2
4
When you quoted that fused multiply-add is allowed, you left out the important condition "unless pragma FP_CONTRACT is off". Which is a newish feature in C (I think introduced in C99) and was made absolutely necessary by PowerPC, which all had fused multiply-add from the start - actually, x*y was equivalent to fma (x, y, 0) and x+y was equivalent to fma (1.0, x, y).
当您引用了允许合并的multiply-add时,您忽略了重要的条件“除非pragma FP_CONTRACT是off”。这是C中的一个新特性(我认为是在C99中引入的),是PowerPC必不可少的,PowerPC从一开始就融合了multiply-add——实际上,x*y等于fma (x, y, 0), x+y等于fma (1.0, x, y)。
FP_CONTRACT is what controls fused multiply/add, not FLT_EVAL_METHOD. Although if FLT_EVAL_METHOD allows higher precision, then contracting is always legal; just pretend that the operations were performed with very high precision and then rounded.
FP_CONTRACT是控制融合乘法/添加的方法,而不是FLT_EVAL_METHOD。虽然FLT_EVAL_METHOD允许更高的精度,但是收缩始终是合法的;假设这些操作的执行精度非常高,然后是圆的。
The fma function is useful if you don't want the speed, but the precision. It will calculate the contracted result slowly but correctly even if it isn't available in hardware. And should be inlined if it is available in hardware.
如果您不想要速度,而是想要精度,那么fma函数是有用的。它将缓慢而正确地计算压缩结果,即使它在硬件中不可用。如果在硬件中可用,则应该内联。