It is nearly impossible(*) to provide strict IEEE 754 semantics at reasonable cost when the only floating-point instructions one is allowed to used are the 387 ones. It is particularly hard when one wishes to keep the FPU working on the full 64-bit significand so that the long double
type is available for extended precision. The usual “solution” is to do intermediate computations at the only available precision, and to convert to the lower precision at more or less well-defined occasions.
当允许使用的唯一浮点指令是387指令时,几乎不可能(*)以合理的成本提供严格的IEEE 754语义。当一个人想要保持FPU在64位的意义上工作时,这是非常困难的,因此,长双类型可以用于扩展的精度。通常的“解决方案”是在唯一可用的精度下进行中间计算,并在或多或少定义良好的情况下转换为较低的精度。
Recent versions of GCC handle excess precision in intermediate computations according to the interpretation laid out by Joseph S. Myers in a 2008 post to the GCC mailing list. This description makes a program compiled with gcc -std=c99 -mno-sse2 -mfpmath=387
completely predictable, to the last bit, as far as I understand. And if by chance it doesn't, it is a bug and it will be fixed: Joseph S. Myers' stated intention in his post is to make it predictable.
根据Joseph S. Myers在2008年提交给GCC邮件列表的一篇文章中给出的解释,最近版本的GCC在中间计算中处理的精度过高。这一描述使使用gcc -std=c99 -mno-sse2 -mfpmath=387编译的程序完全可预测,直到最后一点,就我所知。如果碰巧没有,这是一个bug,而且会被修复:Joseph S. Myers在他的博文中说,他的目的是让它变得可预测。
Is it documented how Clang handles excess precision (say when the option -mno-sse2
is used), and where?
它是否记录了Clang如何处理过度的精度(比如使用-mno-sse2选项),以及在哪里?
(*) EDIT: this is an exaggeration. It is slightly annoying but not that difficult to emulate binary64 when one is allowed to configure the x87 FPU to use a 53-bit significand.
(*)编辑:这有点夸张。当允许将x87 FPU配置为使用53位的重要符号时,模拟binary64有点烦人,但也不是那么困难。
Following a comment by R.. below, here is the log of a short interaction of mine with the most recent version of Clang I have :
以下是R.. .下面是我与最新版本的“铿锵”短促互动的日志:
Hexa:~ $ clang -v
Apple clang version 4.1 (tags/Apple/clang-421.11.66) (based on LLVM 3.1svn)
Target: x86_64-apple-darwin12.4.0
Thread model: posix
Hexa:~ $ cat fem.c
#include <stdio.h>
#include <math.h>
#include <float.h>
#include <fenv.h>
double x;
double y = 2.0;
double z = 1.0;
int main(){
x = y + z;
printf("%d\n", (int) FLT_EVAL_METHOD);
}
Hexa:~ $ clang -std=c99 -mno-sse2 fem.c
Hexa:~ $ ./a.out
0
Hexa:~ $ clang -std=c99 -mno-sse2 -S fem.c
Hexa:~ $ cat fem.s
…
movl $0, %esi
fldl _y(%rip)
fldl _z(%rip)
faddp %st(1)
movq _x@GOTPCREL(%rip), %rax
fstpl (%rax)
…
2 个解决方案
#1
15
This does not answer the originally posed question, but if you are a programmer working with similar issues, this answer might help you.
这并不能回答最初提出的问题,但是如果您是处理类似问题的程序员,这个答案可能会对您有所帮助。
I really don't see where the perceived difficulty is. Providing strict IEEE-754 binary64 semantics while being limited to 80387 floating-point math, and retaining 80-bit long double computation, seems to follow well-specified C99 casting rules with both GCC-4.6.3 and clang-3.0 (based on LLVM 3.0).
我真的看不出感知到的困难在哪里。提供严格的IEEE-754 binary64语义,同时限制为80387浮点数,并保持80位长双计算,似乎遵循了GCC-4.6.3和clang-3.0(基于LLVM 3.0)指定良好的C99强制转换规则。
Edited to add: Yet, Pascal Cuoq is correct: neither gcc-4.6.3 or clang-llvm-3.0 actually enforce those rules correctly for '387 floating-point math. Given the proper compiler options, the rules are correctly applied to expressions evaluated at compile time, but not for run-time expressions. There are workarounds, listed after the break below.
不过,Pascal Cuoq是对的:gcc-4.6.3或clang-llvm-3.0实际上都没有对“387浮点数”正确执行这些规则。给定正确的编译器选项,规则可以正确地应用于在编译时计算的表达式,而不是用于运行时表达式。有一些变通的办法,列在下面的休息之后。
I do molecular dynamics simulation code, and am very familiar with the repeatability/predictability requirements and also with the desire to retain maximum precision available when possible, so I do claim I know what I am talking about here. This answer should show that the tools exist and are simple to use; the problems arise from not being aware of or not using those tools.
我做分子动力学模拟代码,并且非常熟悉可重复性/可预测性的要求,也希望尽可能保持最大的可用精度,所以我断言我知道我在这里说的是什么。这个答案应该表明这些工具是存在的,并且易于使用;问题产生于没有意识到或者没有使用这些工具。
(A preferred example I like, is the Kahan summation algorithm. With C99 and proper casting (adding casts to e.g. Wikipedia example code), no tricks or extra temporary variables are needed at all. The implementation works regardless of compiler optimization level, including at -O3
and -Ofast
.)
(我比较喜欢的一个例子是Kahan求和算法。使用C99和适当的强制转换(向*示例代码添加强制转换),根本不需要任何技巧或额外的临时变量。不管编译器优化级别如何(包括-O3和-Ofast),实现都可以工作。
C99 explicitly states (in e.g. 5.4.2.2) that casting and assignment both remove all extra range and precision. This means that you can use long double
arithmetic by defining your temporary variables used during computation as long double
, also casting your input variables to that type; whenever a IEEE-754 binary64 is needed, just cast to a double
.
C99明确指出(如5.4.2.2)铸造和赋值都去除所有额外的范围和精度。这意味着您可以通过定义计算期间使用的临时变量long double来使用长双精度运算,还可以将输入变量转换为该类型;每当需要IEEE-754双ary64时,只需转换为double即可。
On '387, the cast generates an assignment and a load on both the above compilers; this does correctly round the 80-bit value to IEEE-754 binary64. This cost is very reasonable in my opinion. The exact time taken depends on the architecture and surrounding code; usually it is and can be interleaved with other code to bring the cost down to neglible levels. When MMX, SSE or AVX are available, their registers are separate from the 80-bit 80387 registers, and the cast usually is done by moving the value to the MMX/SSE/AVX register.
在“387”上,cast生成了上述两个编译器的赋值和负载;这将正确地将80位的值四舍五入到IEEE-754 binary64。我认为这个费用是很合理的。具体时间取决于架构和周围的代码;通常情况下,它是并且可以与其他代码交错使用,从而将成本降低到可以忽略的级别。当MMX、SSE或AVX可用时,它们的寄存器与80位80387寄存器是分开的,转换通常通过将值移动到MMX/SSE/AVX寄存器来完成。
(I prefer production code to use a specific floating-point type, say tempdouble
or such, for temporary variables, so that it can be defined to either double
or long double
depending on architecture and speed/precision tradeoffs desired.)
(我更喜欢使用特定的浮点类型的生产代码,比如tempdouble之类的,用于临时变量,这样它就可以根据架构和速度/精确的权衡来定义为double或long double。)
In a nutshell:
简而言之:
Don't assume
(expression)
is ofdouble
precision just because all the variables and literal constants are. Write it as(double)(expression)
if you want the result atdouble
precision.不要仅仅因为所有的变量和文字常量都是双精度的就认为(表达式)是双精度的。如果您想要得到双精度的结果,可以将其写成(double)(expression)。
This applies to compound expressions, too, and may sometimes lead to unwieldy expressions with many levels of casts.
这也适用于复合表达式,有时可能会导致具有多个级别的强制类型转换的笨拙表达式。
If you have expr1
and expr2
that you wish to compute at 80-bit precision, but also need the product of each rounded to 64-bit first, use
如果您希望以80位精度计算expr1和expr2,但还需要将每个四舍五入到64位的产品,请使用
long double expr1;
long double expr2;
double product = (double)(expr1) * (double)(expr2);
Note, product
is computed as a product of two 64-bit values; not computed at 80-bit precision, then rounded down. Calculating the product at 80-bit precision, then rounding down, would be
注意,产品被计算为两个64位值的产品;不是以80位精度计算,然后四舍五入。对产品进行80位精度的计算,然后取整。
double other = expr1 * expr2;
or, adding descriptive casts that tell you exactly what is happening,
或者,添加描述性强制转换,告诉你发生了什么,
double other = (double)((long double)(expr1) * (long double)(expr2));
It should be obvious that product
and other
often differ.
很明显,产品和其他产品经常不同。
The C99 casting rules are just another tool you must learn to wield, if you do work with mixed 32-bit/64-bit/80-bit/128-bit floating point values. Really, you encounter the exact same issues if you mix binary32 and binary64 floats (float
and double
on most architectures)!
如果您使用32位/64位/80位/128位浮点值的混合值,那么C99强制转换规则只是您必须学会使用的另一个工具。实际上,如果您混合使用binary32和binary64浮动(在大多数架构上是浮动和双浮动),您将会遇到相同的问题!
Perhaps rewriting Pascal Cuoq's exploration code, to correctly apply casting rules, makes this clearer?
也许重写Pascal Cuoq的探索代码,以正确地应用强制转换规则,会使这更清晰?
#include <stdio.h>
#define TEST(eq) printf("%-56s%s\n", "" # eq ":", (eq) ? "true" : "false")
int main(void)
{
double d = 1.0 / 10.0;
long double ld = 1.0L / 10.0L;
printf("sizeof (double) = %d\n", (int)sizeof (double));
printf("sizeof (long double) == %d\n", (int)sizeof (long double));
printf("\nExpect true:\n");
TEST(d == (double)(0.1));
TEST(ld == (long double)(0.1L));
TEST(d == (double)(1.0 / 10.0));
TEST(ld == (long double)(1.0L / 10.0L));
TEST(d == (double)(ld));
TEST((double)(1.0L/10.0L) == (double)(0.1));
TEST((long double)(1.0L/10.0L) == (long double)(0.1L));
printf("\nExpect false:\n");
TEST(d == ld);
TEST((long double)(d) == ld);
TEST(d == 0.1L);
TEST(ld == 0.1);
TEST(d == (long double)(1.0L / 10.0L));
TEST(ld == (double)(1.0L / 10.0));
return 0;
}
The output, with both GCC and clang, is
输出,有GCC和clang,是
sizeof (double) = 8
sizeof (long double) == 12
Expect true:
d == (double)(0.1): true
ld == (long double)(0.1L): true
d == (double)(1.0 / 10.0): true
ld == (long double)(1.0L / 10.0L): true
d == (double)(ld): true
(double)(1.0L/10.0L) == (double)(0.1): true
(long double)(1.0L/10.0L) == (long double)(0.1L): true
Expect false:
d == ld: false
(long double)(d) == ld: false
d == 0.1L: false
ld == 0.1: false
d == (long double)(1.0L / 10.0L): false
ld == (double)(1.0L / 10.0): false
except that recent versions of GCC promote the right hand side of ld == 0.1
to long double first (i.e. to ld == 0.1L
), yielding true
, and that with SSE/AVX, long double
is 128-bit.
除了最近版本的GCC将ld = 0.1的右侧提升为long double first(即到ld = 0.1 l),从而得到true,对于SSE/AVX, long double为128位。
For the pure '387 tests, I used
在387年的测试中,我使用了
gcc -W -Wall -m32 -mfpmath=387 -mno-sse ... test.c -o test
clang -W -Wall -m32 -mfpmath=387 -mno-sse ... test.c -o test
with various optimization flag combinations as ...
, including -fomit-frame-pointer
, -O0
, -O1
, -O2
, -O3
, and -Os
.
使用各种优化标志组合作为…,包括-配方-帧指针,-O0, -O1, -O2, -O3,和-Os。
Using any other flags or C99 compilers should lead to the same results, except for long double
size (and ld == 1.0
for current GCC versions). If you encounter any differences, I'd be very grateful to hear about them; I may need to warn my users of such compilers (compiler versions). Note that Microsoft does not support C99, so they are completely uninteresting to me.
使用任何其他标志或C99编译器都应该得到相同的结果,除了长倍大小(当前GCC版本的ld == 1.0)。如果你们有任何分歧,我将非常感谢他们;我可能需要警告我的用户这些编译器(编译器版本)。注意,微软不支持C99,所以我对它们完全不感兴趣。
Pascal Cuoq does bring up an interesting problem in the comment chain below, which I didn't immediately recognize.
Pascal Cuoq在下面的评论链中提出了一个有趣的问题,我没有马上意识到。
When evaluating an expression, both GCC and clang with -mfpmath=387
specify that all expressions are evaluated using 80-bit precision. This leads to for example
在计算表达式时,GCC和clang都使用-mfpmath=387指定使用80位精度计算所有表达式。举个例子
7491907632491941888 = 0x1.9fe2693112e14p+62 = 110011111111000100110100100110001000100101110000101000000000000
5698883734965350400 = 0x1.3c5a02407b71cp+62 = 100111100010110100000001001000000011110110111000111000000000000
7491907632491941888 * 5698883734965350400 = 42695510550671093541385598890357555200 = 100000000111101101101100110001101000010100100001011110111111111111110011000111000001011101010101100011000000000000000000000000
yielding incorrect results, because that string of ones in the middle of the binary result is just at the difference between 53- and 64-bit mantissas (64 and 80-bit floating point numbers, respectively). So, while the expected result is
产生不正确的结果,因为二进制结果中间的字符串正好是53和64位mantissas(分别是64位和80位浮点数)之间的差值。所以,虽然预期的结果是。
42695510550671088819251326462451515392 = 0x1.00f6d98d0a42fp+125 = 100000000111101101101100110001101000010100100001011110000000000000000000000000000000000000000000000000000000000000000000000000
the result obtained with just -std=c99 -m32 -mno-sse -mfpmath=387
is
用-std=c99 -m32 -mno-sse -mfpmath=387得到的结果为
42695510550671098263984292201741942784 = 0x1.00f6d98d0a43p+125 = 100000000111101101101100110001101000010100100001100000000000000000000000000000000000000000000000000000000000000000000000000000
In theory, you should be able to tell gcc and clang to enforce the correct C99 rounding rules by using options
理论上,您应该能够告诉gcc和clang使用选项来执行正确的C99舍入规则
-std=c99 -m32 -mno-sse -mfpmath=387 -ffloat-store -fexcess-precision=standard
However, this only affects expressions the compiler optimizes, and does not seem to fix the 387 handling at all. If you use e.g. clang -O1 -std=c99 -m32 -mno-sse -mfpmath=387 -ffloat-store -fexcess-precision=standard test.c -o test && ./test
with test.c
being Pascal Cuoq's example program, you will get the correct result per IEEE-754 rules -- but only because the compiler optimizes away the expression, not using the 387 at all.
但是,这只会影响编译器优化的表达式,而且似乎根本不会修复387处理。如clang -O1 -std=c99 -m32 -mno-sse -mfpmath=387 -ffloat-store -fexcess-precision=标准检验。c -o测试& ./测试与测试。c是Pascal Cuoq的示例程序,您将得到每个IEEE-754规则的正确结果——但这只是因为编译器优化了表达式,而不是使用了387。
Simply put, instead of computing
简单地说,而不是计算
(double)d1 * (double)d2
both gcc and clang actually tell the '387 to compute
实际上,gcc和clang都告诉387要计算
(double)((long double)d1 * (long double)d2)
This is indeed I believe this is a compiler bug affecting both gcc-4.6.3 and clang-llvm-3.0, and an easily reproduced one. (Pascal Cuoq points out that FLT_EVAL_METHOD=2
means operations on double-precision arguments is always done at extended precision, but I cannot see any sane reason -- aside from having to rewrite parts of libm
on '387 -- to do that in C99 and considering IEEE-754 rules are achievable by the hardware! After all, the correct operation is easily achievable by the compiler, by modifying the '387 control word to match the precision of the expression. And, given the compiler options that should force this behaviour -- -std=c99 -ffloat-store -fexcess-precision=standard
-- make no sense if FLT_EVAL_METHOD=2
behaviour is actually desired, there is no backwards compatibility issues, either.) It is important to note that given the proper compiler flags, expressions evaluated at compile time do get evaluated correctly, and that only expressions evaluated at run time get incorrect results.
我认为这确实是一个影响gcc-4.6.3和clang-llvm-3.0的编译错误,并且是一个容易复制的错误。(Pascal Cuoq指出,FLT_EVAL_METHOD=2意味着对双精度参数的操作总是在扩展精度上完成,但我看不出有什么合理的理由——除了不得不重写libm在387上的部分——在C99中实现这一点,并且考虑到IEEE-754规则是由硬件实现的!毕竟,编译器通过修改'387控制字以匹配表达式的精度,很容易实现正确的操作。而且,考虑到应该强制执行这种行为的编译器选项——-std=c99 -ffloat-store -fexcess-precision=standard——如果实际需要FLT_EVAL_METHOD=2行为,那么也没有向后兼容性问题。需要注意的是,给定适当的编译器标志,在编译时求值的表达式将得到正确的求值,并且只有在运行时求值的表达式才会得到不正确的结果。
The simplest workaround, and the portable one, is to use fesetround(FE_TOWARDZERO)
(from fenv.h
) to round all results towards zero.
最简单的变通方法是使用fesetround(FE_TOWARDZERO)(从fenv.h)到所有的结果都为零。
In some cases, rounding towards zero may help with predictability and pathological cases. In particular, for intervals like x = [0,1)
, rounding towards zero means the upper limit is never reached through rounding; important if you evaluate e.g. piecewise splines.
在某些情况下,四舍五入可能有助于预测和病理病例。特别是对于x =[0,1]这样的区间,四舍五入接近于零,表示不可能通过四舍五入达到上限;重要的是你要评估,如分段样条。
For the other rounding modes, you need to control the 387 hardware directly.
对于其他的舍入模式,您需要直接控制387硬件。
You can use either __FPU_SETCW()
from #include <fpu_control.h>
, or open-code it. For example, precision.c
:
您可以使用#include
#include <stdlib.h>
#include <stdio.h>
#include <limits.h>
#define FP387_NEAREST 0x0000
#define FP387_ZERO 0x0C00
#define FP387_UP 0x0800
#define FP387_DOWN 0x0400
#define FP387_SINGLE 0x0000
#define FP387_DOUBLE 0x0200
#define FP387_EXTENDED 0x0300
static inline void fp387(const unsigned short control)
{
unsigned short cw = (control & 0x0F00) | 0x007f;
__asm__ volatile ("fldcw %0" : : "m" (*&cw));
}
const char *bits(const double value)
{
const unsigned char *const data = (const unsigned char *)&value;
static char buffer[CHAR_BIT * sizeof value + 1];
char *p = buffer;
size_t i = CHAR_BIT * sizeof value;
while (i-->0)
*(p++) = '0' + !!(data[i / CHAR_BIT] & (1U << (i % CHAR_BIT)));
*p = '\0';
return (const char *)buffer;
}
int main(int argc, char *argv[])
{
double d1, d2;
char dummy;
if (argc != 3) {
fprintf(stderr, "\nUsage: %s 7491907632491941888 5698883734965350400\n\n", argv[0]);
return EXIT_FAILURE;
}
if (sscanf(argv[1], " %lf %c", &d1, &dummy) != 1) {
fprintf(stderr, "%s: Not a number.\n", argv[1]);
return EXIT_FAILURE;
}
if (sscanf(argv[2], " %lf %c", &d2, &dummy) != 1) {
fprintf(stderr, "%s: Not a number.\n", argv[2]);
return EXIT_FAILURE;
}
printf("%s:\td1 = %.0f\n\t %s in binary\n", argv[1], d1, bits(d1));
printf("%s:\td2 = %.0f\n\t %s in binary\n", argv[2], d2, bits(d2));
printf("\nDefaults:\n");
printf("Product = %.0f\n\t %s in binary\n", d1 * d2, bits(d1 * d2));
printf("\nExtended precision, rounding to nearest integer:\n");
fp387(FP387_EXTENDED | FP387_NEAREST);
printf("Product = %.0f\n\t %s in binary\n", d1 * d2, bits(d1 * d2));
printf("\nDouble precision, rounding to nearest integer:\n");
fp387(FP387_DOUBLE | FP387_NEAREST);
printf("Product = %.0f\n\t %s in binary\n", d1 * d2, bits(d1 * d2));
printf("\nExtended precision, rounding to zero:\n");
fp387(FP387_EXTENDED | FP387_ZERO);
printf("Product = %.0f\n\t %s in binary\n", d1 * d2, bits(d1 * d2));
printf("\nDouble precision, rounding to zero:\n");
fp387(FP387_DOUBLE | FP387_ZERO);
printf("Product = %.0f\n\t %s in binary\n", d1 * d2, bits(d1 * d2));
return 0;
}
Using clang-llvm-3.0 to compile and run, I get the correct results,
使用clang-llvm-3.0编译和运行,我得到了正确的结果,
clang -std=c99 -m32 -mno-sse -mfpmath=387 -O3 -W -Wall precision.c -o precision
./precision 7491907632491941888 5698883734965350400
7491907632491941888: d1 = 7491907632491941888
0100001111011001111111100010011010010011000100010010111000010100 in binary
5698883734965350400: d2 = 5698883734965350400
0100001111010011110001011010000000100100000001111011011100011100 in binary
Defaults:
Product = 42695510550671098263984292201741942784
0100011111000000000011110110110110011000110100001010010000110000 in binary
Extended precision, rounding to nearest integer:
Product = 42695510550671098263984292201741942784
0100011111000000000011110110110110011000110100001010010000110000 in binary
Double precision, rounding to nearest integer:
Product = 42695510550671088819251326462451515392
0100011111000000000011110110110110011000110100001010010000101111 in binary
Extended precision, rounding to zero:
Product = 42695510550671088819251326462451515392
0100011111000000000011110110110110011000110100001010010000101111 in binary
Double precision, rounding to zero:
Product = 42695510550671088819251326462451515392
0100011111000000000011110110110110011000110100001010010000101111 in binary
In other words, you can work around the compiler issues by using fp387()
to set the precision and rounding mode.
换句话说,您可以通过使用fp387()设置精度和舍入模式来解决编译器问题。
The downside is that some math libraries (libm.a
, libm.so
) may be written with the assumption that intermediate results are always computed at 80-bit precision. At least the GNU C library fpu_control.h
on x86_64 has the comment "libm requires extended precision". Fortunately, you can take the '387 implementations from e.g. GNU C library, and implement them in a header file or write a known-to-work libm
, if you need the math.h
functionality; in fact, I think I might be able to help there.
缺点是一些数学库(libm)。a, libm.so)的编写可以假定中间结果总是以80位精度计算。至少是GNU C库fpu_control。h在x86_64上有“libm需要扩展精度”的注释。幸运的是,您可以从GNU C库中获取'387实现,并在头文件中实现它们,或者编写一个已知的工作libm,如果您需要计算的话。h功能;事实上,我想我也许能帮上忙。
#2
5
For the record, below is what I found by experimentation. The following program shows various behaviors when compiled with Clang:
以下是我通过实验得出的结论。下面的程序用Clang编译时显示了各种行为:
#include <stdio.h>
int r1, r2, r3, r4, r5, r6, r7;
double ten = 10.0;
int main(int c, char **v)
{
r1 = 0.1 == (1.0 / ten);
r2 = 0.1 == (1.0 / 10.0);
r3 = 0.1 == (double) (1.0 / ten);
r4 = 0.1 == (double) (1.0 / 10.0);
ten = 10.0;
r5 = 0.1 == (1.0 / ten);
r6 = 0.1 == (double) (1.0 / ten);
r7 = ((double) 0.1) == (1.0 / 10.0);
printf("r1=%d r2=%d r3=%d r4=%d r5=%d r6=%d r7=%d\n", r1, r2, r3, r4, r5, r6, r7);
}
The results vary with the optimization level:
结果随优化水平的不同而不同:
$ clang -v
Apple LLVM version 4.2 (clang-425.0.24) (based on LLVM 3.2svn)
$ clang -mno-sse2 -std=c99 t.c && ./a.out
r1=0 r2=1 r3=0 r4=1 r5=1 r6=0 r7=1
$ clang -mno-sse2 -std=c99 -O2 t.c && ./a.out
r1=0 r2=1 r3=0 r4=1 r5=1 r6=1 r7=1
The cast (double)
that differentiates r5
and r6
at -O2
has no effect at -O0
and for variables r3
and r4
. The result r1
is different from r5
at all optimization levels, whereas r6
only differs from r3
at -O2
.
在-O2时微分r5和r6的铸型(双)对-O0和变量r3和r4没有影响。结果r1在所有优化级别上都与r5不同,而r6在-O2上仅与r3不同。
#1
15
This does not answer the originally posed question, but if you are a programmer working with similar issues, this answer might help you.
这并不能回答最初提出的问题,但是如果您是处理类似问题的程序员,这个答案可能会对您有所帮助。
I really don't see where the perceived difficulty is. Providing strict IEEE-754 binary64 semantics while being limited to 80387 floating-point math, and retaining 80-bit long double computation, seems to follow well-specified C99 casting rules with both GCC-4.6.3 and clang-3.0 (based on LLVM 3.0).
我真的看不出感知到的困难在哪里。提供严格的IEEE-754 binary64语义,同时限制为80387浮点数,并保持80位长双计算,似乎遵循了GCC-4.6.3和clang-3.0(基于LLVM 3.0)指定良好的C99强制转换规则。
Edited to add: Yet, Pascal Cuoq is correct: neither gcc-4.6.3 or clang-llvm-3.0 actually enforce those rules correctly for '387 floating-point math. Given the proper compiler options, the rules are correctly applied to expressions evaluated at compile time, but not for run-time expressions. There are workarounds, listed after the break below.
不过,Pascal Cuoq是对的:gcc-4.6.3或clang-llvm-3.0实际上都没有对“387浮点数”正确执行这些规则。给定正确的编译器选项,规则可以正确地应用于在编译时计算的表达式,而不是用于运行时表达式。有一些变通的办法,列在下面的休息之后。
I do molecular dynamics simulation code, and am very familiar with the repeatability/predictability requirements and also with the desire to retain maximum precision available when possible, so I do claim I know what I am talking about here. This answer should show that the tools exist and are simple to use; the problems arise from not being aware of or not using those tools.
我做分子动力学模拟代码,并且非常熟悉可重复性/可预测性的要求,也希望尽可能保持最大的可用精度,所以我断言我知道我在这里说的是什么。这个答案应该表明这些工具是存在的,并且易于使用;问题产生于没有意识到或者没有使用这些工具。
(A preferred example I like, is the Kahan summation algorithm. With C99 and proper casting (adding casts to e.g. Wikipedia example code), no tricks or extra temporary variables are needed at all. The implementation works regardless of compiler optimization level, including at -O3
and -Ofast
.)
(我比较喜欢的一个例子是Kahan求和算法。使用C99和适当的强制转换(向*示例代码添加强制转换),根本不需要任何技巧或额外的临时变量。不管编译器优化级别如何(包括-O3和-Ofast),实现都可以工作。
C99 explicitly states (in e.g. 5.4.2.2) that casting and assignment both remove all extra range and precision. This means that you can use long double
arithmetic by defining your temporary variables used during computation as long double
, also casting your input variables to that type; whenever a IEEE-754 binary64 is needed, just cast to a double
.
C99明确指出(如5.4.2.2)铸造和赋值都去除所有额外的范围和精度。这意味着您可以通过定义计算期间使用的临时变量long double来使用长双精度运算,还可以将输入变量转换为该类型;每当需要IEEE-754双ary64时,只需转换为double即可。
On '387, the cast generates an assignment and a load on both the above compilers; this does correctly round the 80-bit value to IEEE-754 binary64. This cost is very reasonable in my opinion. The exact time taken depends on the architecture and surrounding code; usually it is and can be interleaved with other code to bring the cost down to neglible levels. When MMX, SSE or AVX are available, their registers are separate from the 80-bit 80387 registers, and the cast usually is done by moving the value to the MMX/SSE/AVX register.
在“387”上,cast生成了上述两个编译器的赋值和负载;这将正确地将80位的值四舍五入到IEEE-754 binary64。我认为这个费用是很合理的。具体时间取决于架构和周围的代码;通常情况下,它是并且可以与其他代码交错使用,从而将成本降低到可以忽略的级别。当MMX、SSE或AVX可用时,它们的寄存器与80位80387寄存器是分开的,转换通常通过将值移动到MMX/SSE/AVX寄存器来完成。
(I prefer production code to use a specific floating-point type, say tempdouble
or such, for temporary variables, so that it can be defined to either double
or long double
depending on architecture and speed/precision tradeoffs desired.)
(我更喜欢使用特定的浮点类型的生产代码,比如tempdouble之类的,用于临时变量,这样它就可以根据架构和速度/精确的权衡来定义为double或long double。)
In a nutshell:
简而言之:
Don't assume
(expression)
is ofdouble
precision just because all the variables and literal constants are. Write it as(double)(expression)
if you want the result atdouble
precision.不要仅仅因为所有的变量和文字常量都是双精度的就认为(表达式)是双精度的。如果您想要得到双精度的结果,可以将其写成(double)(expression)。
This applies to compound expressions, too, and may sometimes lead to unwieldy expressions with many levels of casts.
这也适用于复合表达式,有时可能会导致具有多个级别的强制类型转换的笨拙表达式。
If you have expr1
and expr2
that you wish to compute at 80-bit precision, but also need the product of each rounded to 64-bit first, use
如果您希望以80位精度计算expr1和expr2,但还需要将每个四舍五入到64位的产品,请使用
long double expr1;
long double expr2;
double product = (double)(expr1) * (double)(expr2);
Note, product
is computed as a product of two 64-bit values; not computed at 80-bit precision, then rounded down. Calculating the product at 80-bit precision, then rounding down, would be
注意,产品被计算为两个64位值的产品;不是以80位精度计算,然后四舍五入。对产品进行80位精度的计算,然后取整。
double other = expr1 * expr2;
or, adding descriptive casts that tell you exactly what is happening,
或者,添加描述性强制转换,告诉你发生了什么,
double other = (double)((long double)(expr1) * (long double)(expr2));
It should be obvious that product
and other
often differ.
很明显,产品和其他产品经常不同。
The C99 casting rules are just another tool you must learn to wield, if you do work with mixed 32-bit/64-bit/80-bit/128-bit floating point values. Really, you encounter the exact same issues if you mix binary32 and binary64 floats (float
and double
on most architectures)!
如果您使用32位/64位/80位/128位浮点值的混合值,那么C99强制转换规则只是您必须学会使用的另一个工具。实际上,如果您混合使用binary32和binary64浮动(在大多数架构上是浮动和双浮动),您将会遇到相同的问题!
Perhaps rewriting Pascal Cuoq's exploration code, to correctly apply casting rules, makes this clearer?
也许重写Pascal Cuoq的探索代码,以正确地应用强制转换规则,会使这更清晰?
#include <stdio.h>
#define TEST(eq) printf("%-56s%s\n", "" # eq ":", (eq) ? "true" : "false")
int main(void)
{
double d = 1.0 / 10.0;
long double ld = 1.0L / 10.0L;
printf("sizeof (double) = %d\n", (int)sizeof (double));
printf("sizeof (long double) == %d\n", (int)sizeof (long double));
printf("\nExpect true:\n");
TEST(d == (double)(0.1));
TEST(ld == (long double)(0.1L));
TEST(d == (double)(1.0 / 10.0));
TEST(ld == (long double)(1.0L / 10.0L));
TEST(d == (double)(ld));
TEST((double)(1.0L/10.0L) == (double)(0.1));
TEST((long double)(1.0L/10.0L) == (long double)(0.1L));
printf("\nExpect false:\n");
TEST(d == ld);
TEST((long double)(d) == ld);
TEST(d == 0.1L);
TEST(ld == 0.1);
TEST(d == (long double)(1.0L / 10.0L));
TEST(ld == (double)(1.0L / 10.0));
return 0;
}
The output, with both GCC and clang, is
输出,有GCC和clang,是
sizeof (double) = 8
sizeof (long double) == 12
Expect true:
d == (double)(0.1): true
ld == (long double)(0.1L): true
d == (double)(1.0 / 10.0): true
ld == (long double)(1.0L / 10.0L): true
d == (double)(ld): true
(double)(1.0L/10.0L) == (double)(0.1): true
(long double)(1.0L/10.0L) == (long double)(0.1L): true
Expect false:
d == ld: false
(long double)(d) == ld: false
d == 0.1L: false
ld == 0.1: false
d == (long double)(1.0L / 10.0L): false
ld == (double)(1.0L / 10.0): false
except that recent versions of GCC promote the right hand side of ld == 0.1
to long double first (i.e. to ld == 0.1L
), yielding true
, and that with SSE/AVX, long double
is 128-bit.
除了最近版本的GCC将ld = 0.1的右侧提升为long double first(即到ld = 0.1 l),从而得到true,对于SSE/AVX, long double为128位。
For the pure '387 tests, I used
在387年的测试中,我使用了
gcc -W -Wall -m32 -mfpmath=387 -mno-sse ... test.c -o test
clang -W -Wall -m32 -mfpmath=387 -mno-sse ... test.c -o test
with various optimization flag combinations as ...
, including -fomit-frame-pointer
, -O0
, -O1
, -O2
, -O3
, and -Os
.
使用各种优化标志组合作为…,包括-配方-帧指针,-O0, -O1, -O2, -O3,和-Os。
Using any other flags or C99 compilers should lead to the same results, except for long double
size (and ld == 1.0
for current GCC versions). If you encounter any differences, I'd be very grateful to hear about them; I may need to warn my users of such compilers (compiler versions). Note that Microsoft does not support C99, so they are completely uninteresting to me.
使用任何其他标志或C99编译器都应该得到相同的结果,除了长倍大小(当前GCC版本的ld == 1.0)。如果你们有任何分歧,我将非常感谢他们;我可能需要警告我的用户这些编译器(编译器版本)。注意,微软不支持C99,所以我对它们完全不感兴趣。
Pascal Cuoq does bring up an interesting problem in the comment chain below, which I didn't immediately recognize.
Pascal Cuoq在下面的评论链中提出了一个有趣的问题,我没有马上意识到。
When evaluating an expression, both GCC and clang with -mfpmath=387
specify that all expressions are evaluated using 80-bit precision. This leads to for example
在计算表达式时,GCC和clang都使用-mfpmath=387指定使用80位精度计算所有表达式。举个例子
7491907632491941888 = 0x1.9fe2693112e14p+62 = 110011111111000100110100100110001000100101110000101000000000000
5698883734965350400 = 0x1.3c5a02407b71cp+62 = 100111100010110100000001001000000011110110111000111000000000000
7491907632491941888 * 5698883734965350400 = 42695510550671093541385598890357555200 = 100000000111101101101100110001101000010100100001011110111111111111110011000111000001011101010101100011000000000000000000000000
yielding incorrect results, because that string of ones in the middle of the binary result is just at the difference between 53- and 64-bit mantissas (64 and 80-bit floating point numbers, respectively). So, while the expected result is
产生不正确的结果,因为二进制结果中间的字符串正好是53和64位mantissas(分别是64位和80位浮点数)之间的差值。所以,虽然预期的结果是。
42695510550671088819251326462451515392 = 0x1.00f6d98d0a42fp+125 = 100000000111101101101100110001101000010100100001011110000000000000000000000000000000000000000000000000000000000000000000000000
the result obtained with just -std=c99 -m32 -mno-sse -mfpmath=387
is
用-std=c99 -m32 -mno-sse -mfpmath=387得到的结果为
42695510550671098263984292201741942784 = 0x1.00f6d98d0a43p+125 = 100000000111101101101100110001101000010100100001100000000000000000000000000000000000000000000000000000000000000000000000000000
In theory, you should be able to tell gcc and clang to enforce the correct C99 rounding rules by using options
理论上,您应该能够告诉gcc和clang使用选项来执行正确的C99舍入规则
-std=c99 -m32 -mno-sse -mfpmath=387 -ffloat-store -fexcess-precision=standard
However, this only affects expressions the compiler optimizes, and does not seem to fix the 387 handling at all. If you use e.g. clang -O1 -std=c99 -m32 -mno-sse -mfpmath=387 -ffloat-store -fexcess-precision=standard test.c -o test && ./test
with test.c
being Pascal Cuoq's example program, you will get the correct result per IEEE-754 rules -- but only because the compiler optimizes away the expression, not using the 387 at all.
但是,这只会影响编译器优化的表达式,而且似乎根本不会修复387处理。如clang -O1 -std=c99 -m32 -mno-sse -mfpmath=387 -ffloat-store -fexcess-precision=标准检验。c -o测试& ./测试与测试。c是Pascal Cuoq的示例程序,您将得到每个IEEE-754规则的正确结果——但这只是因为编译器优化了表达式,而不是使用了387。
Simply put, instead of computing
简单地说,而不是计算
(double)d1 * (double)d2
both gcc and clang actually tell the '387 to compute
实际上,gcc和clang都告诉387要计算
(double)((long double)d1 * (long double)d2)
This is indeed I believe this is a compiler bug affecting both gcc-4.6.3 and clang-llvm-3.0, and an easily reproduced one. (Pascal Cuoq points out that FLT_EVAL_METHOD=2
means operations on double-precision arguments is always done at extended precision, but I cannot see any sane reason -- aside from having to rewrite parts of libm
on '387 -- to do that in C99 and considering IEEE-754 rules are achievable by the hardware! After all, the correct operation is easily achievable by the compiler, by modifying the '387 control word to match the precision of the expression. And, given the compiler options that should force this behaviour -- -std=c99 -ffloat-store -fexcess-precision=standard
-- make no sense if FLT_EVAL_METHOD=2
behaviour is actually desired, there is no backwards compatibility issues, either.) It is important to note that given the proper compiler flags, expressions evaluated at compile time do get evaluated correctly, and that only expressions evaluated at run time get incorrect results.
我认为这确实是一个影响gcc-4.6.3和clang-llvm-3.0的编译错误,并且是一个容易复制的错误。(Pascal Cuoq指出,FLT_EVAL_METHOD=2意味着对双精度参数的操作总是在扩展精度上完成,但我看不出有什么合理的理由——除了不得不重写libm在387上的部分——在C99中实现这一点,并且考虑到IEEE-754规则是由硬件实现的!毕竟,编译器通过修改'387控制字以匹配表达式的精度,很容易实现正确的操作。而且,考虑到应该强制执行这种行为的编译器选项——-std=c99 -ffloat-store -fexcess-precision=standard——如果实际需要FLT_EVAL_METHOD=2行为,那么也没有向后兼容性问题。需要注意的是,给定适当的编译器标志,在编译时求值的表达式将得到正确的求值,并且只有在运行时求值的表达式才会得到不正确的结果。
The simplest workaround, and the portable one, is to use fesetround(FE_TOWARDZERO)
(from fenv.h
) to round all results towards zero.
最简单的变通方法是使用fesetround(FE_TOWARDZERO)(从fenv.h)到所有的结果都为零。
In some cases, rounding towards zero may help with predictability and pathological cases. In particular, for intervals like x = [0,1)
, rounding towards zero means the upper limit is never reached through rounding; important if you evaluate e.g. piecewise splines.
在某些情况下,四舍五入可能有助于预测和病理病例。特别是对于x =[0,1]这样的区间,四舍五入接近于零,表示不可能通过四舍五入达到上限;重要的是你要评估,如分段样条。
For the other rounding modes, you need to control the 387 hardware directly.
对于其他的舍入模式,您需要直接控制387硬件。
You can use either __FPU_SETCW()
from #include <fpu_control.h>
, or open-code it. For example, precision.c
:
您可以使用#include
#include <stdlib.h>
#include <stdio.h>
#include <limits.h>
#define FP387_NEAREST 0x0000
#define FP387_ZERO 0x0C00
#define FP387_UP 0x0800
#define FP387_DOWN 0x0400
#define FP387_SINGLE 0x0000
#define FP387_DOUBLE 0x0200
#define FP387_EXTENDED 0x0300
static inline void fp387(const unsigned short control)
{
unsigned short cw = (control & 0x0F00) | 0x007f;
__asm__ volatile ("fldcw %0" : : "m" (*&cw));
}
const char *bits(const double value)
{
const unsigned char *const data = (const unsigned char *)&value;
static char buffer[CHAR_BIT * sizeof value + 1];
char *p = buffer;
size_t i = CHAR_BIT * sizeof value;
while (i-->0)
*(p++) = '0' + !!(data[i / CHAR_BIT] & (1U << (i % CHAR_BIT)));
*p = '\0';
return (const char *)buffer;
}
int main(int argc, char *argv[])
{
double d1, d2;
char dummy;
if (argc != 3) {
fprintf(stderr, "\nUsage: %s 7491907632491941888 5698883734965350400\n\n", argv[0]);
return EXIT_FAILURE;
}
if (sscanf(argv[1], " %lf %c", &d1, &dummy) != 1) {
fprintf(stderr, "%s: Not a number.\n", argv[1]);
return EXIT_FAILURE;
}
if (sscanf(argv[2], " %lf %c", &d2, &dummy) != 1) {
fprintf(stderr, "%s: Not a number.\n", argv[2]);
return EXIT_FAILURE;
}
printf("%s:\td1 = %.0f\n\t %s in binary\n", argv[1], d1, bits(d1));
printf("%s:\td2 = %.0f\n\t %s in binary\n", argv[2], d2, bits(d2));
printf("\nDefaults:\n");
printf("Product = %.0f\n\t %s in binary\n", d1 * d2, bits(d1 * d2));
printf("\nExtended precision, rounding to nearest integer:\n");
fp387(FP387_EXTENDED | FP387_NEAREST);
printf("Product = %.0f\n\t %s in binary\n", d1 * d2, bits(d1 * d2));
printf("\nDouble precision, rounding to nearest integer:\n");
fp387(FP387_DOUBLE | FP387_NEAREST);
printf("Product = %.0f\n\t %s in binary\n", d1 * d2, bits(d1 * d2));
printf("\nExtended precision, rounding to zero:\n");
fp387(FP387_EXTENDED | FP387_ZERO);
printf("Product = %.0f\n\t %s in binary\n", d1 * d2, bits(d1 * d2));
printf("\nDouble precision, rounding to zero:\n");
fp387(FP387_DOUBLE | FP387_ZERO);
printf("Product = %.0f\n\t %s in binary\n", d1 * d2, bits(d1 * d2));
return 0;
}
Using clang-llvm-3.0 to compile and run, I get the correct results,
使用clang-llvm-3.0编译和运行,我得到了正确的结果,
clang -std=c99 -m32 -mno-sse -mfpmath=387 -O3 -W -Wall precision.c -o precision
./precision 7491907632491941888 5698883734965350400
7491907632491941888: d1 = 7491907632491941888
0100001111011001111111100010011010010011000100010010111000010100 in binary
5698883734965350400: d2 = 5698883734965350400
0100001111010011110001011010000000100100000001111011011100011100 in binary
Defaults:
Product = 42695510550671098263984292201741942784
0100011111000000000011110110110110011000110100001010010000110000 in binary
Extended precision, rounding to nearest integer:
Product = 42695510550671098263984292201741942784
0100011111000000000011110110110110011000110100001010010000110000 in binary
Double precision, rounding to nearest integer:
Product = 42695510550671088819251326462451515392
0100011111000000000011110110110110011000110100001010010000101111 in binary
Extended precision, rounding to zero:
Product = 42695510550671088819251326462451515392
0100011111000000000011110110110110011000110100001010010000101111 in binary
Double precision, rounding to zero:
Product = 42695510550671088819251326462451515392
0100011111000000000011110110110110011000110100001010010000101111 in binary
In other words, you can work around the compiler issues by using fp387()
to set the precision and rounding mode.
换句话说,您可以通过使用fp387()设置精度和舍入模式来解决编译器问题。
The downside is that some math libraries (libm.a
, libm.so
) may be written with the assumption that intermediate results are always computed at 80-bit precision. At least the GNU C library fpu_control.h
on x86_64 has the comment "libm requires extended precision". Fortunately, you can take the '387 implementations from e.g. GNU C library, and implement them in a header file or write a known-to-work libm
, if you need the math.h
functionality; in fact, I think I might be able to help there.
缺点是一些数学库(libm)。a, libm.so)的编写可以假定中间结果总是以80位精度计算。至少是GNU C库fpu_control。h在x86_64上有“libm需要扩展精度”的注释。幸运的是,您可以从GNU C库中获取'387实现,并在头文件中实现它们,或者编写一个已知的工作libm,如果您需要计算的话。h功能;事实上,我想我也许能帮上忙。
#2
5
For the record, below is what I found by experimentation. The following program shows various behaviors when compiled with Clang:
以下是我通过实验得出的结论。下面的程序用Clang编译时显示了各种行为:
#include <stdio.h>
int r1, r2, r3, r4, r5, r6, r7;
double ten = 10.0;
int main(int c, char **v)
{
r1 = 0.1 == (1.0 / ten);
r2 = 0.1 == (1.0 / 10.0);
r3 = 0.1 == (double) (1.0 / ten);
r4 = 0.1 == (double) (1.0 / 10.0);
ten = 10.0;
r5 = 0.1 == (1.0 / ten);
r6 = 0.1 == (double) (1.0 / ten);
r7 = ((double) 0.1) == (1.0 / 10.0);
printf("r1=%d r2=%d r3=%d r4=%d r5=%d r6=%d r7=%d\n", r1, r2, r3, r4, r5, r6, r7);
}
The results vary with the optimization level:
结果随优化水平的不同而不同:
$ clang -v
Apple LLVM version 4.2 (clang-425.0.24) (based on LLVM 3.2svn)
$ clang -mno-sse2 -std=c99 t.c && ./a.out
r1=0 r2=1 r3=0 r4=1 r5=1 r6=0 r7=1
$ clang -mno-sse2 -std=c99 -O2 t.c && ./a.out
r1=0 r2=1 r3=0 r4=1 r5=1 r6=1 r7=1
The cast (double)
that differentiates r5
and r6
at -O2
has no effect at -O0
and for variables r3
and r4
. The result r1
is different from r5
at all optimization levels, whereas r6
only differs from r3
at -O2
.
在-O2时微分r5和r6的铸型(双)对-O0和变量r3和r4没有影响。结果r1在所有优化级别上都与r5不同,而r6在-O2上仅与r3不同。