如何找到一个浮点数最接近的不相等值?(复制)

时间:2021-12-17 21:31:34

This question already has an answer here:

这个问题已经有了答案:

A float (a.k.a. single) value is a 4-byte value, and supposed to represent any real-valued number. Because of the way it is formatted and the finite number of bytes it is made off, there is a minimum value and a maximum value it can represent, and it has a finite precision, depending on it's own value.

浮点值(也叫单数)是一个4字节的值,应该表示任何实数。由于它的格式和有限的字节数,它可以表示一个最小值和一个最大值,并且它有一个有限的精度,这取决于它自己的值。

I would like to know if there is a way to get the closest possible value above or below some reference value, given the finite precision of a float. With integers, this is trivial: one simply adds or subtracts 1. But with a float, you can't simply add or subtract the minimum float value and expect it to be different from your original value. I.e.

我想知道是否有一种方法可以让最接近的值在某个参考值之上或低于某个值,给定一个浮点数的有限精度。对于整数,这是平凡的:一个简单地加或减1。但是对于浮点数,您不能简单地添加或减去最小浮点数,并期望它与原始值不同。即。

float FindNearestSmaller (const float a)
{
    return a - FLT_MIN; /* This doesn't necessarily work */
}

In fact, the above will almost never work. In the above case, the return will generally still equal a, as the FLT_MIN is far beyond the precision of a. You can easily try this out for yourself: it works for e.g. 0.0f, or for very small numbers of order FLT_MIN, but not for anything between 0 and 100.

事实上,上述方法几乎不会奏效。在上面的例子中,返回值通常仍然等于a,因为FLT_MIN远远超过a的精度。

So how would you get the value that is closest but smaller or larger than a, given floating point precision?

那么,在给定浮点精度的情况下,如何得到最接近但小于或大于a的值呢?

Note: Though i am mainly interested in a C/C++ answer, I assume the answer will be applicable for most programming languages.

注意:虽然我主要对C/ c++的答案感兴趣,但我认为这个答案适用于大多数编程语言。

3 个解决方案

#1


13  

The standard way to find a floating-point value's neighbors is the function nextafter for double and nextafterf for float. The second argument gives the direction. Remember that infinities are legal values in IEEE 754 floating-point, so you can very well call nextafter(x, +1.0/0.0) to get the value immediately above x, and this will work even for DBL_MAX (whereas if you wrote nextafter(x, DBL_MAX), it would return DBL_MAX when applied for x == DBL_MAX).

查找浮点值近邻的标准方法是双精度浮点数的函数nextafter,浮点数的函数nextafterf。第二个论证给出了方向。请记住,infinities在IEEE 754浮点数中是合法的值,因此您可以很好地调用nextafter(x, +1.0/0.0)来获得紧挨着x的值,这对于DBL_MAX(而如果您编写nextafter(x, DBL_MAX),当应用到x = DBL_MAX时,它将返回DBL_MAX)。

Two non-standard ways that are sometimes useful are:

有时有用的两种非标准方法是:

  1. access the representation of the float/double as an unsigned integer of the same size, and increment or decrement this integer. The floating-point format was carefully designed so that for positive floats, and respectively for negative floats, the bits of the representation, seen as an integer, evolve monotonously with the represented float.

    访问浮点/双精度浮点数作为相同大小的无符号整数的表示,并递增或递减这个整数。浮点格式经过精心设计,对于正浮点数和负浮点数,表示形式的位元(被视为整数)与表示的浮点数单调地一起演化。

  2. change the rounding mode to upward, and add the smallest positive floating-point number. The smallest positive floating-point number is also the smallest increment that there can be between two floats, so this will never skip any float. The smallest positive floating-point number is FLT_MIN * FLT_EPSILON.

    将舍入模式改为向上,并添加最小的正浮点数。最小的浮点数也是两个浮点数之间最小的增量,所以这不会跳过任何浮点数。最小的正浮点数是FLT_MIN * FLT_EPSILON。


For the sake of completeness, I will add that even without changing the rounding mode from its “to nearest” default, multiplying a float by (1.0f + FLT_EPSILON) produces a number that is either the immediate neighbor away from zero, or the neighbor after that. It is probably the cheapest if you already know the sign of the float you wish to increase/decrease and you don't mind that it sometimes does not produce the immediate neighbor. Functions nextafter and nextafterf are specified in such a way that a correct implementation on the x86 must test for a number of special values and FPU states, and is thus rather costly for what it does.

为了完整起见,我将补充一点,即使不改变舍入模式从“到最近”的默认值,将浮点数乘以(1.0f + FLT_EPSILON)也会生成一个数,该数要么是离零最近的邻居,要么是后面的邻居。它可能是最便宜的,如果你已经知道浮动的符号,你希望增加/减少,并且你不介意它有时不产生直接邻居。函数nextafter和nextafterf的指定方式是,x86上的正确实现必须测试一些特殊值和FPU状态,因此对于它所做的事情来说,代价相当高昂。

To go towards zero, multiply by 1.0f - FLT_EPSILON.

要趋向于0,乘以1。0f - FLT_EPSILON。

This doesn't work for 0.0f, obviously, and generally for the smaller denormalized numbers.

显然,这对0。0f没有作用,一般来说,对于更小的非规格化的数字。

The values for which multiplying by 1.0f + FLT_EPSILON advance by 2 ULPS are just below a power of two, specifically in the interval [0.75 * 2p … 2p). If you don't mind doing a multiplication and an addition, x + (x * (FLT_EPSILON * 0.74)) should work for all normal numbers (but still not for zero nor for all the small denormal numbers).

将1。0f + FLT_EPSILON乘以2个ULPS的值略低于2的幂,特别是在区间[0.75 * 2p…2p]。如果您不介意做乘法和加法,那么x + (x * (FLT_EPSILON * 0.74)应该适用于所有的正规数(但仍然不适用于零,也不适用于所有的小密度数)。

#2


11  

Look at the "nextafter" function, which is part of Standard C (and probably C++, but I didn't check).

看看“nextafter”函数,它是标准C的一部分(可能还有c++,但我没有检查)。

#3


0  

I tried it out on my machine. And all three approaches:
1. adding with 1 and memcopying
2. adding FLT_EPSILON
3. multiplying by (1.0f + FLT_EPSILON)
seems to give the same answer.

我在我的机器上试用过。三种方法都是:1。添加1和memcopy 2。添加FLT_EPSILON 3。乘以(1.0f + FLT_EPSILON)似乎得到了相同的答案。


see the result here
bash-3.2$ cc float_test.c -o float_test; ./float_test 1.023456 10
Original num: 1.023456
int added = 1.023456 01-eps added = 1.023456 mult by 01*(eps+1) = 1.023456
int added = 1.023456 02-eps added = 1.023456 mult by 02*(eps+1) = 1.023456
int added = 1.023456 03-eps added = 1.023456 mult by 03*(eps+1) = 1.023456
int added = 1.023456 04-eps added = 1.023456 mult by 04*(eps+1) = 1.023456
int added = 1.023457 05-eps added = 1.023457 mult by 05*(eps+1) = 1.023457
int added = 1.023457 06-eps added = 1.023457 mult by 06*(eps+1) = 1.023457
int added = 1.023457 07-eps added = 1.023457 mult by 07*(eps+1) = 1.023457
int added = 1.023457 08-eps added = 1.023457 mult by 08*(eps+1) = 1.023457
int added = 1.023457 09-eps added = 1.023457 mult by 09*(eps+1) = 1.023457
int added = 1.023457 10-eps added = 1.023457 mult by 10*(eps+1) = 1.023457

Code

#include <float.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <assert.h>

int main(int argc, char *argv[])
{

    if(argc != 3) {
        printf("Usage: <binary> <floating_pt_num> <num_iter>\n");
        exit(0);
    }

    float f = atof(argv[1]);
    int count = atoi(argv[2]);

    assert(count > 0);

    int i;
    int num;
    float num_float;

    printf("Original num: %f\n", f);
    for(i=1; i<=count; i++) {
        memcpy(&num, &f, 4);
        num += i;
        memcpy(&num_float, &num, 4);
        printf("int added = %f \t%02d-eps added = %f \tmult by %2d*(eps+1) = %f\n", num_float, i, f + i*FLT_EPSILON, i, f*(1.0f + i*FLT_EPSILON));
    }

    return 0;
}

#1


13  

The standard way to find a floating-point value's neighbors is the function nextafter for double and nextafterf for float. The second argument gives the direction. Remember that infinities are legal values in IEEE 754 floating-point, so you can very well call nextafter(x, +1.0/0.0) to get the value immediately above x, and this will work even for DBL_MAX (whereas if you wrote nextafter(x, DBL_MAX), it would return DBL_MAX when applied for x == DBL_MAX).

查找浮点值近邻的标准方法是双精度浮点数的函数nextafter,浮点数的函数nextafterf。第二个论证给出了方向。请记住,infinities在IEEE 754浮点数中是合法的值,因此您可以很好地调用nextafter(x, +1.0/0.0)来获得紧挨着x的值,这对于DBL_MAX(而如果您编写nextafter(x, DBL_MAX),当应用到x = DBL_MAX时,它将返回DBL_MAX)。

Two non-standard ways that are sometimes useful are:

有时有用的两种非标准方法是:

  1. access the representation of the float/double as an unsigned integer of the same size, and increment or decrement this integer. The floating-point format was carefully designed so that for positive floats, and respectively for negative floats, the bits of the representation, seen as an integer, evolve monotonously with the represented float.

    访问浮点/双精度浮点数作为相同大小的无符号整数的表示,并递增或递减这个整数。浮点格式经过精心设计,对于正浮点数和负浮点数,表示形式的位元(被视为整数)与表示的浮点数单调地一起演化。

  2. change the rounding mode to upward, and add the smallest positive floating-point number. The smallest positive floating-point number is also the smallest increment that there can be between two floats, so this will never skip any float. The smallest positive floating-point number is FLT_MIN * FLT_EPSILON.

    将舍入模式改为向上,并添加最小的正浮点数。最小的浮点数也是两个浮点数之间最小的增量,所以这不会跳过任何浮点数。最小的正浮点数是FLT_MIN * FLT_EPSILON。


For the sake of completeness, I will add that even without changing the rounding mode from its “to nearest” default, multiplying a float by (1.0f + FLT_EPSILON) produces a number that is either the immediate neighbor away from zero, or the neighbor after that. It is probably the cheapest if you already know the sign of the float you wish to increase/decrease and you don't mind that it sometimes does not produce the immediate neighbor. Functions nextafter and nextafterf are specified in such a way that a correct implementation on the x86 must test for a number of special values and FPU states, and is thus rather costly for what it does.

为了完整起见,我将补充一点,即使不改变舍入模式从“到最近”的默认值,将浮点数乘以(1.0f + FLT_EPSILON)也会生成一个数,该数要么是离零最近的邻居,要么是后面的邻居。它可能是最便宜的,如果你已经知道浮动的符号,你希望增加/减少,并且你不介意它有时不产生直接邻居。函数nextafter和nextafterf的指定方式是,x86上的正确实现必须测试一些特殊值和FPU状态,因此对于它所做的事情来说,代价相当高昂。

To go towards zero, multiply by 1.0f - FLT_EPSILON.

要趋向于0,乘以1。0f - FLT_EPSILON。

This doesn't work for 0.0f, obviously, and generally for the smaller denormalized numbers.

显然,这对0。0f没有作用,一般来说,对于更小的非规格化的数字。

The values for which multiplying by 1.0f + FLT_EPSILON advance by 2 ULPS are just below a power of two, specifically in the interval [0.75 * 2p … 2p). If you don't mind doing a multiplication and an addition, x + (x * (FLT_EPSILON * 0.74)) should work for all normal numbers (but still not for zero nor for all the small denormal numbers).

将1。0f + FLT_EPSILON乘以2个ULPS的值略低于2的幂,特别是在区间[0.75 * 2p…2p]。如果您不介意做乘法和加法,那么x + (x * (FLT_EPSILON * 0.74)应该适用于所有的正规数(但仍然不适用于零,也不适用于所有的小密度数)。

#2


11  

Look at the "nextafter" function, which is part of Standard C (and probably C++, but I didn't check).

看看“nextafter”函数,它是标准C的一部分(可能还有c++,但我没有检查)。

#3


0  

I tried it out on my machine. And all three approaches:
1. adding with 1 and memcopying
2. adding FLT_EPSILON
3. multiplying by (1.0f + FLT_EPSILON)
seems to give the same answer.

我在我的机器上试用过。三种方法都是:1。添加1和memcopy 2。添加FLT_EPSILON 3。乘以(1.0f + FLT_EPSILON)似乎得到了相同的答案。


see the result here
bash-3.2$ cc float_test.c -o float_test; ./float_test 1.023456 10
Original num: 1.023456
int added = 1.023456 01-eps added = 1.023456 mult by 01*(eps+1) = 1.023456
int added = 1.023456 02-eps added = 1.023456 mult by 02*(eps+1) = 1.023456
int added = 1.023456 03-eps added = 1.023456 mult by 03*(eps+1) = 1.023456
int added = 1.023456 04-eps added = 1.023456 mult by 04*(eps+1) = 1.023456
int added = 1.023457 05-eps added = 1.023457 mult by 05*(eps+1) = 1.023457
int added = 1.023457 06-eps added = 1.023457 mult by 06*(eps+1) = 1.023457
int added = 1.023457 07-eps added = 1.023457 mult by 07*(eps+1) = 1.023457
int added = 1.023457 08-eps added = 1.023457 mult by 08*(eps+1) = 1.023457
int added = 1.023457 09-eps added = 1.023457 mult by 09*(eps+1) = 1.023457
int added = 1.023457 10-eps added = 1.023457 mult by 10*(eps+1) = 1.023457

Code

#include <float.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <assert.h>

int main(int argc, char *argv[])
{

    if(argc != 3) {
        printf("Usage: <binary> <floating_pt_num> <num_iter>\n");
        exit(0);
    }

    float f = atof(argv[1]);
    int count = atoi(argv[2]);

    assert(count > 0);

    int i;
    int num;
    float num_float;

    printf("Original num: %f\n", f);
    for(i=1; i<=count; i++) {
        memcpy(&num, &f, 4);
        num += i;
        memcpy(&num_float, &num, 4);
        printf("int added = %f \t%02d-eps added = %f \tmult by %2d*(eps+1) = %f\n", num_float, i, f + i*FLT_EPSILON, i, f*(1.0f + i*FLT_EPSILON));
    }

    return 0;
}