处理精确度损失减去两个彼此接近的双精度数

时间:2023-02-11 14:18:51

I have a project to do where we are to solve the matrix equation AX=B for x, given that A is a tridiagonal matrix. I did this project in C++, got the program to produce the right Matrix X, but when trying to report the error back to the user, A*X-B, I get an erroneous error!! It is due to the fact that I am subtracing A*X and B, whose entries are arbitrarily close to each other. I had two ideas on how to handle this, element-by-element:

假设A是三对角矩阵,我有一个项目可以解决x的矩阵方程AX = B.我用C ++做了这个项目,得到了生成正确的Matrix X的程序,但是当试图将错误报告给用户A * X-B时,我得到了一个错误的错误!这是因为我正在减去A * X和B,其条目彼此任意接近。我有两个关于如何处理这个元素的想法,逐个元素:

  1. According to this article, http://en.wikipedia.org/wiki/Loss_of_significance, there could be as many as -log2(1-y/x) bits lost in straight subtraction x-y. Let's scale both x and y by pow(2,bitsLost), subtract the two, and then scale them back down by dividing by pow(2,bitsLost)
  2. 根据这篇文章http://en.wikipedia.org/wiki/Loss_of_significance,在直线减法x-y中可能会丢失-log2(1-y / x)位。让我们用pow(2,bitsLost)来缩放x和y,减去两个,然后通过除以pow(2,bitsLost)将它们缩小。

  3. Stressed so much in the numeric methods course this is for: take the arithmetic conjugate! Instead of double difference = x-y; use double difference = (x*x-y*y)/(x+y);
  4. 在数值方法课程中强调这是为了:取算术共轭!而不是双重差异= x-y;使用双差=(x * x-y * y)/(x + y);


OK, so why haven't you chose a method and moved on?

I tried all three methods (including straight subtraction) here: http://ideone.com/wfkEUp . I would like to know two things:

我在这里尝试了所有三种方法(包括直接减法):http://ideone.com/wfkEUp。我想知道两件事:

  1. Between the "scaling and descaling" method (for which I intentionally chose a power of two) and the arithmetic conjugate method, which one produces less error (in terms of subtracting the large numbers)?
  2. 在“缩放和除垢”方法(我故意选择2的幂)和算术共轭方法之间,哪一个产生较少的误差(就减去大数而言)?

  3. Which method is computationally more efficient? /*For this, I was going to say the scaling method was going to be more efficient with a linear complexity versus the seemed quadratic complexity of the conjugate method, but I don't know the complexity of log2()*/
  4. 哪种方法在计算上更有效? / *为此,我要说缩放方法在线性复杂度方面会更有效,而不是共轭方法的似乎二次复杂度,但我不知道log2()的复杂性* /

Any and all help would be welcomed!!

欢迎任何和所有的帮助!

P.S.: All three methods seem to return the same double in the sample code...

P.S。:所有三种方法似乎都在示例代码中返回相同的double ...


Let's see some of your code No problem; here is my Matrix.cpp code

让我们看看你的一些代码没问题;这是我的Matrix.cpp代码

#include "ExceptionType.h"
#include "Matrix.h"
#include "MatrixArithmeticException.h"
#include <iomanip>
#include <iostream>
#include <vector>

Matrix::Matrix()
{
    //default size for Matrix is 1 row and 1 column, whose entry is 0
    std::vector<long double> rowVector(1,0);
    this->matrixData.assign(1, rowVector);
}

Matrix::Matrix(const std::vector<std::vector<long double> >& data)
{
    this->matrixData = data;
    //validate matrixData
    validateData();
}

//getter functions
//Recall that matrixData is a vector of a vector, whose elements should be accessed like matrixData[row][column].
//Each rowVector should have the same size.
unsigned Matrix::getRowCount() const { return matrixData.size(); }

unsigned Matrix::getColumnCount() const { return matrixData[0].size(); }

//matrix validator should just append zeroes into row vectors that are of smaller dimension than they should be...
void Matrix::validateData()
{
    //fetch the size of the largest-dimension rowVector
    unsigned largestSize = 0;
    for (unsigned i = 0; i < getRowCount(); i++)
    {
        if (largestSize < matrixData[i].size())
            largestSize = matrixData[i].size();
    }
    //make sure that all rowVectors are of that dimension
    for (unsigned i = 0; i < getRowCount(); i++)
    {
        //if we find a rowVector where this isn't the case
        if (matrixData[i].size() < largestSize)
        {
            //add zeroes to it so that it becomes the case
            matrixData[i].insert(matrixData[i].end(), largestSize-matrixData[i].size(), 0);
        }
    }

}
//operators
//+ and - operators should check to see if the size of the first matrix is exactly the same size as that of the second matrix
Matrix Matrix::operator+(const Matrix& B)
{
    //if the sizes coincide
    if ((getRowCount() == B.getRowCount()) && (getColumnCount() == B.getColumnCount()))
    {
        //declare the matrixData
        std::vector<std::vector<long double> > summedData = B.matrixData;    //since we are in the scope of the Matrix, we can access private data members
        for (unsigned i = 0; i < getRowCount(); i++)
        {
            for (unsigned j = 0; j < getColumnCount(); j++)
            {
                summedData[i][j] += matrixData[i][j];   //add the elements together
            }
        }
        //return result Matrix
        return Matrix(summedData);
    }
    else
        throw MatrixArithmeticException(DIFFERENT_DIMENSIONS);
}

Matrix Matrix::operator-(const Matrix& B)
{
    //declare negativeB
    Matrix negativeB = B;
    //negate all entries
    for (unsigned i = 0; i < negativeB.getRowCount(); i++)
    {
        for (unsigned j = 0; j < negativeB.getColumnCount(); j++)
        {
            negativeB.matrixData[i][j] = 0-negativeB.matrixData[i][j];
        }
    }
    //simply add the negativeB
    try
    {
        return ((*this)+negativeB);
    }
    catch (MatrixArithmeticException& mistake)
    {
        //should exit or do something similar
        std::cout << mistake.what() << std::endl;
    }
}

Matrix Matrix::operator*(const Matrix& B)
{
    //the columnCount of the left operand must be equal to the rowCount of the right operand
    if (getColumnCount() == B.getRowCount())
    {
        //if it is, declare data with getRowCount() rows and B.getColumnCount() columns
        std::vector<long double> zeroVector(B.getColumnCount(), 0);
        std::vector<std::vector<long double> > data(getRowCount(), zeroVector);
        for (unsigned i = 0; i < getRowCount(); i++)
        {
            for (unsigned j = 0; j < B.getColumnCount(); j++)
            {
                long double sum = 0; //set sum to zero
                for (unsigned k = 0; k < getColumnCount(); k++)
                {
                    //add the product of matrixData[i][k] and B.matrixData[k][j] to sum
                    sum += (matrixData[i][k]*B.matrixData[k][j]);
                }
                data[i][j] = sum;   //assign the sum to data
            }
        }
        return Matrix(data);
    }
    else
    {
        throw MatrixArithmeticException(ROW_COLUMN_MISMATCH); //dimension mismatch
    }
}

std::ostream& operator<<(std::ostream& outputStream, const Matrix& theMatrix)
{
    //Here, you should use the << again, just like you would for ANYTHING ELSE.
    //first, print a newline
    outputStream << "\n";
    //setting precision (optional)
    outputStream.precision(11);
    for (unsigned i = 0; i < theMatrix.getRowCount(); i++)
    {
        //print '['
        outputStream << "[";
        //format stream(optional)
        for (unsigned j = 0; j < theMatrix.getColumnCount(); j++)
        {
            //print numbers
            outputStream << std::setw(17) << theMatrix.matrixData[i][j];
            //print ", "
            if (j < theMatrix.getColumnCount() - 1)
                outputStream << ", ";
        }
        //print ']'
        outputStream << "]\n";
    }
    return outputStream;
}

6 个解决方案

#1


2  

You computed two numbers x and y which are of a limited precision floating point type. This means that they are already rounded somehow, meaning loss of precision while computing the result. If you subtract those numbers afterwards, you compute the difference between those two already rounded numbers.

您计算了两个数字x和y,它们是有限精度浮点类型。这意味着它们已经以某种方式被舍入,这意味着在计算结果时会损失精度。如果之后减去这些数字,则计算这两个已经舍入的数字之间的差异。

The formula you wrote gives you the maximum error for computing the difference, but this error is with regard to the stored intermediate results x and y (again: rounded). No other method than x-y will give you a "better" result (in terms of the complete computation, not only the difference). To put it in a nutshell: the difference can't be more accurate using any foruma other than x-y.

您编写的公式为计算差异提供了最大误差,但此错误与存储的中间结果x和y有关(再次:舍入)。没有其他方法比x-y会给你一个“更好”的结果(就完整的计算而言,不仅仅是差异)。简而言之:除了x-y之外,使用任何形式的差异都不能更准确。

I'd suggest taking a look at arbitrary precision arithmetic math libraries like GMP or Eigen. Use such libraries for computing your equation system. Don't use double for the matrix computations. This way you can make sure that the intermediate results x and y (or the matrices Ax and B) are as precise as you want them to be, for example 512 bits, which should definitely be enough for most cases.

我建议你看看GMP或Eigen这样的任意精度算术数学库。使用此类库来计算方程式系统。不要使用double进行矩阵计算。通过这种方式,您可以确保中间结果x和y(或矩阵Ax和B)尽可能精确,例如512位,对于大多数情况来说,这肯定是足够的。

#2


1  

Finite precision floating point data types cannot represent all possible real values. There are an infinite number of different values, and so it is easy to see that not all values are representable in a type of finite size.

有限精度浮点数据类型不能代表所有可能的实际值。存在无数个不同的值,因此很容易看出并非所有值都可以在有限大小的类型中表示。

So it's perfectly plausible that your true solution will be a non-representable value. No amount of trickery can get you an exact solution in the finite data type.

因此,您的真正解决方案将是一个不可表示的价值,这是完全可信的。没有多少技巧可以为您提供有限数据类型的精确解决方案。

You need to re-calibrate your expectations to match the reality of finite precision floating point data types. The starting point is What Every Computer Scientist Should Know About Floating-Point Arithmetic.

您需要重新校准您的期望以匹配有限精度浮点数据类型的实际情况。起点是每个计算机科学家应该知道的关于浮点运算的内容。

#3


1  

To all the people answering the question: I knew, and figured out by accident, that the cardinality of the set of all possible doubles was finite. I suppose I have no choice but to either try a higher-precision number, or create my own class that represents a HugeDecimal.

对于回答这个问题的所有人:我知道,并且偶然发现,所有可能的双打的基数都是有限的。我想我别无选择,只能尝试更高精度的数字,或创建我自己的代表HugeDecimal的类。

#4


0  

You cannot expect to get infinite precision with floating point numbers. You should consider what precision is needed, and then choose the simplest method that satisfies your needs. So if you get the same result then stick with normal subtraction and use an epsilon as suggested in V-X's answer.

您不能期望浮点数具有无限精度。您应该考虑需要什么样的精度,然后选择满足您需求的最简单方法。因此,如果你得到相同的结果,那么坚持使用正常的减法,并按照V-X的答案中的建议使用epsilon。

How do you end up with a O(n^2) complexity for the conjugate method? You have a fixed set of operations, two additions, one subtraction and one division. Assuming all three operations are O(1) then you have get O(n) for applying it to n numbers.

你如何最终得到共轭方法的O(n ^ 2)复杂度?你有一组固定的操作,两个加法,一个减法和一个除法。假设所有三个操作都是O(1),那么你可以得到O(n)来将它应用于n个数。

#5


0  

While this may not help you choose a method, a while ago I wrote a tool that may help you choose a precision based on the sorts of values you're expecting:

虽然这可能无法帮助您选择一种方法,但前段时间我写了一个工具,可以帮助您根据您期望的各种值选择精度:

http://riot.so/floatprecision.html

As other answers have said, you can't expect to get infinite precision with floating point, but you can use tools such as this to obtain the minimum increment and decrement size of a given number, and work out what is the optimal precision to use to get the accuracy you need.

正如其他答案所说的那样,你不能期望通过浮点获得无限精度,但你可以使用这样的工具来获得给定数字的最小增量和减量大小,并计算出使用的最佳精度获得所需的准确性。

#6


0  

replace equality by a check for difference bigger than some given epsilon (a constant with meaning as minimal distinguishable difference).

通过检查大于某个给定epsilon的差异来替换等式(具有最小可区分差异的常数)。

#1


2  

You computed two numbers x and y which are of a limited precision floating point type. This means that they are already rounded somehow, meaning loss of precision while computing the result. If you subtract those numbers afterwards, you compute the difference between those two already rounded numbers.

您计算了两个数字x和y,它们是有限精度浮点类型。这意味着它们已经以某种方式被舍入,这意味着在计算结果时会损失精度。如果之后减去这些数字,则计算这两个已经舍入的数字之间的差异。

The formula you wrote gives you the maximum error for computing the difference, but this error is with regard to the stored intermediate results x and y (again: rounded). No other method than x-y will give you a "better" result (in terms of the complete computation, not only the difference). To put it in a nutshell: the difference can't be more accurate using any foruma other than x-y.

您编写的公式为计算差异提供了最大误差,但此错误与存储的中间结果x和y有关(再次:舍入)。没有其他方法比x-y会给你一个“更好”的结果(就完整的计算而言,不仅仅是差异)。简而言之:除了x-y之外,使用任何形式的差异都不能更准确。

I'd suggest taking a look at arbitrary precision arithmetic math libraries like GMP or Eigen. Use such libraries for computing your equation system. Don't use double for the matrix computations. This way you can make sure that the intermediate results x and y (or the matrices Ax and B) are as precise as you want them to be, for example 512 bits, which should definitely be enough for most cases.

我建议你看看GMP或Eigen这样的任意精度算术数学库。使用此类库来计算方程式系统。不要使用double进行矩阵计算。通过这种方式,您可以确保中间结果x和y(或矩阵Ax和B)尽可能精确,例如512位,对于大多数情况来说,这肯定是足够的。

#2


1  

Finite precision floating point data types cannot represent all possible real values. There are an infinite number of different values, and so it is easy to see that not all values are representable in a type of finite size.

有限精度浮点数据类型不能代表所有可能的实际值。存在无数个不同的值,因此很容易看出并非所有值都可以在有限大小的类型中表示。

So it's perfectly plausible that your true solution will be a non-representable value. No amount of trickery can get you an exact solution in the finite data type.

因此,您的真正解决方案将是一个不可表示的价值,这是完全可信的。没有多少技巧可以为您提供有限数据类型的精确解决方案。

You need to re-calibrate your expectations to match the reality of finite precision floating point data types. The starting point is What Every Computer Scientist Should Know About Floating-Point Arithmetic.

您需要重新校准您的期望以匹配有限精度浮点数据类型的实际情况。起点是每个计算机科学家应该知道的关于浮点运算的内容。

#3


1  

To all the people answering the question: I knew, and figured out by accident, that the cardinality of the set of all possible doubles was finite. I suppose I have no choice but to either try a higher-precision number, or create my own class that represents a HugeDecimal.

对于回答这个问题的所有人:我知道,并且偶然发现,所有可能的双打的基数都是有限的。我想我别无选择,只能尝试更高精度的数字,或创建我自己的代表HugeDecimal的类。

#4


0  

You cannot expect to get infinite precision with floating point numbers. You should consider what precision is needed, and then choose the simplest method that satisfies your needs. So if you get the same result then stick with normal subtraction and use an epsilon as suggested in V-X's answer.

您不能期望浮点数具有无限精度。您应该考虑需要什么样的精度,然后选择满足您需求的最简单方法。因此,如果你得到相同的结果,那么坚持使用正常的减法,并按照V-X的答案中的建议使用epsilon。

How do you end up with a O(n^2) complexity for the conjugate method? You have a fixed set of operations, two additions, one subtraction and one division. Assuming all three operations are O(1) then you have get O(n) for applying it to n numbers.

你如何最终得到共轭方法的O(n ^ 2)复杂度?你有一组固定的操作,两个加法,一个减法和一个除法。假设所有三个操作都是O(1),那么你可以得到O(n)来将它应用于n个数。

#5


0  

While this may not help you choose a method, a while ago I wrote a tool that may help you choose a precision based on the sorts of values you're expecting:

虽然这可能无法帮助您选择一种方法,但前段时间我写了一个工具,可以帮助您根据您期望的各种值选择精度:

http://riot.so/floatprecision.html

As other answers have said, you can't expect to get infinite precision with floating point, but you can use tools such as this to obtain the minimum increment and decrement size of a given number, and work out what is the optimal precision to use to get the accuracy you need.

正如其他答案所说的那样,你不能期望通过浮点获得无限精度,但你可以使用这样的工具来获得给定数字的最小增量和减量大小,并计算出使用的最佳精度获得所需的准确性。

#6


0  

replace equality by a check for difference bigger than some given epsilon (a constant with meaning as minimal distinguishable difference).

通过检查大于某个给定epsilon的差异来替换等式(具有最小可区分差异的常数)。