修复分布式版本中的算术错误

时间:2022-12-25 20:31:15

I am inverting a matrix via a Cholesky factorization, in a distributed environment, as it was discussed here. My code works fine, but in order to test that my distributed project produces correct results, I had to compare it with the serial version. The results are not exactly the same!

我在分布式环境中通过Cholesky分解来反转矩阵,如此处所讨论的那样。我的代码工作正常,但为了测试我的分布式项目产生正确的结果,我不得不将它与串行版本进行比较。结果不完全一样!

For example, the last five cells of the result matrix are:

例如,结果矩阵的最后五个单元格是:

serial gives:
-250207683.634793 -1353198687.861288 2816966067.598196 -144344843844.616425 323890119928.788757
distributed gives:
-250207683.634692 -1353198687.861386 2816966067.598891 -144344843844.617096 323890119928.788757

I had post in the Intel forum about that, but the answer I got was about getting the same results across all the executions I will make with the distributed version, something that I already had. They seem (in another thread) to be unable to respond to this:

我曾在英特尔论坛上发帖,但我得到的答案是在我将使用分布式版本执行的所有执行中得到相同的结果,这是我已经拥有的。他们似乎(在另一个线程中)无法回应这个:

How to get same results, between serial and distributed execution? Is this possible? This would result in fixing the arithmetic error.

如何在串行和分布式执行之间获得相同的结果?这可能吗?这将导致修复算术错误。

I have tried setting this: mkl_cbwr_set(MKL_CBWR_AVX); and using mkl_malloc(), in order to align memory, but nothing changed. I will get the same results, only in the case that I will spawn one process for the distributed version (which will make it almost serial)!

我试过这样设置:mkl_cbwr_set(MKL_CBWR_AVX);并使用mkl_malloc(),以便对齐内存,但没有任何改变。我将得到相同的结果,只有在我将为分布式版本生成一个进程(这将使它几乎连续)的情况下!

The distributed routines I am calling: pdpotrf() and pdpotri().

我调用的分布式例程:pdpotrf()和pdpotri()。

The serial routines I am calling: dpotrf() and dpotri().

我调用的串行例程:dpotrf()和dpotri()。

2 个解决方案

#1


4  

Your differences seem to appear at about the 12th s.f. Since floating-point arithmetic is not truly associative (that is, f-p arithmetic does not guarantee that a+(b+c) == (a+b)+c), and since parallel execution does not, generally, give a deterministic order of the application of operations, these small differences are typical of parallelised numerical codes when compared to their serial equivalents. Indeed you may observe the same order of difference when running on a different number of processors, 4 vs 8, say.

你的差异似乎出现在12日左右。由于浮点运算不是真正的关联(即,fp算术不保证a +(b + c)==(a + b)+ c),并且由于并行执行没有,通常给出一个确定的顺序在操作的应用中,与其序列等价物相比,这些小的差异是典型的并行化数字代码。事实上,当你在不同数量的处理器上运行时,你可能会观察到相同的差异顺序,4比8。

Unfortunately the easy way to get deterministic results is to stick to serial execution. To get deterministic results from parallel execution requires a major effort to be very specific about the order of execution of operations right down to the last + or * which almost certainly rules out the use of most numeric libraries and leads you to painstaking manual coding of large numeric routines.

不幸的是,获得确定性结果的简单方法是坚持串行执行。要从并行执行中获得确定性结果,需要花费大量精力来确定操作执行顺序,直到最后一个+或*,这几乎肯定会排除使用大多数数字库并导致您进行大量手动编码数字例程。

In most cases that I've encountered the accuracy of the input data, often derived from sensors, does not warrant worrying about the 12th or later s.f. I don't know what your numbers represent but for many scientists and engineers equality to the 4th or 5th sf is enough equality for all practical purposes. It's a different matter for mathematicians ...

在大多数情况下,我遇到输入数据的准确性,通常来自传感器,并不值得担心第12或更晚的s.f.我不知道你的数字代表什么,但对于许多科学家和工程师来说,平等到第4或第5个sf对于所有实际目的来说都是足够的平等。这对数学家来说是另一回事......

#2


2  

As the other answer mentions getting the exact same results between serial and distributed is not guaranteed. One common technique with HPC/distributed workloads is to validate the solution. There are a number of techniques from calculating percent error to more complex validation schemes, like the one used by the HPL. Here is a simple C++ function that calculates percent error. As @HighPerformanceMark notes in his post the analysis of this sort of numerical error is incredibly complex; this is a very simple method, and there is a lot of info available online about the topic.

正如另一个答案所提到的那样,不能保证在串行和分布式之间获得完全相同的结果。 HPC /分布式工作负载的一种常见技术是验证解决方案。从计算百分比误差到更复杂的验证方案有许多技术,例如HPL使用的方法。这是一个简单的C ++函数,用于计算百分比误差。正如@HighPerformanceMark在他的文章中指出的那样,对这种数值误差的分析非常复杂;这是一个非常简单的方法,网上有很多关于这个主题的信息。

#include <iostream>
#include <cmath>

double calc_error(double a,double x)
{
  return std::abs(x-a)/std::abs(a);
}
int main(void)
{
  double sans[]={-250207683.634793,-1353198687.861288,2816966067.598196,-144344843844.616425, 323890119928.788757};
  double pans[]={-250207683.634692, -1353198687.861386, 2816966067.598891, -144344843844.617096, 323890119928.788757};
  double err[5];
  std::cout<<"Serial Answer,Distributed Answer, Error"<<std::endl;
  for (int it=0; it<5; it++) {
    err[it]=calc_error(sans[it], pans[it]);
    std::cout<<sans[it]<<","<<pans[it]<<","<<err[it]<<"\n";
  }
return 0;
}

Which produces this output:

哪个产生这个输出:

Serial Answer,Distributed Answer, Error
-2.50208e+08,-2.50208e+08,4.03665e-13
-1.3532e+09,-1.3532e+09,7.24136e-14
2.81697e+09,2.81697e+09,2.46631e-13
-1.44345e+11,-1.44345e+11,4.65127e-15
3.2389e+11,3.2389e+11,0

As you can see the order of magnitude of the error in every case is on the order of 10^-13 or less and in one case non-existent. Depending on the problem you are trying to solve error on this order of magnitude could be considered acceptable. Hopefully this helps to illustrate one way of validating a distributed solution against a serial one, or at least gives one way to show how far apart the parallel and serial algorithm are.

正如您所看到的,在每种情况下,错误的数量级大约为10 ^ -13或更小,并且在一种情况下不存在。根据您尝试解决的问题,可以认为这个数量级的误差是可以接受的。希望这有助于说明针对串行解决方案验证分布式解决方案的一种方法,或者至少提供一种方式来显示并行和串行算法的距离。

When validating answers for big problems and parallel algorithms it can also be valuable to perform several runs of the parallel algorithm, saving the results of each run. You can then look to see if the result and/or error varies with the parallel algorithm run or if it settles over time.

在验证大问题和并行算法的答案时,执行并行算法的多次运行也很有价值,从而节省了每次运行的结果。然后,您可以查看结果和/或错误是否随并行算法运行或随着时间的推移而变化。

Showing that a parallel algorithm produces error within acceptable thresholds over 1000 runs(just an example, the more data the better for this sort of thing) for various problem sizes is one way to assess the validity of a result.

表明并行算法在超过1000次运行的可接受阈值内产生错误(只是一个例子,对于各种问题大小,数据越多越好)是评估结果有效性的一种方法。

In the past when I have performed benchmark testing I have noticed wildly varying behavior for the first several runs before the servers have "warmed up". At the time I never bother to check to see if error in the result stabilized over time the same way performance did, but it would be interesting to see.

在我执行基准测试的过去,我注意到在服务器“预热”之前的前几次运行中出现了极大的变化。当时我从来没有费心去检查结果中的错误是否随着时间的推移以与性能相同的方式稳定,但是看起来很有意思。

#1


4  

Your differences seem to appear at about the 12th s.f. Since floating-point arithmetic is not truly associative (that is, f-p arithmetic does not guarantee that a+(b+c) == (a+b)+c), and since parallel execution does not, generally, give a deterministic order of the application of operations, these small differences are typical of parallelised numerical codes when compared to their serial equivalents. Indeed you may observe the same order of difference when running on a different number of processors, 4 vs 8, say.

你的差异似乎出现在12日左右。由于浮点运算不是真正的关联(即,fp算术不保证a +(b + c)==(a + b)+ c),并且由于并行执行没有,通常给出一个确定的顺序在操作的应用中,与其序列等价物相比,这些小的差异是典型的并行化数字代码。事实上,当你在不同数量的处理器上运行时,你可能会观察到相同的差异顺序,4比8。

Unfortunately the easy way to get deterministic results is to stick to serial execution. To get deterministic results from parallel execution requires a major effort to be very specific about the order of execution of operations right down to the last + or * which almost certainly rules out the use of most numeric libraries and leads you to painstaking manual coding of large numeric routines.

不幸的是,获得确定性结果的简单方法是坚持串行执行。要从并行执行中获得确定性结果,需要花费大量精力来确定操作执行顺序,直到最后一个+或*,这几乎肯定会排除使用大多数数字库并导致您进行大量手动编码数字例程。

In most cases that I've encountered the accuracy of the input data, often derived from sensors, does not warrant worrying about the 12th or later s.f. I don't know what your numbers represent but for many scientists and engineers equality to the 4th or 5th sf is enough equality for all practical purposes. It's a different matter for mathematicians ...

在大多数情况下,我遇到输入数据的准确性,通常来自传感器,并不值得担心第12或更晚的s.f.我不知道你的数字代表什么,但对于许多科学家和工程师来说,平等到第4或第5个sf对于所有实际目的来说都是足够的平等。这对数学家来说是另一回事......

#2


2  

As the other answer mentions getting the exact same results between serial and distributed is not guaranteed. One common technique with HPC/distributed workloads is to validate the solution. There are a number of techniques from calculating percent error to more complex validation schemes, like the one used by the HPL. Here is a simple C++ function that calculates percent error. As @HighPerformanceMark notes in his post the analysis of this sort of numerical error is incredibly complex; this is a very simple method, and there is a lot of info available online about the topic.

正如另一个答案所提到的那样,不能保证在串行和分布式之间获得完全相同的结果。 HPC /分布式工作负载的一种常见技术是验证解决方案。从计算百分比误差到更复杂的验证方案有许多技术,例如HPL使用的方法。这是一个简单的C ++函数,用于计算百分比误差。正如@HighPerformanceMark在他的文章中指出的那样,对这种数值误差的分析非常复杂;这是一个非常简单的方法,网上有很多关于这个主题的信息。

#include <iostream>
#include <cmath>

double calc_error(double a,double x)
{
  return std::abs(x-a)/std::abs(a);
}
int main(void)
{
  double sans[]={-250207683.634793,-1353198687.861288,2816966067.598196,-144344843844.616425, 323890119928.788757};
  double pans[]={-250207683.634692, -1353198687.861386, 2816966067.598891, -144344843844.617096, 323890119928.788757};
  double err[5];
  std::cout<<"Serial Answer,Distributed Answer, Error"<<std::endl;
  for (int it=0; it<5; it++) {
    err[it]=calc_error(sans[it], pans[it]);
    std::cout<<sans[it]<<","<<pans[it]<<","<<err[it]<<"\n";
  }
return 0;
}

Which produces this output:

哪个产生这个输出:

Serial Answer,Distributed Answer, Error
-2.50208e+08,-2.50208e+08,4.03665e-13
-1.3532e+09,-1.3532e+09,7.24136e-14
2.81697e+09,2.81697e+09,2.46631e-13
-1.44345e+11,-1.44345e+11,4.65127e-15
3.2389e+11,3.2389e+11,0

As you can see the order of magnitude of the error in every case is on the order of 10^-13 or less and in one case non-existent. Depending on the problem you are trying to solve error on this order of magnitude could be considered acceptable. Hopefully this helps to illustrate one way of validating a distributed solution against a serial one, or at least gives one way to show how far apart the parallel and serial algorithm are.

正如您所看到的,在每种情况下,错误的数量级大约为10 ^ -13或更小,并且在一种情况下不存在。根据您尝试解决的问题,可以认为这个数量级的误差是可以接受的。希望这有助于说明针对串行解决方案验证分布式解决方案的一种方法,或者至少提供一种方式来显示并行和串行算法的距离。

When validating answers for big problems and parallel algorithms it can also be valuable to perform several runs of the parallel algorithm, saving the results of each run. You can then look to see if the result and/or error varies with the parallel algorithm run or if it settles over time.

在验证大问题和并行算法的答案时,执行并行算法的多次运行也很有价值,从而节省了每次运行的结果。然后,您可以查看结果和/或错误是否随并行算法运行或随着时间的推移而变化。

Showing that a parallel algorithm produces error within acceptable thresholds over 1000 runs(just an example, the more data the better for this sort of thing) for various problem sizes is one way to assess the validity of a result.

表明并行算法在超过1000次运行的可接受阈值内产生错误(只是一个例子,对于各种问题大小,数据越多越好)是评估结果有效性的一种方法。

In the past when I have performed benchmark testing I have noticed wildly varying behavior for the first several runs before the servers have "warmed up". At the time I never bother to check to see if error in the result stabilized over time the same way performance did, but it would be interesting to see.

在我执行基准测试的过去,我注意到在服务器“预热”之前的前几次运行中出现了极大的变化。当时我从来没有费心去检查结果中的错误是否随着时间的推移以与性能相同的方式稳定,但是看起来很有意思。