高性能大整数除法应该使用什么算法?

时间:2022-07-19 16:52:29

I am encoding large integers into an array of size_t. I already have the other operations working (add, subtract, multiply); as well as division by a single digit. But I would like match the time complexity of my multiplication algorithms if possible (currently Toom-Cook).

我将大整数编码为size_t数组。我已经有其他操作了(加,减,乘);除除法外,还有个位数。但是,如果可能的话,我想要匹配我的乘法运算法则的时间复杂度(目前是toomcook)。

I gather there are linear time algorithms for taking various notions of multiplicative inverse of my dividend. This means I could theoretically achieve division in the same time complexity as my multiplication, because the linear-time operation is "insignificant" by comparison anyway.

我收集了一些线性时间算法用于计算红利的乘法逆的各种概念。这意味着我可以在理论上实现与我的乘法一样的复杂性,因为线性时间的运算是“无关紧要的”。

My question is, how do I actually do that? What type of multiplicative inverse is best in practice? Modulo 64^digitcount? When I multiply the multiplicative inverse by my divisor, can I shirk computing the part of the data that would be thrown away due to integer truncation? Can anyone provide C or C++ pseudocode or give a precise explanation of how this should be done?

我的问题是,我该怎么做呢?在实践中什么类型的乘法逆是最好的?模64 ^ digitcount吗?当我用除数乘上乘法逆时,我可以逃避计算由于整数截断而被丢弃的部分数据吗?谁能提供C或c++伪代码,或者给出一个确切的解释?

Or is there a dedicated division algorithm that is even better than the inverse-based approach?

还是有一个专门的除法算法比基于逆的方法更好?

Edit: I dug up where I was getting "inverse" approach mentioned above. On page 312 of "Art of Computer Programming, Volume 2: Seminumerical Algorithms", Knuth provides "Algorithm R" which is a high-precision reciprocal. He says its time complexity is less than that of multiplication. It is, however, nontrivial to convert it to C and test it out, and unclear how much overhead memory, etc, will be consumed until I code this up, which would take a while. I'll post it if no one beats me to it.

编辑:我发现了上面提到的“逆”方法。在“计算机编程的艺术,第二卷:半计算机算法”的第312页,Knuth提供了一个高精度的交互算法R。他说它的时间复杂度小于乘法运算。但是,将其转换为C并进行测试,以及不清楚在我编写代码之前将消耗多少内存,这是非常重要的,这需要一段时间。如果没有人打我,我就把它寄出去。

2 个解决方案

#1


3  

The GMP library is usually a good reference for good algorithms. Their documented algorithms for division mainly depend on choosing a very large base, so that you're dividing a 4 digit number by a 2 digit number, and then proceed via long division.

GMP库通常是很好的算法参考。他们所记录的划分算法主要依赖于选择一个非常大的基数,所以你将一个4位数除以2位数,然后经过长除法。

Long division will require computing 2 digit by 1 digit quotients; this can either be done recursively, or by precomputing an inverse and estimating the quotient as you would with Barrett reduction.

长除法需要计算2位数字1位商;这可以递归地完成,也可以通过预计算逆和估计商如Barrett还原。

When dividing a 2n-bit number by an n-bit number, the recursive version costs O(M(n) log(n)), where M(n) is the cost of multiplying n-bit numbers.

当用n位数字除以n位数字时,递归版本成本O(M(n) log(n)),其中M(n)是n位数字相乘的代价。

The version using Barrett reduction will cost O(M(n)) if you use Newton's algorithm to compute the inverse, but according to GMP's documentation, the hidden constant is a lot larger, so this method is only preferable for very large divisions.

如果你使用牛顿的算法来计算逆矩阵,那么使用Barrett reduce的版本将花费O(M(n)),但是根据GMP的文档,隐藏的常量要大得多,所以这个方法只适用于非常大的部分。


In more detail, the core algorithm behind most division algorithms is an "estimated quotient with reduction" calculation, computing (q,r) so that

在更详细的情况下,大多数划分算法背后的核心算法是一个“估计商与还原”计算,计算(q,r)。

x = qy + r

but without the restriction that 0 <= r < y. The typical loop is

但是没有限制,0 <= r < y,典型的循环是。

  • Estimate the quotient q of x/y
  • 求出x/y的商q。
  • Compute the corresponding reduction r = x - qy
  • 计算相应的还原r = x - qy。
  • Optionally adjust the quotient so that the reduction r is in some desired interval
  • 可选地调整商,使还原r在某些期望区间内。
  • If r is too big, then repeat with r in place of x.
  • 如果r太大,就用r代替x。

The quotient of x/y will be the sum of all the qs produced, and the final value of r will be the true remainder.

x/y的商是所有生成的q的和,r的最终值将是真正的余数。

Schoolbook long division, for example, is of this form. e.g. step 3 covers those cases where the digit you guessed was too big or too small, and you adjust it to get the right value.

例如,教科书长除法就是这种形式。步骤3包括你猜的数字太大或太小的情况,然后调整它以得到正确的值。

The divide and conquer approach estimates the quotient of x/y by computing x'/y' where x' and y' are the leading digits of x and y. There is a lot of room for optimization by adjusting their sizes, but IIRC you get best results if x' is twice as many digits of y'.

分而治之的方法估计的系数x / y通过计算x / y的x和y的x和y的主要数字。有一个很大的优化空间,调整大小,但是这个你得到最好的结果,如果x是y的数字的两倍。

The multiply-by-inverse approach is, IMO, the simplest if you stick to integer arithmetic. The basic method is

多重逆的方法是,IMO,最简单的如果你坚持整数算术。基本的方法是

  • Estimate the inverse of y with m = floor(2^k / y)
  • 估计y的逆m =地板(2 ^ k / y)
  • Estimate x/y with q = 2^(i+j-k) floor(floor(x / 2^i) m / 2^j)
  • 估计x/y与q = 2 (i+j-k)地板(地板(x / 2 i) m / 2 j)

In fact, practical implementations can tolerate additional error in m if it means you can use a faster reciprocal implementation.

事实上,实际实现可以容忍m中的额外错误,如果这意味着您可以使用更快的交互实现。

The error is a pain to analyze, but if I recall the way to do it, you want to choose i and j so that x ~ 2^(i+j) due to how errors accumulate, and you want to choose x / 2^i ~ m^2 to minimize the overall work.

错误分析是一种痛苦,但如果我记得,你要选择我和j x ~ 2 ^(I + j)由于错误积累,如何你想选择x / 2 ^我~ m ^ 2最小化整体工作。

The ensuing reduction will have r ~ max(x/m, y), so that gives a rule of thumb for choosing k: you want the size of m to be about the number of bits of quotient you compute per iteration — or equivalently the number of bits you want to remove from x per iteration.

随后的减少将r ~马克斯(x / m、y),以便给选择k的经验法则:你想要大小的m的比特数的除法计算每个迭代或等价的比特数你想删除从x每迭代。

#2


3  

I do not know the multiplicative inverse algorithm but it sounds like modification of Montgomery Reduction or Barrett's Reduction.

我不知道乘法逆算法但它听起来像是蒙哥马利减量或巴雷特的减少。

I do bigint divisions a bit differently.

我对bigint的划分有点不同。

See bignum division. Especially take a look at the approximation divider and the 2 links there. One is my fixed point divider and the others are fast multiplication algos (like karatsuba,Schönhage-Strassen on NTT) with measurements, and a link to my very fast NTT implementation for 32bit Base.

看到bignum部门。特别是看一下近似值和两个连杆。一个是我的定点分配器,另一个是快速乘法(像karatsuba, schonhag - strassen on NTT)和测量,以及一个连接到我非常快的NTT实现的32位基础。

I'm not sure if the inverse multiplicant is the way.

我不确定逆乘是否正确。

It is mostly used for modulo operation where the divider is constant. I'm afraid that for arbitrary divisions the time and operations needed to acquire bigint inverse can be bigger then the standard divisions itself, but as I am not familiar with it I could be wrong.

它主要用于模块操作,其中分配器是常量。我担心,对于任意的划分,获得bigint逆的时间和操作可以比标准的划分本身更大,但是我不熟悉它,我可能是错的。

The most common divider in use I saw in implemetations are Newton–Raphson division which is very similar to approximation divider in the link above.

我在实现中看到的最常用的分隔符是Newton-Raphson,它与上面链接中的近似分隔符非常相似。

Approximation/iterative dividers usually use multiplication which define their speed.

近似/迭代分割器通常使用乘法来定义它们的速度。

For small enough numbers is usually long binary division and 32/64bit digit base division fast enough if not fastest: usually they have small overhead, and let n be the max value processed (not the number of digits!)

对于足够小的数字,通常是长二进制和32/64位的数字基,如果不是最快的,就足够快:通常它们的开销很小,而让n为处理的最大值(而不是数字的数目!)

Binary division example:

二进制除法的例子:

Is O(log32(n).log2(n)) = O(log^2(n)).
It loops through all significant bits. In each iteration you need to compare, sub, add, bitshift. Each of those operations can be done in log32(n), and log2(n) is the number of bits.

是O(log32(n).log2(n))= O(log(n)^ 2)。它循环遍历所有重要的比特。在每个迭代中,您需要比较、子、添加、位移。每个操作可以在log32(n)中完成,log2(n)是位的数目。

Here example of binary division from one of my bigint templates (C++):

这里是我的bigint模板(c++)中的二进制除法的例子:

template <DWORD N> void uint<N>::div(uint &c,uint &d,uint a,uint b)
    {
    int i,j,sh;
    sh=0; c=DWORD(0); d=1;
    sh=a.bits()-b.bits();
    if (sh<0) sh=0; else { b<<=sh; d<<=sh; }
    for (;;)
        {
        j=geq(a,b);
        if (j)
            {
            c+=d;
            sub(a,a,b);
            if (j==2) break;
            }
        if (!sh) break;
        b>>=1; d>>=1; sh--;
        }
    d=a;
    }

N is the number of 32 bit DWORDs used to store a bigint number.

N是用于存储bigint数字的32位DWORDs的数字。

  • c = a / b
  • c = a / b。
  • d = a % b
  • d = a % b。
  • qeq(a,b) is a comparison: a >= b greater or equal (done in log32(n)=N)
    It returns 0 for a < b, 1 for a > b, 2 for a == b
  • qeq(a,b)是一个比较:>= b更大或相等(在log32(n)= n),它返回0为a < b, 1为> b, 2为a == b。
  • sub(c,a,b) is c = a - b
  • (c,a,b)是c = a - b。

The speed boost is gained from that this does not use multiplication (if you do not count the bit shift)

速度的提升是由于它不使用乘法(如果你不计算位偏移)

If you use digit with a big base like 2^32 (ALU blocks), then you can rewrite the whole in polynomial like style using 32bit build in ALU operations.
This is usually even faster then binary long division, the idea is to process each DWORD as a single digit, or recursively divide the used arithmetic by half until hit the CPU capabilities.
See division by half-bitwidth arithmetics

如果你使用数字大基地2 ^ 32块(ALU),然后你可以重写整个多项式的风格使用32位构建ALU操作。这通常是更快的二进制长除法,它的思想是将每个DWORD作为一个数字处理,或者递归地将所使用的算法分成两半,直到达到CPU的性能。用半宽的算术方法看除法。

On top of all that while computing with bignums

最重要的是,在使用bignums的时候。

If you have optimized basic operations, then the complexity can lower even further as sub-results get smaller with iterations (changing the complexity of basic operations) A nice example of that are NTT based multiplications.

如果您已经优化了基本操作,那么随着迭代(改变基本操作的复杂性)的迭代(改变基本操作的复杂性),复杂性可以进一步降低,这是基于NTT的乘法的一个很好的例子。

The overhead can mess thing up.

头顶上的东西会把事情弄得一团糟。

Due to this the runtime sometimes does not copy the big O complexity, so you should always measure the tresholds and use faster approach for used bit-count to get the max performance and optimize what you can.

由于这个原因,运行时有时不会复制大的O复杂度,所以您应该始终测量tresholds,并使用更快的方法来获取最大性能并优化您所能做的。

#1


3  

The GMP library is usually a good reference for good algorithms. Their documented algorithms for division mainly depend on choosing a very large base, so that you're dividing a 4 digit number by a 2 digit number, and then proceed via long division.

GMP库通常是很好的算法参考。他们所记录的划分算法主要依赖于选择一个非常大的基数,所以你将一个4位数除以2位数,然后经过长除法。

Long division will require computing 2 digit by 1 digit quotients; this can either be done recursively, or by precomputing an inverse and estimating the quotient as you would with Barrett reduction.

长除法需要计算2位数字1位商;这可以递归地完成,也可以通过预计算逆和估计商如Barrett还原。

When dividing a 2n-bit number by an n-bit number, the recursive version costs O(M(n) log(n)), where M(n) is the cost of multiplying n-bit numbers.

当用n位数字除以n位数字时,递归版本成本O(M(n) log(n)),其中M(n)是n位数字相乘的代价。

The version using Barrett reduction will cost O(M(n)) if you use Newton's algorithm to compute the inverse, but according to GMP's documentation, the hidden constant is a lot larger, so this method is only preferable for very large divisions.

如果你使用牛顿的算法来计算逆矩阵,那么使用Barrett reduce的版本将花费O(M(n)),但是根据GMP的文档,隐藏的常量要大得多,所以这个方法只适用于非常大的部分。


In more detail, the core algorithm behind most division algorithms is an "estimated quotient with reduction" calculation, computing (q,r) so that

在更详细的情况下,大多数划分算法背后的核心算法是一个“估计商与还原”计算,计算(q,r)。

x = qy + r

but without the restriction that 0 <= r < y. The typical loop is

但是没有限制,0 <= r < y,典型的循环是。

  • Estimate the quotient q of x/y
  • 求出x/y的商q。
  • Compute the corresponding reduction r = x - qy
  • 计算相应的还原r = x - qy。
  • Optionally adjust the quotient so that the reduction r is in some desired interval
  • 可选地调整商,使还原r在某些期望区间内。
  • If r is too big, then repeat with r in place of x.
  • 如果r太大,就用r代替x。

The quotient of x/y will be the sum of all the qs produced, and the final value of r will be the true remainder.

x/y的商是所有生成的q的和,r的最终值将是真正的余数。

Schoolbook long division, for example, is of this form. e.g. step 3 covers those cases where the digit you guessed was too big or too small, and you adjust it to get the right value.

例如,教科书长除法就是这种形式。步骤3包括你猜的数字太大或太小的情况,然后调整它以得到正确的值。

The divide and conquer approach estimates the quotient of x/y by computing x'/y' where x' and y' are the leading digits of x and y. There is a lot of room for optimization by adjusting their sizes, but IIRC you get best results if x' is twice as many digits of y'.

分而治之的方法估计的系数x / y通过计算x / y的x和y的x和y的主要数字。有一个很大的优化空间,调整大小,但是这个你得到最好的结果,如果x是y的数字的两倍。

The multiply-by-inverse approach is, IMO, the simplest if you stick to integer arithmetic. The basic method is

多重逆的方法是,IMO,最简单的如果你坚持整数算术。基本的方法是

  • Estimate the inverse of y with m = floor(2^k / y)
  • 估计y的逆m =地板(2 ^ k / y)
  • Estimate x/y with q = 2^(i+j-k) floor(floor(x / 2^i) m / 2^j)
  • 估计x/y与q = 2 (i+j-k)地板(地板(x / 2 i) m / 2 j)

In fact, practical implementations can tolerate additional error in m if it means you can use a faster reciprocal implementation.

事实上,实际实现可以容忍m中的额外错误,如果这意味着您可以使用更快的交互实现。

The error is a pain to analyze, but if I recall the way to do it, you want to choose i and j so that x ~ 2^(i+j) due to how errors accumulate, and you want to choose x / 2^i ~ m^2 to minimize the overall work.

错误分析是一种痛苦,但如果我记得,你要选择我和j x ~ 2 ^(I + j)由于错误积累,如何你想选择x / 2 ^我~ m ^ 2最小化整体工作。

The ensuing reduction will have r ~ max(x/m, y), so that gives a rule of thumb for choosing k: you want the size of m to be about the number of bits of quotient you compute per iteration — or equivalently the number of bits you want to remove from x per iteration.

随后的减少将r ~马克斯(x / m、y),以便给选择k的经验法则:你想要大小的m的比特数的除法计算每个迭代或等价的比特数你想删除从x每迭代。

#2


3  

I do not know the multiplicative inverse algorithm but it sounds like modification of Montgomery Reduction or Barrett's Reduction.

我不知道乘法逆算法但它听起来像是蒙哥马利减量或巴雷特的减少。

I do bigint divisions a bit differently.

我对bigint的划分有点不同。

See bignum division. Especially take a look at the approximation divider and the 2 links there. One is my fixed point divider and the others are fast multiplication algos (like karatsuba,Schönhage-Strassen on NTT) with measurements, and a link to my very fast NTT implementation for 32bit Base.

看到bignum部门。特别是看一下近似值和两个连杆。一个是我的定点分配器,另一个是快速乘法(像karatsuba, schonhag - strassen on NTT)和测量,以及一个连接到我非常快的NTT实现的32位基础。

I'm not sure if the inverse multiplicant is the way.

我不确定逆乘是否正确。

It is mostly used for modulo operation where the divider is constant. I'm afraid that for arbitrary divisions the time and operations needed to acquire bigint inverse can be bigger then the standard divisions itself, but as I am not familiar with it I could be wrong.

它主要用于模块操作,其中分配器是常量。我担心,对于任意的划分,获得bigint逆的时间和操作可以比标准的划分本身更大,但是我不熟悉它,我可能是错的。

The most common divider in use I saw in implemetations are Newton–Raphson division which is very similar to approximation divider in the link above.

我在实现中看到的最常用的分隔符是Newton-Raphson,它与上面链接中的近似分隔符非常相似。

Approximation/iterative dividers usually use multiplication which define their speed.

近似/迭代分割器通常使用乘法来定义它们的速度。

For small enough numbers is usually long binary division and 32/64bit digit base division fast enough if not fastest: usually they have small overhead, and let n be the max value processed (not the number of digits!)

对于足够小的数字,通常是长二进制和32/64位的数字基,如果不是最快的,就足够快:通常它们的开销很小,而让n为处理的最大值(而不是数字的数目!)

Binary division example:

二进制除法的例子:

Is O(log32(n).log2(n)) = O(log^2(n)).
It loops through all significant bits. In each iteration you need to compare, sub, add, bitshift. Each of those operations can be done in log32(n), and log2(n) is the number of bits.

是O(log32(n).log2(n))= O(log(n)^ 2)。它循环遍历所有重要的比特。在每个迭代中,您需要比较、子、添加、位移。每个操作可以在log32(n)中完成,log2(n)是位的数目。

Here example of binary division from one of my bigint templates (C++):

这里是我的bigint模板(c++)中的二进制除法的例子:

template <DWORD N> void uint<N>::div(uint &c,uint &d,uint a,uint b)
    {
    int i,j,sh;
    sh=0; c=DWORD(0); d=1;
    sh=a.bits()-b.bits();
    if (sh<0) sh=0; else { b<<=sh; d<<=sh; }
    for (;;)
        {
        j=geq(a,b);
        if (j)
            {
            c+=d;
            sub(a,a,b);
            if (j==2) break;
            }
        if (!sh) break;
        b>>=1; d>>=1; sh--;
        }
    d=a;
    }

N is the number of 32 bit DWORDs used to store a bigint number.

N是用于存储bigint数字的32位DWORDs的数字。

  • c = a / b
  • c = a / b。
  • d = a % b
  • d = a % b。
  • qeq(a,b) is a comparison: a >= b greater or equal (done in log32(n)=N)
    It returns 0 for a < b, 1 for a > b, 2 for a == b
  • qeq(a,b)是一个比较:>= b更大或相等(在log32(n)= n),它返回0为a < b, 1为> b, 2为a == b。
  • sub(c,a,b) is c = a - b
  • (c,a,b)是c = a - b。

The speed boost is gained from that this does not use multiplication (if you do not count the bit shift)

速度的提升是由于它不使用乘法(如果你不计算位偏移)

If you use digit with a big base like 2^32 (ALU blocks), then you can rewrite the whole in polynomial like style using 32bit build in ALU operations.
This is usually even faster then binary long division, the idea is to process each DWORD as a single digit, or recursively divide the used arithmetic by half until hit the CPU capabilities.
See division by half-bitwidth arithmetics

如果你使用数字大基地2 ^ 32块(ALU),然后你可以重写整个多项式的风格使用32位构建ALU操作。这通常是更快的二进制长除法,它的思想是将每个DWORD作为一个数字处理,或者递归地将所使用的算法分成两半,直到达到CPU的性能。用半宽的算术方法看除法。

On top of all that while computing with bignums

最重要的是,在使用bignums的时候。

If you have optimized basic operations, then the complexity can lower even further as sub-results get smaller with iterations (changing the complexity of basic operations) A nice example of that are NTT based multiplications.

如果您已经优化了基本操作,那么随着迭代(改变基本操作的复杂性)的迭代(改变基本操作的复杂性),复杂性可以进一步降低,这是基于NTT的乘法的一个很好的例子。

The overhead can mess thing up.

头顶上的东西会把事情弄得一团糟。

Due to this the runtime sometimes does not copy the big O complexity, so you should always measure the tresholds and use faster approach for used bit-count to get the max performance and optimize what you can.

由于这个原因,运行时有时不会复制大的O复杂度,所以您应该始终测量tresholds,并使用更快的方法来获取最大性能并优化您所能做的。