gcc内在的扩展除法/乘法

Modern CPU's can perform extended multiplication between two native-size words and store the low and high result in separate registers. Similarly, when performing division, they store the quotient and the remainder in two different registers instead of discarding the unwanted part.

现代CPU可以在两个原生大小的字之间执行扩展乘法,并将低和高结果存储在单独的寄存器中。类似地,当执行除法时,它们将商和余数存储在两个不同的寄存器中,而不是丢弃不需要的部分。

Is there some sort of portable gcc intrinsic which would take the following signature:

是否存在某种可移植的gcc内在函数,它将采用以下签名:

void extmul(size_t a, size_t b, size_t *lo, size_t *hi);

Or something like that, and for division:

或类似的东西,以及分裂:

void extdiv(size_t a, size_t b, size_t *q, size_t *r);

I know I could do it myself with inline assembly and shoehorn portability into it by throwing #ifdef's in the code, or I could emulate the multiplication part using partial sums (which would be significantly slower) but I would like to avoid that for readability. Surely there exists some built-in function to do this?

我知道我可以通过在代码中抛出#ifdef来使用内联汇编和shoehorn可移植性来实现它,或者我可以使用部分和来模拟乘法部分(这将显着更慢)但我想避免这样做以便于阅读。当然有一些内置功能来做到这一点?

2 个解决方案

#1

For gcc since version 4.6 you can use __int128. This works on most 64 bit hardware. For instance

对于4.6版本的gcc,您可以使用__int128。这适用于大多数64位硬件。例如

To get the 128 bit result of a 64x64 bit multiplication just use

要获得64位64位乘法的128位结果,请使用

void extmul(size_t a, size_t b, size_t *lo, size_t *hi) {
    __int128 result = (__int128)a * (__int128)b;
    *lo = (size_t)result;
    *hi = result >> 64;
}

On x86_64 gcc is smart enough to compile this to

在x86_64上,gcc足够聪明,可以将其编译为

   0:   48 89 f8                mov    %rdi,%rax
   3:   49 89 d0                mov    %rdx,%r8
   6:   48 f7 e6                mul    %rsi
   9:   49 89 00                mov    %rax,(%r8)
   c:   48 89 11                mov    %rdx,(%rcx)
   f:   c3                      retq

No native 128 bit support or similar required, and after inlining only the mul instruction remains.

不需要本机128位支持或类似,并且在内联之后仅保留mul指令。

Edit: On a 32 bit arch this works in a similar way, you need to replace __int128_t by uint64_t and the shift width by 32. The optimization will work on even older gccs.

编辑:在32位拱门上,这种方式类似,您需要将uint64_t替换为__int128_t,并将移位宽度替换为32.优化将适用于更旧的gcc。

#2

For those wondering about the other half of the question (division), gcc does not provide an intrinsic for that because the processor division instructions don't conform to the standard.

对于那些对问题的另一半(除法)感到疑惑的人来说,gcc没有为此提供内在因素,因为处理器除法指令不符合标准。

This is true both with 128-bit dividends on 64-bit x86 targets and 64-bit dividends on 32-bit x86 targets. The problem is that DIV will cause divide overflow exceptions in cases where the standard says the result should be truncated. For example (unsigned long long) (((unsigned _int128) 1 << 64) / 1) should evaluate to 0, but would cause divide overflow exception if evaluated with DIV.

对于64位x86目标上的128位被除数和32位x86目标上的64位被除数,都是如此。问题是如果标准说结果应该被截断,DIV将导致除数溢出异常。例如(无符号long long)(((unsigned _int128)1 << 64)/ 1)应该求值为0,但如果使用DIV求值则会导致除法溢出异常。

(Thanks to @ross-ridge for this info)

(感谢@ ross-ridge获取此信息)

#1