I don't want to optimize anything, I swear, I just want to ask this question out of curiosity. I know that on most hardware there's an assembly command of bit-shift (e.g. shl
, shr
), which is a single command. But does it matter (nanosecond-wise, or CPU-tact-wise) how many bits you shift. In other words, is either of the following faster on any CPU?
我不想优化任何东西,我发誓,我只是出于好奇想问这个问题。我知道在大多数硬件上都有一个bit-shift的汇编命令(例如shl, shr),这是一个命令。但这有关系吗(奈米秒,或cput -tact-wise)你移动了多少位。换句话说,以下任何一种在任何CPU上都更快吗?
x << 1;
and
和
x << 10;
And please don't hate me for this question. :)
请不要因为这个问题而恨我。:)
9 个解决方案
#1
83
Potentially depends on the CPU.
可能取决于CPU。
However, all modern CPUs (x86, ARM) use a "barrel shifter" -- a hardware module specifically designed to perform arbitrary shifts in constant time.
然而,所有现代的cpu (x86, ARM)都使用“桶移位器”——一种专门设计用于在恒定时间内执行任意移位的硬件模块。
So the bottom line is... no. No difference.
所以底线是……不。没有区别。
#2
62
Some embedded processors only have a "shift-by-one" instruction. On such processors, the compiler would change x << 3
into ((x << 1) << 1) << 1
.
有些嵌入式处理器只有一个“移位-一个”指令。在这样的处理器上,编译器将x << 3转换为(x < 1) < 1) < 1。
I think the Motorola MC68HCxx was one of the more popular families with this limitation. Fortunately, such architectures are now quite rare, most now include a barrel shifter with a variable shift size.
我认为摩托罗拉MC68HCxx是有这个限制的更受欢迎的家庭之一。幸运的是,这样的架构现在非常少见,大多数现在都包含了一个可变移位大小的桶移位器。
The Intel 8051, which has many modern derivatives, also cannot shift an arbitrary number of bits.
拥有许多现代衍生产品的英特尔8051也不能任意移动比特位。
#3
28
There are many cases on this.
这方面有很多案例。
-
Many hi-speed MPUs have barrel shifter, multiplexer-like electronic circuit which do any shift in constant time.
许多高速的mpu都有桶形移位器,多路复用器式电子电路,可以在恒定时间内进行任何移位。
-
If MPU have only 1 bit shift
x << 10
would normally be slower, as it mostly done by 10 shifts or byte copying with 2 shifts.如果MPU只有1位移位x < 10通常会比较慢,因为它主要是通过10个移位完成的,或者通过2个移位完成字节复制。
-
But there is known common case where
x << 10
would be even faster thanx << 1
. If x is 16 bit, only lower 6 bits of it is care (all other will be shifted out), so MPU need to load only lower byte, thus make only single access cycle to 8-bit memory, whilex << 10
need two access cycles. If access cycle is slower than shift (and clear lower byte),x << 10
will be faster. This may apply to microcontrollers with fast onboard program ROM while accessing slow external data RAM.但已知的常见情况是,x < 10将比x < 1还要快。如果x是16位,那么只有较低的6位是care(所有其他的将会被移出),因此MPU只需要加载较低的字节,从而只使一个访问周期为8位内存,而x << 10需要两个访问周期。如果访问周期比移位慢(清除下字节),则x << 10将更快。这可能适用于具有快速板载程序ROM的微控制器,同时访问缓慢的外部数据RAM。
-
As addition to case 3, compiler may care about number of significant bits in
x << 10
and optimize further operations to lower-width ones, like replacing 16x16 multiplication with 16x8 one (as lower byte is always zero).除了情形3之外,编译器可能会关心x < 10中的重要位的数量,并将进一步的操作优化为较低的位,例如用16x8的1替换16x16的乘法(因为较低的字节总是0)。
Note, some microcontrollers have no shift-left instruction at all, they use add x,x
instead.
注意,有些微控制器根本没有左移指令,它们使用的是add x,x。
#4
9
On ARM, this can be done as a side effect of another instruction. So potentially, there's no latency at all for either of them.
在ARM上,这可以作为另一条指令的副作用来完成。因此,它们中的任何一个都没有延迟。
#5
9
Here's my favorite CPU, in which x<<2
takes twice as long as x<<1
:)
这是我最喜欢的CPU,其中x< 2的时间是x< 1:的两倍)
#6
7
That depends both on the CPU and compiler. Even if the underlying CPU has arbitrary bit shift with a barrel shifter, this will only happen if the compiler takes advantage of that resource.
这取决于CPU和编译器。即使底层CPU使用桶移器进行任意位移,也只有在编译器利用该资源时才会发生这种情况。
Keep in mind that shifting anything outside the width in bits of the data is "undefined behavior" in C and C++. Right shift of signed data is also "implementation defined". Rather than too much concern about speed, be concerned that you are getting the same answer on different implementations.
请记住,在C和c++中,将数据的宽度以外的任何内容移动都是“未定义的行为”。签名数据的右移也是“实现定义”。与其过于关注速度,不如关注在不同的实现中得到相同的答案。
Quoting from ANSI C section 3.3.7:
引用ANSI C节3.3.7:
3.3.7 Bitwise shift operators
3.3.7位移位算子
Syntax
语法
shift-expression: additive-expression shift-expression << additive-expression shift-expression >> additive-expression
Constraints
约束
Each of the operands shall have integral type.
每个操作数都应该具有整型。
Semantics
语义
The integral promotions are performed on each of the operands. The type of the result is that of the promoted left operand. If the value of the right operand is negative or is greater than or equal to the width in bits of the promoted left operand, the behavior is undefined.
对每个操作数执行整体提升。结果的类型是提升的左操作数。如果右操作数的值为负,或大于或等于提升左操作数的位元宽度,则行为未定义。
The result of E1 << E2 is E1 left-shifted E2 bit positions; vacated bits are filled with zeros. If E1 has an unsigned type, the value of the result is E1 multiplied by the quantity, 2 raised to the power E2, reduced modulo ULONG_MAX+1 if E1 has type unsigned long, UINT_MAX+1 otherwise. (The constants ULONG_MAX and UINT_MAX are defined in the header .)
E1 << E2的结果是E1左移E2位位置;空出的位充满了0。如果E1有一个无符号类型,那么结果的值是E1乘以数量,2的幂E2次方,如果E1有无符号long类型,则会减少modulo ULONG_MAX+1,反之则是UINT_MAX+1。(在标题中定义了常量ULONG_MAX和UINT_MAX)。
The result of E1 >> E2 is E1 right-shifted E2 bit positions. If E1 has an unsigned type or if E1 has a signed type and a nonnegative value, the value of the result is the integral part of the quotient of E1 divided by the quantity, 2 raised to the power E2 . If E1 has a signed type and a negative value, the resulting value is implementation-defined.
E1 >> E2的结果是E1右移E2位位置。如果E1有一个无符号类型,或者E1有一个符号类型和一个非负值,则结果的值是E1的商除以量的整数部分,2被提升到E2。如果E1具有符号类型和负值,则结果值是实现定义的。
So:
所以:
x = y << z;
"<<": y × 2z (undefined if an overflow occurs);
“< <”:y×2 z(未定义如果发生溢出);
x = y >> z;
">>": implementation-defined for signed (most often the result of the arithmetic shift: y / 2z).
“>>”:为签名定义的实现(通常是算术移位的结果:y / 2z)。
#7
7
It is conceivable that, on an 8-bit processor, x<<1
could actually be much slower than x<<10
for a 16-bit value.
可以想象,在8位处理器上,对于16位的值,x<<1实际上要比x< 10慢得多。
For example a reasonable translation of x<<1
may be:
例如,x<<1的合理翻译可以是:
byte1 = (byte1 << 1) | (byte2 >> 7)
byte2 = (byte2 << 1)
whereas x<<10
would be more simple:
而x< 10则更简单:
byte1 = (byte2 << 2)
byte2 = 0
Notice how x<<1
shifts more often and even farther than x<<10
. Furthermore the result of x<<10
doesn't depend on the content of byte1. This could speed up the operation additionally.
注意x<<1的变化比x<<10更频繁甚至更远。此外,x< 10的结果并不取决于byte1的内容。这还可以加快操作的速度。
#8
5
On some generations of Intel CPUs (P2 or P3? Not AMD though, if I remember right), the bitshift operations are ridiculously slow. Bitshift by 1 bit should always be fast though since it can just use addition. Another question to consider is whether bitshifts by a constant number of bits are faster than variable-length shifts. Even if the opcodes are the same speed, on x86 the nonconstant righthand operand of a bitshift must occupy the CL register, which imposes additional constrains on register allocation and may slow the program down that way too.
一些一代的英特尔cpu (P2还是P3?虽然不是AMD,如果我没记错的话),位移操作极其缓慢。位移1位总是很快的,因为它可以只使用加法。另一个要考虑的问题是,比特位移的位数是否比变长移快。即使操作码是相同的速度,在x86上,位移的非常量右操作数必须占用CL寄存器,这将对寄存器分配施加额外的限制,并可能使程序慢下来。
#9
3
As always, it depends on the surrounding code context: e.g. are you using x<<1
as an array index? Or adding it to something else? In either case, small shift counts (1 or 2) can often optimize even more than if the compiler ends up having to just shift. Not to mention the whole throughput vs. latency vs. front-end bottlenecks tradeoff. Performance of a tiny fragment is not one-dimensional.
与往常一样,它取决于周围的代码上下文:例如,您是否使用x<<1作为数组索引?或者把它加到别的东西上?在这两种情况下,小移位计数(1或2)通常比编译器只需要移位更能优化。更不用说整个吞吐量、延迟和前端瓶颈之间的权衡。一个小片段的性能不是一维的。
A hardware shift instructions is not a compiler's only option for compiling x<<1
, but the other answers are mostly assuming that.
硬件移位指令不是编译x<<1的唯一选项,但是其他的答案大部分是假设的。
x << 1
is exactly equivalent to x+x
for unsigned, and for 2's complement signed integers. Compilers always know what hardware they're targeting while they're compiling, so they can take advantage of tricks like this.
x < 1完全等价于x+x对于无符号,对于2的补有符号整数。编译器在编译时总是知道目标是什么硬件,所以它们可以利用这样的技巧。
On Intel Haswell, add
has 4 per clock throughput, but shl
with an immediate count has only 2 per clock throughput. (See http://agner.org/optimize/ for instruction tables, and other links in the x86 tag wiki). SIMD vector shifts are 1 per clock (2 in Skylake), but SIMD vector integer adds are 2 per clock (3 in Skylake). Latency is the same, though: 1 cycle.
在Intel Haswell上,add的每个时钟吞吐量是4,但是shl的每个时钟吞吐量只有2。(关于指令表和x86标记wiki中的其他链接,请参见http://agner.org/optimize/)。SIMD矢量移位为1 /时钟(2在Skylake中),而SIMD矢量整数相加为2 /时钟(3在Skylake中)。延迟是相同的:1个周期。
There's also a special shift-by-one encoding of shl
where the count is implicit in the opcode. 8086 didn't have immediate-count shifts, only by-one and by cl
register. This is mostly relevant for right-shifts, because you can just add for left shifts unless you're shifting a memory operand. But if the value is needed later, it's better to load into a register first. But anyway, shl eax,1
or add eax,eax
is one byte shorter than shl eax,10
, and code-size can directly (decode / front-end bottlenecks) or indirectly (L1I code cache misses) affect performance.
还有一种特殊的shl的逐位编码,其中的计数在操作码中是隐式的。8086没有立即计数的移位,只有一个和cl寄存器。这主要与右移有关,因为您可以只添加左移,除非您正在移动内存操作数。但是如果以后需要这个值,最好先加载到寄存器中。但是无论如何,shl eax、1或添加eax,eax比shl eax、10短一个字节,并且代码大小可以直接(解码/前端瓶颈)或间接(L1I代码缓存丢失)影响性能。
More generally, small shift counts can sometimes be optimized into a scaled index in an addressing mode on x86. Most other architectures in common use these days are RISC, and don't have scaled-index addressing modes, but x86 is a common enough architecture for this to be worth mentioning. (e.g.g if you're indexing an array of 4-byte elements, there's room to increase the scale factor by 1 for int arr[]; arr[x<<1]
).
更一般地说,在x86上,小移位计数有时可以优化为按比例缩放的索引。目前大多数常用的架构都是RISC,并且没有标度索引寻址模式,但是x86是一种足够通用的架构,值得注意。(例如,g如果你对一个4字节的元素数组进行索引,那么在int arr[]中可以将比例因子增加1;加勒比海盗[x < < 1])。
Needing to copy+shift is common in situations where the original value of x
is still needed. But most x86 integer instructions operate in-place. (The destination is one of the sources for instructions like add
or shl
.) The x86-64 System V calling convention passes args in registers, with the first arg in edi
and return value in eax
, so a function that returns x<<10
also makes the compiler emit copy+shift code.
需要复制+移位在仍然需要x的原始值的情况下是很常见的。但是大多数x86整数指令都是就地操作的。(目的地是添加或shl等指令的来源之一。)x86-64系统V调用约定在寄存器中传递args,在edi中传递第一个arg,在eax中返回值,因此返回x< 10的函数也使编译器发出copy+shift代码。
The LEA
instruction lets you shift-and-add (with a shift count of 0 to 3, because it uses addressing-mode machine-encoding). It puts the result in a separate register.
LEA指令允许您进行移位和添加(移位计数为0到3,因为它使用了寻址模式的机器编码)。它将结果放在一个单独的寄存器中。
gcc和clang都以相同的方式优化这些函数,正如您在Godbolt编译器资源管理器上看到的:
int shl1(int x) { return x<<1; }
lea eax, [rdi+rdi] # 1 cycle latency, 1 uop
ret
int shl2(int x) { return x<<2; }
lea eax, [4*rdi] # longer encoding: needs a disp32 of 0 because there's no base register, only scaled-index.
ret
int times5(int x) { return x * 5; }
lea eax, [rdi + 4*rdi]
ret
int shl10(int x) { return x<<10; }
mov eax, edi # 1 uop, 0 or 1 cycle latency
shl eax, 10 # 1 uop, 1 cycle latency
ret
LEA with 2 components has 1 cycle latency and 2-per-clock throughput on recent Intel and AMD CPUs. (Sandybridge-family and Bulldozer/Ryzen). On Intel, it's only 1 per clock throughput with 3c latency for lea eax, [rdi + rsi + 123]
. (Related: Why is this C++ code faster than my hand-written assembly for testing the Collatz conjecture? goes into this in detail.)
在最近的Intel和AMD cpu上,拥有2个组件的LEA具有1个周期延迟和2个时钟吞吐量。(Sandybridge-family和推土机/ Ryzen)。在Intel上,它的每个时钟吞吐量只有1,而lea eax的延迟为3c, [rdi + rsi + 123]。(相关:为什么这个c++代码比我手工编写的用于测试Collatz猜想的程序集要快?)详细地讲一下)
Anyway, copy+shift by 10 needs a separate mov
instruction. It might be zero latency on many recent CPUs, but it still takes front-end bandwidth and code size. (Can x86's MOV really be "free"? Why can't I reproduce this at all?)
无论如何,拷贝+移位10需要一个独立的mov指令。在许多最近的cpu上,它可能是零延迟的,但是它仍然需要前端带宽和代码大小。x86的MOV真的是“免费的”吗?为什么我不能复制这个?)
Also related: How to multiply a register by 37 using only 2 consecutive leal instructions in x86?.
另外,如何在x86中使用2个连续的leal指令将寄存器乘以37 ?
The compiler is also free to transform the surrounding code so there isn't an actual shift, or it's combined with other operations.
编译器也可以*地转换周围的代码,这样就不会发生真正的转移,也不会与其他操作合并。
For example if(x<<1) { }
could use an and
to check all bits except the high bit. On x86, you'd use a test
instruction, like test eax, 0x7fffffff
/ jz .false
instead of shl eax,1 / jz
. This optimization works for any shift count, and it also works on machines where large-count shifts are slow (like Pentium 4), or non-existent (some micro-controllers).
例如,if(x< 1){}可以使用and检查除高比特之外的所有位。在x86上,您将使用一个测试指令,比如测试eax、0x7fffffff / jz. false,而不是shl eax、1 / jz。这种优化适用于任何移位计数,而且它也适用于大型移位缓慢的机器(如奔腾4),或不存在(一些微控制器)。
Many ISAs have bit-manipulation instructions beyond just shifting. e.g. PowerPC has a lot of bit-field extract / insert instructions. Or ARM has shifts of source operands as part of any other instruction. (So shift/rotate instructions are just a special form of move
, using a shifted source.)
许多ISAs除了移动之外还有位操作指令。PowerPC有许多位域提取/插入指令。或ARM作为任何其他指令的一部分,将源操作数转移。(所以移位/旋转指令只是移动的一种特殊形式,使用移位的源。)
Remember, C is not assembly language. Always look at optimized compiler output when you're tuning your source code to compile efficiently.
记住,C不是汇编语言。在优化源代码以高效编译时,始终要查看优化的编译器输出。
#1
83
Potentially depends on the CPU.
可能取决于CPU。
However, all modern CPUs (x86, ARM) use a "barrel shifter" -- a hardware module specifically designed to perform arbitrary shifts in constant time.
然而,所有现代的cpu (x86, ARM)都使用“桶移位器”——一种专门设计用于在恒定时间内执行任意移位的硬件模块。
So the bottom line is... no. No difference.
所以底线是……不。没有区别。
#2
62
Some embedded processors only have a "shift-by-one" instruction. On such processors, the compiler would change x << 3
into ((x << 1) << 1) << 1
.
有些嵌入式处理器只有一个“移位-一个”指令。在这样的处理器上,编译器将x << 3转换为(x < 1) < 1) < 1。
I think the Motorola MC68HCxx was one of the more popular families with this limitation. Fortunately, such architectures are now quite rare, most now include a barrel shifter with a variable shift size.
我认为摩托罗拉MC68HCxx是有这个限制的更受欢迎的家庭之一。幸运的是,这样的架构现在非常少见,大多数现在都包含了一个可变移位大小的桶移位器。
The Intel 8051, which has many modern derivatives, also cannot shift an arbitrary number of bits.
拥有许多现代衍生产品的英特尔8051也不能任意移动比特位。
#3
28
There are many cases on this.
这方面有很多案例。
-
Many hi-speed MPUs have barrel shifter, multiplexer-like electronic circuit which do any shift in constant time.
许多高速的mpu都有桶形移位器,多路复用器式电子电路,可以在恒定时间内进行任何移位。
-
If MPU have only 1 bit shift
x << 10
would normally be slower, as it mostly done by 10 shifts or byte copying with 2 shifts.如果MPU只有1位移位x < 10通常会比较慢,因为它主要是通过10个移位完成的,或者通过2个移位完成字节复制。
-
But there is known common case where
x << 10
would be even faster thanx << 1
. If x is 16 bit, only lower 6 bits of it is care (all other will be shifted out), so MPU need to load only lower byte, thus make only single access cycle to 8-bit memory, whilex << 10
need two access cycles. If access cycle is slower than shift (and clear lower byte),x << 10
will be faster. This may apply to microcontrollers with fast onboard program ROM while accessing slow external data RAM.但已知的常见情况是,x < 10将比x < 1还要快。如果x是16位,那么只有较低的6位是care(所有其他的将会被移出),因此MPU只需要加载较低的字节,从而只使一个访问周期为8位内存,而x << 10需要两个访问周期。如果访问周期比移位慢(清除下字节),则x << 10将更快。这可能适用于具有快速板载程序ROM的微控制器,同时访问缓慢的外部数据RAM。
-
As addition to case 3, compiler may care about number of significant bits in
x << 10
and optimize further operations to lower-width ones, like replacing 16x16 multiplication with 16x8 one (as lower byte is always zero).除了情形3之外,编译器可能会关心x < 10中的重要位的数量,并将进一步的操作优化为较低的位,例如用16x8的1替换16x16的乘法(因为较低的字节总是0)。
Note, some microcontrollers have no shift-left instruction at all, they use add x,x
instead.
注意,有些微控制器根本没有左移指令,它们使用的是add x,x。
#4
9
On ARM, this can be done as a side effect of another instruction. So potentially, there's no latency at all for either of them.
在ARM上,这可以作为另一条指令的副作用来完成。因此,它们中的任何一个都没有延迟。
#5
9
Here's my favorite CPU, in which x<<2
takes twice as long as x<<1
:)
这是我最喜欢的CPU,其中x< 2的时间是x< 1:的两倍)
#6
7
That depends both on the CPU and compiler. Even if the underlying CPU has arbitrary bit shift with a barrel shifter, this will only happen if the compiler takes advantage of that resource.
这取决于CPU和编译器。即使底层CPU使用桶移器进行任意位移,也只有在编译器利用该资源时才会发生这种情况。
Keep in mind that shifting anything outside the width in bits of the data is "undefined behavior" in C and C++. Right shift of signed data is also "implementation defined". Rather than too much concern about speed, be concerned that you are getting the same answer on different implementations.
请记住,在C和c++中,将数据的宽度以外的任何内容移动都是“未定义的行为”。签名数据的右移也是“实现定义”。与其过于关注速度,不如关注在不同的实现中得到相同的答案。
Quoting from ANSI C section 3.3.7:
引用ANSI C节3.3.7:
3.3.7 Bitwise shift operators
3.3.7位移位算子
Syntax
语法
shift-expression: additive-expression shift-expression << additive-expression shift-expression >> additive-expression
Constraints
约束
Each of the operands shall have integral type.
每个操作数都应该具有整型。
Semantics
语义
The integral promotions are performed on each of the operands. The type of the result is that of the promoted left operand. If the value of the right operand is negative or is greater than or equal to the width in bits of the promoted left operand, the behavior is undefined.
对每个操作数执行整体提升。结果的类型是提升的左操作数。如果右操作数的值为负,或大于或等于提升左操作数的位元宽度,则行为未定义。
The result of E1 << E2 is E1 left-shifted E2 bit positions; vacated bits are filled with zeros. If E1 has an unsigned type, the value of the result is E1 multiplied by the quantity, 2 raised to the power E2, reduced modulo ULONG_MAX+1 if E1 has type unsigned long, UINT_MAX+1 otherwise. (The constants ULONG_MAX and UINT_MAX are defined in the header .)
E1 << E2的结果是E1左移E2位位置;空出的位充满了0。如果E1有一个无符号类型,那么结果的值是E1乘以数量,2的幂E2次方,如果E1有无符号long类型,则会减少modulo ULONG_MAX+1,反之则是UINT_MAX+1。(在标题中定义了常量ULONG_MAX和UINT_MAX)。
The result of E1 >> E2 is E1 right-shifted E2 bit positions. If E1 has an unsigned type or if E1 has a signed type and a nonnegative value, the value of the result is the integral part of the quotient of E1 divided by the quantity, 2 raised to the power E2 . If E1 has a signed type and a negative value, the resulting value is implementation-defined.
E1 >> E2的结果是E1右移E2位位置。如果E1有一个无符号类型,或者E1有一个符号类型和一个非负值,则结果的值是E1的商除以量的整数部分,2被提升到E2。如果E1具有符号类型和负值,则结果值是实现定义的。
So:
所以:
x = y << z;
"<<": y × 2z (undefined if an overflow occurs);
“< <”:y×2 z(未定义如果发生溢出);
x = y >> z;
">>": implementation-defined for signed (most often the result of the arithmetic shift: y / 2z).
“>>”:为签名定义的实现(通常是算术移位的结果:y / 2z)。
#7
7
It is conceivable that, on an 8-bit processor, x<<1
could actually be much slower than x<<10
for a 16-bit value.
可以想象,在8位处理器上,对于16位的值,x<<1实际上要比x< 10慢得多。
For example a reasonable translation of x<<1
may be:
例如,x<<1的合理翻译可以是:
byte1 = (byte1 << 1) | (byte2 >> 7)
byte2 = (byte2 << 1)
whereas x<<10
would be more simple:
而x< 10则更简单:
byte1 = (byte2 << 2)
byte2 = 0
Notice how x<<1
shifts more often and even farther than x<<10
. Furthermore the result of x<<10
doesn't depend on the content of byte1. This could speed up the operation additionally.
注意x<<1的变化比x<<10更频繁甚至更远。此外,x< 10的结果并不取决于byte1的内容。这还可以加快操作的速度。
#8
5
On some generations of Intel CPUs (P2 or P3? Not AMD though, if I remember right), the bitshift operations are ridiculously slow. Bitshift by 1 bit should always be fast though since it can just use addition. Another question to consider is whether bitshifts by a constant number of bits are faster than variable-length shifts. Even if the opcodes are the same speed, on x86 the nonconstant righthand operand of a bitshift must occupy the CL register, which imposes additional constrains on register allocation and may slow the program down that way too.
一些一代的英特尔cpu (P2还是P3?虽然不是AMD,如果我没记错的话),位移操作极其缓慢。位移1位总是很快的,因为它可以只使用加法。另一个要考虑的问题是,比特位移的位数是否比变长移快。即使操作码是相同的速度,在x86上,位移的非常量右操作数必须占用CL寄存器,这将对寄存器分配施加额外的限制,并可能使程序慢下来。
#9
3
As always, it depends on the surrounding code context: e.g. are you using x<<1
as an array index? Or adding it to something else? In either case, small shift counts (1 or 2) can often optimize even more than if the compiler ends up having to just shift. Not to mention the whole throughput vs. latency vs. front-end bottlenecks tradeoff. Performance of a tiny fragment is not one-dimensional.
与往常一样,它取决于周围的代码上下文:例如,您是否使用x<<1作为数组索引?或者把它加到别的东西上?在这两种情况下,小移位计数(1或2)通常比编译器只需要移位更能优化。更不用说整个吞吐量、延迟和前端瓶颈之间的权衡。一个小片段的性能不是一维的。
A hardware shift instructions is not a compiler's only option for compiling x<<1
, but the other answers are mostly assuming that.
硬件移位指令不是编译x<<1的唯一选项,但是其他的答案大部分是假设的。
x << 1
is exactly equivalent to x+x
for unsigned, and for 2's complement signed integers. Compilers always know what hardware they're targeting while they're compiling, so they can take advantage of tricks like this.
x < 1完全等价于x+x对于无符号,对于2的补有符号整数。编译器在编译时总是知道目标是什么硬件,所以它们可以利用这样的技巧。
On Intel Haswell, add
has 4 per clock throughput, but shl
with an immediate count has only 2 per clock throughput. (See http://agner.org/optimize/ for instruction tables, and other links in the x86 tag wiki). SIMD vector shifts are 1 per clock (2 in Skylake), but SIMD vector integer adds are 2 per clock (3 in Skylake). Latency is the same, though: 1 cycle.
在Intel Haswell上,add的每个时钟吞吐量是4,但是shl的每个时钟吞吐量只有2。(关于指令表和x86标记wiki中的其他链接,请参见http://agner.org/optimize/)。SIMD矢量移位为1 /时钟(2在Skylake中),而SIMD矢量整数相加为2 /时钟(3在Skylake中)。延迟是相同的:1个周期。
There's also a special shift-by-one encoding of shl
where the count is implicit in the opcode. 8086 didn't have immediate-count shifts, only by-one and by cl
register. This is mostly relevant for right-shifts, because you can just add for left shifts unless you're shifting a memory operand. But if the value is needed later, it's better to load into a register first. But anyway, shl eax,1
or add eax,eax
is one byte shorter than shl eax,10
, and code-size can directly (decode / front-end bottlenecks) or indirectly (L1I code cache misses) affect performance.
还有一种特殊的shl的逐位编码,其中的计数在操作码中是隐式的。8086没有立即计数的移位,只有一个和cl寄存器。这主要与右移有关,因为您可以只添加左移,除非您正在移动内存操作数。但是如果以后需要这个值,最好先加载到寄存器中。但是无论如何,shl eax、1或添加eax,eax比shl eax、10短一个字节,并且代码大小可以直接(解码/前端瓶颈)或间接(L1I代码缓存丢失)影响性能。
More generally, small shift counts can sometimes be optimized into a scaled index in an addressing mode on x86. Most other architectures in common use these days are RISC, and don't have scaled-index addressing modes, but x86 is a common enough architecture for this to be worth mentioning. (e.g.g if you're indexing an array of 4-byte elements, there's room to increase the scale factor by 1 for int arr[]; arr[x<<1]
).
更一般地说,在x86上,小移位计数有时可以优化为按比例缩放的索引。目前大多数常用的架构都是RISC,并且没有标度索引寻址模式,但是x86是一种足够通用的架构,值得注意。(例如,g如果你对一个4字节的元素数组进行索引,那么在int arr[]中可以将比例因子增加1;加勒比海盗[x < < 1])。
Needing to copy+shift is common in situations where the original value of x
is still needed. But most x86 integer instructions operate in-place. (The destination is one of the sources for instructions like add
or shl
.) The x86-64 System V calling convention passes args in registers, with the first arg in edi
and return value in eax
, so a function that returns x<<10
also makes the compiler emit copy+shift code.
需要复制+移位在仍然需要x的原始值的情况下是很常见的。但是大多数x86整数指令都是就地操作的。(目的地是添加或shl等指令的来源之一。)x86-64系统V调用约定在寄存器中传递args,在edi中传递第一个arg,在eax中返回值,因此返回x< 10的函数也使编译器发出copy+shift代码。
The LEA
instruction lets you shift-and-add (with a shift count of 0 to 3, because it uses addressing-mode machine-encoding). It puts the result in a separate register.
LEA指令允许您进行移位和添加(移位计数为0到3,因为它使用了寻址模式的机器编码)。它将结果放在一个单独的寄存器中。
gcc和clang都以相同的方式优化这些函数,正如您在Godbolt编译器资源管理器上看到的:
int shl1(int x) { return x<<1; }
lea eax, [rdi+rdi] # 1 cycle latency, 1 uop
ret
int shl2(int x) { return x<<2; }
lea eax, [4*rdi] # longer encoding: needs a disp32 of 0 because there's no base register, only scaled-index.
ret
int times5(int x) { return x * 5; }
lea eax, [rdi + 4*rdi]
ret
int shl10(int x) { return x<<10; }
mov eax, edi # 1 uop, 0 or 1 cycle latency
shl eax, 10 # 1 uop, 1 cycle latency
ret
LEA with 2 components has 1 cycle latency and 2-per-clock throughput on recent Intel and AMD CPUs. (Sandybridge-family and Bulldozer/Ryzen). On Intel, it's only 1 per clock throughput with 3c latency for lea eax, [rdi + rsi + 123]
. (Related: Why is this C++ code faster than my hand-written assembly for testing the Collatz conjecture? goes into this in detail.)
在最近的Intel和AMD cpu上,拥有2个组件的LEA具有1个周期延迟和2个时钟吞吐量。(Sandybridge-family和推土机/ Ryzen)。在Intel上,它的每个时钟吞吐量只有1,而lea eax的延迟为3c, [rdi + rsi + 123]。(相关:为什么这个c++代码比我手工编写的用于测试Collatz猜想的程序集要快?)详细地讲一下)
Anyway, copy+shift by 10 needs a separate mov
instruction. It might be zero latency on many recent CPUs, but it still takes front-end bandwidth and code size. (Can x86's MOV really be "free"? Why can't I reproduce this at all?)
无论如何,拷贝+移位10需要一个独立的mov指令。在许多最近的cpu上,它可能是零延迟的,但是它仍然需要前端带宽和代码大小。x86的MOV真的是“免费的”吗?为什么我不能复制这个?)
Also related: How to multiply a register by 37 using only 2 consecutive leal instructions in x86?.
另外,如何在x86中使用2个连续的leal指令将寄存器乘以37 ?
The compiler is also free to transform the surrounding code so there isn't an actual shift, or it's combined with other operations.
编译器也可以*地转换周围的代码,这样就不会发生真正的转移,也不会与其他操作合并。
For example if(x<<1) { }
could use an and
to check all bits except the high bit. On x86, you'd use a test
instruction, like test eax, 0x7fffffff
/ jz .false
instead of shl eax,1 / jz
. This optimization works for any shift count, and it also works on machines where large-count shifts are slow (like Pentium 4), or non-existent (some micro-controllers).
例如,if(x< 1){}可以使用and检查除高比特之外的所有位。在x86上,您将使用一个测试指令,比如测试eax、0x7fffffff / jz. false,而不是shl eax、1 / jz。这种优化适用于任何移位计数,而且它也适用于大型移位缓慢的机器(如奔腾4),或不存在(一些微控制器)。
Many ISAs have bit-manipulation instructions beyond just shifting. e.g. PowerPC has a lot of bit-field extract / insert instructions. Or ARM has shifts of source operands as part of any other instruction. (So shift/rotate instructions are just a special form of move
, using a shifted source.)
许多ISAs除了移动之外还有位操作指令。PowerPC有许多位域提取/插入指令。或ARM作为任何其他指令的一部分,将源操作数转移。(所以移位/旋转指令只是移动的一种特殊形式,使用移位的源。)
Remember, C is not assembly language. Always look at optimized compiler output when you're tuning your source code to compile efficiently.
记住,C不是汇编语言。在优化源代码以高效编译时,始终要查看优化的编译器输出。