Why is memcmp(a, b, size)
so much faster than:
为什么memcmp(a,b,size)比以下快得多:
for(i = 0; i < nelements; i++) {
if a[i] != b[i] return 0;
}
return 1;
Is memcmp a CPU instruction or something? It must be pretty deep because I got a massive speedup using memcmp
over the loop.
memcmp是CPU指令还是什么?它必须非常深,因为我在循环中使用memcmp获得了大量的加速。
4 个解决方案
#1
28
memcmp
is often implemented in assembly to take advantage of a number of architecture-specific features, which can make it much faster than a simple loop in C.
memcmp通常在汇编中实现,以利用许多特定于体系结构的特性,这可以使它比C中的简单循环快得多。
As a "builtin"
GCC supports memcmp
(as well as a ton of other functions) as builtins. In some versions / configurations of GCC, a call to memcmp
will be recognized as __builtin_memcmp
. Instead of emitting a call
to the memcmp
library function, GCC will emit a handful of instructions to act as an optimized inline version of the function.
GCC支持memcmp(以及许多其他功能)作为内置。在GCC的某些版本/配置中,对memcmp的调用将被识别为__builtin_memcmp。 GCC不会发出对memcmp库函数的调用,而是发出一些指令作为函数的优化内联版本。
On x86, this leverages the use of the cmpsb
instruction, which compares a string of bytes at one memory location to another. This is coupled with the repe
prefix, so the strings are compared until they are no longer equal, or a count is exhausted. (Exactly what memcmp
does).
在x86上,这利用了cmpsb指令,该指令将一个内存位置的字节串与另一个内存位置进行比较。这与重复前缀相结合,因此比较字符串直到它们不再相等,或者计数耗尽。 (正是memcmp的作用)。
Given the following code:
给出以下代码:
int test(const void* s1, const void* s2, int count)
{
return memcmp(s1, s2, count) == 0;
}
gcc version 3.4.4
on Cygwin generates the following assembly:
Cygwin上的gcc版本3.4.4生成以下程序集:
; (prologue)
mov esi, [ebp+arg_0] ; Move first pointer to esi
mov edi, [ebp+arg_4] ; Move second pointer to edi
mov ecx, [ebp+arg_8] ; Move length to ecx
cld ; Clear DF, the direction flag, so comparisons happen
; at increasing addresses
cmp ecx, ecx ; Special case: If length parameter to memcmp is
; zero, don't compare any bytes.
repe cmpsb ; Compare bytes at DS:ESI and ES:EDI, setting flags
; Repeat this while equal ZF is set
setz al ; Set al (return value) to 1 if ZF is still set
; (all bytes were equal).
; (epilogue)
Reference:
参考:
cmpsb
instruction- cmpsb指令
As a library function
Highly-optimized versions of memcmp
exist in many C standard libraries. These will usually take advantage of architecture-specific instructions to work with lots of data in parallel.
许多C标准库中都存在高度优化的memcmp版本。这些通常会利用特定于体系结构的指令来并行处理大量数据。
In Glibc, there are versions of memcmp
for x86_64 that can take advantage of the following instruction set extensions:
在Glibc中,有x86_64的memcmp版本可以利用以下指令集扩展:
-
SSE2 -
sysdeps/x86_64/memcmp.S
- SSE2 - sysdeps / x86_64 / memcmp.S
-
SSE4 -
sysdeps/x86_64/multiarch/memcmp-sse4.S
- SSE4 - sysdeps / x86_64 / multiarch / memcmp-sse4.S
-
SSSE3 -
sysdeps/x86_64/multiarch/memcmp-ssse3.S
- SSSE3 - sysdeps / x86_64 / multiarch / memcmp-ssse3.S
The cool part is that glibc will detect (at run-time) the newest instruction set your CPU has, and execute the version optimized for it. See this snippet from sysdeps/x86_64/multiarch/memcmp.S
:
很酷的部分是glibc将检测(在运行时)CPU的最新指令集,并执行为其优化的版本。请参阅sysdeps / x86_64 / multiarch / memcmp.S中的以下代码段:
ENTRY(memcmp)
.type memcmp, @gnu_indirect_function
LOAD_RTLD_GLOBAL_RO_RDX
HAS_CPU_FEATURE (SSSE3)
jnz 2f
leaq __memcmp_sse2(%rip), %rax
ret
2: HAS_CPU_FEATURE (SSE4_1)
jz 3f
leaq __memcmp_sse4_1(%rip), %rax
ret
3: leaq __memcmp_ssse3(%rip), %rax
ret
END(memcmp)
In the Linux kernel
Linux does not seem to have an optimized version of memcmp
for x86_64, but it does for memcpy
, in arch/x86/lib/memcpy_64.S
. Note that is uses the alternatives infrastructure (arch/x86/kernel/alternative.c
) for not only deciding at runtime which version to use, but actually patching itself to only make this decision once at boot-up.
Linux似乎没有针对x86_64的memcmp的优化版本,但是对于memcpy,它在arch / x86 / lib / memcpy_64.S中。请注意,它使用备用基础结构(arch / x86 / kernel / alternative.c),不仅在运行时决定使用哪个版本,而且实际上修补自身仅在启动时做出此决定。
#2
0
It's usually a compiler intrinsic that is translated into fast assembly with specialized instructions for comparing blocks of memory.
它通常是一个编译器内在函数,它被转换为快速汇编,并带有用于比较内存块的专用指令。
内在的memcmp
#3
0
Is memcmp a CPU instruction or something?
memcmp是CPU指令还是什么?
It is at least a very highly optimized compiler-provided intrinsic function. Possibly a single machine instruction, or two, depending on the platform, which you haven't specified.
它至少是一个非常高度优化的编译器提供的内部函数。可能是一台或两台机器指令,具体取决于您未指定的平台。
#4
-2
Yes, on intel hardware, there's a single assembly instruction for such a loop. The runtime will use that. (I don't exactly remember, it was something like rep cmps[b|w]
, depending also on the datasize)
是的,在intel硬件上,有一个用于这种循环的汇编指令。运行时将使用它。 (我不记得,它有点像rep cmps [b | w],还取决于数据量)
#1
28
memcmp
is often implemented in assembly to take advantage of a number of architecture-specific features, which can make it much faster than a simple loop in C.
memcmp通常在汇编中实现,以利用许多特定于体系结构的特性,这可以使它比C中的简单循环快得多。
As a "builtin"
GCC supports memcmp
(as well as a ton of other functions) as builtins. In some versions / configurations of GCC, a call to memcmp
will be recognized as __builtin_memcmp
. Instead of emitting a call
to the memcmp
library function, GCC will emit a handful of instructions to act as an optimized inline version of the function.
GCC支持memcmp(以及许多其他功能)作为内置。在GCC的某些版本/配置中,对memcmp的调用将被识别为__builtin_memcmp。 GCC不会发出对memcmp库函数的调用,而是发出一些指令作为函数的优化内联版本。
On x86, this leverages the use of the cmpsb
instruction, which compares a string of bytes at one memory location to another. This is coupled with the repe
prefix, so the strings are compared until they are no longer equal, or a count is exhausted. (Exactly what memcmp
does).
在x86上,这利用了cmpsb指令,该指令将一个内存位置的字节串与另一个内存位置进行比较。这与重复前缀相结合,因此比较字符串直到它们不再相等,或者计数耗尽。 (正是memcmp的作用)。
Given the following code:
给出以下代码:
int test(const void* s1, const void* s2, int count)
{
return memcmp(s1, s2, count) == 0;
}
gcc version 3.4.4
on Cygwin generates the following assembly:
Cygwin上的gcc版本3.4.4生成以下程序集:
; (prologue)
mov esi, [ebp+arg_0] ; Move first pointer to esi
mov edi, [ebp+arg_4] ; Move second pointer to edi
mov ecx, [ebp+arg_8] ; Move length to ecx
cld ; Clear DF, the direction flag, so comparisons happen
; at increasing addresses
cmp ecx, ecx ; Special case: If length parameter to memcmp is
; zero, don't compare any bytes.
repe cmpsb ; Compare bytes at DS:ESI and ES:EDI, setting flags
; Repeat this while equal ZF is set
setz al ; Set al (return value) to 1 if ZF is still set
; (all bytes were equal).
; (epilogue)
Reference:
参考:
cmpsb
instruction- cmpsb指令
As a library function
Highly-optimized versions of memcmp
exist in many C standard libraries. These will usually take advantage of architecture-specific instructions to work with lots of data in parallel.
许多C标准库中都存在高度优化的memcmp版本。这些通常会利用特定于体系结构的指令来并行处理大量数据。
In Glibc, there are versions of memcmp
for x86_64 that can take advantage of the following instruction set extensions:
在Glibc中,有x86_64的memcmp版本可以利用以下指令集扩展:
-
SSE2 -
sysdeps/x86_64/memcmp.S
- SSE2 - sysdeps / x86_64 / memcmp.S
-
SSE4 -
sysdeps/x86_64/multiarch/memcmp-sse4.S
- SSE4 - sysdeps / x86_64 / multiarch / memcmp-sse4.S
-
SSSE3 -
sysdeps/x86_64/multiarch/memcmp-ssse3.S
- SSSE3 - sysdeps / x86_64 / multiarch / memcmp-ssse3.S
The cool part is that glibc will detect (at run-time) the newest instruction set your CPU has, and execute the version optimized for it. See this snippet from sysdeps/x86_64/multiarch/memcmp.S
:
很酷的部分是glibc将检测(在运行时)CPU的最新指令集,并执行为其优化的版本。请参阅sysdeps / x86_64 / multiarch / memcmp.S中的以下代码段:
ENTRY(memcmp)
.type memcmp, @gnu_indirect_function
LOAD_RTLD_GLOBAL_RO_RDX
HAS_CPU_FEATURE (SSSE3)
jnz 2f
leaq __memcmp_sse2(%rip), %rax
ret
2: HAS_CPU_FEATURE (SSE4_1)
jz 3f
leaq __memcmp_sse4_1(%rip), %rax
ret
3: leaq __memcmp_ssse3(%rip), %rax
ret
END(memcmp)
In the Linux kernel
Linux does not seem to have an optimized version of memcmp
for x86_64, but it does for memcpy
, in arch/x86/lib/memcpy_64.S
. Note that is uses the alternatives infrastructure (arch/x86/kernel/alternative.c
) for not only deciding at runtime which version to use, but actually patching itself to only make this decision once at boot-up.
Linux似乎没有针对x86_64的memcmp的优化版本,但是对于memcpy,它在arch / x86 / lib / memcpy_64.S中。请注意,它使用备用基础结构(arch / x86 / kernel / alternative.c),不仅在运行时决定使用哪个版本,而且实际上修补自身仅在启动时做出此决定。
#2
0
It's usually a compiler intrinsic that is translated into fast assembly with specialized instructions for comparing blocks of memory.
它通常是一个编译器内在函数,它被转换为快速汇编,并带有用于比较内存块的专用指令。
内在的memcmp
#3
0
Is memcmp a CPU instruction or something?
memcmp是CPU指令还是什么?
It is at least a very highly optimized compiler-provided intrinsic function. Possibly a single machine instruction, or two, depending on the platform, which you haven't specified.
它至少是一个非常高度优化的编译器提供的内部函数。可能是一台或两台机器指令,具体取决于您未指定的平台。
#4
-2
Yes, on intel hardware, there's a single assembly instruction for such a loop. The runtime will use that. (I don't exactly remember, it was something like rep cmps[b|w]
, depending also on the datasize)
是的,在intel硬件上,有一个用于这种循环的汇编指令。运行时将使用它。 (我不记得,它有点像rep cmps [b | w],还取决于数据量)