I need a comparison function for blocks of memory for doing binary searches on arrays of bytes in the D programming language. It does not need to have any useful semantics. It only needs to be fast and be a valid comparison function (one that produces a total ordering). The blocks of memory to be compared are already known to be the same length.
我需要一个内存块的比较函数,用于在D编程语言中对字节数组进行二进制搜索。它不需要任何有用的语义。它只需要快速并且是一个有效的比较函数(一个生成总排序的函数)。要比较的内存块的长度已知相同。
C's memcmp
is actually pretty slow because it tries to preserve useful string comparison semantics, which I don't need. Below is the best I've come up with so far. Does anyone know of anything better, preferably w/o dipping into non-portable CPU-specific instructions?
C的memcmp实际上非常慢,因为它试图保留有用的字符串比较语义,我不需要它。下面是我迄今为止所提出的最好的建议。有没有人知道更好的方法,最好是使用非便携式cpu专用指令?
// Faster than C's memcmp because it doesn't preserve any meaningful
// semantics. It's just a completely arbitrary, but really fast,
// comparison function.
int memoryCompare(const(void)* lhs, const(void)* rhs, size_t n) {
for(; n >= uint.sizeof; n -= uint.sizeof) {
if( *(cast(uint*) lhs) < *(cast(uint*) rhs)) {
return -1;
} else if( *(cast(uint*) lhs) > *(cast(uint*) rhs)) {
return 1;
}
lhs += uint.sizeof;
rhs += uint.sizeof;
}
for(; n >= ubyte.sizeof; n -= ubyte.sizeof) {
if( *(cast(ubyte*) lhs) < *(cast(ubyte*) rhs)) {
return -1;
} else if( *(cast(ubyte*) lhs) > *(cast(ubyte*) rhs)) {
return 1;
}
lhs += ubyte.sizeof;
rhs += ubyte.sizeof;
}
return 0;
}
Edit: I've read up on SSE and I don't want to use it for 3 reasons:
编辑:我读过SSE,我不想用它有三个原因:
- It's not portable.
- 这是不可移植的。
- It requires programming in ASM.
- 它需要ASM编程。
- The comparison instructions assume your data is floating points, which might be problematic if some data happens to match the pattern for a NaN.
- 比较指令假定您的数据是浮点数,如果某些数据碰巧与NaN的模式匹配,那么这可能会有问题。
6 个解决方案
#1
3
You could try:
你可以试试:
- check if uint is the largest type that fits in your target CPU (ulongs might fit a native register better)
- 检查uint是否是适合您的目标CPU的最大类型(ulongs可能更适合本地寄存器)
- use 2 pointers of that type
- 使用两种类型的指针
- use 2 local variables using *p++ (dont dereference the pointers 2 times for 1 value)
- 使用*p++ 2个局部变量(不要删除指针2次,值为1)
- devide the counter of the 1st loop up front (use while (counter--))
- 将第1圈的计数器向前偏移(使用while (counter-))
- unroll the 2nd loop by replacing it with a switch (if sizeof(type that fits in a register) is known and will be always the same.)
- 将第二个循环展开,用一个开关替换它(如果sizeof(适用于寄存器的类型)是已知的,并且将始终是相同的。)
Edit: if the first loop is the bottleneck, unrolling might be the answer. Combined with halving the number of conditions in case of equal values, for unrolling 4 times I get something like:
编辑:如果第一个循环是瓶颈,展开可能就是答案。结合将条件的数量减半,在相等的情况下,展开4次,我得到如下结果:
uint* lp = (uint*)lhs;
uint* rp = (uint*)rhs;
uint l;
uint r;
int count = (n / uint.sizeof) / 4;
while (count--) {
if( (l = *lp++) != (r = *rp++) {
return (l < r) ? -1 : 1;
}
if( (l = *lp++) != (r = *rp++) {
return (l < r) ? -1 : 1;
}
if( (l = *lp++) != (r = *rp++) {
return (l < r) ? -1 : 1;
}
if( (l = *lp++) != (r = *rp++) {
return (l < r) ? -1 : 1;
}
}
Of course that leaves (n / uint.sizeof) % 4
iterations left to do, which you can mix into this loop by interleaving a switch, I left that as exercise for the reader evil grin.
当然这还剩下(n / ui .sizeof) % 4个迭代要做,你可以通过插入一个开关来混合到这个循环中,我留给读者一个坏笑作为练习。
#2
2
I dont know much about it but there are vector instructions which can apply instructions to many bytes at once. You can use those results to do a blazing fast memcmp. I dont know which instructions you would need but if you look up Larrabee New Instructions or check out this article you may find what you are looking for http://www.ddj.com/architect/216402188
我对它不太了解,但有一些向量指令,可以一次将指令应用到多个字节。您可以使用这些结果来做一个非常快速的memcmp。我不知道您需要哪个指令,但是如果您查找Larrabee新指令或查看本文,您可能会找到您想要的http://www.ddj.com/architectural t/216402188
NOTE: This CPU isnt out ATM AFAIK
注意:这个CPU不在ATM机上
-Edit- Right now i am positive there is an instruction sets (try looking at SSE or SSE2) which may compare 16 bytes at once if they are aligned.
- edit——现在我确定有一个指令集(试试SSE或SSE2),如果它们是对齐的,它们可以同时比较16个字节。
You can try this pure c++ code.
你可以试试这个纯c++代码。
template<class T>
int memoryCompare(const T* lhs, const T * rhs, size_t n) {
const T* endLHS = lhs + n/sizeof(T);
while(lhs<endLHS) {
int i = *lhs - *rhs;
if(i != 0) return i > 0 ? 1 : -1;
lhs++;
rhs++;
}
//more code for the remaining bytes or call memoryCompare<char>(lhs, rhs, n%sizeof(T));
return 0;
}
The advantage here is your increasing the pointer so you can dereference it and not apply an index (its ptr_offset[index] vs ptr_offset). The above uses a template so you can use 64 bits on 64 bit machines. and CMPs in assembly are really just subtracts checking the N and Z flags. Instead of comparing N and decreasing N i just compare in my version.
这里的优点是增加了指针,因此可以取消引用,而不应用索引(它的ptr_offset[index] vs ptr_offset)。上面使用的模板可以在64位机器上使用64位。而装配中的CMPs实际上只是对N和Z标志进行减缩。而不是比较N和N的减少,我只是比较我的版本。
#3
1
Well a lot depends upon your system and data. There is only so many assumptions we can make. What processor are you using? Does it have to be straight C code? How wide are the CPU registers? What is the cache structure of the CPU? etc etc
这很大程度上取决于你的系统和数据。我们只能做出这么多假设。你在用什么处理器?它必须是直接的C代码吗?CPU寄存器有多宽?CPU的缓存结构是什么?等等
It also depends on how different your data is. If it is very unlikely that the first byte from each buffer is the same then the speed of the function is pretty meaningless as, logically, it won't get to the rest of the function. If it is likely that the first n-1 bytes are usually the sme then it becomes more important.
这也取决于你的数据有多不同。如果每个缓冲区的第一个字节不太可能相同,那么函数的速度就毫无意义了,因为,逻辑上,它不会到达函数的其余部分。如果第一个n-1字节通常是sme,那么它就变得更重要了。
All in you are unlikely to see much change regardless of how you do the test.
无论你如何做测试,你都不太可能看到很多变化。
Anyway this is a little implementation of my own, it may, or may not, be faster than your own (or, given i just made it up as i went along, it may, or may not, work ;)):
无论如何,这是我自己的一个小实现,它可能,也可能不会,比你自己的要快(或者,考虑到我是边走边编的,它可能,也可能不工作;):
int memoryCompare(const void* lhs, const void* rhs, size_t n)
{
uint_64 diff = 0
// Test the first few bytes until we are 32-bit aligned.
while( (n & 0x3) != 0 && diff != 0 )
{
diff = (uint_8*)lhs - (uint_8*)rhs;
n--;
((uint_8*)lhs)++;
((uint_8*)rhs)++;
}
// Test the next set of 32-bit integers using comparisons with
// aligned data.
while( n > sizeof( uint_32 ) && diff != 0 )
{
diff = (uint_32*)lhs - (uint_32*)rhs;
n -= sizeof( uint_32 );
((uint_32*)lhs)++;
((uint_32*)rhs)++;
}
// now do final bytes.
while( n > 0 && diff != 0 )
{
diff = (uint_8*)lhs - (uint_8*)rhs;
n--;
((uint_8*)lhs)++;
((uint_8*)rhs)++;
}
return (int)*diff / abs( diff ));
}
#4
1
Does the answer to this question help you at all?
这个问题的答案对你有帮助吗?
If the compiler has support for implementing memcmp() as an intrinsic/builtin function, it seems that you would have a hard time besting it.
如果编译器支持将memcmp()实现为一个内部/内建函数,那么您似乎很难胜过它。
I have to admit to knowing little to nothing about D, so I have no idea whether the D compiler has support for intrinsics.
我不得不承认,我对D几乎一无所知,所以我不知道D编译器是否支持intrinsic。
#5
1
I think memcmp is specified to do a byte-by-byte comparison, regardless of data type. Are you sure your compiler's implementation preserves string semantics? It shouldn't.
我认为memcmp被指定做逐字节的比较,而不考虑数据类型。您确定编译器的实现保留了字符串语义吗?这是不应该的。
#6
1
If you trust your compiler's optimisation you might try a few modifications to acidzombie24s suggestion:
如果你相信你的编译器的优化,你可以尝试一些修改acidzombie24的建议:
template<class T> int memoryCompare(const T* lhs, const T * rhs, size_t n) {
const T* endLHS = &lhs[n];
while(lhs<endLHS) {
//A good optimiser will keep these values in register
//and may even be clever enough to just retest the flags
//before incrementing the pointers iff we loop again.
//gcc -O3 did the optimisation very well.
if (*lhs > *rhs) return 1;
if (*lhs++ < *rhs++) return -1;
}
//more code for the remaining bytes or call memoryCompare<char>(lhs, rhs, n%sizeof(T));
return 0;
}
Here's the entire gcc -O3 optimised inner loop in x86 assembler code for a C version of the above, only passing char array pointers:
下面是针对上述C版本的x86汇编代码中优化的整个gcc -O3内部循环,只传递char数组指针:
Loop:
incl %eax ; %eax is lhs
incl %edx ; %edx is rhs
cmpl %eax, %ebx ; %ebx is endLHS
jbe ReturnEq
movb (%edx), %cl
cmpb %cl, (%eax)
jg ReturnGT
jge Loop
ReturnLT:
...
#1
3
You could try:
你可以试试:
- check if uint is the largest type that fits in your target CPU (ulongs might fit a native register better)
- 检查uint是否是适合您的目标CPU的最大类型(ulongs可能更适合本地寄存器)
- use 2 pointers of that type
- 使用两种类型的指针
- use 2 local variables using *p++ (dont dereference the pointers 2 times for 1 value)
- 使用*p++ 2个局部变量(不要删除指针2次,值为1)
- devide the counter of the 1st loop up front (use while (counter--))
- 将第1圈的计数器向前偏移(使用while (counter-))
- unroll the 2nd loop by replacing it with a switch (if sizeof(type that fits in a register) is known and will be always the same.)
- 将第二个循环展开,用一个开关替换它(如果sizeof(适用于寄存器的类型)是已知的,并且将始终是相同的。)
Edit: if the first loop is the bottleneck, unrolling might be the answer. Combined with halving the number of conditions in case of equal values, for unrolling 4 times I get something like:
编辑:如果第一个循环是瓶颈,展开可能就是答案。结合将条件的数量减半,在相等的情况下,展开4次,我得到如下结果:
uint* lp = (uint*)lhs;
uint* rp = (uint*)rhs;
uint l;
uint r;
int count = (n / uint.sizeof) / 4;
while (count--) {
if( (l = *lp++) != (r = *rp++) {
return (l < r) ? -1 : 1;
}
if( (l = *lp++) != (r = *rp++) {
return (l < r) ? -1 : 1;
}
if( (l = *lp++) != (r = *rp++) {
return (l < r) ? -1 : 1;
}
if( (l = *lp++) != (r = *rp++) {
return (l < r) ? -1 : 1;
}
}
Of course that leaves (n / uint.sizeof) % 4
iterations left to do, which you can mix into this loop by interleaving a switch, I left that as exercise for the reader evil grin.
当然这还剩下(n / ui .sizeof) % 4个迭代要做,你可以通过插入一个开关来混合到这个循环中,我留给读者一个坏笑作为练习。
#2
2
I dont know much about it but there are vector instructions which can apply instructions to many bytes at once. You can use those results to do a blazing fast memcmp. I dont know which instructions you would need but if you look up Larrabee New Instructions or check out this article you may find what you are looking for http://www.ddj.com/architect/216402188
我对它不太了解,但有一些向量指令,可以一次将指令应用到多个字节。您可以使用这些结果来做一个非常快速的memcmp。我不知道您需要哪个指令,但是如果您查找Larrabee新指令或查看本文,您可能会找到您想要的http://www.ddj.com/architectural t/216402188
NOTE: This CPU isnt out ATM AFAIK
注意:这个CPU不在ATM机上
-Edit- Right now i am positive there is an instruction sets (try looking at SSE or SSE2) which may compare 16 bytes at once if they are aligned.
- edit——现在我确定有一个指令集(试试SSE或SSE2),如果它们是对齐的,它们可以同时比较16个字节。
You can try this pure c++ code.
你可以试试这个纯c++代码。
template<class T>
int memoryCompare(const T* lhs, const T * rhs, size_t n) {
const T* endLHS = lhs + n/sizeof(T);
while(lhs<endLHS) {
int i = *lhs - *rhs;
if(i != 0) return i > 0 ? 1 : -1;
lhs++;
rhs++;
}
//more code for the remaining bytes or call memoryCompare<char>(lhs, rhs, n%sizeof(T));
return 0;
}
The advantage here is your increasing the pointer so you can dereference it and not apply an index (its ptr_offset[index] vs ptr_offset). The above uses a template so you can use 64 bits on 64 bit machines. and CMPs in assembly are really just subtracts checking the N and Z flags. Instead of comparing N and decreasing N i just compare in my version.
这里的优点是增加了指针,因此可以取消引用,而不应用索引(它的ptr_offset[index] vs ptr_offset)。上面使用的模板可以在64位机器上使用64位。而装配中的CMPs实际上只是对N和Z标志进行减缩。而不是比较N和N的减少,我只是比较我的版本。
#3
1
Well a lot depends upon your system and data. There is only so many assumptions we can make. What processor are you using? Does it have to be straight C code? How wide are the CPU registers? What is the cache structure of the CPU? etc etc
这很大程度上取决于你的系统和数据。我们只能做出这么多假设。你在用什么处理器?它必须是直接的C代码吗?CPU寄存器有多宽?CPU的缓存结构是什么?等等
It also depends on how different your data is. If it is very unlikely that the first byte from each buffer is the same then the speed of the function is pretty meaningless as, logically, it won't get to the rest of the function. If it is likely that the first n-1 bytes are usually the sme then it becomes more important.
这也取决于你的数据有多不同。如果每个缓冲区的第一个字节不太可能相同,那么函数的速度就毫无意义了,因为,逻辑上,它不会到达函数的其余部分。如果第一个n-1字节通常是sme,那么它就变得更重要了。
All in you are unlikely to see much change regardless of how you do the test.
无论你如何做测试,你都不太可能看到很多变化。
Anyway this is a little implementation of my own, it may, or may not, be faster than your own (or, given i just made it up as i went along, it may, or may not, work ;)):
无论如何,这是我自己的一个小实现,它可能,也可能不会,比你自己的要快(或者,考虑到我是边走边编的,它可能,也可能不工作;):
int memoryCompare(const void* lhs, const void* rhs, size_t n)
{
uint_64 diff = 0
// Test the first few bytes until we are 32-bit aligned.
while( (n & 0x3) != 0 && diff != 0 )
{
diff = (uint_8*)lhs - (uint_8*)rhs;
n--;
((uint_8*)lhs)++;
((uint_8*)rhs)++;
}
// Test the next set of 32-bit integers using comparisons with
// aligned data.
while( n > sizeof( uint_32 ) && diff != 0 )
{
diff = (uint_32*)lhs - (uint_32*)rhs;
n -= sizeof( uint_32 );
((uint_32*)lhs)++;
((uint_32*)rhs)++;
}
// now do final bytes.
while( n > 0 && diff != 0 )
{
diff = (uint_8*)lhs - (uint_8*)rhs;
n--;
((uint_8*)lhs)++;
((uint_8*)rhs)++;
}
return (int)*diff / abs( diff ));
}
#4
1
Does the answer to this question help you at all?
这个问题的答案对你有帮助吗?
If the compiler has support for implementing memcmp() as an intrinsic/builtin function, it seems that you would have a hard time besting it.
如果编译器支持将memcmp()实现为一个内部/内建函数,那么您似乎很难胜过它。
I have to admit to knowing little to nothing about D, so I have no idea whether the D compiler has support for intrinsics.
我不得不承认,我对D几乎一无所知,所以我不知道D编译器是否支持intrinsic。
#5
1
I think memcmp is specified to do a byte-by-byte comparison, regardless of data type. Are you sure your compiler's implementation preserves string semantics? It shouldn't.
我认为memcmp被指定做逐字节的比较,而不考虑数据类型。您确定编译器的实现保留了字符串语义吗?这是不应该的。
#6
1
If you trust your compiler's optimisation you might try a few modifications to acidzombie24s suggestion:
如果你相信你的编译器的优化,你可以尝试一些修改acidzombie24的建议:
template<class T> int memoryCompare(const T* lhs, const T * rhs, size_t n) {
const T* endLHS = &lhs[n];
while(lhs<endLHS) {
//A good optimiser will keep these values in register
//and may even be clever enough to just retest the flags
//before incrementing the pointers iff we loop again.
//gcc -O3 did the optimisation very well.
if (*lhs > *rhs) return 1;
if (*lhs++ < *rhs++) return -1;
}
//more code for the remaining bytes or call memoryCompare<char>(lhs, rhs, n%sizeof(T));
return 0;
}
Here's the entire gcc -O3 optimised inner loop in x86 assembler code for a C version of the above, only passing char array pointers:
下面是针对上述C版本的x86汇编代码中优化的整个gcc -O3内部循环,只传递char数组指针:
Loop:
incl %eax ; %eax is lhs
incl %edx ; %edx is rhs
cmpl %eax, %ebx ; %ebx is endLHS
jbe ReturnEq
movb (%edx), %cl
cmpb %cl, (%eax)
jg ReturnGT
jge Loop
ReturnLT:
...