In a program I wrote, 20% of the time is being spent on finding out the minimum of 3 numbers in an inner loop, in this routine:
在我写的一个程序中,在这个例程中,20%的时间用于在内循环中找出最少3个数字:
static inline unsigned int
min(unsigned int a, unsigned int b, unsigned int c)
{
unsigned int m = a;
if (m > b) m = b;
if (m > c) m = c;
return m;
}
Is there any way to speed this up? I am ok with assembly code too for x86/x86_64.
有什么方法可以加快速度吗?对于x86 / x86_64,我也可以使用汇编代码。
Edit: In reply to some of the comments:
* Compiler being used is gcc 4.3.3
* As far as assembly is concerned, I am only a beginner there. I asked for assembly here, to learn how to do this. :)
* I have a quad-core Intel 64 running, so MMX/SSE etc. are supported.
* It's hard to post the loop here, but I can tell you it's a heavily optimized implementation of the levenshtein algorithm.
编辑:回复一些评论:*正在使用的编译器是gcc 4.3.3 *就汇编而言,我只是一个初学者。我在这里要求组装,学习如何做到这一点。 :) *我有四核Intel 64运行,所以支持MMX / SSE等。 *这里很难发布循环,但我可以告诉你它是levenshtein算法的一个高度优化的实现。
This is what the compiler is giving me for the non-inlined version of min:
这是编译器给我的非内联版本的min:
.globl min
.type min, @function
min:
pushl %ebp
movl %esp, %ebp
movl 8(%ebp), %edx
movl 12(%ebp), %eax
movl 16(%ebp), %ecx
cmpl %edx, %eax
jbe .L2
movl %edx, %eax
.L2:
cmpl %ecx, %eax
jbe .L3
movl %ecx, %eax
.L3:
popl %ebp
ret
.size min, .-min
.ident "GCC: (Ubuntu 4.3.3-5ubuntu4) 4.3.3"
.section .note.GNU-stack,"",@progbits
The inlined version is within -O2 optimized code (even my markers mrk = 0xfefefefe, before and after the call to min()) are getting optimized away by gcc, so I couldn't get hold of it.
内联版本在-O2优化代码内(甚至我的标记mrk = 0xfefefefe,在调用min()之前和之后)都被gcc优化掉了,所以我无法掌握它。
Update: I tested the changes suggested by Nils, ephemient, however there's no perceptible performance boost I get by using the assembly versions of min(). However, I get a 12.5% boost by compiling the program with -march=i686, which I guess is because the whole program is getting the benefits of the new faster instructions that gcc is generating with this option. Thanks for your help guys.
更新:我测试了Nils建议的变化,但是使用min()的汇编版本没有明显的性能提升。但是,通过使用-march = i686编译程序,我获得了12.5%的提升,我想这是因为整个程序正在获得gcc使用此选项生成的新快速指令的好处。谢谢你的帮助。
P.S. - I used the ruby profiler to measure performance (my C program is a shared library loaded by a ruby program), so I could get time spent only for the top-level C function called by the ruby program, which ends up calling min() down the stack. Please see this question.
附: - 我使用ruby探测器来测量性能(我的C程序是一个由ruby程序加载的共享库),所以我只能花时间花在ruby程序调用的*C函数上,最终调用min(在堆栈中。请看这个问题。
10 个解决方案
#1
11
Make sure you are using an appropriate -march
setting, first off. GCC defaults to not using any instructions that were not supported on the original i386 - allowing it to use newer instruction sets can make a BIG difference at times! On -march=core2 -O2
I get:
首先确保使用适当的-march设置。 GCC默认不使用原始i386不支持的任何指令 - 允许它使用更新的指令集有时会产生巨大的差异!在-march = core2 -O2我得到:
min:
pushl %ebp
movl %esp, %ebp
movl 8(%ebp), %edx
movl 12(%ebp), %ecx
movl 16(%ebp), %eax
cmpl %edx, %ecx
leave
cmovbe %ecx, %edx
cmpl %eax, %edx
cmovbe %edx, %eax
ret
The use of cmov here may help you avoid branch delays - and you get it without any inline asm just by passing in -march
. When inlined into a larger function this is likely to be even more efficient, possibly just four assembly operations. If you need something faster than this, see if you can get the SSE vector operations to work in the context of your overall algorithm.
在这里使用cmov可以帮助您避免分支延迟 - 并且只需通过-march就可以在没有任何内联asm的情况下获得它。当内联到更大的功能时,这可能更有效,可能只有四个装配操作。如果你需要比这更快的东西,看看你是否可以让SSE向量操作在整个算法的上下文中工作。
#2
11
Assuming your compiler isn't out to lunch, this should compile down to two compares and two conditional moves. It isn't possible to do much better than that.
假设你的编译器没有去吃午餐,这应该编译为两个比较和两个条件移动。不可能做得更好。
If you post the assembly that your compiler is actually generating, we can see if there's anything unnecessary that's slowing it down.
如果您发布编译器实际生成的程序集,我们可以看到是否有任何不必要的东西会降低它的速度。
The number one thing to check is that the routine is actually getting inlined. The compiler isn't obligated to do so, and if it's generating a function call, that will be hugely expensive for such a simple operation.
要检查的首要问题是例程实际上是内联的。编译器没有义务这样做,如果它正在生成一个函数调用,那么对于这样一个简单的操作来说,这将是非常昂贵的。
If the call really is getting inlined, then loop unrolling may be beneficial, as DigitalRoss said, or vectorization may be possible.
如果调用真的被内联,那么循环展开可能是有益的,正如DigitalRoss所说,或者矢量化是可能的。
Edit: If you want to vectorize the code, and are using a recent x86 processor, you will want to use the SSE4.1 pminud
instruction (intrinsic: _mm_min_epu32
), which takes two vectors of four unsigned ints each, and produces a vector of four unsigned ints. Each element of the result is the minimum of the corresponding elements in the two inputs.
编辑:如果你想对代码进行矢量化,并且正在使用最新的x86处理器,你将需要使用SSE4.1 pminud指令(内在:_mm_min_epu32),该指令采用两个四个无符号整数的向量,并生成一个向量四个无符号的整数。结果的每个元素是两个输入中相应元素的最小值。
I also note that your compiler used branches instead of conditional moves; you should probably try a version that uses conditional moves first and see if that gets you any speedup before you go off to the races on a vector implementation.
我还注意到你的编译器使用了分支而不是条件移动;你可能应该首先尝试一个使用条件移动的版本,看看在你进行矢量实现的比赛之前是否能获得任何加速。
#3
6
My take on an x86 assembler implementation, GCC syntax. Should be trivial to translate to another inline assembler syntax:
我对x86汇编器实现,GCC语法的看法。翻译成另一个内联汇编语法应该是微不足道的:
int inline least (int a, int b, int c)
{
int result;
__asm__ ("mov %1, %0\n\t"
"cmp %0, %2\n\t"
"cmovle %2, %0\n\t"
"cmp %0, %3\n\t"
"cmovle %3, %0\n\t"
: "=r"(result) :
"r"(a), "r"(b), "r"(c)
);
return result;
}
New and improved version:
新版和改进版:
int inline least (int a, int b, int c)
{
__asm__ (
"cmp %0, %1\n\t"
"cmovle %1, %0\n\t"
"cmp %0, %2\n\t"
"cmovle %2, %0\n\t"
: "+r"(a) :
"%r"(b), "r"(c)
);
return a;
}
NOTE: It may or may not be faster than C code.
注意:它可能也可能不比C代码快。
This depends on a lot of factors. Usually cmov wins if the branches are not predictable (on some x86 architectures) OTOH inline assembler is always a problem for the optimizer, so the optimization penalty for the surrounding code may outweight all gains..
这取决于很多因素。如果分支不可预测,通常cmov会胜出(在某些x86架构上)OTOH内联汇编程序始终是优化器的问题,因此对周围代码的优化代价可能超过所有增益。
Btw Sudhanshu, it would be interesting to hear how this code performs with your testdata.
Btw Sudhanshu,听听这段代码如何与你的testdata一起表演会很有趣。
#4
5
The SSE2 instruction extensions contain an integer min
instruction that can choose 8 minimums at a time. See _mm_mulhi_epu16
in http://www.intel.com/software/products/compilers/clin/docs/ug_cpp/comm1046.htm
SSE2指令扩展包含一个整数最小指令,一次可以选择8个最小值。请参阅http://www.intel.com/software/products/compilers/clin/docs/ug_cpp/comm1046.htm中的_mm_mulhi_epu16
#5
5
This drop-in replacement clocks in about 1.5% faster on my AMD Phenom:
我的AMD Phenom上的这种替换时钟速度提高了大约1.5%:
static inline unsigned int
min(unsigned int a, unsigned int b, unsigned int c)
{
asm("cmp %1,%0\n"
"cmova %1,%0\n"
"cmp %2,%0\n"
"cmova %2,%0\n"
: "+r" (a) : "r" (b), "r" (c));
return a;
}
Results may vary; some x86 processors don't handle CMOV very well.
结果可能有所不同一些x86处理器不能很好地处理CMOV。
#6
1
First, look at the disassembly. That'll tell you a lot. For example, as written, there are 2 if-statements (which means there are 2 possible branch mispredictions), but my guess is that a decent modern C compiler will have some clever optimization that can do it without branching. I'd be curious to find out.
首先,看看反汇编。那会告诉你很多。例如,正如所写,有2个if语句(这意味着有2个可能的分支错误预测),但我的猜测是,一个体面的现代C编译器将有一些聪明的优化,可以在没有分支的情况下完成。我很想知道。
Second, if your libc has special built-in min/max functions, use them. GNU libc has fmin/fmax for floating-point, for example, and they claim that "On some processors these functions can use special machine instructions to perform these operations faster than the equivalent C code". Maybe there's something similar for uints.
其次,如果您的libc具有特殊的内置最小/最大功能,请使用它们。例如,GNU libc的浮点数为fmin / fmax,他们声称“在某些处理器上,这些函数可以使用特殊的机器指令比同等的C代码更快地执行这些操作”。也许有类似的东西。
Finally, if you're doing this to a bunch of numbers in parallel, there are probably vector instructions to do this, which could provide significant speedup. But I've even seen non-vector code be faster when using vector units. Something like "load one uint into a vector register, call vector min function, get result out" looks dumb but might actually be faster.
最后,如果你对一堆数字并行执行此操作,可能会有向量指令来执行此操作,这可以提供显着的加速。但是我甚至看到使用矢量单位时非矢量代码更快。像“将一个uint加载到向量寄存器,调用向量最小函数,得到结果”这样的东西看起来很笨,但实际上可能更快。
#7
0
If you are only doing one comparison you might want to unroll the loop manually.
如果您只进行一次比较,则可能需要手动展开循环。
First, see if you can get the compiler to unroll the loop for you, and if you can't, do it yourself. This will at least reduce the loop control overhead...
首先,看看你是否可以让编译器为你展开循环,如果你不能,你可以自己动手。这至少会减少循环控制开销......
#8
0
Yes, post assembly, but my naive optimization is:
是的,发布后,但我天真的优化是:
static inline unsigned int
min(unsigned int a, unsigned int b, unsigned int c)
{
unsigned int m = a;
if (m > b) m = b;
if (m > c) return c;
return m;
}
#9
0
You could try something like this to save on declaration and unnecessary comparisons:
您可以尝试这样的方法来保存声明和不必要的比较:
static inline unsigned int
min(unsigned int a, unsigned int b, unsigned int c)
{
if (a < b)
{
if (a < c)
return a;
else
return c;
}
if (b < c)
return b;
else return c;
}
#10
0
These are all good answers. At the risk of being accused of not answering the question, I would also look at the other 80% of the time. Stackshots are my favorite way to find code worth optimizing, especially if it is function calls that you find out you don't absolutely need.
这些都是很好的答案。冒着被指责不回答问题的风险,我也会看其他80%的时间。 Stackshots是我最喜欢的方法,可以找到值得优化的代码,特别是如果它是您发现的并不是绝对需要的函数调用。
#1
11
Make sure you are using an appropriate -march
setting, first off. GCC defaults to not using any instructions that were not supported on the original i386 - allowing it to use newer instruction sets can make a BIG difference at times! On -march=core2 -O2
I get:
首先确保使用适当的-march设置。 GCC默认不使用原始i386不支持的任何指令 - 允许它使用更新的指令集有时会产生巨大的差异!在-march = core2 -O2我得到:
min:
pushl %ebp
movl %esp, %ebp
movl 8(%ebp), %edx
movl 12(%ebp), %ecx
movl 16(%ebp), %eax
cmpl %edx, %ecx
leave
cmovbe %ecx, %edx
cmpl %eax, %edx
cmovbe %edx, %eax
ret
The use of cmov here may help you avoid branch delays - and you get it without any inline asm just by passing in -march
. When inlined into a larger function this is likely to be even more efficient, possibly just four assembly operations. If you need something faster than this, see if you can get the SSE vector operations to work in the context of your overall algorithm.
在这里使用cmov可以帮助您避免分支延迟 - 并且只需通过-march就可以在没有任何内联asm的情况下获得它。当内联到更大的功能时,这可能更有效,可能只有四个装配操作。如果你需要比这更快的东西,看看你是否可以让SSE向量操作在整个算法的上下文中工作。
#2
11
Assuming your compiler isn't out to lunch, this should compile down to two compares and two conditional moves. It isn't possible to do much better than that.
假设你的编译器没有去吃午餐,这应该编译为两个比较和两个条件移动。不可能做得更好。
If you post the assembly that your compiler is actually generating, we can see if there's anything unnecessary that's slowing it down.
如果您发布编译器实际生成的程序集,我们可以看到是否有任何不必要的东西会降低它的速度。
The number one thing to check is that the routine is actually getting inlined. The compiler isn't obligated to do so, and if it's generating a function call, that will be hugely expensive for such a simple operation.
要检查的首要问题是例程实际上是内联的。编译器没有义务这样做,如果它正在生成一个函数调用,那么对于这样一个简单的操作来说,这将是非常昂贵的。
If the call really is getting inlined, then loop unrolling may be beneficial, as DigitalRoss said, or vectorization may be possible.
如果调用真的被内联,那么循环展开可能是有益的,正如DigitalRoss所说,或者矢量化是可能的。
Edit: If you want to vectorize the code, and are using a recent x86 processor, you will want to use the SSE4.1 pminud
instruction (intrinsic: _mm_min_epu32
), which takes two vectors of four unsigned ints each, and produces a vector of four unsigned ints. Each element of the result is the minimum of the corresponding elements in the two inputs.
编辑:如果你想对代码进行矢量化,并且正在使用最新的x86处理器,你将需要使用SSE4.1 pminud指令(内在:_mm_min_epu32),该指令采用两个四个无符号整数的向量,并生成一个向量四个无符号的整数。结果的每个元素是两个输入中相应元素的最小值。
I also note that your compiler used branches instead of conditional moves; you should probably try a version that uses conditional moves first and see if that gets you any speedup before you go off to the races on a vector implementation.
我还注意到你的编译器使用了分支而不是条件移动;你可能应该首先尝试一个使用条件移动的版本,看看在你进行矢量实现的比赛之前是否能获得任何加速。
#3
6
My take on an x86 assembler implementation, GCC syntax. Should be trivial to translate to another inline assembler syntax:
我对x86汇编器实现,GCC语法的看法。翻译成另一个内联汇编语法应该是微不足道的:
int inline least (int a, int b, int c)
{
int result;
__asm__ ("mov %1, %0\n\t"
"cmp %0, %2\n\t"
"cmovle %2, %0\n\t"
"cmp %0, %3\n\t"
"cmovle %3, %0\n\t"
: "=r"(result) :
"r"(a), "r"(b), "r"(c)
);
return result;
}
New and improved version:
新版和改进版:
int inline least (int a, int b, int c)
{
__asm__ (
"cmp %0, %1\n\t"
"cmovle %1, %0\n\t"
"cmp %0, %2\n\t"
"cmovle %2, %0\n\t"
: "+r"(a) :
"%r"(b), "r"(c)
);
return a;
}
NOTE: It may or may not be faster than C code.
注意:它可能也可能不比C代码快。
This depends on a lot of factors. Usually cmov wins if the branches are not predictable (on some x86 architectures) OTOH inline assembler is always a problem for the optimizer, so the optimization penalty for the surrounding code may outweight all gains..
这取决于很多因素。如果分支不可预测,通常cmov会胜出(在某些x86架构上)OTOH内联汇编程序始终是优化器的问题,因此对周围代码的优化代价可能超过所有增益。
Btw Sudhanshu, it would be interesting to hear how this code performs with your testdata.
Btw Sudhanshu,听听这段代码如何与你的testdata一起表演会很有趣。
#4
5
The SSE2 instruction extensions contain an integer min
instruction that can choose 8 minimums at a time. See _mm_mulhi_epu16
in http://www.intel.com/software/products/compilers/clin/docs/ug_cpp/comm1046.htm
SSE2指令扩展包含一个整数最小指令,一次可以选择8个最小值。请参阅http://www.intel.com/software/products/compilers/clin/docs/ug_cpp/comm1046.htm中的_mm_mulhi_epu16
#5
5
This drop-in replacement clocks in about 1.5% faster on my AMD Phenom:
我的AMD Phenom上的这种替换时钟速度提高了大约1.5%:
static inline unsigned int
min(unsigned int a, unsigned int b, unsigned int c)
{
asm("cmp %1,%0\n"
"cmova %1,%0\n"
"cmp %2,%0\n"
"cmova %2,%0\n"
: "+r" (a) : "r" (b), "r" (c));
return a;
}
Results may vary; some x86 processors don't handle CMOV very well.
结果可能有所不同一些x86处理器不能很好地处理CMOV。
#6
1
First, look at the disassembly. That'll tell you a lot. For example, as written, there are 2 if-statements (which means there are 2 possible branch mispredictions), but my guess is that a decent modern C compiler will have some clever optimization that can do it without branching. I'd be curious to find out.
首先,看看反汇编。那会告诉你很多。例如,正如所写,有2个if语句(这意味着有2个可能的分支错误预测),但我的猜测是,一个体面的现代C编译器将有一些聪明的优化,可以在没有分支的情况下完成。我很想知道。
Second, if your libc has special built-in min/max functions, use them. GNU libc has fmin/fmax for floating-point, for example, and they claim that "On some processors these functions can use special machine instructions to perform these operations faster than the equivalent C code". Maybe there's something similar for uints.
其次,如果您的libc具有特殊的内置最小/最大功能,请使用它们。例如,GNU libc的浮点数为fmin / fmax,他们声称“在某些处理器上,这些函数可以使用特殊的机器指令比同等的C代码更快地执行这些操作”。也许有类似的东西。
Finally, if you're doing this to a bunch of numbers in parallel, there are probably vector instructions to do this, which could provide significant speedup. But I've even seen non-vector code be faster when using vector units. Something like "load one uint into a vector register, call vector min function, get result out" looks dumb but might actually be faster.
最后,如果你对一堆数字并行执行此操作,可能会有向量指令来执行此操作,这可以提供显着的加速。但是我甚至看到使用矢量单位时非矢量代码更快。像“将一个uint加载到向量寄存器,调用向量最小函数,得到结果”这样的东西看起来很笨,但实际上可能更快。
#7
0
If you are only doing one comparison you might want to unroll the loop manually.
如果您只进行一次比较,则可能需要手动展开循环。
First, see if you can get the compiler to unroll the loop for you, and if you can't, do it yourself. This will at least reduce the loop control overhead...
首先,看看你是否可以让编译器为你展开循环,如果你不能,你可以自己动手。这至少会减少循环控制开销......
#8
0
Yes, post assembly, but my naive optimization is:
是的,发布后,但我天真的优化是:
static inline unsigned int
min(unsigned int a, unsigned int b, unsigned int c)
{
unsigned int m = a;
if (m > b) m = b;
if (m > c) return c;
return m;
}
#9
0
You could try something like this to save on declaration and unnecessary comparisons:
您可以尝试这样的方法来保存声明和不必要的比较:
static inline unsigned int
min(unsigned int a, unsigned int b, unsigned int c)
{
if (a < b)
{
if (a < c)
return a;
else
return c;
}
if (b < c)
return b;
else return c;
}
#10
0
These are all good answers. At the risk of being accused of not answering the question, I would also look at the other 80% of the time. Stackshots are my favorite way to find code worth optimizing, especially if it is function calls that you find out you don't absolutely need.
这些都是很好的答案。冒着被指责不回答问题的风险,我也会看其他80%的时间。 Stackshots是我最喜欢的方法,可以找到值得优化的代码,特别是如果它是您发现的并不是绝对需要的函数调用。