This question already has an answer here:
这个问题已经有了答案:
- What is the fastest way to convert float to int on x86 10 answers
- 在x86 10的答案上,将浮点数转换为int的最快方式是什么
We're doing a great deal of floating-point to integer number conversions in our project. Basically, something like this
在我们的项目中,我们做了大量的浮点数到整数的转换。基本上是这样的
for(int i = 0; i < HUGE_NUMBER; i++)
int_array[i] = float_array[i];
The default C function which performs the conversion turns out to be quite time consuming.
执行转换的默认C函数非常耗时。
Is there any work around (maybe a hand tuned function) which can speed up the process a little bit? We don't care much about a precision.
有没有什么工作(可能是手动调优函数)可以加快这个过程?我们不太关心精确。
16 个解决方案
#1
15
Most of the other answers here just try to eliminate loop overhead.
这里的大多数其他答案只是试图消除循环开销。
Only deft_code's answer gets to the heart of what is likely the real problem -- that converting floating point to integers is shockingly expensive on an x86 processor. deft_code's solution is correct, though he gives no citation or explanation.
只有deft_code的答案才能触及真正问题的核心——在x86处理器上,将浮点数转换为整数的成本高得惊人。deft_code的解决方案是正确的,尽管他没有给出任何引用或解释。
Here is the source of the trick, with some explanation and also versions specific to whether you want to round up, down, or toward zero: Know your FPU
这里是诀窍的来源,有一些解释,也有一些特定的版本,以决定你是想向上、向下还是接近于零:了解你的FPU
Sorry to provide a link, but really anything written here, short of reproducing that excellent article, is not going to make things clear.
抱歉提供了一个链接,但是这里写的任何东西,除了复制那篇优秀的文章之外,都不能说明问题。
#2
14
inline int float2int( double d )
{
union Cast
{
double d;
long l;
};
volatile Cast c;
c.d = d + 6755399441055744.0;
return c.l;
}
// this is the same thing but it's
// not always optimizer safe
inline int float2int( double d )
{
d += 6755399441055744.0;
return reinterpret_cast<int&>(d);
}
for(int i = 0; i < HUGE_NUMBER; i++)
int_array[i] = float2int(float_array[i]);
The double parameter is not a mistake! There is way to do this trick with floats directly but it gets ugly trying to cover all the corner cases. In its current form this function will round the float the nearest whole number if you want truncation instead use 6755399441055743.5 (0.5 less).
双参数不是错误!有一种方法可以直接对浮点数进行处理,但如果试图覆盖所有角的情况就会变得很糟糕。在当前形式下,如果您希望截断,此函数将围绕浮点数使用6755399441055743.5(减少0.5)。
#3
8
I ran some tests on different ways of doing float-to-int conversion. The short answer is to assume your customer has SSE2-capable CPUs and set the /arch:SSE2 compiler flag. This will allow the compiler to use the SSE scalar instructions which are twice as fast as even the magic-number technique.
我用不同的方法对浮点数转换进行了一些测试。简短的答案是假定您的客户拥有支持SSE2的cpu,并设置/arch:SSE2编译器标志。这将允许编译器使用SSE标量指令,该指令的速度甚至是magic-number技术的两倍。
Otherwise, if you have long strings of floats to grind, use the SSE2 packed ops.
否则,如果你有一长串的漂浮物要研磨,请使用SSE2压缩操作。
#4
3
There's an FISTTP instruction in the SSE3 instruction set which does what you want, but as to whether or not it could be utilized and produce faster results than libc, I have no idea.
SSE3指令集中有一个FISTTP指令,它可以做你想做的事情,但是关于它是否可以被利用,并且产生比libc更快的结果,我不知道。
#5
2
Is the time large enough that it outweighs the cost of starting a couple of threads?
时间是否足够大以至于超过启动几个线程的成本?
Assuming you have a multi-core processor or multiple processors on your box that you could take advantage of, this would be a trivial task to parallelize across multiple threads.
假设您的盒子上有一个多核处理器或多个处理器,您可以利用它,这将是一个简单的任务,可以在多个线程间并行化。
#6
2
The key is to avoid the _ftol() function, which is needlessly slow. Your best bet for long lists of data like this is to use the SSE2 instruction cvtps2dq to convert two packed floats to two packed int64s. Do this twice (getting four int64s across two SSE registers) and you can shuffle them together to get four int32s (losing the top 32 bits of each conversion result). You don't need assembly to do this; MSVC exposes compiler intrinsics to the relevant instructions -- _mm_cvtpd_epi32() if my memory serves me correctly.
关键是要避免_ftol()函数,这是不必要的缓慢。你最好的方法是使用SSE2指令cvtps2dq将两个填充的浮点数转换成两个填充的int64s。这样做两次(在两个SSE寄存器中获得4个int64),您可以将它们混合在一起获得4个int32(丢失每个转换结果的前32位)。不需要装配;MSVC将编译器intrinsic公开给相关的指令——如果我的内存没有问题,_mm_cvtpd_epi32()。
If you do this it is very important that your float and int arrays be 16-byte aligned so that the SSE2 load/store intrinsics can work at maximum efficiency. Also, I recommend you software pipeline a little and process sixteen floats at once in each loop, eg (assuming that the "functions" here are actually calls to compiler intrinsics):
如果您这样做,那么浮点数组和int数组必须是16字节对齐的,以便SSE2 load/store intrinsic能够以最大的效率工作。另外,我建议您使用软件管道,并在每个循环中同时处理16个浮点数,例如(假设这里的“函数”实际上是调用编译器intrinsic):
for(int i = 0; i < HUGE_NUMBER; i+=16)
{
//int_array[i] = float_array[i];
__m128 a = sse_load4(float_array+i+0);
__m128 b = sse_load4(float_array+i+4);
__m128 c = sse_load4(float_array+i+8);
__m128 d = sse_load4(float_array+i+12);
a = sse_convert4(a);
b = sse_convert4(b);
c = sse_convert4(c);
d = sse_convert4(d);
sse_write4(int_array+i+0, a);
sse_write4(int_array+i+4, b);
sse_write4(int_array+i+8, c);
sse_write4(int_array+i+12, d);
}
The reason for this is that the SSE instructions have a long latency, so if you follow a load into xmm0 immediately with a dependent operation on xmm0 then you will have a stall. Having multiple registers "in flight" at once hides the latency a little. (Theoretically a magic all-knowing compiler could alias its way around this problem but in practice it doesn't.)
这样做的原因是,SSE指令有很长的延迟,所以如果您在xmm0上使用一个依赖的操作立即跟随一个加载到xmm0,那么您将有一个停顿。同时拥有多个寄存器“in flight”会隐藏一些延迟。(理论上,一个万能的编译器可以通过别名来解决这个问题,但实际上并非如此。)
Failing this SSE juju you can supply the /QIfist option to MSVC which will cause it to issue the single opcode fist instead of a call to _ftol; this means it will simply use whichever rounding mode happens to be set in the CPU without making sure it is ANSI C's specific truncate op. The Microsoft docs say /QIfist is deprecated because their floating point code is fast now, but a disassembler will show you that this is unjustifiedly optimistic. Even /fp:fast simply results to a call to _ftol_sse2, which though faster than the egregious _ftol is still a function call followed by a latent SSE op, and thus unnecessarily slow.
如果失败了,你可以向MSVC提供/QIfist选项,这将导致它发出单一的opcode拳头,而不是对_ftol的调用;这意味着它将会使用哪个舍入模式是在CPU中设置不确保它是ANSI C的具体截断相机会微软文档说/ QIfist弃用,因为他们的浮点代码现在快,但反汇编程序将向您展示,这是不合理的乐观。甚至/fp:fast只会导致对_ftol_sse2的调用,尽管比异常的_ftol快,但它仍然是一个函数调用,后面跟着一个潜在的SSE op,因此会不必要地慢下来。
I'm assuming you're on x86 arch, by the way -- if you're on PPC there are equivalent VMX operations, or you can use the magic-number-multiply trick mentioned above followed by a vsel (to mask out the non-mantissa bits) and an aligned store.
我假设你在x86 arch上,顺便说一下——如果你在PPC上,有等效的VMX操作,或者你可以使用上面提到的魔法数字相乘技巧,然后是一个vsel(用来掩盖非mantissa位)和一个对齐的存储。
#7
1
You might be able to load all of the integers into the SSE module of your processor using some magic assembly code, then do the equivalent code to set the values to ints, then read them as floats. I'm not sure this would be any faster though. I'm not a SSE guru, so I don't know how to do this. Maybe someone else can chime in.
您可以使用一些神奇的汇编代码将所有的整数装载到处理器的SSE模块中,然后执行等效的代码将值设置为ints,然后将它们作为浮点数读取。我不确定这是否会更快。我不是SSE大师,所以我不知道怎么做。也许有人能插嘴。
#8
1
In Visual C++ 2008, the compiler generates SSE2 calls by itself, if you do a release build with maxed out optimization options, and look at a disassembly (though some conditions have to be met, play around with your code).
在Visual c++ 2008中,编译器自己生成SSE2调用,如果您使用最大化的优化选项进行发布构建,并查看反汇编(尽管有些条件必须满足,但可以使用代码)。
#9
1
See this Intel article for speeding up integer conversions:
请看这篇关于加速整数转换的Intel文章:
http://software.intel.com/en-us/articles/latency-of-floating-point-to-integer-conversions/
http://software.intel.com/en-us/articles/latency-of-floating-point-to-integer-conversions/
According to Microsoft, the /QIfist compiler option is deprecated in VS 2005 because integer conversion has been sped up. They neglect to say how it has been sped up, but looking at the disassembly listing might give a clue.
根据微软的说法,/QIfist编译器选项在VS 2005中被弃用,因为整数转换已经加快了。他们忽略了它是如何被加速的,但是看看拆解清单可能会提供一个线索。
http://msdn.microsoft.com/en-us/library/z8dh4h17(vs.80).aspx
http://msdn.microsoft.com/en-us/library/z8dh4h17(vs.80). aspx
#10
1
most c compilers generate calls to _ftol or something for every float to int conversion. putting a reduced floating point conformance switch (like fp:fast) might help - IF you understand AND accept the other effects of this switch. other than that, put the thing in a tight assembly or sse intrinsic loop, IF you are ok AND understand the different rounding behavior. for large loops like your example you should write a function that sets up floating point control words once and then does the bulk rounding with only fistp instructions and then resets the control word - IF you are ok with an x86 only code path, but at least you will not change the rounding. read up on the fld and fistp fpu instructions and the fpu control word.
大多数c编译器为每一个浮点数生成对_ftol或其他值的调用。如果您理解并接受这个开关的其他影响,那么放置一个减少的浮点一致性开关(如fp:fast)可能会有所帮助。除此之外,如果你理解并理解不同的舍入行为,把它放到一个紧密的程序集或sse内部循环中。对于大型循环喜欢你的例子,你应该写一个函数,设置浮点一旦控制词汇,然后大部分的舍入只有fistp指令,然后重置控制字——如果你可以用一个x86只有代码路径,但是至少你不会改变舍入。阅读fld和fistp fpu指令和fpu控制词。
#11
0
What compiler are you using? In Microsoft's more recent C/C++ compilers, there is an option under C/C++ -> Code Generation -> Floating point model, which has options: fast, precise, strict. I think precise is the default, and works by emulating FP operations to some extent. If you are using a MS compiler, how is this option set? Does it help to set it to "fast"? In any case, what does the disassembly look like?
你在用什么编译器?在微软最近的C/ c++编译器中,在C/ c++ ->代码生成->浮点模型下有一个选项:快速、精确、严格。我认为precision是默认值,在某种程度上通过模拟FP操作来实现。如果您正在使用MS编译器,该选项是如何设置的?把它设置成“快”有帮助吗?无论如何,拆卸是什么样子的?
As thirtyseven said above, the CPU can convert float<->int
in essentially one instruction, and it doesn't get any faster than that (short of a SIMD operation).
正如上面提到的,CPU可以在一条指令中转换浮点<->int,并且不会比这更快(除了SIMD操作)。
Also note that modern CPUs use the same FP unit for both single (32 bit) and double (64 bit) FP numbers, so unless you are trying to save memory storing a lot of floats, there's really no reason to favor float
over double.
还要注意,现代cpu对单(32位)和双(64位)FP号都使用相同的FP单元,因此,除非您试图保存存储大量浮点数的内存,否则实际上没有理由支持float而不是double。
#13
0
I'm surprised by your result. What compiler are you using? Are you compiling with optimization turned all the way up? Have you confirmed using valgrind and Kcachegrind that this is where the bottleneck is? What processor are you using? What does the assembly code look like?
我对你的结果感到惊讶。你在用什么编译器?你是否一直在优化编译?你确认过使用valgrind和Kcachegrind吗?瓶颈就在这里?你在用什么处理器?汇编代码是什么样子的?
The conversion itself should be compiled to a single instruction. A good optimizing compiler should unroll the loop so that several conversions are done per test-and-branch. If that's not happening, you can unroll the loop by hand:
转换本身应该编译为一条指令。一个好的优化编译器应该展开循环,这样每个测试和分支都可以进行几个转换。如果这种情况没有发生,你可以手工展开循环:
for(int i = 0; i < HUGE_NUMBER-3; i += 4) {
int_array[i] = float_array[i];
int_array[i+1] = float_array[i+1];
int_array[i+2] = float_array[i+2];
int_array[i+3] = float_array[i+3];
}
for(; i < HUGE_NUMBER; i++)
int_array[i] = float_array[i];
If your compiler is really pathetic, you might need to help it with the common subexpressions, e.g.,
如果你的编译器真的很糟糕,你可能需要帮助它处理常见的子表达式,例如,
int *ip = int_array+i;
float *fp = float_array+i;
ip[0] = fp[0];
ip[1] = fp[1];
ip[2] = fp[2];
ip[3] = fp[3];
Do report back with more info!
一定要用更多的信息汇报!
#14
0
If you do not care very much about the rounding semantics, you can use the lrint()
function. This allows for more freedom in rounding and it can be much faster.
如果不太关心舍入语义,可以使用lrint()函数。这允许更*的四舍五入,而且可以更快。
Technically, it's a C99 function, but your compiler probably exposes it in C++. A good compiler will also inline it to one instruction (a modern G++ will).
从技术上讲,它是一个C99函数,但是编译器可能会将它暴露在c++中。一个好的编译器也会将它内联到一个指令(一个现代的g++会)。
lrint文档
#15
0
rounding only excellent trick, only the use 6755399441055743.5 (0.5 less) to do rounding won't work.
四舍五入的技巧,只使用6755399441055743.5(0.5以下)做四舍五入不会起作用。
6755399441055744 = 2^52 + 2^51 overflowing decimals off the end of the mantissa leaving the integer that you want in bits 51 - 0 of the fpu register.
6755399441055744 = 2 ^ 52 + 2 ^ 51满溢的小数的结束尾数留下你想要的整数位51 - 0 fpu的登记。
In IEEE 754
6755399441055744.0 =
在ieee754 6755399441055744.0 =
sign exponent mantissa
0 10000110011 1000000000000000000000000000000000000000000000000000
符号指数mantissa 0 10000110011 1000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000。
6755399441055743.5 will also however compile to 0100001100111000000000000000000000000000000000000000000000000000
不过,也要编译到010000110011100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000。
the 0.5 overflows off the end (rounding up) which is why this works in the first place.
0。5从末端溢出(四舍五入),这就是为什么它首先起作用。
to do truncation you would have to add 0.5 to your double then do this the guard digits should take care of rounding to the correct result done this way. also watch out for 64 bit gcc linux where long rather annoyingly means a 64 bit integer.
要做截断,你必须把0。5加到你的双精度浮点数上,然后这样做,保护数字应该考虑四舍五入到正确的结果上。还要注意64位的gcc linux,在那里long是一个64位的整数。
#16
-1
If you have very large arrays (bigger than a few MB--the size of the CPU cache), time your code and see what the throughput is. You're probably saturating the memory bus, not the FP unit. Look up the maximum theoretical bandwidth for your CPU and see how close to it you are.
如果您有非常大的数组(大于几个MB——CPU缓存的大小),那么您就可以查看代码并查看吞吐量了。你可能使内存总线饱和,而不是FP单元。查找CPU的最大理论带宽,看看您离它有多近。
If you're being limited by the memory bus, extra threads will just make it worse. You need better hardware (e.g. faster memory, different CPU, different motherboard).
如果您受到内存总线的限制,那么额外的线程只会使情况变得更糟。你需要更好的硬件(例如更快的内存,不同的CPU,不同的主板)。
In response to Larry Gritz's comment...
You are correct: the FPU is a major bottleneck (and using the xs_CRoundToInt trick allows one to come very close to saturating the memory bus).
您是正确的:FPU是一个主要的瓶颈(使用xs_CRoundToInt技巧可以使内存总线非常接近饱和)。
Here are some test results for a Core 2 (Q6600) processor. The theoretical main-memory bandwidth for this machine is 3.2 GB/s (L1 and L2 bandwidths are much higher). The code was compiled with Visual Studio 2008. Similar results for 32-bit and 64-bit, and with /O2 or /Ox optimizations.
下面是对Core 2 (Q6600)处理器的一些测试结果。这台机器的理论主存带宽为3.2 GB/s (L1和L2带宽要高得多)。代码是用Visual Studio 2008编译的。32位和64位以及/O2或/Ox优化的类似结果。
WRITING ONLY... 1866359 ticks with 33554432 array elements (33554432 touched). Bandwidth: 1.91793 GB/s 154749 ticks with 262144 array elements (33554432 touched). Bandwidth: 23.1313 GB/s 108816 ticks with 8192 array elements (33554432 touched). Bandwidth: 32.8954 GB/s USING CASTING... 5236122 ticks with 33554432 array elements (33554432 touched). Bandwidth: 0.683625 GB/s 2014309 ticks with 262144 array elements (33554432 touched). Bandwidth: 1.77706 GB/s 1967345 ticks with 8192 array elements (33554432 touched). Bandwidth: 1.81948 GB/s USING xs_CRoundToInt... 1490583 ticks with 33554432 array elements (33554432 touched). Bandwidth: 2.40144 GB/s 1079530 ticks with 262144 array elements (33554432 touched). Bandwidth: 3.31584 GB/s 1008407 ticks with 8192 array elements (33554432 touched). Bandwidth: 3.5497 GB/s
(Windows) source code:
(Windows)源代码:
// floatToIntTime.cpp : Defines the entry point for the console application.
//
#include <windows.h>
#include <iostream>
using namespace std;
double const _xs_doublemagic = double(6755399441055744.0);
inline int xs_CRoundToInt(double val, double dmr=_xs_doublemagic) {
val = val + dmr;
return ((int*)&val)[0];
}
static size_t const N = 256*1024*1024/sizeof(double);
int I[N];
double F[N];
static size_t const L1CACHE = 128*1024/sizeof(double);
static size_t const L2CACHE = 4*1024*1024/sizeof(double);
static size_t const Sz[] = {N, L2CACHE/2, L1CACHE/2};
static size_t const NIter[] = {1, N/(L2CACHE/2), N/(L1CACHE/2)};
int main(int argc, char *argv[])
{
__int64 freq;
QueryPerformanceFrequency((LARGE_INTEGER*)&freq);
cout << "WRITING ONLY..." << endl;
for (int t=0; t<3; t++) {
__int64 t0,t1;
QueryPerformanceCounter((LARGE_INTEGER*)&t0);
size_t const niter = NIter[t];
size_t const sz = Sz[t];
for (size_t i=0; i<niter; i++) {
for (size_t n=0; n<sz; n++) {
I[n] = 13;
}
}
QueryPerformanceCounter((LARGE_INTEGER*)&t1);
double bandwidth = 8*niter*sz / (((double)(t1-t0))/freq) / 1024/1024/1024;
cout << " " << (t1-t0) << " ticks with " << sz
<< " array elements (" << niter*sz << " touched). "
<< "Bandwidth: " << bandwidth << " GB/s" << endl;
}
cout << "USING CASTING..." << endl;
for (int t=0; t<3; t++) {
__int64 t0,t1;
QueryPerformanceCounter((LARGE_INTEGER*)&t0);
size_t const niter = NIter[t];
size_t const sz = Sz[t];
for (size_t i=0; i<niter; i++) {
for (size_t n=0; n<sz; n++) {
I[n] = (int)F[n];
}
}
QueryPerformanceCounter((LARGE_INTEGER*)&t1);
double bandwidth = 8*niter*sz / (((double)(t1-t0))/freq) / 1024/1024/1024;
cout << " " << (t1-t0) << " ticks with " << sz
<< " array elements (" << niter*sz << " touched). "
<< "Bandwidth: " << bandwidth << " GB/s" << endl;
}
cout << "USING xs_CRoundToInt..." << endl;
for (int t=0; t<3; t++) {
__int64 t0,t1;
QueryPerformanceCounter((LARGE_INTEGER*)&t0);
size_t const niter = NIter[t];
size_t const sz = Sz[t];
for (size_t i=0; i<niter; i++) {
for (size_t n=0; n<sz; n++) {
I[n] = xs_CRoundToInt(F[n]);
}
}
QueryPerformanceCounter((LARGE_INTEGER*)&t1);
double bandwidth = 8*niter*sz / (((double)(t1-t0))/freq) / 1024/1024/1024;
cout << " " << (t1-t0) << " ticks with " << sz
<< " array elements (" << niter*sz << " touched). "
<< "Bandwidth: " << bandwidth << " GB/s" << endl;
}
return 0;
}
#1
15
Most of the other answers here just try to eliminate loop overhead.
这里的大多数其他答案只是试图消除循环开销。
Only deft_code's answer gets to the heart of what is likely the real problem -- that converting floating point to integers is shockingly expensive on an x86 processor. deft_code's solution is correct, though he gives no citation or explanation.
只有deft_code的答案才能触及真正问题的核心——在x86处理器上,将浮点数转换为整数的成本高得惊人。deft_code的解决方案是正确的,尽管他没有给出任何引用或解释。
Here is the source of the trick, with some explanation and also versions specific to whether you want to round up, down, or toward zero: Know your FPU
这里是诀窍的来源,有一些解释,也有一些特定的版本,以决定你是想向上、向下还是接近于零:了解你的FPU
Sorry to provide a link, but really anything written here, short of reproducing that excellent article, is not going to make things clear.
抱歉提供了一个链接,但是这里写的任何东西,除了复制那篇优秀的文章之外,都不能说明问题。
#2
14
inline int float2int( double d )
{
union Cast
{
double d;
long l;
};
volatile Cast c;
c.d = d + 6755399441055744.0;
return c.l;
}
// this is the same thing but it's
// not always optimizer safe
inline int float2int( double d )
{
d += 6755399441055744.0;
return reinterpret_cast<int&>(d);
}
for(int i = 0; i < HUGE_NUMBER; i++)
int_array[i] = float2int(float_array[i]);
The double parameter is not a mistake! There is way to do this trick with floats directly but it gets ugly trying to cover all the corner cases. In its current form this function will round the float the nearest whole number if you want truncation instead use 6755399441055743.5 (0.5 less).
双参数不是错误!有一种方法可以直接对浮点数进行处理,但如果试图覆盖所有角的情况就会变得很糟糕。在当前形式下,如果您希望截断,此函数将围绕浮点数使用6755399441055743.5(减少0.5)。
#3
8
I ran some tests on different ways of doing float-to-int conversion. The short answer is to assume your customer has SSE2-capable CPUs and set the /arch:SSE2 compiler flag. This will allow the compiler to use the SSE scalar instructions which are twice as fast as even the magic-number technique.
我用不同的方法对浮点数转换进行了一些测试。简短的答案是假定您的客户拥有支持SSE2的cpu,并设置/arch:SSE2编译器标志。这将允许编译器使用SSE标量指令,该指令的速度甚至是magic-number技术的两倍。
Otherwise, if you have long strings of floats to grind, use the SSE2 packed ops.
否则,如果你有一长串的漂浮物要研磨,请使用SSE2压缩操作。
#4
3
There's an FISTTP instruction in the SSE3 instruction set which does what you want, but as to whether or not it could be utilized and produce faster results than libc, I have no idea.
SSE3指令集中有一个FISTTP指令,它可以做你想做的事情,但是关于它是否可以被利用,并且产生比libc更快的结果,我不知道。
#5
2
Is the time large enough that it outweighs the cost of starting a couple of threads?
时间是否足够大以至于超过启动几个线程的成本?
Assuming you have a multi-core processor or multiple processors on your box that you could take advantage of, this would be a trivial task to parallelize across multiple threads.
假设您的盒子上有一个多核处理器或多个处理器,您可以利用它,这将是一个简单的任务,可以在多个线程间并行化。
#6
2
The key is to avoid the _ftol() function, which is needlessly slow. Your best bet for long lists of data like this is to use the SSE2 instruction cvtps2dq to convert two packed floats to two packed int64s. Do this twice (getting four int64s across two SSE registers) and you can shuffle them together to get four int32s (losing the top 32 bits of each conversion result). You don't need assembly to do this; MSVC exposes compiler intrinsics to the relevant instructions -- _mm_cvtpd_epi32() if my memory serves me correctly.
关键是要避免_ftol()函数,这是不必要的缓慢。你最好的方法是使用SSE2指令cvtps2dq将两个填充的浮点数转换成两个填充的int64s。这样做两次(在两个SSE寄存器中获得4个int64),您可以将它们混合在一起获得4个int32(丢失每个转换结果的前32位)。不需要装配;MSVC将编译器intrinsic公开给相关的指令——如果我的内存没有问题,_mm_cvtpd_epi32()。
If you do this it is very important that your float and int arrays be 16-byte aligned so that the SSE2 load/store intrinsics can work at maximum efficiency. Also, I recommend you software pipeline a little and process sixteen floats at once in each loop, eg (assuming that the "functions" here are actually calls to compiler intrinsics):
如果您这样做,那么浮点数组和int数组必须是16字节对齐的,以便SSE2 load/store intrinsic能够以最大的效率工作。另外,我建议您使用软件管道,并在每个循环中同时处理16个浮点数,例如(假设这里的“函数”实际上是调用编译器intrinsic):
for(int i = 0; i < HUGE_NUMBER; i+=16)
{
//int_array[i] = float_array[i];
__m128 a = sse_load4(float_array+i+0);
__m128 b = sse_load4(float_array+i+4);
__m128 c = sse_load4(float_array+i+8);
__m128 d = sse_load4(float_array+i+12);
a = sse_convert4(a);
b = sse_convert4(b);
c = sse_convert4(c);
d = sse_convert4(d);
sse_write4(int_array+i+0, a);
sse_write4(int_array+i+4, b);
sse_write4(int_array+i+8, c);
sse_write4(int_array+i+12, d);
}
The reason for this is that the SSE instructions have a long latency, so if you follow a load into xmm0 immediately with a dependent operation on xmm0 then you will have a stall. Having multiple registers "in flight" at once hides the latency a little. (Theoretically a magic all-knowing compiler could alias its way around this problem but in practice it doesn't.)
这样做的原因是,SSE指令有很长的延迟,所以如果您在xmm0上使用一个依赖的操作立即跟随一个加载到xmm0,那么您将有一个停顿。同时拥有多个寄存器“in flight”会隐藏一些延迟。(理论上,一个万能的编译器可以通过别名来解决这个问题,但实际上并非如此。)
Failing this SSE juju you can supply the /QIfist option to MSVC which will cause it to issue the single opcode fist instead of a call to _ftol; this means it will simply use whichever rounding mode happens to be set in the CPU without making sure it is ANSI C's specific truncate op. The Microsoft docs say /QIfist is deprecated because their floating point code is fast now, but a disassembler will show you that this is unjustifiedly optimistic. Even /fp:fast simply results to a call to _ftol_sse2, which though faster than the egregious _ftol is still a function call followed by a latent SSE op, and thus unnecessarily slow.
如果失败了,你可以向MSVC提供/QIfist选项,这将导致它发出单一的opcode拳头,而不是对_ftol的调用;这意味着它将会使用哪个舍入模式是在CPU中设置不确保它是ANSI C的具体截断相机会微软文档说/ QIfist弃用,因为他们的浮点代码现在快,但反汇编程序将向您展示,这是不合理的乐观。甚至/fp:fast只会导致对_ftol_sse2的调用,尽管比异常的_ftol快,但它仍然是一个函数调用,后面跟着一个潜在的SSE op,因此会不必要地慢下来。
I'm assuming you're on x86 arch, by the way -- if you're on PPC there are equivalent VMX operations, or you can use the magic-number-multiply trick mentioned above followed by a vsel (to mask out the non-mantissa bits) and an aligned store.
我假设你在x86 arch上,顺便说一下——如果你在PPC上,有等效的VMX操作,或者你可以使用上面提到的魔法数字相乘技巧,然后是一个vsel(用来掩盖非mantissa位)和一个对齐的存储。
#7
1
You might be able to load all of the integers into the SSE module of your processor using some magic assembly code, then do the equivalent code to set the values to ints, then read them as floats. I'm not sure this would be any faster though. I'm not a SSE guru, so I don't know how to do this. Maybe someone else can chime in.
您可以使用一些神奇的汇编代码将所有的整数装载到处理器的SSE模块中,然后执行等效的代码将值设置为ints,然后将它们作为浮点数读取。我不确定这是否会更快。我不是SSE大师,所以我不知道怎么做。也许有人能插嘴。
#8
1
In Visual C++ 2008, the compiler generates SSE2 calls by itself, if you do a release build with maxed out optimization options, and look at a disassembly (though some conditions have to be met, play around with your code).
在Visual c++ 2008中,编译器自己生成SSE2调用,如果您使用最大化的优化选项进行发布构建,并查看反汇编(尽管有些条件必须满足,但可以使用代码)。
#9
1
See this Intel article for speeding up integer conversions:
请看这篇关于加速整数转换的Intel文章:
http://software.intel.com/en-us/articles/latency-of-floating-point-to-integer-conversions/
http://software.intel.com/en-us/articles/latency-of-floating-point-to-integer-conversions/
According to Microsoft, the /QIfist compiler option is deprecated in VS 2005 because integer conversion has been sped up. They neglect to say how it has been sped up, but looking at the disassembly listing might give a clue.
根据微软的说法,/QIfist编译器选项在VS 2005中被弃用,因为整数转换已经加快了。他们忽略了它是如何被加速的,但是看看拆解清单可能会提供一个线索。
http://msdn.microsoft.com/en-us/library/z8dh4h17(vs.80).aspx
http://msdn.microsoft.com/en-us/library/z8dh4h17(vs.80). aspx
#10
1
most c compilers generate calls to _ftol or something for every float to int conversion. putting a reduced floating point conformance switch (like fp:fast) might help - IF you understand AND accept the other effects of this switch. other than that, put the thing in a tight assembly or sse intrinsic loop, IF you are ok AND understand the different rounding behavior. for large loops like your example you should write a function that sets up floating point control words once and then does the bulk rounding with only fistp instructions and then resets the control word - IF you are ok with an x86 only code path, but at least you will not change the rounding. read up on the fld and fistp fpu instructions and the fpu control word.
大多数c编译器为每一个浮点数生成对_ftol或其他值的调用。如果您理解并接受这个开关的其他影响,那么放置一个减少的浮点一致性开关(如fp:fast)可能会有所帮助。除此之外,如果你理解并理解不同的舍入行为,把它放到一个紧密的程序集或sse内部循环中。对于大型循环喜欢你的例子,你应该写一个函数,设置浮点一旦控制词汇,然后大部分的舍入只有fistp指令,然后重置控制字——如果你可以用一个x86只有代码路径,但是至少你不会改变舍入。阅读fld和fistp fpu指令和fpu控制词。
#11
0
What compiler are you using? In Microsoft's more recent C/C++ compilers, there is an option under C/C++ -> Code Generation -> Floating point model, which has options: fast, precise, strict. I think precise is the default, and works by emulating FP operations to some extent. If you are using a MS compiler, how is this option set? Does it help to set it to "fast"? In any case, what does the disassembly look like?
你在用什么编译器?在微软最近的C/ c++编译器中,在C/ c++ ->代码生成->浮点模型下有一个选项:快速、精确、严格。我认为precision是默认值,在某种程度上通过模拟FP操作来实现。如果您正在使用MS编译器,该选项是如何设置的?把它设置成“快”有帮助吗?无论如何,拆卸是什么样子的?
As thirtyseven said above, the CPU can convert float<->int
in essentially one instruction, and it doesn't get any faster than that (short of a SIMD operation).
正如上面提到的,CPU可以在一条指令中转换浮点<->int,并且不会比这更快(除了SIMD操作)。
Also note that modern CPUs use the same FP unit for both single (32 bit) and double (64 bit) FP numbers, so unless you are trying to save memory storing a lot of floats, there's really no reason to favor float
over double.
还要注意,现代cpu对单(32位)和双(64位)FP号都使用相同的FP单元,因此,除非您试图保存存储大量浮点数的内存,否则实际上没有理由支持float而不是double。
#12
#13
0
I'm surprised by your result. What compiler are you using? Are you compiling with optimization turned all the way up? Have you confirmed using valgrind and Kcachegrind that this is where the bottleneck is? What processor are you using? What does the assembly code look like?
我对你的结果感到惊讶。你在用什么编译器?你是否一直在优化编译?你确认过使用valgrind和Kcachegrind吗?瓶颈就在这里?你在用什么处理器?汇编代码是什么样子的?
The conversion itself should be compiled to a single instruction. A good optimizing compiler should unroll the loop so that several conversions are done per test-and-branch. If that's not happening, you can unroll the loop by hand:
转换本身应该编译为一条指令。一个好的优化编译器应该展开循环,这样每个测试和分支都可以进行几个转换。如果这种情况没有发生,你可以手工展开循环:
for(int i = 0; i < HUGE_NUMBER-3; i += 4) {
int_array[i] = float_array[i];
int_array[i+1] = float_array[i+1];
int_array[i+2] = float_array[i+2];
int_array[i+3] = float_array[i+3];
}
for(; i < HUGE_NUMBER; i++)
int_array[i] = float_array[i];
If your compiler is really pathetic, you might need to help it with the common subexpressions, e.g.,
如果你的编译器真的很糟糕,你可能需要帮助它处理常见的子表达式,例如,
int *ip = int_array+i;
float *fp = float_array+i;
ip[0] = fp[0];
ip[1] = fp[1];
ip[2] = fp[2];
ip[3] = fp[3];
Do report back with more info!
一定要用更多的信息汇报!
#14
0
If you do not care very much about the rounding semantics, you can use the lrint()
function. This allows for more freedom in rounding and it can be much faster.
如果不太关心舍入语义,可以使用lrint()函数。这允许更*的四舍五入,而且可以更快。
Technically, it's a C99 function, but your compiler probably exposes it in C++. A good compiler will also inline it to one instruction (a modern G++ will).
从技术上讲,它是一个C99函数,但是编译器可能会将它暴露在c++中。一个好的编译器也会将它内联到一个指令(一个现代的g++会)。
lrint文档
#15
0
rounding only excellent trick, only the use 6755399441055743.5 (0.5 less) to do rounding won't work.
四舍五入的技巧,只使用6755399441055743.5(0.5以下)做四舍五入不会起作用。
6755399441055744 = 2^52 + 2^51 overflowing decimals off the end of the mantissa leaving the integer that you want in bits 51 - 0 of the fpu register.
6755399441055744 = 2 ^ 52 + 2 ^ 51满溢的小数的结束尾数留下你想要的整数位51 - 0 fpu的登记。
In IEEE 754
6755399441055744.0 =
在ieee754 6755399441055744.0 =
sign exponent mantissa
0 10000110011 1000000000000000000000000000000000000000000000000000
符号指数mantissa 0 10000110011 1000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000。
6755399441055743.5 will also however compile to 0100001100111000000000000000000000000000000000000000000000000000
不过,也要编译到010000110011100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000。
the 0.5 overflows off the end (rounding up) which is why this works in the first place.
0。5从末端溢出(四舍五入),这就是为什么它首先起作用。
to do truncation you would have to add 0.5 to your double then do this the guard digits should take care of rounding to the correct result done this way. also watch out for 64 bit gcc linux where long rather annoyingly means a 64 bit integer.
要做截断,你必须把0。5加到你的双精度浮点数上,然后这样做,保护数字应该考虑四舍五入到正确的结果上。还要注意64位的gcc linux,在那里long是一个64位的整数。
#16
-1
If you have very large arrays (bigger than a few MB--the size of the CPU cache), time your code and see what the throughput is. You're probably saturating the memory bus, not the FP unit. Look up the maximum theoretical bandwidth for your CPU and see how close to it you are.
如果您有非常大的数组(大于几个MB——CPU缓存的大小),那么您就可以查看代码并查看吞吐量了。你可能使内存总线饱和,而不是FP单元。查找CPU的最大理论带宽,看看您离它有多近。
If you're being limited by the memory bus, extra threads will just make it worse. You need better hardware (e.g. faster memory, different CPU, different motherboard).
如果您受到内存总线的限制,那么额外的线程只会使情况变得更糟。你需要更好的硬件(例如更快的内存,不同的CPU,不同的主板)。
In response to Larry Gritz's comment...
You are correct: the FPU is a major bottleneck (and using the xs_CRoundToInt trick allows one to come very close to saturating the memory bus).
您是正确的:FPU是一个主要的瓶颈(使用xs_CRoundToInt技巧可以使内存总线非常接近饱和)。
Here are some test results for a Core 2 (Q6600) processor. The theoretical main-memory bandwidth for this machine is 3.2 GB/s (L1 and L2 bandwidths are much higher). The code was compiled with Visual Studio 2008. Similar results for 32-bit and 64-bit, and with /O2 or /Ox optimizations.
下面是对Core 2 (Q6600)处理器的一些测试结果。这台机器的理论主存带宽为3.2 GB/s (L1和L2带宽要高得多)。代码是用Visual Studio 2008编译的。32位和64位以及/O2或/Ox优化的类似结果。
WRITING ONLY... 1866359 ticks with 33554432 array elements (33554432 touched). Bandwidth: 1.91793 GB/s 154749 ticks with 262144 array elements (33554432 touched). Bandwidth: 23.1313 GB/s 108816 ticks with 8192 array elements (33554432 touched). Bandwidth: 32.8954 GB/s USING CASTING... 5236122 ticks with 33554432 array elements (33554432 touched). Bandwidth: 0.683625 GB/s 2014309 ticks with 262144 array elements (33554432 touched). Bandwidth: 1.77706 GB/s 1967345 ticks with 8192 array elements (33554432 touched). Bandwidth: 1.81948 GB/s USING xs_CRoundToInt... 1490583 ticks with 33554432 array elements (33554432 touched). Bandwidth: 2.40144 GB/s 1079530 ticks with 262144 array elements (33554432 touched). Bandwidth: 3.31584 GB/s 1008407 ticks with 8192 array elements (33554432 touched). Bandwidth: 3.5497 GB/s
(Windows) source code:
(Windows)源代码:
// floatToIntTime.cpp : Defines the entry point for the console application.
//
#include <windows.h>
#include <iostream>
using namespace std;
double const _xs_doublemagic = double(6755399441055744.0);
inline int xs_CRoundToInt(double val, double dmr=_xs_doublemagic) {
val = val + dmr;
return ((int*)&val)[0];
}
static size_t const N = 256*1024*1024/sizeof(double);
int I[N];
double F[N];
static size_t const L1CACHE = 128*1024/sizeof(double);
static size_t const L2CACHE = 4*1024*1024/sizeof(double);
static size_t const Sz[] = {N, L2CACHE/2, L1CACHE/2};
static size_t const NIter[] = {1, N/(L2CACHE/2), N/(L1CACHE/2)};
int main(int argc, char *argv[])
{
__int64 freq;
QueryPerformanceFrequency((LARGE_INTEGER*)&freq);
cout << "WRITING ONLY..." << endl;
for (int t=0; t<3; t++) {
__int64 t0,t1;
QueryPerformanceCounter((LARGE_INTEGER*)&t0);
size_t const niter = NIter[t];
size_t const sz = Sz[t];
for (size_t i=0; i<niter; i++) {
for (size_t n=0; n<sz; n++) {
I[n] = 13;
}
}
QueryPerformanceCounter((LARGE_INTEGER*)&t1);
double bandwidth = 8*niter*sz / (((double)(t1-t0))/freq) / 1024/1024/1024;
cout << " " << (t1-t0) << " ticks with " << sz
<< " array elements (" << niter*sz << " touched). "
<< "Bandwidth: " << bandwidth << " GB/s" << endl;
}
cout << "USING CASTING..." << endl;
for (int t=0; t<3; t++) {
__int64 t0,t1;
QueryPerformanceCounter((LARGE_INTEGER*)&t0);
size_t const niter = NIter[t];
size_t const sz = Sz[t];
for (size_t i=0; i<niter; i++) {
for (size_t n=0; n<sz; n++) {
I[n] = (int)F[n];
}
}
QueryPerformanceCounter((LARGE_INTEGER*)&t1);
double bandwidth = 8*niter*sz / (((double)(t1-t0))/freq) / 1024/1024/1024;
cout << " " << (t1-t0) << " ticks with " << sz
<< " array elements (" << niter*sz << " touched). "
<< "Bandwidth: " << bandwidth << " GB/s" << endl;
}
cout << "USING xs_CRoundToInt..." << endl;
for (int t=0; t<3; t++) {
__int64 t0,t1;
QueryPerformanceCounter((LARGE_INTEGER*)&t0);
size_t const niter = NIter[t];
size_t const sz = Sz[t];
for (size_t i=0; i<niter; i++) {
for (size_t n=0; n<sz; n++) {
I[n] = xs_CRoundToInt(F[n]);
}
}
QueryPerformanceCounter((LARGE_INTEGER*)&t1);
double bandwidth = 8*niter*sz / (((double)(t1-t0))/freq) / 1024/1024/1024;
cout << " " << (t1-t0) << " ticks with " << sz
<< " array elements (" << niter*sz << " touched). "
<< "Bandwidth: " << bandwidth << " GB/s" << endl;
}
return 0;
}