I have two arrays: char* c
and float* f
and I need to do this operation:
我有两个数组:char * c和float * f,我需要执行此操作:
// Compute float mask
float* f;
char* c;
char c_thresh;
int n;
for ( int i = 0; i < n; ++i )
{
if ( c[i] < c_thresh ) f[i] = 0.0f;
else f[i] = 1.0f;
}
I am looking for a fast way to do it: without conditionals and using SSE (4.2 or AVX) if possible.
我正在寻找一种快速的方法:没有条件,如果可能的话使用SSE(4.2或AVX)。
If using float
instead of char
can result in faster code, I can change my code to use floats only:
如果使用float而不是char可以导致更快的代码,我可以更改我的代码只使用浮点数:
// Compute float mask
float* f;
float* c;
float c_thresh;
int n;
for ( int i = 0; i < n; ++i )
{
if ( c[i] < c_thresh ) f[i] = 0.0f;
else f[i] = 1.0f;
}
Thanks
6 个解决方案
#1
5
Pretty easy, just do the comparison, convert bytes to dword, AND with 1.0f: (not tested, and this isn't meant to be copy&paste code anyway, it's meant to show how you do it)
非常简单,只需进行比较,将字节转换为dword,并使用1.0f :(未经过测试,无论如何这并不意味着复制和粘贴代码,它的目的是展示你是如何做到的)
movd xmm0, [c] ; read 4 bytes from c
pcmpgtb xmm0, threshold ; compare (note: comparison is >, not >=, so adjust threshold)
pmovzxbd xmm0, xmm0 ; convert bytes to dwords
pand xmm0, one ; AND all four elements with 1.0f
movdqa [f], xmm0 ; save result
Should be pretty easy to convert to intrinsics.
应该很容易转换为内在函数。
#2
5
The following code uses SSE2 (i think).
以下代码使用SSE2(我认为)。
It performs 16 comparisons of bytes in one instruction (_mm_cmpgt_epi8
). It assumes that char
is signed; if your char
is unsigned, it requires additional fiddling (flipping the most significant bit of each char
).
它在一条指令(_mm_cmpgt_epi8)中执行16次字节比较。它假定char已签名;如果你的char是无符号的,它需要额外的摆弄(翻转每个char的最重要的位)。
The only non-standard thing it does is using the magic number 3f80
to represent the floating-point constant 1.0
. The magic number is actually 0x3f800000
, but the fact that the 16 LSB are zero makes it possible to do the bit fiddling more efficiently (using 16-bit masks instead of 32-bit ones).
它唯一的非标准功能是使用幻数3f80来表示浮点常数1.0。幻数实际上是0x3f800000,但是16 LSB为零的事实使得可以更有效地进行比特摆放(使用16位掩码而不是32位掩码)。
// load (assuming the pointer is aligned)
__m128i input = *(const __m128i*)c;
// compare
__m128i cmp = _mm_cmpgt_epi8(input, _mm_set1_epi8(c_thresh - 1));
// convert to 16-bit
__m128i c0 = _mm_unpacklo_epi8(cmp, cmp);
__m128i c1 = _mm_unpackhi_epi8(cmp, cmp);
// convert ffff to 3f80
c0 = _mm_and_si128(c0, _mm_set1_epi16(0x3f80));
c1 = _mm_and_si128(c1, _mm_set1_epi16(0x3f80));
// convert to 32-bit and write (assuming the pointer is aligned)
__m128i* result = (__m128i*)f;
result[0] = _mm_unpacklo_epi16(_mm_setzero_si128(), c0);
result[1] = _mm_unpackhi_epi16(_mm_setzero_si128(), c0);
result[2] = _mm_unpacklo_epi16(_mm_setzero_si128(), c1);
result[3] = _mm_unpackhi_epi16(_mm_setzero_si128(), c1);
#3
3
By switching to floats you can auto-vectorize the loop in GCC and not worry about intrinsics. The following code will do what you want and auto-vectorizes.
通过切换到浮点数,您可以在GCC中自动矢量化循环,而不用担心内在函数。以下代码将执行您想要的操作并自动进行矢量化。
void foo(float *f, float*c, float c_thresh, const int n) {
for (int i = 0; i < n; ++i) {
f[i] = (float)(c[i] >= c_thresh);
}
}
Compiled with
g++ -O3 -Wall -pedantic -march=native main.cpp -ftree-vectorizer-verbose=1
You can see the results and edit/compile the code yourself at coliru. However, MSVC2013 did not vectorize the loop.
您可以在coliru上看到结果并自己编辑/编译代码。但是,MSVC2013没有对循环进行矢量化。
#4
2
AVX version:
void floatSelect(float* f, const char* c, size_t n, char c_thresh) {
for (size_t i = 0; i < n; ++i) {
if (c[i] < c_thresh) f[i] = 0.0f;
else f[i] = 1.0f;
}
}
void vecFloatSelect(float* f, const char* c, size_t n, char c_thresh) {
const auto thresh = _mm_set1_epi8(c_thresh);
const auto zeros = _mm256_setzero_ps();
const auto ones = _mm256_set1_ps(1.0f);
const auto shuffle0 = _mm_set_epi8(3, -1, -1, -1, 2, -1, -1, -1, 1, -1, -1, -1, 0, -1, -1, -1);
const auto shuffle1 = _mm_set_epi8(7, -1, -1, -1, 6, -1, -1, -1, 5, -1, -1, -1, 4, -1, -1, -1);
const auto shuffle2 = _mm_set_epi8(11, -1, -1, -1, 10, -1, -1, -1, 9, -1, -1, -1, 8, -1, -1, -1);
const auto shuffle3 = _mm_set_epi8(15, -1, -1, -1, 14, -1, -1, -1, 13, -1, -1, -1, 12, -1, -1, -1);
const size_t nVec = (n / 16) * 16;
for (size_t i = 0; i < nVec; i += 16) {
const auto chars = _mm_loadu_si128(reinterpret_cast<const __m128i*>(c + i));
const auto mask = _mm_cmplt_epi8(chars, thresh);
const auto floatMask0 = _mm_shuffle_epi8(mask, shuffle0);
const auto floatMask1 = _mm_shuffle_epi8(mask, shuffle1);
const auto floatMask2 = _mm_shuffle_epi8(mask, shuffle2);
const auto floatMask3 = _mm_shuffle_epi8(mask, shuffle3);
const auto floatMask01 = _mm256_set_m128i(floatMask1, floatMask0);
const auto floatMask23 = _mm256_set_m128i(floatMask3, floatMask2);
const auto floats0 = _mm256_blendv_ps(ones, zeros, _mm256_castsi256_ps(floatMask01));
const auto floats1 = _mm256_blendv_ps(ones, zeros, _mm256_castsi256_ps(floatMask23));
_mm256_storeu_ps(f + i, floats0);
_mm256_storeu_ps(f + i + 8, floats1);
}
floatSelect(f + nVec, c + nVec, n % 16, c_thresh);
}
#5
1
What about:
f[i] = (c[i] >= c_thresh);
At least this removes the conditional.
至少这会删除条件。
#6
1
Converting to
f[i] = (float)(c[i] >= c_thresh);
- will also auto-vectorizable with Intel Compiler (mentioned by others to be true for gcc as well)
- 也可以使用英特尔编译器自动矢量化(其他人也提到gcc也是如此)
In case you need to auto-vectorize some branchy loop in general, - you could also try #pragma ivdep or pragma simd (the last one is a part of Intel Cilk Plus and OpenMP 4.0 standards). These pragmas auto-vectorize given code in portable way for SSE, AVX and future vector extensions (like AVX512). These pragmas are supported by Intel Compiler (all known versions), Cray and PGI compilers (ivdep only), probably upcoming GCC4.9 release and partially supported by MSVC (ivdep only) starting from VS2012.
如果您需要自动向量化一些分支循环,您还可以尝试#pragma ivdep或pragma simd(最后一个是Intel Cilk Plus和OpenMP 4.0标准的一部分)。这些编译指示以可移植的方式自动向量化给定代码,用于SSE,AVX和未来的向量扩展(如AVX512)。这些编译指示由英特尔编译器(所有已知版本),Cray和PGI编译器(仅限ivdep)支持,可能即将推出的GCC4.9版本,并且从VS2012开始部分支持MSVC(仅限ivdep)。
For given example I didn't change anything (kept if and char*), just added pragma ivdep:
对于给定的例子,我没有改变任何东西(保留if和char *),只是添加了pragma ivdep:
void foo(float *f, char*c, char c_thresh, const int n) {
#pragma ivdep
for ( int i = 0; i < n; ++i )
{
if ( c[i] < c_thresh ) f[i] = 0.0f;
else f[i] = 1.0f;
}
}
On my Core i5 with no AVX support (SSE3 only), for n = 32K (32000000), having c[i] generated randomly and using c_thresh equal to 0 (we use signed char), given code provides about ~5x speed-up due to vectorization with ICL.
在没有AVX支持的Core i5上(仅限SSE3),对于n = 32K(32000000),c [i]随机生成并使用c_thresh等于0(我们使用signed char),给定代码提供约5倍的加速由于ICL的矢量化。
Full test (with additional test case correctness check) is available here (it's coliru, i.e. gcc4.8 only, no ICL/Cray; that's why it doesn't vectorize in coliru env).
这里有完整的测试(附加测试用例正确性检查)(它是coliru,即仅限gcc4.8,没有ICL / Cray;这就是为什么它不能在coliru env中进行矢量化)。
It should be possible to do further performance optimization by dealing with more pre-fetching, alignment and type conversions pragmas/optimizations. Also adding restrict keyword (or restrict depending on compiler used) may be used instead of ivdep/simd for given simple case, while for more general cases - pragmas simd/ivdep are most powerful.
应该可以通过处理更多的预取,对齐和类型转换pragma / optimizations来进行进一步的性能优化。对于给定的简单情况,也可以使用添加restrict关键字(或根据所使用的编译器进行限制)而不是ivdep / simd,而对于更一般的情况 - 编译指示simd / ivdep是最强大的。
Note: in fact #pragma ivdep "instructs the compiler to ignore assumed cross-iterational dependencies" ( roughly speaking those who leads to data races if you parallelize the same loop). Compilers are very conservative in these assumptions by well-known reasons. In given case there is obviously no write-after-read or read-after-write dependencies. If needed, one could validate presence of such dependencies at least on given workload with dynamic tools like Advisor XE Correctness analysis, like btw shown in my comments below.
注意:实际上#pragma ivdep“指示编译器忽略假定的跨迭代依赖性”(粗略地说,如果并行化同一个循环,那些导致数据竞争的那些)。由于众所周知的原因,编译器在这些假设中非常保守。在给定的情况下,显然没有写后读或写后读依赖。如果需要,可以使用Advisor XE Correctness分析等动态工具验证至少在给定工作负载上存在此类依赖关系,如下面的评论中所示的btw。
#1
5
Pretty easy, just do the comparison, convert bytes to dword, AND with 1.0f: (not tested, and this isn't meant to be copy&paste code anyway, it's meant to show how you do it)
非常简单,只需进行比较,将字节转换为dword,并使用1.0f :(未经过测试,无论如何这并不意味着复制和粘贴代码,它的目的是展示你是如何做到的)
movd xmm0, [c] ; read 4 bytes from c
pcmpgtb xmm0, threshold ; compare (note: comparison is >, not >=, so adjust threshold)
pmovzxbd xmm0, xmm0 ; convert bytes to dwords
pand xmm0, one ; AND all four elements with 1.0f
movdqa [f], xmm0 ; save result
Should be pretty easy to convert to intrinsics.
应该很容易转换为内在函数。
#2
5
The following code uses SSE2 (i think).
以下代码使用SSE2(我认为)。
It performs 16 comparisons of bytes in one instruction (_mm_cmpgt_epi8
). It assumes that char
is signed; if your char
is unsigned, it requires additional fiddling (flipping the most significant bit of each char
).
它在一条指令(_mm_cmpgt_epi8)中执行16次字节比较。它假定char已签名;如果你的char是无符号的,它需要额外的摆弄(翻转每个char的最重要的位)。
The only non-standard thing it does is using the magic number 3f80
to represent the floating-point constant 1.0
. The magic number is actually 0x3f800000
, but the fact that the 16 LSB are zero makes it possible to do the bit fiddling more efficiently (using 16-bit masks instead of 32-bit ones).
它唯一的非标准功能是使用幻数3f80来表示浮点常数1.0。幻数实际上是0x3f800000,但是16 LSB为零的事实使得可以更有效地进行比特摆放(使用16位掩码而不是32位掩码)。
// load (assuming the pointer is aligned)
__m128i input = *(const __m128i*)c;
// compare
__m128i cmp = _mm_cmpgt_epi8(input, _mm_set1_epi8(c_thresh - 1));
// convert to 16-bit
__m128i c0 = _mm_unpacklo_epi8(cmp, cmp);
__m128i c1 = _mm_unpackhi_epi8(cmp, cmp);
// convert ffff to 3f80
c0 = _mm_and_si128(c0, _mm_set1_epi16(0x3f80));
c1 = _mm_and_si128(c1, _mm_set1_epi16(0x3f80));
// convert to 32-bit and write (assuming the pointer is aligned)
__m128i* result = (__m128i*)f;
result[0] = _mm_unpacklo_epi16(_mm_setzero_si128(), c0);
result[1] = _mm_unpackhi_epi16(_mm_setzero_si128(), c0);
result[2] = _mm_unpacklo_epi16(_mm_setzero_si128(), c1);
result[3] = _mm_unpackhi_epi16(_mm_setzero_si128(), c1);
#3
3
By switching to floats you can auto-vectorize the loop in GCC and not worry about intrinsics. The following code will do what you want and auto-vectorizes.
通过切换到浮点数,您可以在GCC中自动矢量化循环,而不用担心内在函数。以下代码将执行您想要的操作并自动进行矢量化。
void foo(float *f, float*c, float c_thresh, const int n) {
for (int i = 0; i < n; ++i) {
f[i] = (float)(c[i] >= c_thresh);
}
}
Compiled with
g++ -O3 -Wall -pedantic -march=native main.cpp -ftree-vectorizer-verbose=1
You can see the results and edit/compile the code yourself at coliru. However, MSVC2013 did not vectorize the loop.
您可以在coliru上看到结果并自己编辑/编译代码。但是,MSVC2013没有对循环进行矢量化。
#4
2
AVX version:
void floatSelect(float* f, const char* c, size_t n, char c_thresh) {
for (size_t i = 0; i < n; ++i) {
if (c[i] < c_thresh) f[i] = 0.0f;
else f[i] = 1.0f;
}
}
void vecFloatSelect(float* f, const char* c, size_t n, char c_thresh) {
const auto thresh = _mm_set1_epi8(c_thresh);
const auto zeros = _mm256_setzero_ps();
const auto ones = _mm256_set1_ps(1.0f);
const auto shuffle0 = _mm_set_epi8(3, -1, -1, -1, 2, -1, -1, -1, 1, -1, -1, -1, 0, -1, -1, -1);
const auto shuffle1 = _mm_set_epi8(7, -1, -1, -1, 6, -1, -1, -1, 5, -1, -1, -1, 4, -1, -1, -1);
const auto shuffle2 = _mm_set_epi8(11, -1, -1, -1, 10, -1, -1, -1, 9, -1, -1, -1, 8, -1, -1, -1);
const auto shuffle3 = _mm_set_epi8(15, -1, -1, -1, 14, -1, -1, -1, 13, -1, -1, -1, 12, -1, -1, -1);
const size_t nVec = (n / 16) * 16;
for (size_t i = 0; i < nVec; i += 16) {
const auto chars = _mm_loadu_si128(reinterpret_cast<const __m128i*>(c + i));
const auto mask = _mm_cmplt_epi8(chars, thresh);
const auto floatMask0 = _mm_shuffle_epi8(mask, shuffle0);
const auto floatMask1 = _mm_shuffle_epi8(mask, shuffle1);
const auto floatMask2 = _mm_shuffle_epi8(mask, shuffle2);
const auto floatMask3 = _mm_shuffle_epi8(mask, shuffle3);
const auto floatMask01 = _mm256_set_m128i(floatMask1, floatMask0);
const auto floatMask23 = _mm256_set_m128i(floatMask3, floatMask2);
const auto floats0 = _mm256_blendv_ps(ones, zeros, _mm256_castsi256_ps(floatMask01));
const auto floats1 = _mm256_blendv_ps(ones, zeros, _mm256_castsi256_ps(floatMask23));
_mm256_storeu_ps(f + i, floats0);
_mm256_storeu_ps(f + i + 8, floats1);
}
floatSelect(f + nVec, c + nVec, n % 16, c_thresh);
}
#5
1
What about:
f[i] = (c[i] >= c_thresh);
At least this removes the conditional.
至少这会删除条件。
#6
1
Converting to
f[i] = (float)(c[i] >= c_thresh);
- will also auto-vectorizable with Intel Compiler (mentioned by others to be true for gcc as well)
- 也可以使用英特尔编译器自动矢量化(其他人也提到gcc也是如此)
In case you need to auto-vectorize some branchy loop in general, - you could also try #pragma ivdep or pragma simd (the last one is a part of Intel Cilk Plus and OpenMP 4.0 standards). These pragmas auto-vectorize given code in portable way for SSE, AVX and future vector extensions (like AVX512). These pragmas are supported by Intel Compiler (all known versions), Cray and PGI compilers (ivdep only), probably upcoming GCC4.9 release and partially supported by MSVC (ivdep only) starting from VS2012.
如果您需要自动向量化一些分支循环,您还可以尝试#pragma ivdep或pragma simd(最后一个是Intel Cilk Plus和OpenMP 4.0标准的一部分)。这些编译指示以可移植的方式自动向量化给定代码,用于SSE,AVX和未来的向量扩展(如AVX512)。这些编译指示由英特尔编译器(所有已知版本),Cray和PGI编译器(仅限ivdep)支持,可能即将推出的GCC4.9版本,并且从VS2012开始部分支持MSVC(仅限ivdep)。
For given example I didn't change anything (kept if and char*), just added pragma ivdep:
对于给定的例子,我没有改变任何东西(保留if和char *),只是添加了pragma ivdep:
void foo(float *f, char*c, char c_thresh, const int n) {
#pragma ivdep
for ( int i = 0; i < n; ++i )
{
if ( c[i] < c_thresh ) f[i] = 0.0f;
else f[i] = 1.0f;
}
}
On my Core i5 with no AVX support (SSE3 only), for n = 32K (32000000), having c[i] generated randomly and using c_thresh equal to 0 (we use signed char), given code provides about ~5x speed-up due to vectorization with ICL.
在没有AVX支持的Core i5上(仅限SSE3),对于n = 32K(32000000),c [i]随机生成并使用c_thresh等于0(我们使用signed char),给定代码提供约5倍的加速由于ICL的矢量化。
Full test (with additional test case correctness check) is available here (it's coliru, i.e. gcc4.8 only, no ICL/Cray; that's why it doesn't vectorize in coliru env).
这里有完整的测试(附加测试用例正确性检查)(它是coliru,即仅限gcc4.8,没有ICL / Cray;这就是为什么它不能在coliru env中进行矢量化)。
It should be possible to do further performance optimization by dealing with more pre-fetching, alignment and type conversions pragmas/optimizations. Also adding restrict keyword (or restrict depending on compiler used) may be used instead of ivdep/simd for given simple case, while for more general cases - pragmas simd/ivdep are most powerful.
应该可以通过处理更多的预取,对齐和类型转换pragma / optimizations来进行进一步的性能优化。对于给定的简单情况,也可以使用添加restrict关键字(或根据所使用的编译器进行限制)而不是ivdep / simd,而对于更一般的情况 - 编译指示simd / ivdep是最强大的。
Note: in fact #pragma ivdep "instructs the compiler to ignore assumed cross-iterational dependencies" ( roughly speaking those who leads to data races if you parallelize the same loop). Compilers are very conservative in these assumptions by well-known reasons. In given case there is obviously no write-after-read or read-after-write dependencies. If needed, one could validate presence of such dependencies at least on given workload with dynamic tools like Advisor XE Correctness analysis, like btw shown in my comments below.
注意:实际上#pragma ivdep“指示编译器忽略假定的跨迭代依赖性”(粗略地说,如果并行化同一个循环,那些导致数据竞争的那些)。由于众所周知的原因,编译器在这些假设中非常保守。在给定的情况下,显然没有写后读或写后读依赖。如果需要,可以使用Advisor XE Correctness分析等动态工具验证至少在给定工作负载上存在此类依赖关系,如下面的评论中所示的btw。