将32 0/1值打包到单个32位变量的位中的最快方法是什么?

时间:2022-03-14 03:12:12

I'm working on an x86 or x86_64 machine. I have an array unsigned int a[32] all of whose elements have value either 0 or 1. I want to set the single variable unsigned int b so that (b >> i) & 1 == a[i] will hold for all 32 elements of a. I'm working with GCC on Linux (shouldn't matter much I guess).

我正在使用x86或x86_64机器。我有一个数组unsigned int a [32]它的所有元素的值都是0或1.我想设置单个变量unsigned int b,这样(b >> i)&1 == a [i]将保持为a的所有32个元素。我正在使用Linux上的GCC(我猜不应该这么做)。

What's the fastest way to do this in C?


The fastest way on recent x86 processors is probably to make use of the MOVMSKB family of instructions which extract the MSBs of a SIMD word and pack them into a normal integer register.


I fear SIMD intrinsics are not really my thing but something along these lines ought to work if you've got an AVX2 equipped processor:


uint32_t bitpack(const bool array[32]) {
    __mm256i tmp = _mm256_loadu_si256((const __mm256i *) array);
    tmp = _mm256_cmpgt_epi8(tmp, _mm256_setzero_si256());
    return _mm256_movemask_epi8(tmp);

Assuming sizeof(bool) = 1. For older SSE2 systems you will have to string together a pair of 128-bit operations instead. Aligning the array on a 32-byte boundary and should save another cycle or so.

假设sizeof(bool)= 1.对于较旧的SSE2系统,您必须将一对128位操作串联起来。将数组对齐在32字节边界上,并应保存另一个周期左右。



Other answers contain an obvious loop implementation.


Here's a first variant:


unsigned int result=0;
for(unsigned i = 0; i < 32; ++i)
    result = (result<<1) + a[i];

On modern x86 CPUs, I think shifts of any distance in a register is constant, and this solution won't be better. Your CPU might not be so nice; this code minimizes the cost of long-distance shifts; it does 32 1-bit shifts which every CPU can do (you can always add result to itself to get the same effect). The obvious loop implementation shown by others does about 900 (sum on 32) 1-bit shifts, by virtue of shifting a distance equal to the loop index. (See @Jongware's measurements of differences in comments; apparantly long shifts on x86 are not unit time).

在现代的x86 CPU上,我认为寄存器中任何距离的移位都是不变的,这种解决方案也不会更好。你的CPU可能不那么好;这段代码最大限度地降低了长途班次的成本;它执行32个1位移位,每个CPU都可以执行(您可以始终将结果添加到自身以获得相同的效果)。其他人所示的明显的循环实现通过移动等于循环索引的距离来进行大约900(总和32)1位移位。 (参见@Jongware对评论差异的测量结果; x86上的长时间偏移不是单位时间)。

Let us try something more radical.


Assume you can pack m booleans into an int somehow (trivially you can do this for m==1), and that you have two instance variables i1 and i2 containing such m packed bits.

假设你可以以某种方式将m个布尔值打包成一个int(通常你可以为m == 1执行此操作),并且你有两个实例变量i1和i2包含这样的m个打包位。

Then the following code packs m*2 booleans into an int:

然后下面的代码将m * 2个布尔值打包成一个int:


Using this we can pack 2^n bits as follows:

使用这个我们可以打包2 ^ n位如下:

 unsigned int a2[16],a4[8],a8[4],a16[2], a32[1]; // each "aN" will hold N bits of the answer

 a2[0]=(a1[0]<<1)+a2[1];  // the original bits are a1[k]; can be scalar variables or ints
 a2[1]=(a1[2]<<1)+a1[3];  //  yes, you can use "|" instead of "+"





Assuming our friendly compiler resolves an[k] into a (scalar) direct memory access (if not, you can simply replace the variable an[k] with an_k), the above code does (abstractly) 63 fetches, 31 writes, 31 shifts and 31 adds. (There's an obvious extension to 64 bits).

假设我们友好的编译器将[k]解析为(标量)直接存储器访问(如果没有,你可以简单地用an_k替换变量an [k]),上面的代码(抽象地)抽取63次,31次写入,31次移位和31添加。 (有64位的明显扩展)。

On modern x86 CPUs, I think shifts of any distance in a register is constant. If not, this code minimizes the cost of long-distance shifts; it in effect does 64 1-bit shifts.

在现代的x86 CPU上,我认为寄存器中任何距离的移位都是不变的。如果没有,这段代码可以最大限度地降低长途班次的成本;它实际上是64位1位移位。

On an x64 machine, other than the fetches of the original booleans a1[k], I'd expect all the rest of the scalars to be schedulable by the compiler to fit in the registers, thus 32 memory fetches, 31 shifts and 31 adds. Its pretty hard to avoid the fetches (if the original booleans are scattered around) and the shifts/adds match the obvious simple loop. But there is no loop, so we avoid 32 increment/compare/index operations.

在x64机器上,除了原始布尔值a1 [k]的提取之外,我希望编译器可以调度所有其余的标量以适应寄存器,因此32个内存提取,31个移位和31个添加。很难避免提取(如果原始的布尔分散在周围)并且移位/添加匹配明显的简单循环。但是没有循环,所以我们避免了32个增量/比较/索引操作。

If the starting booleans are really in array, with each bit occupying the bottom bit of and otherwise zeroed byte:


bool a1[32];

then we can abuse our knowledge of memory layout to fetch several at a time:


a4[0]=((unsigned int)a1)[0]; // picks up 4 bools in one fetch
a4[1]=((unsigned int)a1)[1];
a4[7]=((unsigned int)a1)[7];




Here our cost is 8 fetches of (sets of 4) booleans, 7 shifts and 7 adds. Again, no loop overhead. (Again there is an obvious generalization to 64 bits).

在这里,我们的成本是8次(4组)布尔,7班和7加。同样,没有循环开销。 (同样有一个明显的64位泛化)。

To get faster than this, you probably have to drop into assembler and use some of the many wonderful and wierd instrucions available there (the vector registers probably have scatter/gather ops that might work nicely).


As always, these solutions needed to performance tested.




If sizeof(bool) == 1 then you can pack 8 bools at a time into 8 bits (more with 128-bit multiplications) using the technique discussed here in a computer with fast multiplication

如果sizeof(bool)== 1那么你可以使用这里讨论的技术在快速乘法的计算机中一次打包8个bool到8位(更多用128位乘法)

Suppose the bools a[0] to a[7] have their least significant bits named a-h respectively. Treating those 8 consecutive bools as one 64-bit word and load them we'll get the bits in reversed order in a little-endian machine. Now we'll do a multiplication (here dots are zero bits)

假设bo [a]到[7]的最低有效位分别命名为a-h。将这8个连续bool作为一个64位字处理并加载它们,我们将在一个小端机器中以相反的顺序得到这些位。现在我们将进行乘法运算(此处点为零位)

  |  a7  ||  a6  ||  a4  ||  a4  ||  a3  ||  a2  ||  a1  ||  a0  |
x 1000000001000000001000000001000000001000000001000000001000000001
+ ↑...e....↑..d.....↑.c......↑b.......a
= abcdefghxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

The arrows are added so it's easier to see the position of the set bits in the magic number. At this point 8 least significant bits has been put in the top byte, we'll just need to mask the remaining bits out


So by using the magic number 0b1000000001000000001000000001000000001000000001000000001000000001 or 0x8040201008040201 we have the following code


inline int pack8b(bool* a)
    uint64_t t = *((uint64_t*)a);
    return (0x8040201008040201*t >> 56) & 0xFF;

int pack32b(bool* a)
    return (pack8b(a) << 24) | (pack8b(a + 8) << 16) | (pack8b(a + 16) << 8) | (pack8b(a + 24));

Of course you need to make sure that the bool array is correctly 8-byte aligned. You can also unroll the code and optimize it, like shift only once instead of shifting left 56 bits


Sorry I overlooked the question and saw doynax's bool array as well as misread "32 0/1 values" and thought they're 32 bools. Of course the same technique can also be used to pack 4 uint32_t at the same time with 128-bit multiplication, or 2 at a time with normal 64-bit multiplication, but it's a lot less efficient than packing bytes

对不起,我忽略了这个问题,看到了doynax的bool阵列以及误读了“32 0/1值”,并认为他们是32个bool。当然,使用相同的技术也可以使用128位乘法同时打包4 uint32_t,或者使用正常的64位乘法同时打包2个,但它比打包字节的效率低很多

On newer x86 CPUs with BMI2 the PEXT instruction can be used. The pack8b function above can be replaced with

在具有BMI2的较新x86 CPU上,可以使用PEXT指令。上面的pack8b功能可以替换为

_pext_u64(*((uint64_t*)a), 0x0101010101010101ULL);

And to pack 2 uint32_t as the question requires use

并且要打包2 uint32_t,因为问题需要使用

_pext_u64(*((uint64_t*)a), (1ULL << 32) | 1ULL);



I would probably go for this:


unsigned a[32] =
    1, 0, 0, 1, 1, 1, 0 ,0, 1, 0, 0, 0, 1, 1, 0, 0
    , 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1

int main()
    unsigned b = 0;

    for(unsigned i = 0; i < sizeof(a) / sizeof(*a); ++i)
        b |= a[i] << i;

    printf("b: %u\n", b);

Compiler optimization may well unroll that but just in case you can always try:


int main()
    unsigned b = 0;

    b |= a[0];
    b |= a[1] << 1;
    b |= a[2] << 2;
    b |= a[3] << 3;
    // ... etc
    b |= a[31] << 31;

    printf("b: %u\n", b);



To determine what the fastest way is, time all of the various suggestions. Here is one that well may end up as "the" fastest (using standard C, no processor dependent SSE or the likes):


unsigned int bits[32][2] = {
unsigned int b = 0;
for (i=0; i< 32; i++)
     b |= bits[i][a[i]];

The first value in the array is to be the leftmost bit: the highest possible value.


Testing proof-of-concept with some rough timings show this is indeed not magnitudes better than the straightforward loop with b |= (a[i]<<(31-i)):

用一些粗略的时序测试概念验证表明,这确实不比使用b | =(a [i] <<(31-i))的简单循环更好:

Ira                   3618 ticks
naive, unrolled       5620 ticks
Ira, 1-shifted       10044 ticks
Galik                10265 ticks
Jongware, using adds 12536 ticks
Jongware             12682 ticks
naive                13373 ticks

(Relative timings, with the same compiler options.)


(The 'adds' routine is mine with indexing replaced with a pointer-to and an explicit add for both indexed arrays. It is 10% slower, meaning my compiler is efficiently optimizing indexed access. Good to know.)




unsigned b=0;
for(int i=31; i>=0; --i){



Your problem is a good opportunity to use -->, also called the downto operator:

您的问题是使用的好机会 - >,也称为downto运算符:

unsigned int a[32];
unsigned int b = 0;
for (unsigned int i = 32; i --> 0;) {
    b += b + a[i];

The advantage of using --> is it works with both signed and unsigned loop index types.

使用 - >的优点是它适用于有符号和无符号循环索引类型。

This approach is portable and readable, it might not produce the fastest code, but clang does unroll the loop and produce decent performance, see https://godbolt.org/g/6xgwLJ




