如何计算32位整数中设置位的数目?

时间:2021-09-11 02:59:49

8 bits representing the number 7 look like this:

表示数字7的8位是这样的:

00000111

Three bits are set.

三位设置。

What are algorithms to determine the number of set bits in a 32-bit integer?

什么是算法来确定32位整数中的集合位的数量?

50 个解决方案

#1


740  

This is known as the 'Hamming Weight', 'popcount' or 'sideways addition'.

这就是所谓的“汉明重量”、“popcount”或“横向加法”。

The 'best' algorithm really depends on which CPU you are on and what your usage pattern is.

“最佳”算法实际上取决于您所使用的CPU以及您的使用模式是什么。

Some CPUs have a single built-in instruction to do it and others have parallel instructions which act on bit vectors. The parallel instructions (like x86's popcnt, on CPUs where it's supported) will almost certainly be fastest. Some other architectures may have a slow instruction implemented with a microcoded loop that tests a bit per cycle (citation needed).

一些cpu有一个内置的指令来执行它,而另一些cpu则有并行指令,这些指令作用于位向量。并行指令(像x86的popcnt,在它支持的cpu上)几乎肯定是最快的。一些其他的体系结构可能有一个缓慢的指令,用一个微编码的循环来实现,这个循环测试每个周期(需要引用)。

A pre-populated table lookup method can be very fast if your CPU has a large cache and/or you are doing lots of these instructions in a tight loop. However it can suffer because of the expense of a 'cache miss', where the CPU has to fetch some of the table from main memory.

如果您的CPU有一个大的缓存并且/或者您在一个紧密的循环中做了大量的这些指令,那么一个预填充的表查找方法可能非常快。然而,由于“缓存缺失”,CPU不得不从主内存中取出一些表,因此它可能会遭受损失。

If you know that your bytes will be mostly 0's or mostly 1's then there are very efficient algorithms for these scenarios.

如果你知道你的字节大部分是0,或者大部分是1,那么对于这些场景,有非常高效的算法。

I believe a very good general purpose algorithm is the following, known as 'parallel' or 'variable-precision SWAR algorithm'. I have expressed this in a C-like pseudo language, you may need to adjust it to work for a particular language (e.g. using uint32_t for C++ and >>> in Java):

我认为一个非常好的通用算法是以下的,被称为“并行”或“可变精度的SWAR算法”。我已经用C-like的伪语言表达了这一点,您可能需要调整它来为特定的语言工作(例如,在Java中使用uint32_t为c++和>>>):

int numberOfSetBits(int i)
{
     // Java: use >>> instead of >>
     // C or C++: use uint32_t
     i = i - ((i >> 1) & 0x55555555);
     i = (i & 0x33333333) + ((i >> 2) & 0x33333333);
     return (((i + (i >> 4)) & 0x0F0F0F0F) * 0x01010101) >> 24;
}

This has the best worst-case behaviour of any of the algorithms discussed, so will efficiently deal with any usage pattern or values you throw at it.

这是所讨论的任何算法中最坏的情况,因此将有效地处理您抛出的任何使用模式或值。


This bitwise-SWAR algorithm could parallelize to be done in multiple vector elements at once, instead of in a single integer register, for a speedup on CPUs with SIMD but no usable popcount instruction. (e.g. x86-64 code that has to run on any CPU, not just Nehalem or later.)

这个bitwise-SWAR算法可以同时在多个矢量元素上并行执行,而不是在单个整数寄存器中,在cpu上使用SIMD进行加速,但是不能使用popcount指令。(例如,必须在任何CPU上运行的x86-64代码,而不仅仅是Nehalem或稍后。)

However, the best way to use vector instructions for popcount is usually by using a variable-shuffle to do a table-lookup for 4 bits at a time of each byte in parallel. (The 4 bits index a 16 entry table held in a vector register).

然而,使用向量指令的最佳方式通常是使用可变洗牌来在每个字节的时间并行地查找4位。(4位索引在向量寄存器中保存的16个输入表)。

On Intel CPUs, the hardware 64bit popcnt instruction can outperform an SSSE3 PSHUFB bit-parallel implementation by about a factor of 2, but only if your compiler gets it just right. Otherwise SSE can come out significantly ahead. Newer compiler versions are aware of the popcnt false dependency problem on Intel.

在Intel cpu上,硬件64位popcnt指令可以比SSSE3 PSHUFB并行实现的性能好大约2倍,但前提是编译器正确。否则,SSE就会大大领先。更新的编译器版本知道了关于Intel的popcnt错误依赖问题。

References:

引用:

https://graphics.stanford.edu/~seander/bithacks.html

https://graphics.stanford.edu/ ~ seander / bithacks.html

https://en.wikipedia.org/wiki/Hamming_weight

https://en.wikipedia.org/wiki/Hamming_weight

http://gurmeet.net/puzzles/fast-bit-counting-routines/

http://gurmeet.net/puzzles/fast-bit-counting-routines/

http://aggregate.ee.engr.uky.edu/MAGIC/#Population%20Count%20(Ones%20Count)

http://aggregate.ee.engr.uky.edu/MAGIC/人口% 20计数% 20(% 20计数)

#2


176  

Also consider the built-in functions of your compilers.

还要考虑编译器的内置函数。

On the GNU compiler for example you can just use:

例如,在GNU编译器上,您可以使用:

int __builtin_popcount (unsigned int x);
int __builtin_popcountll (unsigned long long x);

In the worst case the compiler will generate a call to a function. In the best case the compiler will emit a cpu instruction to do the same job faster.

在最坏的情况下,编译器会生成一个函数调用。在最好的情况下,编译器会发出cpu指令以更快地完成相同的工作。

The GCC intrinsics even work across multiple platforms. Popcount will become mainstream in the x86 architecture, so it makes sense to start using the intrinsic now. Other architectures have the popcount for years.

GCC的特性甚至可以跨多个平台工作。Popcount将在x86体系结构中成为主流,所以现在开始使用它的固有特性是有意义的。其他体系结构也有多年的流行。


On x86, you can tell the compiler that it can assume support for popcnt instruction with -mpopcnt or -msse4.2 to also enable the vector instructions that were added in the same generation. See GCC x86 options. -march=nehalem (or -march= whatever CPU you want your code to assume and to tune for) could be a good choice. Running the resulting binary on an older CPU will result in an illegal-instruction fault.

在x86上,您可以告诉编译器它可以假定支持popcnt指令,使用-mpopcnt或-msse4.2,也可以支持在同一代中添加的向量指令。看到GCC x86选项。-march=nehalem(或-march=任何你想要你的代码假设和调整的CPU)可能是一个不错的选择。在较旧的CPU上运行生成的二进制文件将导致非法指令错误。

To make binaries optimized for the machine you build them on, use -march=native (with gcc, clang, or ICC).

为了使二进制文件对您构建的机器进行优化,可以使用-march=native(与gcc、clang或ICC)。

MSVC provides an intrinsic for the x86 popcnt instruction, but unlike gcc it's really an intrinsic for the hardware instruction and requires hardware support.

MSVC提供了x86 popcnt指令的内在特性,但与gcc不同,它实际上是硬件指令的内在要求,需要硬件支持。


Using std::bitset<>::count() instead of a built-in

使用std::bitset<>::count()而不是内置的。

In theory, any compiler that knows how to popcount efficiently for the target CPU should expose that functionality through ISO C++ std::bitset<>. In practice, you might be better off with the bit-hack AND/shift/ADD in some cases for some target CPUs.

理论上,任何知道如何有效地对目标CPU进行popcount的编译器都应该通过ISO c++ std::bitset<>来公开该功能。在实践中,您可能会更好地使用bithack和/shift/添加一些针对某些目标cpu的情况。

For target architectures where hardware popcount is an optional extension (like x86), not all compilers have a std::bitset that takes advantage of it when available. For example, MSVC has no way to enable popcnt support at compile time, and always uses a table lookup, even with /Ox /arch:AVX (which implies SSE4.2, although technically there is a separate feature bit for popcnt.)

对于硬件popcount是可选扩展(如x86)的目标架构,并非所有编译器都有std::在可用时利用它的bitset。例如,MSVC没有办法在编译时启用popcnt支持,并且总是使用一个表查找,即使是/Ox /arch:AVX(这意味着SSE4.2,尽管从技术上讲,popcnt有一个单独的特性位)。

But at least you get something portable that works everywhere, and with gcc/clang with the right target options, you get hardware popcount for architectures that support it.

但是,至少你得到了一个可以在任何地方工作的便携设备,并且在gcc/clang中有正确的目标选项,你可以得到支持它的架构的硬件popcount。

#include <bitset>
#include <limits>
#include <type_traits>

template<typename T>
//static inline  // static if you want to compile with -mpopcnt in one compilation unit but not others
typename std::enable_if<std::is_integral<T>::value,  unsigned >::type 
popcount(T x)
{
    static_assert(std::numeric_limits<T>::radix == 2, "non-binary type");

    // sizeof(x)*CHAR_BIT
    constexpr int bitwidth = std::numeric_limits<T>::digits + std::numeric_limits<T>::is_signed;
    // std::bitset constructor was only unsigned long before C++11.  Beware if porting to C++03
    static_assert(bitwidth <= std::numeric_limits<unsigned long long>::digits, "arg too wide for std::bitset() constructor");

    typedef typename std::make_unsigned<T>::type UT;        // probably not needed, bitset width chops after sign-extension

    std::bitset<bitwidth> bs( static_cast<UT>(x) );
    return bs.count();
}

See asm from gcc, clang, icc, and MSVC on the Godbolt compiler explorer.

在Godbolt编译器浏览器上看到来自gcc、clang、icc和MSVC的asm。

x86-64 gcc -O3 -std=gnu++11 -mpopcnt emits this:

x86-64 gcc -O3 -std=gnu++11 -mpopcnt发出:

unsigned test_short(short a) { return popcount(a); }
    movzx   eax, di      # note zero-extension, not sign-extension
    popcnt  rax, rax
    ret
unsigned test_int(int a) { return popcount(a); }
    mov     eax, edi
    popcnt  rax, rax
    ret
unsigned test_u64(unsigned long long a) { return popcount(a); }
    xor     eax, eax     # gcc avoids false dependencies for Intel CPUs
    popcnt  rax, rdi
    ret

PowerPC64 gcc -O3 -std=gnu++11 emits (for the int arg version):

PowerPC64 gcc -O3 -std=gnu++11发出(对于int arg版本):

    rldicl 3,3,0,32     # zero-extend from 32 to 64-bit
    popcntd 3,3         # popcount
    blr

This source isn't x86-specific or GNU-specific at all, but only compiles well for x86 with gcc/clang/icc.

这个源不是x86特定的或特定于gnui的,但是只对x86和gcc/clang/icc进行了良好的编译。

Also note that gcc's fallback for architectures without single-instruction popcount is a byte-at-a-time table lookup. This isn't wonderful for ARM, for example.

还请注意,没有单指令popcount的架构的回退是一个字节-at- time表查找。举个例子,这对ARM来说并不好。

#3


163  

In my opinion, the "best" solution is the one that can be read by another programmer (or the original programmer two years later) without copious comments. You may well want the fastest or cleverest solution which some have already provided but I prefer readability over cleverness any time.

在我看来,“最好”的解决方案是在没有大量评论的情况下,可以由另一个程序员(或两年后的原始程序员)阅读的解决方案。你可能想要一些已经提供的最快的或者最聪明的解决方案,但是我更喜欢在任何时候都比聪明更容易阅读。

unsigned int bitCount (unsigned int value) {
    unsigned int count = 0;
    while (value > 0) {           // until all bits are zero
        if ((value & 1) == 1)     // check lower bit
            count++;
        value >>= 1;              // shift bits, removing lower bit
    }
    return count;
}

If you want more speed (and assuming you document it well to help out your successors), you could use a table lookup:

如果你想要更快的速度(并且假设你能很好地记录它来帮助你的继任者),你可以使用一个表格查找:

// Lookup table for fast calculation of bits set in 8-bit unsigned char.

static unsigned char oneBitsInUChar[] = {
//  0  1  2  3  4  5  6  7  8  9  A  B  C  D  E  F (<- n)
//  =====================================================
    0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4, // 0n
    1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, // 1n
    : : :
    4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8, // Fn
};

// Function for fast calculation of bits set in 16-bit unsigned short.

unsigned char oneBitsInUShort (unsigned short x) {
    return oneBitsInUChar [x >>    8]
         + oneBitsInUChar [x &  0xff];
}

// Function for fast calculation of bits set in 32-bit unsigned int.

unsigned char oneBitsInUInt (unsigned int x) {
    return oneBitsInUShort (x >>     16)
         + oneBitsInUShort (x &  0xffff);
}

Although these rely on specific data type sizes so they're not that portable. But, since many performance optimisations aren't portable anyway, that may not be an issue. If you want portability, I'd stick to the readable solution.

尽管它们依赖于特定的数据类型,所以它们并不是那么便携。但是,由于许多性能优化是不可移植的,这可能不是问题。如果您想要可移植性,我将坚持可读的解决方案。

#4


90  

From Hacker's Delight, p. 66, Figure 5-2

从黑客的喜悦,p. 66,图5-2。

int pop(unsigned x)
{
    x = x - ((x >> 1) & 0x55555555);
    x = (x & 0x33333333) + ((x >> 2) & 0x33333333);
    x = (x + (x >> 4)) & 0x0F0F0F0F;
    x = x + (x >> 8);
    x = x + (x >> 16);
    return x & 0x0000003F;
}

Executes in ~20-ish instructions (arch dependent), no branching.

Hacker's Delight is delightful! Highly recommended.

执行在~20-ish指令(arch依赖),无分支。黑客的喜悦是愉快的!强烈推荐。

#5


66  

I think the fastest way—without using lookup tables and popcount—is the following. It counts the set bits with just 12 operations.

我认为最快的方法——不使用查找表和popcount——是以下内容。它只计算12个操作的集合。

int popcount(int v) {
    v = v - ((v >> 1) & 0x55555555);                // put count of each 2 bits into those 2 bits
    v = (v & 0x33333333) + ((v >> 2) & 0x33333333); // put count of each 4 bits into those 4 bits  
    return c = ((v + (v >> 4) & 0xF0F0F0F) * 0x1010101) >> 24;
}

It works because you can count the total number of set bits by dividing in two halves, counting the number of set bits in both halves and then adding them up. Also know as Divide and Conquer paradigm. Let's get into detail..

它是有效的,因为你可以通过划分两个半部分来计算集合比特的总数,计算两个部分的集合比特数,然后把它们加起来。也知道划分和征服范例。让我们来看看细节. .

v = v - ((v >> 1) & 0x55555555); 

The number of bits in two bits can be 0b00, 0b01 or 0b10. Lets try to work this out on 2 bits..

二进制位的位数可以是0b00, 0b01或0b10。让我们试着把它解决掉。

 ---------------------------------------------
 |   v    |   (v >> 1) & 0b0101   |  v - x   |
 ---------------------------------------------
   0b00           0b00               0b00   
   0b01           0b00               0b01     
   0b10           0b01               0b01
   0b11           0b01               0b10

This is what was required: the last column shows the count of set bits in every two bit pair. If the two bit number is >= 2 (0b10) then and produces 0b01, else it produces 0b00.

这就是所需要的:最后一列显示了每两个位对中设置位的计数。如果两个比特数是>= 2 (0b10),然后产生0b01,那么它就会产生0b00。

v = (v & 0x33333333) + ((v >> 2) & 0x33333333); 

This statement should be easy to understand. After the first operation we have the count of set bits in every two bits, now we sum up that count in every 4 bits.

这个陈述应该很容易理解。在第一次操作之后,我们每2位都有一个集合比特的计数,现在我们把每4位的计数加起来。

v & 0b00110011         //masks out even two bits
(v >> 2) & 0b00110011  // masks out odd two bits

We then sum up the above result, giving us the total count of set bits in 4 bits. The last statement is the most tricky.

然后对上述结果进行求和,得到4位的集合比特总数。最后一句是最棘手的。

c = ((v + (v >> 4) & 0xF0F0F0F) * 0x1010101) >> 24;

Let's break it down further...

让我们进一步细分……

v + (v >> 4)

It's similar to the second statement; we are counting the set bits in groups of 4 instead. We know—because of our previous operations—that every nibble has the count of set bits in it. Let's look an example. Suppose we have the byte 0b01000010. It means the first nibble has its 4bits set and the second one has its 2bits set. Now we add those nibbles together.

它类似于第二种表述;我们用四组中的一组来计算集合。我们知道,因为我们之前的操作——每一个小点都有固定位的计数。让我们看一个例子。假设我们有字节0b01000010。它意味着第一个nibble有它的4位集合,第二个有它的2位集合,现在我们把这些小石子加在一起。

0b01000010 + 0b01000000

It gives us the count of set bits in a byte, in the first nibble 0b01100010 and therefore we mask the last four bytes of all the bytes in the number (discarding them).

它给出了一个字节中的set位的计数,在第一个nibble 0b01100010中,因此我们将所有字节的最后四个字节掩码(丢弃它们)。

0b01100010 & 0xF0 = 0b01100000

Now every byte has the count of set bits in it. We need to add them up all together. The trick is to multiply the result by 0b10101010 which has an interesting property. If our number has four bytes, A B C D, it will result in a new number with these bytes A+B+C+D B+C+D C+D D. A 4 byte number can have maximum of 32 bits set, which can be represented as 0b00100000.

现在每个字节都有一个集合比特的计数。我们需要把它们加起来。诀窍是将结果乘以0b10101010,它有一个有趣的属性。如果我们的数字有4个字节,一个B C D,它将会产生一个新的数字,这些字节为A+B+C+D B+ D C+D D D。一个4字节数最多可以有32位,可以表示为0b00100000。

All we need now is the first byte which has the sum of all set bits in all the bytes, and we get it by >> 24. This algorithm was designed for 32 bit words but can be easily modified for 64 bit words.

现在我们需要的是第一个字节,它有所有字节的集合的和,我们通过>> 24得到它。该算法设计为32位字,但可以很容易地修改为64位字。

#6


53  

I got bored, and timed a billion iterations of three approaches. Compiler is gcc -O3. CPU is whatever they put in the 1st gen Macbook Pro.

我感到无聊,并对三种方法进行了10亿次的迭代。编译器gcc o3。CPU是他们的第一代Macbook Pro。

Fastest is the following, at 3.7 seconds:

最快的是以下,3.7秒:

static unsigned char wordbits[65536] = { bitcounts of ints between 0 and 65535 };
static int popcount( unsigned int i )
{
    return( wordbits[i&0xFFFF] + wordbits[i>>16] );
}

Second place goes to the same code but looking up 4 bytes instead of 2 halfwords. That took around 5.5 seconds.

第二名使用相同的代码,但是查找4个字节而不是2个半字。大约花了5.5秒。

Third place goes to the bit-twiddling 'sideways addition' approach, which took 8.6 seconds.

第三名是“侧边加”的方法,该方法耗时8.6秒。

Fourth place goes to GCC's __builtin_popcount(), at a shameful 11 seconds.

第四名是GCC的__builtin_popcount(),这是一个令人羞耻的11秒。

The counting one-bit-at-a-time approach was waaaay slower, and I got bored of waiting for it to complete.

一段时间的计算方法慢了很多,我厌倦了等待它完成。

So if you care about performance above all else then use the first approach. If you care, but not enough to spend 64Kb of RAM on it, use the second approach. Otherwise use the readable (but slow) one-bit-at-a-time approach.

因此,如果您关心性能,那么就使用第一个方法。如果您关心,但还不足以花费64Kb的RAM,请使用第二种方法。否则,使用可读(但慢)一段时间的方法。

It's hard to think of a situation where you'd want to use the bit-twiddling approach.

很难想象你会想要使用这种无聊的方法。

Edit: Similar results here.

编辑:类似的结果。

#7


51  

If you happen to be using Java, the built-in method Integer.bitCount will do that.

如果您碰巧使用的是Java,那么内置的方法是整数。bitCount会这样做。

#8


28  

This is one of those questions where it helps to know your micro-architecture. I just timed two variants under gcc 4.3.3 compiled with -O3 using C++ inlines to eliminate function call overhead, one billion iterations, keeping the running sum of all counts to ensure the compiler doesn't remove anything important, using rdtsc for timing (clock cycle precise).

这是一个有助于了解你的微架构的问题。我刚刚在gcc 4.3.3中使用-O3编译了两个变量,使用c++ inlines来消除函数调用开销,10亿次迭代,保持所有计数的运行和,以确保编译器不删除任何重要的东西,使用rdtsc来计时(时钟周期精确)。

inline int pop2(unsigned x, unsigned y)
{
    x = x - ((x >> 1) & 0x55555555);
    y = y - ((y >> 1) & 0x55555555);
    x = (x & 0x33333333) + ((x >> 2) & 0x33333333);
    y = (y & 0x33333333) + ((y >> 2) & 0x33333333);
    x = (x + (x >> 4)) & 0x0F0F0F0F;
    y = (y + (y >> 4)) & 0x0F0F0F0F;
    x = x + (x >> 8);
    y = y + (y >> 8);
    x = x + (x >> 16);
    y = y + (y >> 16);
    return (x+y) & 0x000000FF;
}

The unmodified Hacker's Delight took 12.2 gigacycles. My parallel version (counting twice as many bits) runs in 13.0 gigacycles. 10.5s total elapsed for both together on a 2.4GHz Core Duo. 25 gigacycles = just over 10 seconds at this clock frequency, so I'm confident my timings are right.

这个未被修改的黑客的喜悦获得了12.2千兆周。我的并行版本(计算的比特数是它的两倍)运行于13.0千兆周。在这个时钟频率上,在一个2.4GHz的核心Duo上,总共消耗了10秒的时间。25千兆周= 10秒,所以我相信我的时间是正确的。

This has to do with instruction dependency chains, which are very bad for this algorithm. I could nearly double the speed again by using a pair of 64-bit registers. In fact, if I was clever and added x+y a little sooner I could shave off some shifts. The 64-bit version with some small tweaks would come out about even, but count twice as many bits again.

这与指令依赖链有关,这对算法非常不利。通过使用一对64位寄存器,我可以再次加快速度。事实上,如果我很聪明,加上x+y,我可以稍微提前一点。64位的版本有一些小的调整,甚至会出现,但是重复计算的比特数是原来的两倍。

With 128 bit SIMD registers, yet another factor of two, and the SSE instruction sets often have clever short-cuts, too.

使用128位的SIMD寄存器,另一个因素为2,并且SSE指令集通常也有聪明的捷径。

There's no reason for the code to be especially transparent. The interface is simple, the algorithm can be referenced on-line in many places, and it's amenable to comprehensive unit test. The programmer who stumbles upon it might even learn something. These bit operations are extremely natural at the machine level.

代码没有理由特别透明。该算法界面简单,可以在许多地方进行在线参考,并可进行综合单元测试。犯错的程序员甚至可能会学到一些东西。这些位操作在机器级是非常自然的。

OK, I decided to bench the tweaked 64-bit version. For this one sizeof(unsigned long) == 8

好吧,我决定将调整后的64位版本进行调整。对于这个sizeof(unsigned long) == 8。

inline int pop2(unsigned long x, unsigned long y)
{
    x = x - ((x >> 1) & 0x5555555555555555);
    y = y - ((y >> 1) & 0x5555555555555555);
    x = (x & 0x3333333333333333) + ((x >> 2) & 0x3333333333333333);
    y = (y & 0x3333333333333333) + ((y >> 2) & 0x3333333333333333);
    x = (x + (x >> 4)) & 0x0F0F0F0F0F0F0F0F;
    y = (y + (y >> 4)) & 0x0F0F0F0F0F0F0F0F;
    x = x + y; 
    x = x + (x >> 8);
    x = x + (x >> 16);
    x = x + (x >> 32); 
    return x & 0xFF;
}

That looks about right (I'm not testing carefully, though). Now the timings come out at 10.70 gigacycles / 14.1 gigacycles. That later number summed 128 billion bits and corresponds to 5.9s elapsed on this machine. The non-parallel version speeds up a tiny bit because I'm running in 64-bit mode and it likes 64-bit registers slightly better than 32-bit registers.

这看起来是正确的(不过我并没有仔细测试)。现在的时间是1070千兆周/ 14。1千兆周。之后的数字总计为1280亿比特,对应于这台机器上的5。9秒。非并行版本加速了一点,因为我运行的是64位模式,它喜欢64位寄存器,略优于32位寄存器。

Let's see if there's a bit more OOO pipelining to be had here. This was a bit more involved, so I actually tested a bit. Each term alone sums to 64, all combined sum to 256.

让我们看看这里是否还有更多的OOO流水线。这有点复杂,所以我做了一些测试。每一项的总和为64,总和为256。

inline int pop4(unsigned long x, unsigned long y, 
                unsigned long u, unsigned long v)
{
  enum { m1 = 0x5555555555555555, 
         m2 = 0x3333333333333333, 
         m3 = 0x0F0F0F0F0F0F0F0F, 
         m4 = 0x000000FF000000FF };

    x = x - ((x >> 1) & m1);
    y = y - ((y >> 1) & m1);
    u = u - ((u >> 1) & m1);
    v = v - ((v >> 1) & m1);
    x = (x & m2) + ((x >> 2) & m2);
    y = (y & m2) + ((y >> 2) & m2);
    u = (u & m2) + ((u >> 2) & m2);
    v = (v & m2) + ((v >> 2) & m2);
    x = x + y; 
    u = u + v; 
    x = (x & m3) + ((x >> 4) & m3);
    u = (u & m3) + ((u >> 4) & m3);
    x = x + u; 
    x = x + (x >> 8);
    x = x + (x >> 16);
    x = x & m4; 
    x = x + (x >> 32);
    return x & 0x000001FF;
}

I was excited for a moment, but it turns out gcc is playing inline tricks with -O3 even though I'm not using the inline keyword in some tests. When I let gcc play tricks, a billion calls to pop4() takes 12.56 gigacycles, but I determined it was folding arguments as constant expressions. A more realistic number appears to be 19.6gc for another 30% speed-up. My test loop now looks like this, making sure each argument is different enough to stop gcc from playing tricks.

我兴奋了一会儿,但结果是gcc在使用-O3玩内联技巧,尽管我在一些测试中没有使用内联关键字。当我让gcc发挥作用时,10亿次调用pop4()需要12.56千兆周,但我确定它是作为常量表达式的折叠参数。一个更现实的数字似乎是19.6gc,再增加30%。我的测试循环现在看起来是这样的,确保每个参数都不同,足以阻止gcc玩花招。

   hitime b4 = rdtsc(); 
   for (unsigned long i = 10L * 1000*1000*1000; i < 11L * 1000*1000*1000; ++i) 
      sum += pop4 (i,  i^1, ~i, i|1); 
   hitime e4 = rdtsc(); 

256 billion bits summed in 8.17s elapsed. Works out to 1.02s for 32 million bits as benchmarked in the 16-bit table lookup. Can't compare directly, because the other bench doesn't give a clock speed, but looks like I've slapped the snot out of the 64KB table edition, which is a tragic use of L1 cache in the first place.

在8.17秒内总计有256亿比特。在16位表查找中作为基准测试,其结果为1.02亿字节。不能直接比较,因为其他的bench没有给出时钟的速度,但是看起来我把snot从64KB的表版本中删除了,这是L1缓存的一个悲剧用法。

Update: decided to do the obvious and create pop6() by adding four more duplicated lines. Came out to 22.8gc, 384 billion bits summed in 9.5s elapsed. So there's another 20% Now at 800ms for 32 billion bits.

更新:通过添加4个重复的行,决定做明显的和创建pop6()。结果是22.8gc, 384亿比特在9.5秒内完成。所以现在有另外的20%在800毫秒,有320亿比特。

#9


28  

unsigned int count_bit(unsigned int x)
{
  x = (x & 0x55555555) + ((x >> 1) & 0x55555555);
  x = (x & 0x33333333) + ((x >> 2) & 0x33333333);
  x = (x & 0x0F0F0F0F) + ((x >> 4) & 0x0F0F0F0F);
  x = (x & 0x00FF00FF) + ((x >> 8) & 0x00FF00FF);
  x = (x & 0x0000FFFF) + ((x >> 16)& 0x0000FFFF);
  return x;
}

Let me explain this algorithm.

我来解释一下这个算法。

This algorithm is based on Divide and Conquer Algorithm. Suppose there is a 8bit integer 213(11010101 in binary), the algorithm works like this(each time merge two neighbor blocks):

该算法基于分治算法。假设有一个8bit整数213(二进制的11010101),算法是这样工作的(每次合并两个相邻块):

+-------------------------------+
| 1 | 1 | 0 | 1 | 0 | 1 | 0 | 1 |  <- x
|  1 0  |  0 1  |  0 1  |  0 1  |  <- first time merge
|    0 0 1 1    |    0 0 1 0    |  <- second time merge
|        0 0 0 0 0 1 0 1        |  <- third time ( answer = 00000101 = 5)
+-------------------------------+

#10


21  

Why not iteratively divide by 2?

为什么不迭代除以2呢?

count = 0
while n > 0
  if (n % 2) == 1
    count += 1
  n /= 2  

I agree that this isn't the fastest, but "best" is somewhat ambiguous. I'd argue though that "best" should have an element of clarity

我同意这不是最快的,但是“最好的”是有些模糊的。我认为“最好”应该有一个清晰的元素。

#11


19  

For a happy medium between a 232 lookup table and iterating through each bit individually:

对于一个232查找表和遍历每一个比特的折中方法:

int bitcount(unsigned int num){
    int count = 0;
    static int nibblebits[] =
        {0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4};
    for(; num != 0; num >>= 4)
        count += nibblebits[num & 0x0f];
    return count;
}

From http://ctips.pbwiki.com/CountBits

从http://ctips.pbwiki.com/CountBits

#12


19  

The Hacker's Delight bit-twiddling becomes so much clearer when you write out the bit patterns.

当你把比特模式写出来的时候,黑客的喜悦会变得更加清晰。

unsigned int bitCount(unsigned int x)
{
  x = (((x >> 1) & 0b01010101010101010101010101010101)
       + x       & 0b01010101010101010101010101010101);
  x = (((x >> 2) & 0b00110011001100110011001100110011)
       + x       & 0b00110011001100110011001100110011); 
  x = (((x >> 4) & 0b00001111000011110000111100001111)
       + x       & 0b00001111000011110000111100001111); 
  x = (((x >> 8) & 0b00000000111111110000000011111111)
       + x       & 0b00000000111111110000000011111111); 
  x = (((x >> 16)& 0b00000000000000001111111111111111)
       + x       & 0b00000000000000001111111111111111); 
  return x;
}

The first step adds the even bits to the odd bits, producing a sum of bits in each two. The other steps add high-order chunks to low-order chunks, doubling the chunk size all the way up, until we have the final count taking up the entire int.

第一步将偶数位加到奇数位,在每两个位中产生一组位元。其他步骤将高阶块添加到低阶块中,将块的大小增加一倍,直到最后的计数占据整个int数。

#13


16  

It's not the fastest or best solution, but I found the same question in my way, and I started to think and think. finally I realized that it can be done like this if you get the problem from mathematical side, and draw a graph, then you find that it's a function which has some periodic part, and then you realize the difference between the periods... so here you go:

这不是最快或最好的解决方法,但我在路上发现了同样的问题,我开始思考和思考。最后我意识到可以这样做如果你从数学方面得到这个问题,然后画一个图,然后你会发现它是一个具有周期性部分的函数,然后你会意识到周期的不同…所以给你:

unsigned int f(unsigned int x)
{
    switch (x) {
        case 0:
            return 0;
        case 1:
            return 1;
        case 2:
            return 1;
        case 3:
            return 2;
        default:
            return f(x/4) + f(x%4);
    }
}

#14


15  

This can be done in O(k), where k is the number of bits set.

这可以在O(k)中完成,其中k是位集的个数。

int NumberOfSetBits(int n)
{
    int count = 0;

    while (n){
        ++ count;
        n = (n - 1) & n;
    }

    return count;
}

#15


10  

The function you are looking for is often called the "sideways sum" or "population count" of a binary number. Knuth discusses it in pre-Fascicle 1A, pp11-12 (although there was a brief reference in Volume 2, 4.6.3-(7).)

你所寻找的函数通常被称为二进制数的“横向和”或“人口计数”。Knuth在前fascicle 1A, pp11-12(尽管在卷2,4.6.3-(7)中有一个简要的参考)讨论了它。

The locus classicus is Peter Wegner's article "A Technique for Counting Ones in a Binary Computer", from the Communications of the ACM, Volume 3 (1960) Number 5, page 322. He gives two different algorithms there, one optimized for numbers expected to be "sparse" (i.e., have a small number of ones) and one for the opposite case.

《美国的经典》是彼得·韦格纳的文章《在二进制计算机中计数的技术》,从ACM的通讯,第三卷(1960)第5页,第322页。他给出了两种不同的算法,一种是对预期为“稀疏”的数字进行优化。,有一小部分的)和一个相反的情况。

#16


9  

Few open questions:-

一些开放式问题:-

  1. If the number is negative then?
  2. 如果数字是负数呢?
  3. If the number is 1024 , then the "iteratively divide by 2" method will iterate 10 times.
  4. 如果数字是1024,那么“迭代地除以2”方法将迭代10次。

we can modify the algo to support the negative number as follows:-

我们可以修改algo以支持负数如下:-。

count = 0
while n != 0
if ((n % 2) == 1 || (n % 2) == -1
    count += 1
  n /= 2  
return count

now to overcome the second problem we can write the algo like:-

现在,为了克服第二个问题,我们可以把algo写成:-。

int bit_count(int num)
{
    int count=0;
    while(num)
    {
        num=(num)&(num-1);
        count++;
    }
    return count;
}

for complete reference see :

有关完整参考资料,请参阅:

http://goursaha.freeoda.com/Miscellaneous/IntegerBitCount.html

http://goursaha.freeoda.com/Miscellaneous/IntegerBitCount.html

#17


7  

What do you means with "Best algorithm"? The shorted code or the fasted code? Your code look very elegant and it has a constant execution time. The code is also very short.

你说的“最佳算法”是什么意思?短代码还是fasted代码?您的代码看起来非常优雅,并且它有一个持续的执行时间。代码也很短。

But if the speed is the major factor and not the code size then I think the follow can be faster:

但是如果速度是主要因素,而不是代码的大小,那么我认为接下来的速度会更快:

       static final int[] BIT_COUNT = { 0, 1, 1, ... 256 values with a bitsize of a byte ... };
        static int bitCountOfByte( int value ){
            return BIT_COUNT[ value & 0xFF ];
        }

        static int bitCountOfInt( int value ){
            return bitCountOfByte( value ) 
                 + bitCountOfByte( value >> 8 ) 
                 + bitCountOfByte( value >> 16 ) 
                 + bitCountOfByte( value >> 24 );
        }

I think that this will not more faster for a 64 bit value but a 32 bit value can be faster.

我认为这对于64位的值来说不会更快,但是32位的值会更快。

#18


7  

I wrote a fast bitcount macro for RISC machines in about 1990. It does not use advanced arithmetic (multiplication, division, %), memory fetches (way too slow), branches (way too slow), but it does assume the CPU has a 32-bit barrel shifter (in other words, >> 1 and >> 32 take the same amount of cycles.) It assumes that small constants (such as 6, 12, 24) cost nothing to load into the registers, or are stored in temporaries and reused over and over again.

我在大约1990年为RISC机器编写了一个快速的bitcount宏。它不使用高级算术(乘法、除法、%)、内存读取(太慢)、分支(速度太慢),但是它假设CPU有一个32位的转位器(换句话说,>>和>> 32的循环次数相同)。它假定小的常量(如6、12、24)无需加载到寄存器中,或者存储在临时文件中,重复使用。

With these assumptions, it counts 32 bits in about 16 cycles/instructions on most RISC machines. Note that 15 instructions/cycles is close to a lower bound on the number of cycles or instructions, because it seems to take at least 3 instructions (mask, shift, operator) to cut the number of addends in half, so log_2(32) = 5, 5 x 3 = 15 instructions is a quasi-lowerbound.

根据这些假设,它在大多数RISC机器上的16个周期/指令中计算32位。注意,15个指令/周期接近于周期或指令数的下限,因为它似乎至少需要3个指令(掩码、移位、操作符)将addends的数目减半,所以log_2(32) = 5、5×3 = 15指令是一个准下界。

#define BitCount(X,Y)           \
                Y = X - ((X >> 1) & 033333333333) - ((X >> 2) & 011111111111); \
                Y = ((Y + (Y >> 3)) & 030707070707); \
                Y =  (Y + (Y >> 6)); \
                Y = (Y + (Y >> 12) + (Y >> 24)) & 077;

Here is a secret to the first and most complex step:

这是第一个也是最复杂的步骤的秘密:

input output
AB    CD             Note
00    00             = AB
01    01             = AB
10    01             = AB - (A >> 1) & 0x1
11    10             = AB - (A >> 1) & 0x1

so if I take the 1st column (A) above, shift it right 1 bit, and subtract it from AB, I get the output (CD). The extension to 3 bits is similar; you can check it with an 8-row boolean table like mine above if you wish.

如果我取上面的第1列(A),把它右移1位,然后从AB中减去它,我得到了输出(CD)扩展到3位是相似的;如果您愿意,您可以用一个8行布尔表来检查它。

  • Don Gillies
  • 不吉利

#19


7  

if you're using C++ another option is to use template metaprogramming:

如果你使用c++另一种选择是使用模板元编程:

// recursive template to sum bits in an int
template <int BITS>
int countBits(int val) {
        // return the least significant bit plus the result of calling ourselves with
        // .. the shifted value
        return (val & 0x1) + countBits<BITS-1>(val >> 1);
}

// template specialisation to terminate the recursion when there's only one bit left
template<>
int countBits<1>(int val) {
        return val & 0x1;
}

usage would be:

用法是:

// to count bits in a byte/char (this returns 8)
countBits<8>( 255 )

// another byte (this returns 7)
countBits<8>( 254 )

// counting bits in a word/short (this returns 1)
countBits<16>( 256 )

you could of course further expand this template to use different types (even auto-detecting bit size) but I've kept it simple for clarity.

当然,您可以进一步扩展这个模板以使用不同的类型(甚至自动检测位大小),但我已经将它简化为清晰。

edit: forgot to mention this is good because it should work in any C++ compiler and it basically just unrolls your loop for you if a constant value is used for the bit count (in other words, I'm pretty sure it's the fastest general method you'll find)

编辑:忘记提到这一点很好,因为它应该在任何c++编译器中工作,如果一个常量值被用于比特计数(换句话说,我很确定它是你能找到的最快的通用方法),它就会为你打开循环。

#20


7  

I think the Brian Kernighan's method will be useful too... It goes through as many iterations as there are set bits. So if we have a 32-bit word with only the high bit set, then it will only go once through the loop.

我认为Brian Kernighan的方法也很有用……它经历了许多迭代,就像有固定位一样。如果我们有一个32位的单词,只有高位集,那么它只会通过循环一次。

int countSetBits(unsigned int n) { 
    unsigned int n; // count the number of bits set in n
    unsigned int c; // c accumulates the total bits set in n
    for (c=0;n>0;n=n&(n-1)) c++; 
    return c; 
}

Published in 1988, the C Programming Language 2nd Ed. (by Brian W. Kernighan and Dennis M. Ritchie) mentions this in exercise 2-9. On April 19, 2006 Don Knuth pointed out to me that this method "was first published by Peter Wegner in CACM 3 (1960), 322. (Also discovered independently by Derrick Lehmer and published in 1964 in a book edited by Beckenbach.)"

1988年出版的C编程语言第2版(由布莱恩·w·克尼安和丹尼斯·m·里奇)在练习2-9中提到了这一点。2006年4月19日,Don Knuth向我指出,这个方法是由Peter Wegner在cacm3(1960), 322。(也由Derrick Lehmer独立发现并于1964年出版,由Beckenbach编辑的一本书)。

#21


7  

I use the below code which is more intuitive.

下面的代码更直观。

int countSetBits(int n) {
    return !n ? 0 : 1 + countSetBits(n & (n-1));
}

Logic : n & (n-1) resets the last set bit of n.

逻辑:n & (n-1)重置最后一组n。

P.S : I know this is not O(1) solution, albeit an interesting solution.

P。S:我知道这不是解决方案,尽管是一个有趣的解决方案。

#22


6  

I'm particularly fond of this example from the fortune file:

我特别喜欢《财富》杂志的这个例子:

#define BITCOUNT(x)    (((BX_(x)+(BX_(x)>>4)) & 0x0F0F0F0F) % 255)
#define BX_(x)         ((x) - (((x)>>1)&0x77777777)
                             - (((x)>>2)&0x33333333)
                             - (((x)>>3)&0x11111111))

I like it best because it's so pretty!

我最喜欢它,因为它太漂亮了!

#23


6  

Java JDK1.5

Java JDK1.5

Integer.bitCount(n);

Integer.bitCount(n);

where n is the number whose 1's are to be counted.

其中n是要计算1的数。

check also,

检查,

Integer.highestOneBit(n);
Integer.lowestOneBit(n);
Integer.numberOfLeadingZeros(n);
Integer.numberOfTrailingZeros(n);

//Beginning with the value 1, rotate left 16 times
     n = 1;
         for (int i = 0; i < 16; i++) {
            n = Integer.rotateLeft(n, 1);
            System.out.println(n);
         }

#24


6  

I found an implementation of bit counting in an array with using of SIMD instruction (SSSE3 and AVX2). It has in 2-2.5 times better performance than if it will use __popcnt64 intrinsic function.

我在一个使用SIMD指令(SSSE3和AVX2)的数组中发现了比特计数的实现。它的性能比使用__popcnt64的内部函数要好2-2.5倍。

SSSE3 version:

SSSE3版本:

#include <smmintrin.h>
#include <stdint.h>

const __m128i Z = _mm_set1_epi8(0x0);
const __m128i F = _mm_set1_epi8(0xF);
//Vector with pre-calculated bit count:
const __m128i T = _mm_setr_epi8(0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4);

uint64_t BitCount(const uint8_t * src, size_t size)
{
    __m128i _sum =  _mm128_setzero_si128();
    for (size_t i = 0; i < size; i += 16)
    {
        //load 16-byte vector
        __m128i _src = _mm_loadu_si128((__m128i*)(src + i));
        //get low 4 bit for every byte in vector
        __m128i lo = _mm_and_si128(_src, F);
        //sum precalculated value from T
        _sum = _mm_add_epi64(_sum, _mm_sad_epu8(Z, _mm_shuffle_epi8(T, lo)));
        //get high 4 bit for every byte in vector
        __m128i hi = _mm_and_si128(_mm_srli_epi16(_src, 4), F);
        //sum precalculated value from T
        _sum = _mm_add_epi64(_sum, _mm_sad_epu8(Z, _mm_shuffle_epi8(T, hi)));
    }
    uint64_t sum[2];
    _mm_storeu_si128((__m128i*)sum, _sum);
    return sum[0] + sum[1];
}

AVX2 version:

AVX2版本:

#include <immintrin.h>
#include <stdint.h>

const __m256i Z = _mm256_set1_epi8(0x0);
const __m256i F = _mm256_set1_epi8(0xF);
//Vector with pre-calculated bit count:
const __m256i T = _mm256_setr_epi8(0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4, 
                                   0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4);

uint64_t BitCount(const uint8_t * src, size_t size)
{
    __m256i _sum =  _mm256_setzero_si256();
    for (size_t i = 0; i < size; i += 32)
    {
        //load 32-byte vector
        __m256i _src = _mm256_loadu_si256((__m256i*)(src + i));
        //get low 4 bit for every byte in vector
        __m256i lo = _mm256_and_si256(_src, F);
        //sum precalculated value from T
        _sum = _mm256_add_epi64(_sum, _mm256_sad_epu8(Z, _mm256_shuffle_epi8(T, lo)));
        //get high 4 bit for every byte in vector
        __m256i hi = _mm256_and_si256(_mm256_srli_epi16(_src, 4), F);
        //sum precalculated value from T
        _sum = _mm256_add_epi64(_sum, _mm256_sad_epu8(Z, _mm256_shuffle_epi8(T, hi)));
    }
    uint64_t sum[4];
    _mm256_storeu_si256((__m256i*)sum, _sum);
    return sum[0] + sum[1] + sum[2] + sum[3];
}

#25


5  

There are many algorithm to count the set bits; but i think the best one is the faster one! You can see the detailed on this page:

有很多算法来计算集合比特;但我认为最好的是更快的!你可以在这一页看到详细内容:

Bit Twiddling Hacks

玩弄一些黑客

I suggest this one:

我建议这个:

Counting bits set in 14, 24, or 32-bit words using 64-bit instructions

使用64位指令集14、24或32位字。

unsigned int v; // count the number of bits set in v
unsigned int c; // c accumulates the total bits set in v

// option 1, for at most 14-bit values in v:
c = (v * 0x200040008001ULL & 0x111111111111111ULL) % 0xf;

// option 2, for at most 24-bit values in v:
c =  ((v & 0xfff) * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f;
c += (((v & 0xfff000) >> 12) * 0x1001001001001ULL & 0x84210842108421ULL) 
     % 0x1f;

// option 3, for at most 32-bit values in v:
c =  ((v & 0xfff) * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f;
c += (((v & 0xfff000) >> 12) * 0x1001001001001ULL & 0x84210842108421ULL) % 
     0x1f;
c += ((v >> 24) * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f;

This method requires a 64-bit CPU with fast modulus division to be efficient. The first option takes only 3 operations; the second option takes 10; and the third option takes 15.

该方法需要一个具有快速模组的64位CPU效率。第一个选项只需要3个操作;第二个选项是10;第三个选项是15。

#26


5  

Here is a portable module ( ANSI-C ) which can benchmark each of your algorithms on any architecture.

这是一个可移植的模块(ANSI-C),它可以在任何架构上对每个算法进行基准测试。

Your CPU has 9 bit bytes? No problem :-) At the moment it implements 2 algorithms, the K&R algorithm and a byte wise lookup table. The lookup table is on average 3 times faster than the K&R algorithm. If someone can figure a way to make the "Hacker's Delight" algorithm portable feel free to add it in.

你的CPU有9位字节?没有问题:-)目前它实现了2个算法,K&R算法和一个字节wise查找表。查找表的平均速度是K&R算法的3倍。如果有人能想出一种方法,让“黑客的快乐”算法可以携带,那么就可以随意添加。

#ifndef _BITCOUNT_H_
#define _BITCOUNT_H_

/* Return the Hamming Wieght of val, i.e. the number of 'on' bits. */
int bitcount( unsigned int );

/* List of available bitcount algorithms.  
 * onTheFly:    Calculate the bitcount on demand.
 *
 * lookupTalbe: Uses a small lookup table to determine the bitcount.  This
 * method is on average 3 times as fast as onTheFly, but incurs a small
 * upfront cost to initialize the lookup table on the first call.
 *
 * strategyCount is just a placeholder. 
 */
enum strategy { onTheFly, lookupTable, strategyCount };

/* String represenations of the algorithm names */
extern const char *strategyNames[];

/* Choose which bitcount algorithm to use. */
void setStrategy( enum strategy );

#endif

.

#include <limits.h>

#include "bitcount.h"

/* The number of entries needed in the table is equal to the number of unique
 * values a char can represent which is always UCHAR_MAX + 1*/
static unsigned char _bitCountTable[UCHAR_MAX + 1];
static unsigned int _lookupTableInitialized = 0;

static int _defaultBitCount( unsigned int val ) {
    int count;

    /* Starting with:
     * 1100 - 1 == 1011,  1100 & 1011 == 1000
     * 1000 - 1 == 0111,  1000 & 0111 == 0000
     */
    for ( count = 0; val; ++count )
        val &= val - 1;

    return count;
}

/* Looks up each byte of the integer in a lookup table.
 *
 * The first time the function is called it initializes the lookup table.
 */
static int _tableBitCount( unsigned int val ) {
    int bCount = 0;

    if ( !_lookupTableInitialized ) {
        unsigned int i;
        for ( i = 0; i != UCHAR_MAX + 1; ++i )
            _bitCountTable[i] =
                ( unsigned char )_defaultBitCount( i );

        _lookupTableInitialized = 1;
    }

    for ( ; val; val >>= CHAR_BIT )
        bCount += _bitCountTable[val & UCHAR_MAX];

    return bCount;
}

static int ( *_bitcount ) ( unsigned int ) = _defaultBitCount;

const char *strategyNames[] = { "onTheFly", "lookupTable" };

void setStrategy( enum strategy s ) {
    switch ( s ) {
    case onTheFly:
        _bitcount = _defaultBitCount;
        break;
    case lookupTable:
        _bitcount = _tableBitCount;
        break;
    case strategyCount:
        break;
    }
}

/* Just a forwarding function which will call whichever version of the
 * algorithm has been selected by the client 
 */
int bitcount( unsigned int val ) {
    return _bitcount( val );
}

#ifdef _BITCOUNT_EXE_

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

/* Use the same sequence of pseudo random numbers to benmark each Hamming
 * Weight algorithm.
 */
void benchmark( int reps ) {
    clock_t start, stop;
    int i, j;
    static const int iterations = 1000000;

    for ( j = 0; j != strategyCount; ++j ) {
        setStrategy( j );

        srand( 257 );

        start = clock(  );

        for ( i = 0; i != reps * iterations; ++i )
            bitcount( rand(  ) );

        stop = clock(  );

        printf
            ( "\n\t%d psudoe-random integers using %s: %f seconds\n\n",
              reps * iterations, strategyNames[j],
              ( double )( stop - start ) / CLOCKS_PER_SEC );
    }
}

int main( void ) {
    int option;

    while ( 1 ) {
        printf( "Menu Options\n"
            "\t1.\tPrint the Hamming Weight of an Integer\n"
            "\t2.\tBenchmark Hamming Weight implementations\n"
            "\t3.\tExit ( or cntl-d )\n\n\t" );

        if ( scanf( "%d", &option ) == EOF )
            break;

        switch ( option ) {
        case 1:
            printf( "Please enter the integer: " );
            if ( scanf( "%d", &option ) != EOF )
                printf
                    ( "The Hamming Weight of %d ( 0x%X ) is %d\n\n",
                      option, option, bitcount( option ) );
            break;
        case 2:
            printf
                ( "Please select number of reps ( in millions ): " );
            if ( scanf( "%d", &option ) != EOF )
                benchmark( option );
            break;
        case 3:
            goto EXIT;
            break;
        default:
            printf( "Invalid option\n" );
        }

    }

 EXIT:
    printf( "\n" );

    return 0;
}

#endif

#27


5  

  private int get_bits_set(int v)
    {
      int c; // c accumulates the total bits set in v
        for (c = 0; v>0; c++)
        {
            v &= v - 1; // clear the least significant bit set
        }
        return c;
    }

#28


4  

32-bit or not ? I just came with this method in Java after reading "cracking the coding interview" 4th edition exercice 5.5 ( chap 5: Bit Manipulation). If the least significant bit is 1 increment count, then right-shift the integer.

32位吗?我刚在Java中读到“破解编码面试”第4版练习5.5(第5章:位操作)。如果最小有效位是1增量计数,则右移整数。

public static int bitCount( int n){
    int count = 0;
    for (int i=n; i!=0; i = i >> 1){
        count += i & 1;
    }
    return count;
}

I think this one is more intuitive than the solutions with constant 0x33333333 no matter how fast they are. It depends on your definition of "best algorithm" .

我认为这个比常数0x33333333的解更直观,不管它们有多快。这取决于你对“最佳算法”的定义。

#29


4  

Fast C# solution using pre-calculated table of Byte bit counts with branching on input size.

快速c#解决方案,使用预计算的字节位计数表,在输入大小上有分支。

public static class BitCount
{
    public static uint GetSetBitsCount(uint n)
    {
        var counts = BYTE_BIT_COUNTS;
        return n <= 0xff ? counts[n]
             : n <= 0xffff ? counts[n & 0xff] + counts[n >> 8]
             : n <= 0xffffff ? counts[n & 0xff] + counts[(n >> 8) & 0xff] + counts[(n >> 16) & 0xff]
             : counts[n & 0xff] + counts[(n >> 8) & 0xff] + counts[(n >> 16) & 0xff] + counts[(n >> 24) & 0xff];
    }

    public static readonly uint[] BYTE_BIT_COUNTS = 
    {
        0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
        1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
        1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
        2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
        1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
        2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
        2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
        3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
        1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
        2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
        2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
        3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
        2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
        3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
        3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
        4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
    };
}

#30


4  

I always use this in Competitive Programming and it's easy to write and efficient:

我总是在有竞争力的编程中使用它,它很容易写和有效:

#include <bits/stdc++.h>

using namespace std;

int countOnes(int n) {
    bitset<32> b(n);
    return b.count();
}

#1


740  

This is known as the 'Hamming Weight', 'popcount' or 'sideways addition'.

这就是所谓的“汉明重量”、“popcount”或“横向加法”。

The 'best' algorithm really depends on which CPU you are on and what your usage pattern is.

“最佳”算法实际上取决于您所使用的CPU以及您的使用模式是什么。

Some CPUs have a single built-in instruction to do it and others have parallel instructions which act on bit vectors. The parallel instructions (like x86's popcnt, on CPUs where it's supported) will almost certainly be fastest. Some other architectures may have a slow instruction implemented with a microcoded loop that tests a bit per cycle (citation needed).

一些cpu有一个内置的指令来执行它,而另一些cpu则有并行指令,这些指令作用于位向量。并行指令(像x86的popcnt,在它支持的cpu上)几乎肯定是最快的。一些其他的体系结构可能有一个缓慢的指令,用一个微编码的循环来实现,这个循环测试每个周期(需要引用)。

A pre-populated table lookup method can be very fast if your CPU has a large cache and/or you are doing lots of these instructions in a tight loop. However it can suffer because of the expense of a 'cache miss', where the CPU has to fetch some of the table from main memory.

如果您的CPU有一个大的缓存并且/或者您在一个紧密的循环中做了大量的这些指令,那么一个预填充的表查找方法可能非常快。然而,由于“缓存缺失”,CPU不得不从主内存中取出一些表,因此它可能会遭受损失。

If you know that your bytes will be mostly 0's or mostly 1's then there are very efficient algorithms for these scenarios.

如果你知道你的字节大部分是0,或者大部分是1,那么对于这些场景,有非常高效的算法。

I believe a very good general purpose algorithm is the following, known as 'parallel' or 'variable-precision SWAR algorithm'. I have expressed this in a C-like pseudo language, you may need to adjust it to work for a particular language (e.g. using uint32_t for C++ and >>> in Java):

我认为一个非常好的通用算法是以下的,被称为“并行”或“可变精度的SWAR算法”。我已经用C-like的伪语言表达了这一点,您可能需要调整它来为特定的语言工作(例如,在Java中使用uint32_t为c++和>>>):

int numberOfSetBits(int i)
{
     // Java: use >>> instead of >>
     // C or C++: use uint32_t
     i = i - ((i >> 1) & 0x55555555);
     i = (i & 0x33333333) + ((i >> 2) & 0x33333333);
     return (((i + (i >> 4)) & 0x0F0F0F0F) * 0x01010101) >> 24;
}

This has the best worst-case behaviour of any of the algorithms discussed, so will efficiently deal with any usage pattern or values you throw at it.

这是所讨论的任何算法中最坏的情况,因此将有效地处理您抛出的任何使用模式或值。


This bitwise-SWAR algorithm could parallelize to be done in multiple vector elements at once, instead of in a single integer register, for a speedup on CPUs with SIMD but no usable popcount instruction. (e.g. x86-64 code that has to run on any CPU, not just Nehalem or later.)

这个bitwise-SWAR算法可以同时在多个矢量元素上并行执行,而不是在单个整数寄存器中,在cpu上使用SIMD进行加速,但是不能使用popcount指令。(例如,必须在任何CPU上运行的x86-64代码,而不仅仅是Nehalem或稍后。)

However, the best way to use vector instructions for popcount is usually by using a variable-shuffle to do a table-lookup for 4 bits at a time of each byte in parallel. (The 4 bits index a 16 entry table held in a vector register).

然而,使用向量指令的最佳方式通常是使用可变洗牌来在每个字节的时间并行地查找4位。(4位索引在向量寄存器中保存的16个输入表)。

On Intel CPUs, the hardware 64bit popcnt instruction can outperform an SSSE3 PSHUFB bit-parallel implementation by about a factor of 2, but only if your compiler gets it just right. Otherwise SSE can come out significantly ahead. Newer compiler versions are aware of the popcnt false dependency problem on Intel.

在Intel cpu上,硬件64位popcnt指令可以比SSSE3 PSHUFB并行实现的性能好大约2倍,但前提是编译器正确。否则,SSE就会大大领先。更新的编译器版本知道了关于Intel的popcnt错误依赖问题。

References:

引用:

https://graphics.stanford.edu/~seander/bithacks.html

https://graphics.stanford.edu/ ~ seander / bithacks.html

https://en.wikipedia.org/wiki/Hamming_weight

https://en.wikipedia.org/wiki/Hamming_weight

http://gurmeet.net/puzzles/fast-bit-counting-routines/

http://gurmeet.net/puzzles/fast-bit-counting-routines/

http://aggregate.ee.engr.uky.edu/MAGIC/#Population%20Count%20(Ones%20Count)

http://aggregate.ee.engr.uky.edu/MAGIC/人口% 20计数% 20(% 20计数)

#2


176  

Also consider the built-in functions of your compilers.

还要考虑编译器的内置函数。

On the GNU compiler for example you can just use:

例如,在GNU编译器上,您可以使用:

int __builtin_popcount (unsigned int x);
int __builtin_popcountll (unsigned long long x);

In the worst case the compiler will generate a call to a function. In the best case the compiler will emit a cpu instruction to do the same job faster.

在最坏的情况下,编译器会生成一个函数调用。在最好的情况下,编译器会发出cpu指令以更快地完成相同的工作。

The GCC intrinsics even work across multiple platforms. Popcount will become mainstream in the x86 architecture, so it makes sense to start using the intrinsic now. Other architectures have the popcount for years.

GCC的特性甚至可以跨多个平台工作。Popcount将在x86体系结构中成为主流,所以现在开始使用它的固有特性是有意义的。其他体系结构也有多年的流行。


On x86, you can tell the compiler that it can assume support for popcnt instruction with -mpopcnt or -msse4.2 to also enable the vector instructions that were added in the same generation. See GCC x86 options. -march=nehalem (or -march= whatever CPU you want your code to assume and to tune for) could be a good choice. Running the resulting binary on an older CPU will result in an illegal-instruction fault.

在x86上,您可以告诉编译器它可以假定支持popcnt指令,使用-mpopcnt或-msse4.2,也可以支持在同一代中添加的向量指令。看到GCC x86选项。-march=nehalem(或-march=任何你想要你的代码假设和调整的CPU)可能是一个不错的选择。在较旧的CPU上运行生成的二进制文件将导致非法指令错误。

To make binaries optimized for the machine you build them on, use -march=native (with gcc, clang, or ICC).

为了使二进制文件对您构建的机器进行优化,可以使用-march=native(与gcc、clang或ICC)。

MSVC provides an intrinsic for the x86 popcnt instruction, but unlike gcc it's really an intrinsic for the hardware instruction and requires hardware support.

MSVC提供了x86 popcnt指令的内在特性,但与gcc不同,它实际上是硬件指令的内在要求,需要硬件支持。


Using std::bitset<>::count() instead of a built-in

使用std::bitset<>::count()而不是内置的。

In theory, any compiler that knows how to popcount efficiently for the target CPU should expose that functionality through ISO C++ std::bitset<>. In practice, you might be better off with the bit-hack AND/shift/ADD in some cases for some target CPUs.

理论上,任何知道如何有效地对目标CPU进行popcount的编译器都应该通过ISO c++ std::bitset<>来公开该功能。在实践中,您可能会更好地使用bithack和/shift/添加一些针对某些目标cpu的情况。

For target architectures where hardware popcount is an optional extension (like x86), not all compilers have a std::bitset that takes advantage of it when available. For example, MSVC has no way to enable popcnt support at compile time, and always uses a table lookup, even with /Ox /arch:AVX (which implies SSE4.2, although technically there is a separate feature bit for popcnt.)

对于硬件popcount是可选扩展(如x86)的目标架构,并非所有编译器都有std::在可用时利用它的bitset。例如,MSVC没有办法在编译时启用popcnt支持,并且总是使用一个表查找,即使是/Ox /arch:AVX(这意味着SSE4.2,尽管从技术上讲,popcnt有一个单独的特性位)。

But at least you get something portable that works everywhere, and with gcc/clang with the right target options, you get hardware popcount for architectures that support it.

但是,至少你得到了一个可以在任何地方工作的便携设备,并且在gcc/clang中有正确的目标选项,你可以得到支持它的架构的硬件popcount。

#include <bitset>
#include <limits>
#include <type_traits>

template<typename T>
//static inline  // static if you want to compile with -mpopcnt in one compilation unit but not others
typename std::enable_if<std::is_integral<T>::value,  unsigned >::type 
popcount(T x)
{
    static_assert(std::numeric_limits<T>::radix == 2, "non-binary type");

    // sizeof(x)*CHAR_BIT
    constexpr int bitwidth = std::numeric_limits<T>::digits + std::numeric_limits<T>::is_signed;
    // std::bitset constructor was only unsigned long before C++11.  Beware if porting to C++03
    static_assert(bitwidth <= std::numeric_limits<unsigned long long>::digits, "arg too wide for std::bitset() constructor");

    typedef typename std::make_unsigned<T>::type UT;        // probably not needed, bitset width chops after sign-extension

    std::bitset<bitwidth> bs( static_cast<UT>(x) );
    return bs.count();
}

See asm from gcc, clang, icc, and MSVC on the Godbolt compiler explorer.

在Godbolt编译器浏览器上看到来自gcc、clang、icc和MSVC的asm。

x86-64 gcc -O3 -std=gnu++11 -mpopcnt emits this:

x86-64 gcc -O3 -std=gnu++11 -mpopcnt发出:

unsigned test_short(short a) { return popcount(a); }
    movzx   eax, di      # note zero-extension, not sign-extension
    popcnt  rax, rax
    ret
unsigned test_int(int a) { return popcount(a); }
    mov     eax, edi
    popcnt  rax, rax
    ret
unsigned test_u64(unsigned long long a) { return popcount(a); }
    xor     eax, eax     # gcc avoids false dependencies for Intel CPUs
    popcnt  rax, rdi
    ret

PowerPC64 gcc -O3 -std=gnu++11 emits (for the int arg version):

PowerPC64 gcc -O3 -std=gnu++11发出(对于int arg版本):

    rldicl 3,3,0,32     # zero-extend from 32 to 64-bit
    popcntd 3,3         # popcount
    blr

This source isn't x86-specific or GNU-specific at all, but only compiles well for x86 with gcc/clang/icc.

这个源不是x86特定的或特定于gnui的,但是只对x86和gcc/clang/icc进行了良好的编译。

Also note that gcc's fallback for architectures without single-instruction popcount is a byte-at-a-time table lookup. This isn't wonderful for ARM, for example.

还请注意,没有单指令popcount的架构的回退是一个字节-at- time表查找。举个例子,这对ARM来说并不好。

#3


163  

In my opinion, the "best" solution is the one that can be read by another programmer (or the original programmer two years later) without copious comments. You may well want the fastest or cleverest solution which some have already provided but I prefer readability over cleverness any time.

在我看来,“最好”的解决方案是在没有大量评论的情况下,可以由另一个程序员(或两年后的原始程序员)阅读的解决方案。你可能想要一些已经提供的最快的或者最聪明的解决方案,但是我更喜欢在任何时候都比聪明更容易阅读。

unsigned int bitCount (unsigned int value) {
    unsigned int count = 0;
    while (value > 0) {           // until all bits are zero
        if ((value & 1) == 1)     // check lower bit
            count++;
        value >>= 1;              // shift bits, removing lower bit
    }
    return count;
}

If you want more speed (and assuming you document it well to help out your successors), you could use a table lookup:

如果你想要更快的速度(并且假设你能很好地记录它来帮助你的继任者),你可以使用一个表格查找:

// Lookup table for fast calculation of bits set in 8-bit unsigned char.

static unsigned char oneBitsInUChar[] = {
//  0  1  2  3  4  5  6  7  8  9  A  B  C  D  E  F (<- n)
//  =====================================================
    0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4, // 0n
    1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, // 1n
    : : :
    4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8, // Fn
};

// Function for fast calculation of bits set in 16-bit unsigned short.

unsigned char oneBitsInUShort (unsigned short x) {
    return oneBitsInUChar [x >>    8]
         + oneBitsInUChar [x &  0xff];
}

// Function for fast calculation of bits set in 32-bit unsigned int.

unsigned char oneBitsInUInt (unsigned int x) {
    return oneBitsInUShort (x >>     16)
         + oneBitsInUShort (x &  0xffff);
}

Although these rely on specific data type sizes so they're not that portable. But, since many performance optimisations aren't portable anyway, that may not be an issue. If you want portability, I'd stick to the readable solution.

尽管它们依赖于特定的数据类型,所以它们并不是那么便携。但是,由于许多性能优化是不可移植的,这可能不是问题。如果您想要可移植性,我将坚持可读的解决方案。

#4


90  

From Hacker's Delight, p. 66, Figure 5-2

从黑客的喜悦,p. 66,图5-2。

int pop(unsigned x)
{
    x = x - ((x >> 1) & 0x55555555);
    x = (x & 0x33333333) + ((x >> 2) & 0x33333333);
    x = (x + (x >> 4)) & 0x0F0F0F0F;
    x = x + (x >> 8);
    x = x + (x >> 16);
    return x & 0x0000003F;
}

Executes in ~20-ish instructions (arch dependent), no branching.

Hacker's Delight is delightful! Highly recommended.

执行在~20-ish指令(arch依赖),无分支。黑客的喜悦是愉快的!强烈推荐。

#5


66  

I think the fastest way—without using lookup tables and popcount—is the following. It counts the set bits with just 12 operations.

我认为最快的方法——不使用查找表和popcount——是以下内容。它只计算12个操作的集合。

int popcount(int v) {
    v = v - ((v >> 1) & 0x55555555);                // put count of each 2 bits into those 2 bits
    v = (v & 0x33333333) + ((v >> 2) & 0x33333333); // put count of each 4 bits into those 4 bits  
    return c = ((v + (v >> 4) & 0xF0F0F0F) * 0x1010101) >> 24;
}

It works because you can count the total number of set bits by dividing in two halves, counting the number of set bits in both halves and then adding them up. Also know as Divide and Conquer paradigm. Let's get into detail..

它是有效的,因为你可以通过划分两个半部分来计算集合比特的总数,计算两个部分的集合比特数,然后把它们加起来。也知道划分和征服范例。让我们来看看细节. .

v = v - ((v >> 1) & 0x55555555); 

The number of bits in two bits can be 0b00, 0b01 or 0b10. Lets try to work this out on 2 bits..

二进制位的位数可以是0b00, 0b01或0b10。让我们试着把它解决掉。

 ---------------------------------------------
 |   v    |   (v >> 1) & 0b0101   |  v - x   |
 ---------------------------------------------
   0b00           0b00               0b00   
   0b01           0b00               0b01     
   0b10           0b01               0b01
   0b11           0b01               0b10

This is what was required: the last column shows the count of set bits in every two bit pair. If the two bit number is >= 2 (0b10) then and produces 0b01, else it produces 0b00.

这就是所需要的:最后一列显示了每两个位对中设置位的计数。如果两个比特数是>= 2 (0b10),然后产生0b01,那么它就会产生0b00。

v = (v & 0x33333333) + ((v >> 2) & 0x33333333); 

This statement should be easy to understand. After the first operation we have the count of set bits in every two bits, now we sum up that count in every 4 bits.

这个陈述应该很容易理解。在第一次操作之后,我们每2位都有一个集合比特的计数,现在我们把每4位的计数加起来。

v & 0b00110011         //masks out even two bits
(v >> 2) & 0b00110011  // masks out odd two bits

We then sum up the above result, giving us the total count of set bits in 4 bits. The last statement is the most tricky.

然后对上述结果进行求和,得到4位的集合比特总数。最后一句是最棘手的。

c = ((v + (v >> 4) & 0xF0F0F0F) * 0x1010101) >> 24;

Let's break it down further...

让我们进一步细分……

v + (v >> 4)

It's similar to the second statement; we are counting the set bits in groups of 4 instead. We know—because of our previous operations—that every nibble has the count of set bits in it. Let's look an example. Suppose we have the byte 0b01000010. It means the first nibble has its 4bits set and the second one has its 2bits set. Now we add those nibbles together.

它类似于第二种表述;我们用四组中的一组来计算集合。我们知道,因为我们之前的操作——每一个小点都有固定位的计数。让我们看一个例子。假设我们有字节0b01000010。它意味着第一个nibble有它的4位集合,第二个有它的2位集合,现在我们把这些小石子加在一起。

0b01000010 + 0b01000000

It gives us the count of set bits in a byte, in the first nibble 0b01100010 and therefore we mask the last four bytes of all the bytes in the number (discarding them).

它给出了一个字节中的set位的计数,在第一个nibble 0b01100010中,因此我们将所有字节的最后四个字节掩码(丢弃它们)。

0b01100010 & 0xF0 = 0b01100000

Now every byte has the count of set bits in it. We need to add them up all together. The trick is to multiply the result by 0b10101010 which has an interesting property. If our number has four bytes, A B C D, it will result in a new number with these bytes A+B+C+D B+C+D C+D D. A 4 byte number can have maximum of 32 bits set, which can be represented as 0b00100000.

现在每个字节都有一个集合比特的计数。我们需要把它们加起来。诀窍是将结果乘以0b10101010,它有一个有趣的属性。如果我们的数字有4个字节,一个B C D,它将会产生一个新的数字,这些字节为A+B+C+D B+ D C+D D D。一个4字节数最多可以有32位,可以表示为0b00100000。

All we need now is the first byte which has the sum of all set bits in all the bytes, and we get it by >> 24. This algorithm was designed for 32 bit words but can be easily modified for 64 bit words.

现在我们需要的是第一个字节,它有所有字节的集合的和,我们通过>> 24得到它。该算法设计为32位字,但可以很容易地修改为64位字。

#6


53  

I got bored, and timed a billion iterations of three approaches. Compiler is gcc -O3. CPU is whatever they put in the 1st gen Macbook Pro.

我感到无聊,并对三种方法进行了10亿次的迭代。编译器gcc o3。CPU是他们的第一代Macbook Pro。

Fastest is the following, at 3.7 seconds:

最快的是以下,3.7秒:

static unsigned char wordbits[65536] = { bitcounts of ints between 0 and 65535 };
static int popcount( unsigned int i )
{
    return( wordbits[i&0xFFFF] + wordbits[i>>16] );
}

Second place goes to the same code but looking up 4 bytes instead of 2 halfwords. That took around 5.5 seconds.

第二名使用相同的代码,但是查找4个字节而不是2个半字。大约花了5.5秒。

Third place goes to the bit-twiddling 'sideways addition' approach, which took 8.6 seconds.

第三名是“侧边加”的方法,该方法耗时8.6秒。

Fourth place goes to GCC's __builtin_popcount(), at a shameful 11 seconds.

第四名是GCC的__builtin_popcount(),这是一个令人羞耻的11秒。

The counting one-bit-at-a-time approach was waaaay slower, and I got bored of waiting for it to complete.

一段时间的计算方法慢了很多,我厌倦了等待它完成。

So if you care about performance above all else then use the first approach. If you care, but not enough to spend 64Kb of RAM on it, use the second approach. Otherwise use the readable (but slow) one-bit-at-a-time approach.

因此,如果您关心性能,那么就使用第一个方法。如果您关心,但还不足以花费64Kb的RAM,请使用第二种方法。否则,使用可读(但慢)一段时间的方法。

It's hard to think of a situation where you'd want to use the bit-twiddling approach.

很难想象你会想要使用这种无聊的方法。

Edit: Similar results here.

编辑:类似的结果。

#7


51  

If you happen to be using Java, the built-in method Integer.bitCount will do that.

如果您碰巧使用的是Java,那么内置的方法是整数。bitCount会这样做。

#8


28  

This is one of those questions where it helps to know your micro-architecture. I just timed two variants under gcc 4.3.3 compiled with -O3 using C++ inlines to eliminate function call overhead, one billion iterations, keeping the running sum of all counts to ensure the compiler doesn't remove anything important, using rdtsc for timing (clock cycle precise).

这是一个有助于了解你的微架构的问题。我刚刚在gcc 4.3.3中使用-O3编译了两个变量,使用c++ inlines来消除函数调用开销,10亿次迭代,保持所有计数的运行和,以确保编译器不删除任何重要的东西,使用rdtsc来计时(时钟周期精确)。

inline int pop2(unsigned x, unsigned y)
{
    x = x - ((x >> 1) & 0x55555555);
    y = y - ((y >> 1) & 0x55555555);
    x = (x & 0x33333333) + ((x >> 2) & 0x33333333);
    y = (y & 0x33333333) + ((y >> 2) & 0x33333333);
    x = (x + (x >> 4)) & 0x0F0F0F0F;
    y = (y + (y >> 4)) & 0x0F0F0F0F;
    x = x + (x >> 8);
    y = y + (y >> 8);
    x = x + (x >> 16);
    y = y + (y >> 16);
    return (x+y) & 0x000000FF;
}

The unmodified Hacker's Delight took 12.2 gigacycles. My parallel version (counting twice as many bits) runs in 13.0 gigacycles. 10.5s total elapsed for both together on a 2.4GHz Core Duo. 25 gigacycles = just over 10 seconds at this clock frequency, so I'm confident my timings are right.

这个未被修改的黑客的喜悦获得了12.2千兆周。我的并行版本(计算的比特数是它的两倍)运行于13.0千兆周。在这个时钟频率上,在一个2.4GHz的核心Duo上,总共消耗了10秒的时间。25千兆周= 10秒,所以我相信我的时间是正确的。

This has to do with instruction dependency chains, which are very bad for this algorithm. I could nearly double the speed again by using a pair of 64-bit registers. In fact, if I was clever and added x+y a little sooner I could shave off some shifts. The 64-bit version with some small tweaks would come out about even, but count twice as many bits again.

这与指令依赖链有关,这对算法非常不利。通过使用一对64位寄存器,我可以再次加快速度。事实上,如果我很聪明,加上x+y,我可以稍微提前一点。64位的版本有一些小的调整,甚至会出现,但是重复计算的比特数是原来的两倍。

With 128 bit SIMD registers, yet another factor of two, and the SSE instruction sets often have clever short-cuts, too.

使用128位的SIMD寄存器,另一个因素为2,并且SSE指令集通常也有聪明的捷径。

There's no reason for the code to be especially transparent. The interface is simple, the algorithm can be referenced on-line in many places, and it's amenable to comprehensive unit test. The programmer who stumbles upon it might even learn something. These bit operations are extremely natural at the machine level.

代码没有理由特别透明。该算法界面简单,可以在许多地方进行在线参考,并可进行综合单元测试。犯错的程序员甚至可能会学到一些东西。这些位操作在机器级是非常自然的。

OK, I decided to bench the tweaked 64-bit version. For this one sizeof(unsigned long) == 8

好吧,我决定将调整后的64位版本进行调整。对于这个sizeof(unsigned long) == 8。

inline int pop2(unsigned long x, unsigned long y)
{
    x = x - ((x >> 1) & 0x5555555555555555);
    y = y - ((y >> 1) & 0x5555555555555555);
    x = (x & 0x3333333333333333) + ((x >> 2) & 0x3333333333333333);
    y = (y & 0x3333333333333333) + ((y >> 2) & 0x3333333333333333);
    x = (x + (x >> 4)) & 0x0F0F0F0F0F0F0F0F;
    y = (y + (y >> 4)) & 0x0F0F0F0F0F0F0F0F;
    x = x + y; 
    x = x + (x >> 8);
    x = x + (x >> 16);
    x = x + (x >> 32); 
    return x & 0xFF;
}

That looks about right (I'm not testing carefully, though). Now the timings come out at 10.70 gigacycles / 14.1 gigacycles. That later number summed 128 billion bits and corresponds to 5.9s elapsed on this machine. The non-parallel version speeds up a tiny bit because I'm running in 64-bit mode and it likes 64-bit registers slightly better than 32-bit registers.

这看起来是正确的(不过我并没有仔细测试)。现在的时间是1070千兆周/ 14。1千兆周。之后的数字总计为1280亿比特,对应于这台机器上的5。9秒。非并行版本加速了一点,因为我运行的是64位模式,它喜欢64位寄存器,略优于32位寄存器。

Let's see if there's a bit more OOO pipelining to be had here. This was a bit more involved, so I actually tested a bit. Each term alone sums to 64, all combined sum to 256.

让我们看看这里是否还有更多的OOO流水线。这有点复杂,所以我做了一些测试。每一项的总和为64,总和为256。

inline int pop4(unsigned long x, unsigned long y, 
                unsigned long u, unsigned long v)
{
  enum { m1 = 0x5555555555555555, 
         m2 = 0x3333333333333333, 
         m3 = 0x0F0F0F0F0F0F0F0F, 
         m4 = 0x000000FF000000FF };

    x = x - ((x >> 1) & m1);
    y = y - ((y >> 1) & m1);
    u = u - ((u >> 1) & m1);
    v = v - ((v >> 1) & m1);
    x = (x & m2) + ((x >> 2) & m2);
    y = (y & m2) + ((y >> 2) & m2);
    u = (u & m2) + ((u >> 2) & m2);
    v = (v & m2) + ((v >> 2) & m2);
    x = x + y; 
    u = u + v; 
    x = (x & m3) + ((x >> 4) & m3);
    u = (u & m3) + ((u >> 4) & m3);
    x = x + u; 
    x = x + (x >> 8);
    x = x + (x >> 16);
    x = x & m4; 
    x = x + (x >> 32);
    return x & 0x000001FF;
}

I was excited for a moment, but it turns out gcc is playing inline tricks with -O3 even though I'm not using the inline keyword in some tests. When I let gcc play tricks, a billion calls to pop4() takes 12.56 gigacycles, but I determined it was folding arguments as constant expressions. A more realistic number appears to be 19.6gc for another 30% speed-up. My test loop now looks like this, making sure each argument is different enough to stop gcc from playing tricks.

我兴奋了一会儿,但结果是gcc在使用-O3玩内联技巧,尽管我在一些测试中没有使用内联关键字。当我让gcc发挥作用时,10亿次调用pop4()需要12.56千兆周,但我确定它是作为常量表达式的折叠参数。一个更现实的数字似乎是19.6gc,再增加30%。我的测试循环现在看起来是这样的,确保每个参数都不同,足以阻止gcc玩花招。

   hitime b4 = rdtsc(); 
   for (unsigned long i = 10L * 1000*1000*1000; i < 11L * 1000*1000*1000; ++i) 
      sum += pop4 (i,  i^1, ~i, i|1); 
   hitime e4 = rdtsc(); 

256 billion bits summed in 8.17s elapsed. Works out to 1.02s for 32 million bits as benchmarked in the 16-bit table lookup. Can't compare directly, because the other bench doesn't give a clock speed, but looks like I've slapped the snot out of the 64KB table edition, which is a tragic use of L1 cache in the first place.

在8.17秒内总计有256亿比特。在16位表查找中作为基准测试,其结果为1.02亿字节。不能直接比较,因为其他的bench没有给出时钟的速度,但是看起来我把snot从64KB的表版本中删除了,这是L1缓存的一个悲剧用法。

Update: decided to do the obvious and create pop6() by adding four more duplicated lines. Came out to 22.8gc, 384 billion bits summed in 9.5s elapsed. So there's another 20% Now at 800ms for 32 billion bits.

更新:通过添加4个重复的行,决定做明显的和创建pop6()。结果是22.8gc, 384亿比特在9.5秒内完成。所以现在有另外的20%在800毫秒,有320亿比特。

#9


28  

unsigned int count_bit(unsigned int x)
{
  x = (x & 0x55555555) + ((x >> 1) & 0x55555555);
  x = (x & 0x33333333) + ((x >> 2) & 0x33333333);
  x = (x & 0x0F0F0F0F) + ((x >> 4) & 0x0F0F0F0F);
  x = (x & 0x00FF00FF) + ((x >> 8) & 0x00FF00FF);
  x = (x & 0x0000FFFF) + ((x >> 16)& 0x0000FFFF);
  return x;
}

Let me explain this algorithm.

我来解释一下这个算法。

This algorithm is based on Divide and Conquer Algorithm. Suppose there is a 8bit integer 213(11010101 in binary), the algorithm works like this(each time merge two neighbor blocks):

该算法基于分治算法。假设有一个8bit整数213(二进制的11010101),算法是这样工作的(每次合并两个相邻块):

+-------------------------------+
| 1 | 1 | 0 | 1 | 0 | 1 | 0 | 1 |  <- x
|  1 0  |  0 1  |  0 1  |  0 1  |  <- first time merge
|    0 0 1 1    |    0 0 1 0    |  <- second time merge
|        0 0 0 0 0 1 0 1        |  <- third time ( answer = 00000101 = 5)
+-------------------------------+

#10


21  

Why not iteratively divide by 2?

为什么不迭代除以2呢?

count = 0
while n > 0
  if (n % 2) == 1
    count += 1
  n /= 2  

I agree that this isn't the fastest, but "best" is somewhat ambiguous. I'd argue though that "best" should have an element of clarity

我同意这不是最快的,但是“最好的”是有些模糊的。我认为“最好”应该有一个清晰的元素。

#11


19  

For a happy medium between a 232 lookup table and iterating through each bit individually:

对于一个232查找表和遍历每一个比特的折中方法:

int bitcount(unsigned int num){
    int count = 0;
    static int nibblebits[] =
        {0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4};
    for(; num != 0; num >>= 4)
        count += nibblebits[num & 0x0f];
    return count;
}

From http://ctips.pbwiki.com/CountBits

从http://ctips.pbwiki.com/CountBits

#12


19  

The Hacker's Delight bit-twiddling becomes so much clearer when you write out the bit patterns.

当你把比特模式写出来的时候,黑客的喜悦会变得更加清晰。

unsigned int bitCount(unsigned int x)
{
  x = (((x >> 1) & 0b01010101010101010101010101010101)
       + x       & 0b01010101010101010101010101010101);
  x = (((x >> 2) & 0b00110011001100110011001100110011)
       + x       & 0b00110011001100110011001100110011); 
  x = (((x >> 4) & 0b00001111000011110000111100001111)
       + x       & 0b00001111000011110000111100001111); 
  x = (((x >> 8) & 0b00000000111111110000000011111111)
       + x       & 0b00000000111111110000000011111111); 
  x = (((x >> 16)& 0b00000000000000001111111111111111)
       + x       & 0b00000000000000001111111111111111); 
  return x;
}

The first step adds the even bits to the odd bits, producing a sum of bits in each two. The other steps add high-order chunks to low-order chunks, doubling the chunk size all the way up, until we have the final count taking up the entire int.

第一步将偶数位加到奇数位,在每两个位中产生一组位元。其他步骤将高阶块添加到低阶块中,将块的大小增加一倍,直到最后的计数占据整个int数。

#13


16  

It's not the fastest or best solution, but I found the same question in my way, and I started to think and think. finally I realized that it can be done like this if you get the problem from mathematical side, and draw a graph, then you find that it's a function which has some periodic part, and then you realize the difference between the periods... so here you go:

这不是最快或最好的解决方法,但我在路上发现了同样的问题,我开始思考和思考。最后我意识到可以这样做如果你从数学方面得到这个问题,然后画一个图,然后你会发现它是一个具有周期性部分的函数,然后你会意识到周期的不同…所以给你:

unsigned int f(unsigned int x)
{
    switch (x) {
        case 0:
            return 0;
        case 1:
            return 1;
        case 2:
            return 1;
        case 3:
            return 2;
        default:
            return f(x/4) + f(x%4);
    }
}

#14


15  

This can be done in O(k), where k is the number of bits set.

这可以在O(k)中完成,其中k是位集的个数。

int NumberOfSetBits(int n)
{
    int count = 0;

    while (n){
        ++ count;
        n = (n - 1) & n;
    }

    return count;
}

#15


10  

The function you are looking for is often called the "sideways sum" or "population count" of a binary number. Knuth discusses it in pre-Fascicle 1A, pp11-12 (although there was a brief reference in Volume 2, 4.6.3-(7).)

你所寻找的函数通常被称为二进制数的“横向和”或“人口计数”。Knuth在前fascicle 1A, pp11-12(尽管在卷2,4.6.3-(7)中有一个简要的参考)讨论了它。

The locus classicus is Peter Wegner's article "A Technique for Counting Ones in a Binary Computer", from the Communications of the ACM, Volume 3 (1960) Number 5, page 322. He gives two different algorithms there, one optimized for numbers expected to be "sparse" (i.e., have a small number of ones) and one for the opposite case.

《美国的经典》是彼得·韦格纳的文章《在二进制计算机中计数的技术》,从ACM的通讯,第三卷(1960)第5页,第322页。他给出了两种不同的算法,一种是对预期为“稀疏”的数字进行优化。,有一小部分的)和一个相反的情况。

#16


9  

Few open questions:-

一些开放式问题:-

  1. If the number is negative then?
  2. 如果数字是负数呢?
  3. If the number is 1024 , then the "iteratively divide by 2" method will iterate 10 times.
  4. 如果数字是1024,那么“迭代地除以2”方法将迭代10次。

we can modify the algo to support the negative number as follows:-

我们可以修改algo以支持负数如下:-。

count = 0
while n != 0
if ((n % 2) == 1 || (n % 2) == -1
    count += 1
  n /= 2  
return count

now to overcome the second problem we can write the algo like:-

现在,为了克服第二个问题,我们可以把algo写成:-。

int bit_count(int num)
{
    int count=0;
    while(num)
    {
        num=(num)&(num-1);
        count++;
    }
    return count;
}

for complete reference see :

有关完整参考资料,请参阅:

http://goursaha.freeoda.com/Miscellaneous/IntegerBitCount.html

http://goursaha.freeoda.com/Miscellaneous/IntegerBitCount.html

#17


7  

What do you means with "Best algorithm"? The shorted code or the fasted code? Your code look very elegant and it has a constant execution time. The code is also very short.

你说的“最佳算法”是什么意思?短代码还是fasted代码?您的代码看起来非常优雅,并且它有一个持续的执行时间。代码也很短。

But if the speed is the major factor and not the code size then I think the follow can be faster:

但是如果速度是主要因素,而不是代码的大小,那么我认为接下来的速度会更快:

       static final int[] BIT_COUNT = { 0, 1, 1, ... 256 values with a bitsize of a byte ... };
        static int bitCountOfByte( int value ){
            return BIT_COUNT[ value & 0xFF ];
        }

        static int bitCountOfInt( int value ){
            return bitCountOfByte( value ) 
                 + bitCountOfByte( value >> 8 ) 
                 + bitCountOfByte( value >> 16 ) 
                 + bitCountOfByte( value >> 24 );
        }

I think that this will not more faster for a 64 bit value but a 32 bit value can be faster.

我认为这对于64位的值来说不会更快,但是32位的值会更快。

#18


7  

I wrote a fast bitcount macro for RISC machines in about 1990. It does not use advanced arithmetic (multiplication, division, %), memory fetches (way too slow), branches (way too slow), but it does assume the CPU has a 32-bit barrel shifter (in other words, >> 1 and >> 32 take the same amount of cycles.) It assumes that small constants (such as 6, 12, 24) cost nothing to load into the registers, or are stored in temporaries and reused over and over again.

我在大约1990年为RISC机器编写了一个快速的bitcount宏。它不使用高级算术(乘法、除法、%)、内存读取(太慢)、分支(速度太慢),但是它假设CPU有一个32位的转位器(换句话说,>>和>> 32的循环次数相同)。它假定小的常量(如6、12、24)无需加载到寄存器中,或者存储在临时文件中,重复使用。

With these assumptions, it counts 32 bits in about 16 cycles/instructions on most RISC machines. Note that 15 instructions/cycles is close to a lower bound on the number of cycles or instructions, because it seems to take at least 3 instructions (mask, shift, operator) to cut the number of addends in half, so log_2(32) = 5, 5 x 3 = 15 instructions is a quasi-lowerbound.

根据这些假设,它在大多数RISC机器上的16个周期/指令中计算32位。注意,15个指令/周期接近于周期或指令数的下限,因为它似乎至少需要3个指令(掩码、移位、操作符)将addends的数目减半,所以log_2(32) = 5、5×3 = 15指令是一个准下界。

#define BitCount(X,Y)           \
                Y = X - ((X >> 1) & 033333333333) - ((X >> 2) & 011111111111); \
                Y = ((Y + (Y >> 3)) & 030707070707); \
                Y =  (Y + (Y >> 6)); \
                Y = (Y + (Y >> 12) + (Y >> 24)) & 077;

Here is a secret to the first and most complex step:

这是第一个也是最复杂的步骤的秘密:

input output
AB    CD             Note
00    00             = AB
01    01             = AB
10    01             = AB - (A >> 1) & 0x1
11    10             = AB - (A >> 1) & 0x1

so if I take the 1st column (A) above, shift it right 1 bit, and subtract it from AB, I get the output (CD). The extension to 3 bits is similar; you can check it with an 8-row boolean table like mine above if you wish.

如果我取上面的第1列(A),把它右移1位,然后从AB中减去它,我得到了输出(CD)扩展到3位是相似的;如果您愿意,您可以用一个8行布尔表来检查它。

  • Don Gillies
  • 不吉利

#19


7  

if you're using C++ another option is to use template metaprogramming:

如果你使用c++另一种选择是使用模板元编程:

// recursive template to sum bits in an int
template <int BITS>
int countBits(int val) {
        // return the least significant bit plus the result of calling ourselves with
        // .. the shifted value
        return (val & 0x1) + countBits<BITS-1>(val >> 1);
}

// template specialisation to terminate the recursion when there's only one bit left
template<>
int countBits<1>(int val) {
        return val & 0x1;
}

usage would be:

用法是:

// to count bits in a byte/char (this returns 8)
countBits<8>( 255 )

// another byte (this returns 7)
countBits<8>( 254 )

// counting bits in a word/short (this returns 1)
countBits<16>( 256 )

you could of course further expand this template to use different types (even auto-detecting bit size) but I've kept it simple for clarity.

当然,您可以进一步扩展这个模板以使用不同的类型(甚至自动检测位大小),但我已经将它简化为清晰。

edit: forgot to mention this is good because it should work in any C++ compiler and it basically just unrolls your loop for you if a constant value is used for the bit count (in other words, I'm pretty sure it's the fastest general method you'll find)

编辑:忘记提到这一点很好,因为它应该在任何c++编译器中工作,如果一个常量值被用于比特计数(换句话说,我很确定它是你能找到的最快的通用方法),它就会为你打开循环。

#20


7  

I think the Brian Kernighan's method will be useful too... It goes through as many iterations as there are set bits. So if we have a 32-bit word with only the high bit set, then it will only go once through the loop.

我认为Brian Kernighan的方法也很有用……它经历了许多迭代,就像有固定位一样。如果我们有一个32位的单词,只有高位集,那么它只会通过循环一次。

int countSetBits(unsigned int n) { 
    unsigned int n; // count the number of bits set in n
    unsigned int c; // c accumulates the total bits set in n
    for (c=0;n>0;n=n&(n-1)) c++; 
    return c; 
}

Published in 1988, the C Programming Language 2nd Ed. (by Brian W. Kernighan and Dennis M. Ritchie) mentions this in exercise 2-9. On April 19, 2006 Don Knuth pointed out to me that this method "was first published by Peter Wegner in CACM 3 (1960), 322. (Also discovered independently by Derrick Lehmer and published in 1964 in a book edited by Beckenbach.)"

1988年出版的C编程语言第2版(由布莱恩·w·克尼安和丹尼斯·m·里奇)在练习2-9中提到了这一点。2006年4月19日,Don Knuth向我指出,这个方法是由Peter Wegner在cacm3(1960), 322。(也由Derrick Lehmer独立发现并于1964年出版,由Beckenbach编辑的一本书)。

#21


7  

I use the below code which is more intuitive.

下面的代码更直观。

int countSetBits(int n) {
    return !n ? 0 : 1 + countSetBits(n & (n-1));
}

Logic : n & (n-1) resets the last set bit of n.

逻辑:n & (n-1)重置最后一组n。

P.S : I know this is not O(1) solution, albeit an interesting solution.

P。S:我知道这不是解决方案,尽管是一个有趣的解决方案。

#22


6  

I'm particularly fond of this example from the fortune file:

我特别喜欢《财富》杂志的这个例子:

#define BITCOUNT(x)    (((BX_(x)+(BX_(x)>>4)) & 0x0F0F0F0F) % 255)
#define BX_(x)         ((x) - (((x)>>1)&0x77777777)
                             - (((x)>>2)&0x33333333)
                             - (((x)>>3)&0x11111111))

I like it best because it's so pretty!

我最喜欢它,因为它太漂亮了!

#23


6  

Java JDK1.5

Java JDK1.5

Integer.bitCount(n);

Integer.bitCount(n);

where n is the number whose 1's are to be counted.

其中n是要计算1的数。

check also,

检查,

Integer.highestOneBit(n);
Integer.lowestOneBit(n);
Integer.numberOfLeadingZeros(n);
Integer.numberOfTrailingZeros(n);

//Beginning with the value 1, rotate left 16 times
     n = 1;
         for (int i = 0; i < 16; i++) {
            n = Integer.rotateLeft(n, 1);
            System.out.println(n);
         }

#24


6  

I found an implementation of bit counting in an array with using of SIMD instruction (SSSE3 and AVX2). It has in 2-2.5 times better performance than if it will use __popcnt64 intrinsic function.

我在一个使用SIMD指令(SSSE3和AVX2)的数组中发现了比特计数的实现。它的性能比使用__popcnt64的内部函数要好2-2.5倍。

SSSE3 version:

SSSE3版本:

#include <smmintrin.h>
#include <stdint.h>

const __m128i Z = _mm_set1_epi8(0x0);
const __m128i F = _mm_set1_epi8(0xF);
//Vector with pre-calculated bit count:
const __m128i T = _mm_setr_epi8(0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4);

uint64_t BitCount(const uint8_t * src, size_t size)
{
    __m128i _sum =  _mm128_setzero_si128();
    for (size_t i = 0; i < size; i += 16)
    {
        //load 16-byte vector
        __m128i _src = _mm_loadu_si128((__m128i*)(src + i));
        //get low 4 bit for every byte in vector
        __m128i lo = _mm_and_si128(_src, F);
        //sum precalculated value from T
        _sum = _mm_add_epi64(_sum, _mm_sad_epu8(Z, _mm_shuffle_epi8(T, lo)));
        //get high 4 bit for every byte in vector
        __m128i hi = _mm_and_si128(_mm_srli_epi16(_src, 4), F);
        //sum precalculated value from T
        _sum = _mm_add_epi64(_sum, _mm_sad_epu8(Z, _mm_shuffle_epi8(T, hi)));
    }
    uint64_t sum[2];
    _mm_storeu_si128((__m128i*)sum, _sum);
    return sum[0] + sum[1];
}

AVX2 version:

AVX2版本:

#include <immintrin.h>
#include <stdint.h>

const __m256i Z = _mm256_set1_epi8(0x0);
const __m256i F = _mm256_set1_epi8(0xF);
//Vector with pre-calculated bit count:
const __m256i T = _mm256_setr_epi8(0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4, 
                                   0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4);

uint64_t BitCount(const uint8_t * src, size_t size)
{
    __m256i _sum =  _mm256_setzero_si256();
    for (size_t i = 0; i < size; i += 32)
    {
        //load 32-byte vector
        __m256i _src = _mm256_loadu_si256((__m256i*)(src + i));
        //get low 4 bit for every byte in vector
        __m256i lo = _mm256_and_si256(_src, F);
        //sum precalculated value from T
        _sum = _mm256_add_epi64(_sum, _mm256_sad_epu8(Z, _mm256_shuffle_epi8(T, lo)));
        //get high 4 bit for every byte in vector
        __m256i hi = _mm256_and_si256(_mm256_srli_epi16(_src, 4), F);
        //sum precalculated value from T
        _sum = _mm256_add_epi64(_sum, _mm256_sad_epu8(Z, _mm256_shuffle_epi8(T, hi)));
    }
    uint64_t sum[4];
    _mm256_storeu_si256((__m256i*)sum, _sum);
    return sum[0] + sum[1] + sum[2] + sum[3];
}

#25


5  

There are many algorithm to count the set bits; but i think the best one is the faster one! You can see the detailed on this page:

有很多算法来计算集合比特;但我认为最好的是更快的!你可以在这一页看到详细内容:

Bit Twiddling Hacks

玩弄一些黑客

I suggest this one:

我建议这个:

Counting bits set in 14, 24, or 32-bit words using 64-bit instructions

使用64位指令集14、24或32位字。

unsigned int v; // count the number of bits set in v
unsigned int c; // c accumulates the total bits set in v

// option 1, for at most 14-bit values in v:
c = (v * 0x200040008001ULL & 0x111111111111111ULL) % 0xf;

// option 2, for at most 24-bit values in v:
c =  ((v & 0xfff) * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f;
c += (((v & 0xfff000) >> 12) * 0x1001001001001ULL & 0x84210842108421ULL) 
     % 0x1f;

// option 3, for at most 32-bit values in v:
c =  ((v & 0xfff) * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f;
c += (((v & 0xfff000) >> 12) * 0x1001001001001ULL & 0x84210842108421ULL) % 
     0x1f;
c += ((v >> 24) * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f;

This method requires a 64-bit CPU with fast modulus division to be efficient. The first option takes only 3 operations; the second option takes 10; and the third option takes 15.

该方法需要一个具有快速模组的64位CPU效率。第一个选项只需要3个操作;第二个选项是10;第三个选项是15。

#26


5  

Here is a portable module ( ANSI-C ) which can benchmark each of your algorithms on any architecture.

这是一个可移植的模块(ANSI-C),它可以在任何架构上对每个算法进行基准测试。

Your CPU has 9 bit bytes? No problem :-) At the moment it implements 2 algorithms, the K&R algorithm and a byte wise lookup table. The lookup table is on average 3 times faster than the K&R algorithm. If someone can figure a way to make the "Hacker's Delight" algorithm portable feel free to add it in.

你的CPU有9位字节?没有问题:-)目前它实现了2个算法,K&R算法和一个字节wise查找表。查找表的平均速度是K&R算法的3倍。如果有人能想出一种方法,让“黑客的快乐”算法可以携带,那么就可以随意添加。

#ifndef _BITCOUNT_H_
#define _BITCOUNT_H_

/* Return the Hamming Wieght of val, i.e. the number of 'on' bits. */
int bitcount( unsigned int );

/* List of available bitcount algorithms.  
 * onTheFly:    Calculate the bitcount on demand.
 *
 * lookupTalbe: Uses a small lookup table to determine the bitcount.  This
 * method is on average 3 times as fast as onTheFly, but incurs a small
 * upfront cost to initialize the lookup table on the first call.
 *
 * strategyCount is just a placeholder. 
 */
enum strategy { onTheFly, lookupTable, strategyCount };

/* String represenations of the algorithm names */
extern const char *strategyNames[];

/* Choose which bitcount algorithm to use. */
void setStrategy( enum strategy );

#endif

.

#include <limits.h>

#include "bitcount.h"

/* The number of entries needed in the table is equal to the number of unique
 * values a char can represent which is always UCHAR_MAX + 1*/
static unsigned char _bitCountTable[UCHAR_MAX + 1];
static unsigned int _lookupTableInitialized = 0;

static int _defaultBitCount( unsigned int val ) {
    int count;

    /* Starting with:
     * 1100 - 1 == 1011,  1100 & 1011 == 1000
     * 1000 - 1 == 0111,  1000 & 0111 == 0000
     */
    for ( count = 0; val; ++count )
        val &= val - 1;

    return count;
}

/* Looks up each byte of the integer in a lookup table.
 *
 * The first time the function is called it initializes the lookup table.
 */
static int _tableBitCount( unsigned int val ) {
    int bCount = 0;

    if ( !_lookupTableInitialized ) {
        unsigned int i;
        for ( i = 0; i != UCHAR_MAX + 1; ++i )
            _bitCountTable[i] =
                ( unsigned char )_defaultBitCount( i );

        _lookupTableInitialized = 1;
    }

    for ( ; val; val >>= CHAR_BIT )
        bCount += _bitCountTable[val & UCHAR_MAX];

    return bCount;
}

static int ( *_bitcount ) ( unsigned int ) = _defaultBitCount;

const char *strategyNames[] = { "onTheFly", "lookupTable" };

void setStrategy( enum strategy s ) {
    switch ( s ) {
    case onTheFly:
        _bitcount = _defaultBitCount;
        break;
    case lookupTable:
        _bitcount = _tableBitCount;
        break;
    case strategyCount:
        break;
    }
}

/* Just a forwarding function which will call whichever version of the
 * algorithm has been selected by the client 
 */
int bitcount( unsigned int val ) {
    return _bitcount( val );
}

#ifdef _BITCOUNT_EXE_

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

/* Use the same sequence of pseudo random numbers to benmark each Hamming
 * Weight algorithm.
 */
void benchmark( int reps ) {
    clock_t start, stop;
    int i, j;
    static const int iterations = 1000000;

    for ( j = 0; j != strategyCount; ++j ) {
        setStrategy( j );

        srand( 257 );

        start = clock(  );

        for ( i = 0; i != reps * iterations; ++i )
            bitcount( rand(  ) );

        stop = clock(  );

        printf
            ( "\n\t%d psudoe-random integers using %s: %f seconds\n\n",
              reps * iterations, strategyNames[j],
              ( double )( stop - start ) / CLOCKS_PER_SEC );
    }
}

int main( void ) {
    int option;

    while ( 1 ) {
        printf( "Menu Options\n"
            "\t1.\tPrint the Hamming Weight of an Integer\n"
            "\t2.\tBenchmark Hamming Weight implementations\n"
            "\t3.\tExit ( or cntl-d )\n\n\t" );

        if ( scanf( "%d", &option ) == EOF )
            break;

        switch ( option ) {
        case 1:
            printf( "Please enter the integer: " );
            if ( scanf( "%d", &option ) != EOF )
                printf
                    ( "The Hamming Weight of %d ( 0x%X ) is %d\n\n",
                      option, option, bitcount( option ) );
            break;
        case 2:
            printf
                ( "Please select number of reps ( in millions ): " );
            if ( scanf( "%d", &option ) != EOF )
                benchmark( option );
            break;
        case 3:
            goto EXIT;
            break;
        default:
            printf( "Invalid option\n" );
        }

    }

 EXIT:
    printf( "\n" );

    return 0;
}

#endif

#27


5  

  private int get_bits_set(int v)
    {
      int c; // c accumulates the total bits set in v
        for (c = 0; v>0; c++)
        {
            v &= v - 1; // clear the least significant bit set
        }
        return c;
    }

#28


4  

32-bit or not ? I just came with this method in Java after reading "cracking the coding interview" 4th edition exercice 5.5 ( chap 5: Bit Manipulation). If the least significant bit is 1 increment count, then right-shift the integer.

32位吗?我刚在Java中读到“破解编码面试”第4版练习5.5(第5章:位操作)。如果最小有效位是1增量计数,则右移整数。

public static int bitCount( int n){
    int count = 0;
    for (int i=n; i!=0; i = i >> 1){
        count += i & 1;
    }
    return count;
}

I think this one is more intuitive than the solutions with constant 0x33333333 no matter how fast they are. It depends on your definition of "best algorithm" .

我认为这个比常数0x33333333的解更直观,不管它们有多快。这取决于你对“最佳算法”的定义。

#29


4  

Fast C# solution using pre-calculated table of Byte bit counts with branching on input size.

快速c#解决方案,使用预计算的字节位计数表,在输入大小上有分支。

public static class BitCount
{
    public static uint GetSetBitsCount(uint n)
    {
        var counts = BYTE_BIT_COUNTS;
        return n <= 0xff ? counts[n]
             : n <= 0xffff ? counts[n & 0xff] + counts[n >> 8]
             : n <= 0xffffff ? counts[n & 0xff] + counts[(n >> 8) & 0xff] + counts[(n >> 16) & 0xff]
             : counts[n & 0xff] + counts[(n >> 8) & 0xff] + counts[(n >> 16) & 0xff] + counts[(n >> 24) & 0xff];
    }

    public static readonly uint[] BYTE_BIT_COUNTS = 
    {
        0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4,
        1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
        1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
        2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
        1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
        2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
        2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
        3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
        1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5,
        2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
        2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
        3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
        2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6,
        3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
        3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7,
        4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
    };
}

#30


4  

I always use this in Competitive Programming and it's easy to write and efficient:

我总是在有竞争力的编程中使用它,它很容易写和有效:

#include <bits/stdc++.h>

using namespace std;

int countOnes(int n) {
    bitset<32> b(n);
    return b.count();
}