在C中找到整数中最高集位(msb)的最快/最有效的方法是什么?

时间:2022-09-12 03:09:55

If I have some integer n, and I want to know the position of the most significant bit (that is, if the least significant bit is on the right, I want to know the position of the furthest left bit that is a 1), what is the quickest/most efficient method of finding out?

如果我有一些整数n,我想知道这个职位最重要的一点(也就是说,如果最低有效位是在右边,我想知道最左边的位置,是一个1),最快的找到/最有效的方法是什么?

I know that POSIX supports a ffs() method in strings.h to find the first set bit, but there doesn't seem to be a corresponding fls() method.

我知道POSIX支持字符串中的ffs()方法。h查找第一个集合位,但是似乎没有相应的fls()方法。

Is there some really obvious way of doing this that I'm missing?

有没有什么明显的方法可以让我忽略?

What about in cases where you can't use POSIX functions for portability?

如果您不能使用POSIX函数来实现可移植性,又该如何处理呢?

Edit: What about a solution that works on both 32 and 64 bit architectures (many of the code listings seem like they'd only work on 32 bit ints).

编辑:对于既适用于32位体系结构又适用于64位体系结构的解决方案(许多代码清单似乎只适用于32位体系结构)。

26 个解决方案

#1


51  

GCC has:

GCC有:

 -- Built-in Function: int __builtin_clz (unsigned int x)     Returns the number of leading 0-bits in X, starting at the most     significant bit position.  If X is 0, the result is undefined. -- Built-in Function: int __builtin_clzl (unsigned long)     Similar to `__builtin_clz', except the argument type is `unsigned     long'. -- Built-in Function: int __builtin_clzll (unsigned long long)     Similar to `__builtin_clz', except the argument type is `unsigned     long long'.

I'd expect them to be translated into something reasonably efficient for your current platform, whether it be one of those fancy bit-twiddling algorithms, or a single instruction.

我希望它们能被翻译成一种对你目前的平台来说相当有效的东西,无论它是一种花哨的小算法,还是一种指令。

#2


38  

Assuming you're on x86 and game for a bit of inline assembler, Intel provides a BSR instruction ("bit scan reverse"). It's fast on some x86s (microcoded on others). From the manual:

假设您正在使用x86和game进行一些内联汇编,Intel提供了一个BSR指令(“位扫描反向”)。它在某些x86s上运行得很快(在其他一些上进行微编码)。从手册:

Searches the source operand for the most significant set bit (1 bit). If a most significant 1 bit is found, its bit index is stored in the destination operand. The source operand can be a register or a memory location; the destination operand is a register. The bit index is an unsigned offset from bit 0 of the source operand. If the content source operand is 0, the content of the destination operand is undefined.

在源操作数中搜索最重要的设置位(1位)。如果找到一个最重要的1位,它的位索引将存储在目标操作数中。源操作数可以是寄存器或内存位置;目标操作数是一个寄存器。位索引是源操作数的位0的无符号偏移量。如果内容源操作数为0,则目标操作数的内容未定义。

(If you're on PowerPC there's a similar cntlz ("count leading zeros") instruction.)

(如果你在PowerPC上,也有类似的cntlz(“计数前导零”)指令。)

Example code for gcc:

gcc的示例代码:

#include <iostream>int main (int,char**){  int n=1;  for (;;++n) {    int msb;    asm("bsrl %1,%0" : "=r"(msb) : "r"(n));    std::cout << n << " : " << msb << std::endl;  }  return 0;}

See also this inline assembler tutorial, which shows (section 9.4) it being considerably faster than looping code.

请参阅这个内联汇编器教程,它显示(第9.4节)它比循环代码快得多。

#3


34  

Since 2^N is an integer with only the Nth bit set (1 << N), finding the position (N) of the highest set bit is the integer log base 2 of that integer.

因为2 ^ N是一个整数只有第N个点集(1 < < N),找到最高的位置(N)设置是整数log 2的整数。

http://graphics.stanford.edu/~seander/bithacks.html#IntegerLogObvious

http://graphics.stanford.edu/ ~ seander / bithacks.html # IntegerLogObvious

unsigned int v;unsigned r = 0;while (v >>= 1) {    r++;}

This "obvious" algorithm may not be transparent to everyone, but when you realize that the code shifts right by one bit repeatedly until the leftmost bit has been shifted off (note that C treats any non-zero value as true) and returns the number of shifts, it makes perfect sense. It also means that it works even when more than one bit is set — the result is always for the most significant bit.

这种“明显的”算法可能对每个人都不透明,但是当您意识到代码反复地向右移动一点,直到最左边的位被移走(注意C将任何非零值视为true)并返回移位数时,这是完全合理的。它还意味着,即使设置了多个位,它仍然可以工作——结果总是为最重要的位。

If you scroll down on that page, there are faster, more complex variations. However, if you know you're dealing with numbers with a lot of leading zeroes, the naive approach may provide acceptable speed, since bit shifting is rather fast in C, and the simple algorithm doesn't require indexing an array.

如果你向下滚动页面,会有更快、更复杂的变化。但是,如果您知道您正在处理的数字有很多前导0,那么这种简单的方法可能提供可接受的速度,因为在C语言中位移速度相当快,而且简单的算法不需要对数组进行索引。

NOTE: When using 64-bit values, be extremely cautious about using extra-clever algorithms; many of them only work correctly for 32-bit values.

注意:使用64位值时,要特别小心使用超智能算法;它们中的许多只适用于32位的值。

#4


15  

This should be lightning fast:

这应该是闪电速度:

int msb(unsigned int v) {  static const int pos[32] = {0, 1, 28, 2, 29, 14, 24, 3,    30, 22, 20, 15, 25, 17, 4, 8, 31, 27, 13, 23, 21, 19,    16, 7, 26, 12, 18, 6, 11, 5, 10, 9};  v |= v >> 1;  v |= v >> 2;  v |= v >> 4;  v |= v >> 8;  v |= v >> 16;  v = (v >> 1) + 1;  return pos[(v * 0x077CB531UL) >> 27];}

#5


12  

This is sort of like finding a kind of integer log. There are bit-twiddling tricks, but I've made my own tool for this. The goal of course is for speed.

这有点像找到一种整数对数。这里有一些小技巧,但我自己做了一个工具。目标当然是速度。

My realization is that the CPU has an automatic bit-detector already, used for integer to float conversion! So use that.

我的意识是CPU已经有一个自动位检测器,用于整型浮点转换!所以使用它。

double ff=(double)(v|1);return ((*(1+(uint32_t *)&ff))>>20)-1023;  // assumes x86 endianness

This version casts the value to a double, then reads off the exponent, which tells you where the bit was. The fancy shift and subtract is to extract the proper parts from the IEEE value.

这个版本将值转换为double,然后读取指数,它告诉您位在哪里。巧妙的变换和减法是从IEEE值中提取适当的部分。

It's slightly faster to use floats, but a float can only give you the first 24 bit positions because of its smaller precision.

使用浮点数稍微快一点,但是由于浮点数的精度较小,它只能提供最初的24位位置。


To do this safely, without undefined behaviour in C++ or C, use memcpy instead of pointer casting for type-punning. Compilers know how to inline it efficiently.

为了安全地实现这一点,在c++或C中没有未定义的行为,使用memcpy而不是指针强制类型转换。编译器知道如何高效地内联它。

// static_assert(sizeof(double) == 2 * sizeof(uint32_t), "double isn't 8-byte IEEE binary64");// and also static_assert something about FLT_ENDIAN?double ff=(double)(v|1);uint32_t tmp;memcpy(&tmp, ((const char*)&ff)+sizeof(uint32_t), sizeof(uint32_t));return (tmp>>20)-1023;

Or in C99 and later, use a union {double d; uint32_t u[2];};. But note that in C++, union type punning is only supported on some compilers as an extension, not in ISO C++.

或在C99或以后,使用联合{双d;uint32_t u[2];}。但是请注意,在c++中,联合类型punning只支持一些编译器作为扩展,而不是在ISO c++中。


This will usually be slower than a platform-specific intrinsic for a leading-zeros counting instruction, but portable ISO C has no such function. Some CPUs also lack a leading-zero counting instruction, but some of those can efficiently convert integers to double. Type-punning an FP bit pattern back to integer can be slow, though (e.g. on PowerPC it requires a store/reload and usually causes a load-hit-store stall).

这通常要比特定于平台的、用于开始- 0计数指令的固有程序慢,但是可移植的ISO C没有这样的功能。有些cpu也缺少一个开始为零的计数指令,但是有些cpu可以有效地将整数转换为双精度。但是,将FP位模式类型转换为integer可能很慢(例如,在PowerPC上,它需要一个存储/重载,通常会导致加载-hit-store失速)。

This algorithm could potentially be useful for SIMD implementations, because fewer CPUs have SIMD lzcnt. x86 only got such an instruction with AVX512CD

该算法可能对SIMD实现有用,因为较少的cpu有SIMD lzcnt。x86只能用AVX512CD得到这样的指令

#6


9  

Kaz Kylheku here

Kaz Kylheku这里

I benchmarked two approaches for this over 63 bit numbers (the long long type on gcc x86_64), staying away from the sign bit.

我为超过63位的数字(gcc x86_64上的长类型)设置了两种方法,远离符号位。

(I happen to need this "find highest bit" for something, you see.)

(你知道,我碰巧需要这个“找到最高位”。)

I implemented the data-driven binary search (closely based on one of the above answers). I also implemented a completely unrolled decision tree by hand, which is just code with immediate operands. No loops, no tables.

我实现了数据驱动的二进制搜索(紧密基于上面的一个答案)。我还手工实现了一个完全未滚动的决策树,它只是带有即时操作数的代码。没有循环,没有表。

The decision tree (highest_bit_unrolled) benchmarked to be 69% faster, except for the n = 0 case for which the binary search has an explicit test.

决策树(highest_bit_unroll)基准测试的速度要快69%,除了在n = 0的情况下,二进制搜索有一个显式测试。

The binary-search's special test for 0 case is only 48% faster than the decision tree, which does not have a special test.

二值搜索对于0的特殊测试只比没有特殊测试的决策树快48%。

Compiler, machine: (GCC 4.5.2, -O3, x86-64, 2867 Mhz Intel Core i5).

编译器,机器(GCC 4.5.2, -O3, x86-64, 2867mhz Intel Core i5)。

int highest_bit_unrolled(long long n){  if (n & 0x7FFFFFFF00000000) {    if (n & 0x7FFF000000000000) {      if (n & 0x7F00000000000000) {        if (n & 0x7000000000000000) {          if (n & 0x4000000000000000)            return 63;          else            return (n & 0x2000000000000000) ? 62 : 61;        } else {          if (n & 0x0C00000000000000)            return (n & 0x0800000000000000) ? 60 : 59;          else            return (n & 0x0200000000000000) ? 58 : 57;        }      } else {        if (n & 0x00F0000000000000) {          if (n & 0x00C0000000000000)            return (n & 0x0080000000000000) ? 56 : 55;          else            return (n & 0x0020000000000000) ? 54 : 53;        } else {          if (n & 0x000C000000000000)            return (n & 0x0008000000000000) ? 52 : 51;          else            return (n & 0x0002000000000000) ? 50 : 49;        }      }    } else {      if (n & 0x0000FF0000000000) {        if (n & 0x0000F00000000000) {          if (n & 0x0000C00000000000)            return (n & 0x0000800000000000) ? 48 : 47;          else            return (n & 0x0000200000000000) ? 46 : 45;        } else {          if (n & 0x00000C0000000000)            return (n & 0x0000080000000000) ? 44 : 43;          else            return (n & 0x0000020000000000) ? 42 : 41;        }      } else {        if (n & 0x000000F000000000) {          if (n & 0x000000C000000000)            return (n & 0x0000008000000000) ? 40 : 39;          else            return (n & 0x0000002000000000) ? 38 : 37;        } else {          if (n & 0x0000000C00000000)            return (n & 0x0000000800000000) ? 36 : 35;          else            return (n & 0x0000000200000000) ? 34 : 33;        }      }    }  } else {    if (n & 0x00000000FFFF0000) {      if (n & 0x00000000FF000000) {        if (n & 0x00000000F0000000) {          if (n & 0x00000000C0000000)            return (n & 0x0000000080000000) ? 32 : 31;          else            return (n & 0x0000000020000000) ? 30 : 29;        } else {          if (n & 0x000000000C000000)            return (n & 0x0000000008000000) ? 28 : 27;          else            return (n & 0x0000000002000000) ? 26 : 25;        }      } else {        if (n & 0x0000000000F00000) {          if (n & 0x0000000000C00000)            return (n & 0x0000000000800000) ? 24 : 23;          else            return (n & 0x0000000000200000) ? 22 : 21;        } else {          if (n & 0x00000000000C0000)            return (n & 0x0000000000080000) ? 20 : 19;          else            return (n & 0x0000000000020000) ? 18 : 17;        }      }    } else {      if (n & 0x000000000000FF00) {        if (n & 0x000000000000F000) {          if (n & 0x000000000000C000)            return (n & 0x0000000000008000) ? 16 : 15;          else            return (n & 0x0000000000002000) ? 14 : 13;        } else {          if (n & 0x0000000000000C00)            return (n & 0x0000000000000800) ? 12 : 11;          else            return (n & 0x0000000000000200) ? 10 : 9;        }      } else {        if (n & 0x00000000000000F0) {          if (n & 0x00000000000000C0)            return (n & 0x0000000000000080) ? 8 : 7;          else            return (n & 0x0000000000000020) ? 6 : 5;        } else {          if (n & 0x000000000000000C)            return (n & 0x0000000000000008) ? 4 : 3;          else            return (n & 0x0000000000000002) ? 2 : (n ? 1 : 0);        }      }    }  }}int highest_bit(long long n){  const long long mask[] = {    0x000000007FFFFFFF,    0x000000000000FFFF,    0x00000000000000FF,    0x000000000000000F,    0x0000000000000003,    0x0000000000000001  };  int hi = 64;  int lo = 0;  int i = 0;  if (n == 0)    return 0;  for (i = 0; i < sizeof mask / sizeof mask[0]; i++) {    int mi = lo + (hi - lo) / 2;    if ((n >> mi) != 0)      lo = mi;    else if ((n & (mask[i] << lo)) != 0)      hi = mi;  }  return lo + 1;}

Quick and dirty test program:

快速检测程序:

#include <stdio.h>#include <time.h>#include <stdlib.h>int highest_bit_unrolled(long long n);int highest_bit(long long n);main(int argc, char **argv){  long long n = strtoull(argv[1], NULL, 0);  int b1, b2;  long i;  clock_t start = clock(), mid, end;  for (i = 0; i < 1000000000; i++)    b1 = highest_bit_unrolled(n);  mid = clock();  for (i = 0; i < 1000000000; i++)    b2 = highest_bit(n);  end = clock();  printf("highest bit of 0x%llx/%lld = %d, %d\n", n, n, b1, b2);  printf("time1 = %d\n", (int) (mid - start));  printf("time2 = %d\n", (int) (end - mid));  return 0;}

Using only -O2, the difference becomes greater. The decision tree is almost four times faster.

只用-O2,差别就更大了。决策树几乎快了四倍。

I also benchmarked against the naive bit shifting code:

我还以幼稚的比特转换代码为基准:

int highest_bit_shift(long long n){  int i = 0;  for (; n; n >>= 1, i++)    ; /* empty */  return i;}

This is only fast for small numbers, as one would expect. In determining that the highest bit is 1 for n == 1, it benchmarked more than 80% faster. However, half of randomly chosen numbers in the 63 bit space have the 63rd bit set!

正如人们所预期的那样,这对于小数字来说是非常快的。在确定n = 1的最高位为1时,它的基准测试速度要快80%以上。然而,在63位空间中随机选择的数字中,有一半是第63位!

On the input 0x3FFFFFFFFFFFFFFF, the decision tree version is quite a bit faster than it is on 1, and shows to be 1120% faster (12.2 times) than the bit shifter.

在输入0x3fffffffffffffffffffffff上,决策树版本比在1上快得多,显示比位移位器快1120%(12.2倍)。

I will also benchmark the decision tree against the GCC builtins, and also try a mixture of inputs rather than repeating against the same number. There may be some sticking branch prediction going on and perhaps some unrealistic caching scenarios which makes it artificially faster on repetitions.

我还将基准决策树与GCC构建相比较,并尝试混合输入,而不是重复相同的数字。可能会有一些关于分支的预测,可能会有一些不切实际的缓存场景,使其在重复的过程中人为地更快。

#7


7  

What about

是什么

int highest_bit(unsigned int a) {    int count;    std::frexp(a, &count);    return count - 1;}

?

吗?

#8


6  

unsigned intmsb32(register unsigned int x){        x |= (x >> 1);        x |= (x >> 2);        x |= (x >> 4);        x |= (x >> 8);        x |= (x >> 16);        return(x & ~(x >> 1));}

1 register, 13 instructions. Believe it or not, this is usually faster than the BSR instruction mentioned above, which operates in linear time. This is logarithmic time.

1注册,13指令。信不信由你,这通常比上面提到的BSR指令要快,它在线性时间内运行。这是对数时间。

From http://aggregate.org/MAGIC/#Most%20Significant%201%20Bit

从http://aggregate.org/MAGIC/ # % 20大多数重要% 20位201%

#9


5  

Here are some (simple) benchmarks, of algorithms currently given on this page...

以下是一些(简单的)基准,是本页当前给出的算法……

The algorithms have not been tested over all inputs of unsigned int; so check that first, before blindly using something ;)

算法尚未对所有无符号整型输入进行测试;所以,在盲目使用某样东西之前,先检查一下。

On my machine clz (__builtin_clz) and asm work best. asm seems even faster then clz... but it might be due to the simple benchmark...

在我的机器上,clz (__builtin_clz)和asm工作得最好。asm似乎比clz还要快……但这可能是由于简单的基准……

//////// go.c ///////////////////////////////// compile with:  gcc go.c -o go -lm#include <math.h>#include <stdio.h>#include <stdlib.h>#include <time.h>/***************** math ********************/#define POS_OF_HIGHESTBITmath(a) /* 0th position is the Least-Signif-Bit */    \  ((unsigned) log2(a))         /* thus: do not use if a <= 0 */  #define NUM_OF_HIGHESTBITmath(a) ((a)               \                  ? (1U << POS_OF_HIGHESTBITmath(a))    \                  : 0)/***************** clz ********************/unsigned NUM_BITS_U = ((sizeof(unsigned) << 3) - 1);#define POS_OF_HIGHESTBITclz(a) (NUM_BITS_U - __builtin_clz(a)) /* only works for a != 0 */#define NUM_OF_HIGHESTBITclz(a) ((a)                    \                 ? (1U << POS_OF_HIGHESTBITclz(a))  \                 : 0)/***************** i2f ********************/double FF;#define POS_OF_HIGHESTBITi2f(a) (FF = (double)(ui|1), ((*(1+(unsigned*)&FF))>>20)-1023)#define NUM_OF_HIGHESTBITi2f(a) ((a)                    \                 ? (1U << POS_OF_HIGHESTBITi2f(a))  \                 : 0)/***************** asm ********************/unsigned OUT;#define POS_OF_HIGHESTBITasm(a) (({asm("bsrl %1,%0" : "=r"(OUT) : "r"(a));}), OUT)#define NUM_OF_HIGHESTBITasm(a) ((a)                    \                 ? (1U << POS_OF_HIGHESTBITasm(a))  \                 : 0)/***************** bitshift1 ********************/#define NUM_OF_HIGHESTBITbitshift1(a) (({   \  OUT = a;                  \  OUT |= (OUT >> 1);                \  OUT |= (OUT >> 2);                \  OUT |= (OUT >> 4);                \  OUT |= (OUT >> 8);                \  OUT |= (OUT >> 16);               \      }), (OUT & ~(OUT >> 1)))          \/***************** bitshift2 ********************/int POS[32] = {0, 1, 28, 2, 29, 14, 24, 3,             30, 22, 20, 15, 25, 17, 4, 8, 31, 27, 13, 23, 21, 19,             16, 7, 26, 12, 18, 6, 11, 5, 10, 9};#define POS_OF_HIGHESTBITbitshift2(a) (({   \  OUT = a;                  \  OUT |= OUT >> 1;              \  OUT |= OUT >> 2;              \  OUT |= OUT >> 4;              \  OUT |= OUT >> 8;              \  OUT |= OUT >> 16;             \  OUT = (OUT >> 1) + 1;             \      }), POS[(OUT * 0x077CB531UL) >> 27])#define NUM_OF_HIGHESTBITbitshift2(a) ((a)              \                       ? (1U << POS_OF_HIGHESTBITbitshift2(a)) \                       : 0)#define LOOPS 100000000Uint main(){  time_t start, end;  unsigned ui;  unsigned n;  /********* Checking the first few unsigned values (you'll need to check all if you want to use an algorithm here) **************/  printf("math\n");  for (ui = 0U; ui < 18; ++ui)    printf("%i\t%i\n", ui, NUM_OF_HIGHESTBITmath(ui));  printf("\n\n");  printf("clz\n");  for (ui = 0U; ui < 18U; ++ui)    printf("%i\t%i\n", ui, NUM_OF_HIGHESTBITclz(ui));  printf("\n\n");  printf("i2f\n");  for (ui = 0U; ui < 18U; ++ui)    printf("%i\t%i\n", ui, NUM_OF_HIGHESTBITi2f(ui));  printf("\n\n");  printf("asm\n");  for (ui = 0U; ui < 18U; ++ui) {    printf("%i\t%i\n", ui, NUM_OF_HIGHESTBITasm(ui));  }  printf("\n\n");  printf("bitshift1\n");  for (ui = 0U; ui < 18U; ++ui) {    printf("%i\t%i\n", ui, NUM_OF_HIGHESTBITbitshift1(ui));  }  printf("\n\n");  printf("bitshift2\n");  for (ui = 0U; ui < 18U; ++ui) {    printf("%i\t%i\n", ui, NUM_OF_HIGHESTBITbitshift2(ui));  }  printf("\n\nPlease wait...\n\n");  /************************* Simple clock() benchmark ******************/  start = clock();  for (ui = 0; ui < LOOPS; ++ui)    n = NUM_OF_HIGHESTBITmath(ui);  end = clock();  printf("math:\t%e\n", (double)(end-start)/CLOCKS_PER_SEC);  start = clock();  for (ui = 0; ui < LOOPS; ++ui)    n = NUM_OF_HIGHESTBITclz(ui);  end = clock();  printf("clz:\t%e\n", (double)(end-start)/CLOCKS_PER_SEC);  start = clock();  for (ui = 0; ui < LOOPS; ++ui)    n = NUM_OF_HIGHESTBITi2f(ui);  end = clock();  printf("i2f:\t%e\n", (double)(end-start)/CLOCKS_PER_SEC);  start = clock();  for (ui = 0; ui < LOOPS; ++ui)    n = NUM_OF_HIGHESTBITasm(ui);  end = clock();  printf("asm:\t%e\n", (double)(end-start)/CLOCKS_PER_SEC);  start = clock();  for (ui = 0; ui < LOOPS; ++ui)    n = NUM_OF_HIGHESTBITbitshift1(ui);  end = clock();  printf("bitshift1:\t%e\n", (double)(end-start)/CLOCKS_PER_SEC);  start = clock();  for (ui = 0; ui < LOOPS; ++ui)    n = NUM_OF_HIGHESTBITbitshift2(ui);  end = clock();  printf("bitshift2\t%e\n", (double)(end-start)/CLOCKS_PER_SEC);  printf("\nThe lower, the better. Take note that a negative exponent is good! ;)\n");  return EXIT_SUCCESS;}

#10


5  

Although I would probably only use this method if I absolutely required the best possible performance (e.g. for writing some sort of board game AI involving bitboards), the most efficient solution is to use inline ASM. See the Optimisations section of this blog post for code with an explanation.

尽管我可能只会使用这种方法,如果我绝对需要最好的性能(例如,编写一些包含位板的棋盘游戏AI),最有效的解决方案是使用内联ASM。请参阅本博客文章的Optimisations部分以获得解释。

[...], the bsrl assembly instruction computes the position of the most significant bit. Thus, we could use this asm statement:

[…, bsrl汇编指令计算最重要位的位置。因此,我们可以使用这个asm语句:

asm ("bsrl %1, %0"      : "=r" (position)      : "r" (number));

#11


4  

I had a need for a routine to do this and before searching the web (and finding this page) I came up with my own solution basedon a binary search. Although I'm sure someone has done this before! It runs in constant time and can be faster than the "obvious" solution posted, although I'm not making any great claims, just posting it for interest.

我需要一个例行程序来完成这个任务,在搜索web(以及找到这个页面)之前,我找到了我自己的解决方案basedon,一个二分查找。虽然我肯定有人曾经做过!它运行的时间是恒定的,而且可能比发布的“显而易见的”解决方案要快,尽管我并没有发表任何伟大的声明,只是出于兴趣发布它。

int highest_bit(unsigned int a) {  static const unsigned int maskv[] = { 0xffff, 0xff, 0xf, 0x3, 0x1 };  const unsigned int *mask = maskv;  int l, h;  if (a == 0) return -1;  l = 0;  h = 32;  do {    int m = l + (h - l) / 2;    if ((a >> m) != 0) l = m;    else if ((a & (*mask << l)) != 0) h = m;    mask++;  } while (l < h - 1);  return l;}

#12


3  

As the answers above point out, there are a number of ways to determine the most significant bit. However, as was also pointed out, the methods are likely to be unique to either 32bit or 64bit registers. The stanford.edu bithacks page provides solutions that work for both 32bit and 64bit computing. With a little work, they can be combined to provide a solid cross-architecture approach to obtaining the MSB. The solution I arrived at that compiled/worked across 64 & 32 bit computers was:

正如上面的答案所指出的,有很多方法可以确定最重要的部分。然而,正如所指出的,这些方法可能对32位或64位寄存器都是唯一的。stanford。edu bithacks页面提供适用于32位和64位计算的解决方案。只需做一点工作,它们就可以组合在一起,为获得MSB提供可靠的跨体系结构方法。我在64位和32位计算机上得到的解决方案是:

#if defined(__LP64__) || defined(_LP64)# define BUILD_64   1#endif#include <stdio.h>#include <stdint.h>  /* for uint32_t *//* CHAR_BIT  (or include limits.h) */#ifndef CHAR_BIT#define CHAR_BIT  8#endif  /* CHAR_BIT *//*  * Find the log base 2 of an integer with the MSB N set in O(N) * operations. (on 64bit & 32bit architectures) */intgetmsb (uint32_t word){    int r = 0;    if (word < 1)        return 0;#ifdef BUILD_64    union { uint32_t u[2]; double d; } t;  // temp    t.u[__FLOAT_WORD_ORDER==LITTLE_ENDIAN] = 0x43300000;    t.u[__FLOAT_WORD_ORDER!=LITTLE_ENDIAN] = word;    t.d -= 4503599627370496.0;    r = (t.u[__FLOAT_WORD_ORDER==LITTLE_ENDIAN] >> 20) - 0x3FF;#else    while (word >>= 1)    {        r++;    }#endif  /* BUILD_64 */    return r;}

#13


3  

A version in C using successive approximation:

使用逐次逼近的C版本:

unsigned int getMsb(unsigned int n){  unsigned int msb  = sizeof(n) * 4;  unsigned int step = msb;  while (step > 1) {    step /=2;    if (n>>msb)     msb += step;   else     msb -= step; }  if (n>>msb)    msb++;  return (msb - 1);}

Advantage: the running time is constant regardless of the provided number, as the number of loops are always the same.( 4 loops when using "unsigned int")

优点:运行时间是常量,不管所提供的数量是多少,因为循环的数量总是相同的。(使用“无符号int”时循环4次)

#14


3  

thats some kind of binary search, it works with all kinds of (unsigned!) integer types

这是一种二进制搜索,它适用于所有类型的(无符号!)整数类型

#include <climits>#define UINT (unsigned int)#define UINT_BIT (CHAR_BIT*sizeof(UINT))int msb(UINT x){    if(0 == x)        return -1;    int c = 0;    for(UINT i=UINT_BIT>>1; 0<i; i>>=1)    if(static_cast<UINT>(x >> i))    {        x >>= i;        c |= i;    }    return c;}

to make complete:

完成:

#include <climits>#define UINT unsigned int#define UINT_BIT (CHAR_BIT*sizeof(UINT))int lsb(UINT x){    if(0 == x)        return -1;    int c = UINT_BIT-1;    for(UINT i=UINT_BIT>>1; 0<i; i>>=1)    if(static_cast<UINT>(x << i))    {        x <<= i;        c ^= i;    }    return c;}

#15


3  

Some overly complex answers here. The Debruin technique should only be used when the input is already a power of two, otherwise there's a better way. For a power of 2 input, Debruin is the absolute fastest, even faster than _BitScanReverse on any processor I've tested. However, in the general case, _BitScanReverse (or whatever the intrinsic is called in your compiler) is the fastest (on certain CPU's it can be microcoded though).

这里有一些过于复杂的答案。Debruin技术应该只在输入已经是2的幂时使用,否则会有更好的方法。对于2输入的能力,Debruin是绝对最快的,甚至比我测试过的任何处理器上的_BitScanReverse都要快。然而,在一般情况下,_BitScanReverse(或者无论在编译器中调用什么本质)是最快的(在某些CPU上它可以被微编码)。

If the intrinsic function is not an option, here is an optimal software solution for processing general inputs.

如果内部函数不是一个选项,这里有一个处理一般输入的最佳软件解决方案。

u8  inline log2 (u32 val)  {    u8  k = 0;    if (val > 0x0000FFFFu) { val >>= 16; k  = 16; }    if (val > 0x000000FFu) { val >>= 8;  k |= 8;  }    if (val > 0x0000000Fu) { val >>= 4;  k |= 4;  }    if (val > 0x00000003u) { val >>= 2;  k |= 2;  }    k |= (val & 2) >> 1;    return k;}

Note that this version does not require a Debruin lookup at the end, unlike most of the other answers. It computes the position in place.

注意,这个版本不需要在末尾进行Debruin查找,这与其他大多数答案不同。它计算位置。

Tables can be preferable though, if you call it repeatedly enough times, the risk of a cache miss becomes eclipsed by the speedup of a table.

但是,如果重复调用表的次数足够多,则表的加速比缓存丢失的风险要大。

u8 kTableLog2[256] = {0,0,1,1,2,2,2,2,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7};u8 log2_table(u32 val)  {    u8  k = 0;    if (val > 0x0000FFFFuL) { val >>= 16; k  = 16; }    if (val > 0x000000FFuL) { val >>=  8; k |=  8; }    k |= kTableLog2[val]; // precompute the Log2 of the low byte    return k;}

This should produce the highest throughput of any of the software answers given here, but if you only call it occasionally, prefer a table-free solution like my first snippet.

这应该会产生最高的吞吐量,在这里给出的任何软件答案中,但是如果您只是偶尔调用它,那么更喜欢像我的第一个代码片段那样的无表解决方案。

#16


3  

has given us log2. This removes the need for all the special sauce log2 implementations you see on this page. You can use the standard's log2 implementation like this:

c99给出了log2。这消除了您在此页上看到的所有特殊的酱汁log2实现的需要。您可以使用标准的log2实现,如下所示:

const auto n = 13UL;const auto Index = (unsigned long)log2(n);printf("MSB is: %u\n", Index); // Prints 3 (zero offset)

An n of 0UL needs to be guarded against as well, because:

一个n的0UL也需要防备,因为:

-∞ is returned and FE_DIVBYZERO is raised

——∞返回和FE_DIVBYZERO

I have written an example with that check that arbitrarily sets Index to ULONG_MAX here: https://ideone.com/u26vsi

我已经编写了一个示例,其中的检查将索引任意设置为ULONG_MAX: https://ideone.com/u26vsi


The corollary to ephemient's gcc only answer is:

视觉工作室推论ephemient的gcc唯一答案是:

const auto n = 13UL;unsigned long Index;_BitScanReverse(&Index, n);printf("MSB is: %u\n", Index); // Prints 3 (zero offset)

The documentation for _BitScanReverse states that Index is:

_BitScanReverse文件表明索引是:

Loaded with the bit position of the first set bit (1) found

加载找到的第一个set位(1)的位位置

In practice I've found that if n is 0UL that Index is set to 0UL, just as it would be for an n of 1UL. But the only thing guaranteed in the documentation in the case of an n of 0UL is that the return is:

在实践中,我发现如果n是0UL,那么该索引将被设置为0UL,就像对于n (1UL)一样。但在文档中,对于n (0UL)的情况,唯一保证的是返回值是:

0 if no set bits were found

如果没有找到集合点。

Thus, similarly to the preferable log2 implementation above the return should be checked setting Index to a flagged value in this case. I've again written an example of using ULONG_MAX for this flag value here: http://rextester.com/GCU61409

因此,在这种情况下,同样地,在返回值之上的log2实现应该检查设置索引到标记值。我再次为这里的标志值使用ULONG_MAX编写了一个示例:http://rextester.com/GCU61409

#17


2  

Think bitwise operators.

认为逐位运算符。

I missunderstood the question the first time. You should produce an int with the leftmost bit set (the others zero). Assuming cmp is set to that value:

我第一次误解了这个问题。您应该使用最左边的位集(其余的为0)生成一个int。假设cmp设为该值:

position = sizeof(int)*8while(!(n & cmp)){    n <<=1;   position--;}

#18


2  

Expanding on Josh's benchmark...one can improve the clz as follows

扩大Josh基准…可以改进clz,如下所示

/***************** clz2 ********************/#define NUM_OF_HIGHESTBITclz2(a) ((a)                              \                  ? (((1U) << (sizeof(unsigned)*8-1)) >> __builtin_clz(a)) \                  : 0)

Regarding the asm: note that there are bsr and bsrl (this is the "long" version). the normal one might be a bit faster.

关于asm:注意有bsr和bsrl(这是“long”版本)。正常的可能要快一点。

#19


1  

Note that what you are trying to do is calculate the integer log2 of an integer,

注意,你要做的是计算一个整数的log2,

#include <stdio.h>#include <stdlib.h>unsigned intLog2(unsigned long x){    unsigned long n = x;    int bits = sizeof(x)*8;    int step = 1; int k=0;    for( step = 1; step < bits; ) {        n |= (n >> step);        step *= 2; ++k;    }    //printf("%ld %ld\n",x, (x - (n >> 1)) );    return(x - (n >> 1));}

Observe that you can attempt to search more than 1 bit at a time.

注意,您可以尝试每次搜索超过1位。

unsigned intLog2_a(unsigned long x){    unsigned long n = x;    int bits = sizeof(x)*8;    int step = 1;    int step2 = 0;    //observe that you can move 8 bits at a time, and there is a pattern...    //if( x>1<<step2+8 ) { step2+=8;        //if( x>1<<step2+8 ) { step2+=8;            //if( x>1<<step2+8 ) { step2+=8;            //}        //}    //}    for( step2=0; x>1L<<step2+8; ) {        step2+=8;    }    //printf("step2 %d\n",step2);    for( step = 0; x>1L<<(step+step2); ) {        step+=1;        //printf("step %d\n",step+step2);    }    printf("log2(%ld) %d\n",x,step+step2);    return(step+step2);}

This approach uses a binary search

这种方法使用二进制搜索

unsigned intLog2_b(unsigned long x){    unsigned long n = x;    unsigned int bits = sizeof(x)*8;    unsigned int hbit = bits-1;    unsigned int lbit = 0;    unsigned long guess = bits/2;    int found = 0;    while ( hbit-lbit>1 ) {        //printf("log2(%ld) %d<%d<%d\n",x,lbit,guess,hbit);        //when value between guess..lbit        if( (x<=(1L<<guess)) ) {           //printf("%ld < 1<<%d %ld\n",x,guess,1L<<guess);            hbit=guess;            guess=(hbit+lbit)/2;            //printf("log2(%ld) %d<%d<%d\n",x,lbit,guess,hbit);        }        //when value between hbit..guess        //else        if( (x>(1L<<guess)) ) {            //printf("%ld > 1<<%d %ld\n",x,guess,1L<<guess);            lbit=guess;            guess=(hbit+lbit)/2;            //printf("log2(%ld) %d<%d<%d\n",x,lbit,guess,hbit);        }    }    if( (x>(1L<<guess)) ) ++guess;    printf("log2(x%ld)=r%d\n",x,guess);    return(guess);}

Another binary search method, perhaps more readable,

另一个二分查找方法,也许更容易读,

unsigned intLog2_c(unsigned long x){    unsigned long v = x;    unsigned int bits = sizeof(x)*8;    unsigned int step = bits;    unsigned int res = 0;    for( step = bits/2; step>0; )    {        //printf("log2(%ld) v %d >> step %d = %ld\n",x,v,step,v>>step);        while ( v>>step ) {            v>>=step;            res+=step;            //printf("log2(%ld) step %d res %d v>>step %ld\n",x,step,res,v);        }        step /= 2;    }    if( (x>(1L<<res)) ) ++res;    printf("log2(x%ld)=r%ld\n",x,res);    return(res);}

And because you will want to test these,

因为你想测试这些,

int main(){    unsigned long int x = 3;    for( x=2; x<1000000000; x*=2 ) {        //printf("x %ld, x+1 %ld, log2(x+1) %d\n",x,x+1,Log2(x+1));        printf("x %ld, x+1 %ld, log2_a(x+1) %d\n",x,x+1,Log2_a(x+1));        printf("x %ld, x+1 %ld, log2_b(x+1) %d\n",x,x+1,Log2_b(x+1));        printf("x %ld, x+1 %ld, log2_c(x+1) %d\n",x,x+1,Log2_c(x+1));    }    return(0);}

#20


1  

Putting this in since it's 'yet another' approach, seems to be different from others already given.

既然这是“另一种”方法,那么把它放在这里似乎与已经给出的其他方法不同。

returns -1 if x==0, otherwise floor( log2(x)) (max result 31)

如果x= 0,则返回-1,否则返回层(log2(x))(最大结果31)

Reduce from 32 to 4 bit problem, then use a table. Perhaps inelegant, but pragmatic.

从32位问题减少到4位问题,然后使用表。也许不雅,但务实。

This is what I use when I don't want to use __builtin_clz because of portability issues.

由于可移植性问题,当我不想使用__builtin_clz时,我就使用这种方法。

To make it more compact, one could instead use a loop to reduce, adding 4 to r each time, max 7 iterations. Or some hybrid, such as (for 64 bits): loop to reduce to 8, test to reduce to 4.

为了使它更紧凑,可以使用循环来减少,每次增加4到r,最多7次迭代。或者有些混合,如(64位):循环减少到8,测试减少到4。

int log2floor( unsigned x ){   static const signed char wtab[16] = {-1,0,1,1, 2,2,2,2, 3,3,3,3,3,3,3,3};   int r = 0;   unsigned xk = x >> 16;   if( xk != 0 ){       r = 16;       x = xk;   }   // x is 0 .. 0xFFFF   xk = x >> 8;   if( xk != 0){       r += 8;       x = xk;   }   // x is 0 .. 0xFF   xk = x >> 4;   if( xk != 0){       r += 4;       x = xk;   }   // now x is 0..15; x=0 only if originally zero.   return r + wtab[x];}

#21


1  

I know this question is very old, but just having implemented an msb() function myself,I found that most solutions presented here and on other websites are not necessarily the most efficient - at least for my personal definition of efficiency (see also Update below). Here's why:

我知道这个问题很老了,但是仅仅是我自己实现了msb()函数,我就发现这里和其他网站上提供的大多数解决方案不一定是最有效的——至少对于我个人对效率的定义是这样的(参见下面的更新)。原因如下:

Most solutions (especially those which employ some sort of binary search scheme or the naïve approach which does a linear scan from right to left) seem to neglect the fact that for arbitrary binary numbers, there are not many which start with a very long sequence of zeros. In fact, for any bit-width, half of all integers start with a 1 and a quarter of them start with 01.See where i'm getting at? My argument is that a linear scan starting from the most significant bit position to the least significant (left to right) is not so "linear" as it might look like at first glance.

大多数解决方案(特别是那些采用某种二进制搜索方案或从右到左进行线性扫描的简单方法)似乎忽略了一个事实,即对于任意的二进制数,没有多少从一个非常长的零序列开始。事实上,对于任何位宽,所有整数的一半以1开头,四分之一以01开头。明白我的意思了吗?我的论点是,从最重要的位位置到最不重要的(左到右)的线性扫描不是那么“线性”的,就像乍看上去的那样。

It can be shown1, that for any bit-width, the average number of bits that need to be tested is at most 2. This translates to an amortized time complexity of O(1) with respect to the number of bits (!).

可以说明一点,对于任何位宽,需要测试的位的平均数目最多为2。这转化为O(1)相对于比特数(!)的平摊时间复杂度。

Of course, the worst case is still O(n), worse than the O(log(n)) you get with binary-search-like approaches, but since there are so few worst cases, they are negligible for most applications (Update: not quite: There may be few, but they might occur with high probability - see Update below).

当然,最坏的情况仍然是O(n),比用类二进制搜索的方法得到的O(log(n)更糟糕,但是由于最坏的情况非常少,所以对大多数应用程序来说它们是可以忽略的(Update:不完全是这样:可能很少,但是很有可能会发生,请参阅下面的Update)。

Here is the "naïve" approach i've come up with, which at least on my machine beats most other approaches (binary search schemes for 32-bit ints always require log2(32) = 5 steps, whereas this silly algorithm requires less than 2 on average) - sorry for this being C++ and not pure C:

这是我提出了“幼稚”方法,至少在我的机器上比大多数其他方法(32位整数二进制搜索方案总是需要log2(32)= 5步骤,而这愚蠢的算法需要平均小于2)——对不起,这是c++和不纯C:

template <typename T>auto msb(T n) -> int{    static_assert(std::is_integral<T>::value && !std::is_signed<T>::value,        "msb<T>(): T must be an unsigned integral type.");    for (T i = std::numeric_limits<T>::digits - 1, mask = 1 << i; i >= 0; --i, mask >>= 1)    {        if ((n & mask) != 0)            return i;    }    return 0;}

Update: While what i wrote here is perfectly true for arbitrary integers, where every combination of bits is equally probable (my speed test simply measured how long it took to determine the MSB for all 32-bit integers), real-life integers, for which such a function will be called, usually follow a different pattern: In my code, for example, this function is used to determine whether an object size is a power of 2, or to find the next power of 2 greater or equal than an object size.My guess is that most applications using the MSB involve numbers which are much smaller than the maximum number an integer can represent (object sizes rarely utilize all the bits in a size_t). In this case, my solution will actually perform worse than a binary search approach - so the latter should probably be preferred, even though my solution will be faster looping through all integers.
TL;DR: Real-life integers will probably have a bias towards the worst case of this simple algorithm, which will make it perform worse in the end - despite the fact that it's amortized O(1) for truly arbitrary integers.

更新:虽然我在这里写的是完全适用于任意整数,每一位同样可能的组合(速度测试仅仅用了多长时间来确定测量MSB所有32位整数),真实的整数,而这样一个函数将被调用,通常遵循一个不同的模式:在我的代码,例如,这个函数是用来确定一个对象大小是2的幂,或找到下一个2的力量大于或等于物体的大小。我的猜测是,大多数使用MSB的应用程序都包含比整数所代表的最大数字小得多的数字(对象大小很少使用size_t中的所有位)。在这种情况下,我的解决方案实际上会比二进制搜索方法执行得更差——因此应该选择后者,尽管我的解决方案将在所有整数中循环得更快。现实生活中的整数可能会倾向于这个简单算法最坏的情况,这最终会使它表现得更差——尽管对于真正任意的整数,它是被平摊的O(1)。

1The argument goes like this (rough draft):Let n be the number of bits (bit-width). There are a total of 2n integers wich can be represented with n bits. There are 2n - 1 integers starting with a 1 (first 1 is fixed, remaining n - 1 bits can be anything). Those integers require only one interation of the loop to determine the MSB. Further, There are 2n - 2 integers starting with 01, requiring 2 iterations, 2n - 3 integers starting with 001, requiring 3 iterations, and so on.

论点是这样的(粗略的草稿):让n是位(位宽)的个数。总共有2n个整数可以用n位来表示。从1开始有2n - 1个整数(第一个1是固定的,剩下的n - 1位可以是任何数)。这些整数只需要循环的一次交互就可以确定MSB。此外,有2n - 2个整数从01开始,需要2次迭代,2n - 3个整数从001开始,需要3次迭代,以此类推。

If we sum up all the required iterations for all possible integers and divide them by 2n, the total number of integers, we get the average number of iterations needed for determining the MSB for n-bit integers:

如果我们对所有可能的整数进行必要的迭代,并将它们除以2n,即整数的总数,我们就得到了确定n位整数的MSB所需的平均迭代次数:

(1 * 2n - 1 + 2 * 2n - 2 + 3 * 2n - 3 + ... + n) / 2n

(1 * 2n - 1 + 2 * 2n - 2 + 3 * 2n - 3 +…)n + n)/ 2

This series of average iterations is actually convergent and has a limit of 2 for n towards infinity

这一系列的平均迭代实际上是收敛的,n趋于无穷时的极限是2

Thus, the naïve left-to-right algorithm has actually an amortized constant time complexity of O(1) for any number of bits.

因此,从左到右的朴素算法实际上对任意数量的比特都有O(1)的平摊常数时间复杂度。

#22


0  

The code:

代码:

    // x>=1;    unsigned func(unsigned x) {    double d = x ;    int p= (*reinterpret_cast<long long*>(&d) >> 52) - 1023;    printf( "The left-most non zero bit of %d is bit %d\n", x, p);    }

Or get the integer part of FPU instruction FYL2X (Y*Log2 X) by setting Y=1

或者通过设置Y=1获得FPU指令FYL2X (Y*Log2 X)的整数部分

#23


0  

Woaw, that was many answers. I am not sorry for answering on an old question.

啊,有很多答案。我不为回答一个老问题而感到遗憾。

int result = 0;//could be a char or int8_t insteadif(value){//this assumes the value is 64bit    if(0xFFFFFFFF00000000&value){  value>>=(1<<5); result|=(1<<5);  }//if it is 32bit then remove this line    if(0x00000000FFFF0000&value){  value>>=(1<<4); result|=(1<<4);  }//and remove the 32msb    if(0x000000000000FF00&value){  value>>=(1<<3); result|=(1<<3);  }    if(0x00000000000000F0&value){  value>>=(1<<2); result|=(1<<2);  }    if(0x000000000000000C&value){  value>>=(1<<1); result|=(1<<1);  }    if(0x0000000000000002&value){  result|=(1<<0);  }}else{  result=-1;}

This answer is pretty similar to another answer... oh well.

这个答案和另一个答案很相似……哦。

#24


0  

I assume your question is for an integer (called v below) and not an unsigned integer.

我假设您的问题是一个整数(以下称为v),而不是无符号整数。

int v = 612635685; // whatever value you wishunsigned int get_msb(int v){    int r = 31;                         // maximum number of iteration until integer has been totally left shifted out, considering that first bit is index 0. Also we could use (sizeof(int)) << 3 - 1 instead of 31 to make it work on any platform.    while (!(v & 0x8000000) && r--) {   // mask of the highest bit        v <<= 1;                        // multiply integer by 2.    }    return r;                           // will even return -1 if no bit was set, allowing error catch}

If you want to make it work without taking into account the sign you can add an extra 'v <<= 1;' before the loop (and change r value to 30 accordingly).Please let me know if I forgot anything. I haven't tested it but it should work just fine.

如果您想让它在不考虑符号的情况下工作,您可以在循环之前添加一个额外的'v <<= 1;'(并相应地将r值更改为30)。如果我忘了什么,请告诉我。我还没有对它进行测试,但它应该可以正常工作。

#25


-1  

Another poster provided a lookup-table using a byte-wide lookup. In case you want to eke out a bit more performance (at the cost of 32K of memory instead of just 256 lookup entries) here is a solution using a 15-bit lookup table, in C# 7 for .NET.

另一张海报提供了一个使用字节范围查找的查找表。如果您希望获得更多的性能(以32K内存而不是256个查找项为代价),这里有一个使用15位查找表的解决方案,在c# 7中的。net中。

The interesting part is initializing the table. Since it's a relatively small block that we want for the lifetime of the process, I allocate unmanaged memory for this by using Marshal.AllocHGlobal. As you can see, for maximum performance, the whole example is written as native:

有趣的部分是初始化表。由于它是一个相对较小的块,我们希望在整个过程中使用它,所以我使用Marshal.AllocHGlobal来分配非托管内存。正如您所看到的,为了获得最大的性能,整个示例是作为本机编写的:

readonly static byte[] msb_tab_15;// Initialize a table of 32768 bytes with the bit position (counting from LSB=0)// of the highest 'set' (non-zero) bit of its corresponding 16-bit index value.// The table is compressed by half, so use (value >> 1) for indexing.static MyStaticInit(){    var p = new byte[0x8000];    for (byte n = 0; n < 16; n++)        for (int c = (1 << n) >> 1, i = 0; i < c; i++)            p[c + i] = n;    msb_tab_15 = p;}

The table requires one-time initialization via the code above. It is read-only so a single global copy can be shared for concurrent access. With this table you can quickly look up the integer log2, which is what we're looking for here, for all the various integer widths (8, 16, 32, and 64 bits).

表需要通过上面的代码进行一次初始化。它是只读的,因此可以为并发访问共享一个全局副本。使用这个表,您可以快速查找整数log2,这就是我们在这里寻找的,对于所有的整数宽度(8、16、32和64位)。

Notice that the table entry for 0, the sole integer for which the notion of 'highest set bit' is undefined, is given the value -1. This distinction is necessary for proper handling of 0-valued upper words in the code below. Without further ado, here is the code for each of the various integer primitives:

注意,0的表条目(“最高集位”的概念未定义的唯一整数)的值为-1。这种区别对于正确处理下面代码中0值的大写字母是必要的。下面是各种整数原语的代码。

ulong (64-bit) Version

ulong(64位)的版本

/// <summary> Index of the highest set bit in 'v', or -1 for value '0' </summary>public static int HighestOne(this ulong v){    if ((long)v <= 0)        return (int)((v >> 57) & 0x40) - 1;      // handles cases v==0 and MSB==63    int j = /**/ (int)((0xFFFFFFFFU - v /****/) >> 58) & 0x20;    j |= /*****/ (int)((0x0000FFFFU - (v >> j)) >> 59) & 0x10;    return j + msb_tab_15[v >> (j + 1)];}

uint (32-bit) Version

单位(32位)版本

/// <summary> Index of the highest set bit in 'v', or -1 for value '0' </summary>public static int HighestOne(uint v){    if ((int)v <= 0)        return (int)((v >> 26) & 0x20) - 1;     // handles cases v==0 and MSB==31    int j = (int)((0x0000FFFFU - v) >> 27) & 0x10;    return j + msb_tab_15[v >> (j + 1)];}

Various overloads for the above

上面的各种重载。

public static int HighestOne(long v) => HighestOne((ulong)v);public static int HighestOne(int v) => HighestOne((uint)v);public static int HighestOne(ushort v) => msb_tab_15[v >> 1];public static int HighestOne(short v) => msb_tab_15[(ushort)v >> 1];public static int HighestOne(char ch) => msb_tab_15[ch >> 1];public static int HighestOne(sbyte v) => msb_tab_15[(byte)v >> 1];public static int HighestOne(byte v) => msb_tab_15[v >> 1];

This is a complete, working solution which represents the best performance on .NET 4.7.2 for numerous alternatives that I compared with a specialized performance test harness. Some of these are mentioned below. The test parameters were a uniform density of all 65 bit positions, i.e., 0 ... 31/63 plus value 0 (which produces result -1). The bits below the target index position were filled randomly. The tests were x64 only, release mode, with JIT-optimizations enabled.

这是一个完整的、可工作的解决方案,它代表了。net 4.7.2的最佳性能,我将其与专门的性能测试工具进行了比较。下面提到了其中的一些。测试参数为所有65位位置的均匀密度,即,0…31/63 + 0(产生结果-1)目标索引位置下方的位随机填充。测试仅为x64,释放模式,启用了jit优化。




That's the end of my formal answer here; what follows are some casual notes and links to source code for alternative test candidates associated with the testing I ran to validate the performance and correctness of the above code.

这就是我正式回答的结尾;下面是与我运行的测试相关的一些临时注释和源代码链接,以验证上述代码的性能和正确性。


The version provided above above, coded as Tab16A was a consistent winner over many runs. These various candidates, in active working/scratch form, can be found here, here, and here.

上面提供的版本编码为Tab16A,在许多次运行中都是一致的赢家。这些不同的候选人,以积极的工作/抓拍形式,可以在这里,这里和这里找到。

 1  candidates.HighestOne_Tab16A               622,496 2  candidates.HighestOne_Tab16C               628,234 3  candidates.HighestOne_Tab8A                649,146 4  candidates.HighestOne_Tab8B                656,847 5  candidates.HighestOne_Tab16B               657,147 6  candidates.HighestOne_Tab16D               659,650 7  _highest_one_bit_UNMANAGED.HighestOne_U    702,900 8  de_Bruijn.IndexOfMSB                       709,672 9  _old_2.HighestOne_Old2                     715,81010  _test_A.HighestOne8                        757,18811  _old_1.HighestOne_Old1                     757,92512  _test_A.HighestOne5  (unsafe)              760,38713  _test_B.HighestOne8  (unsafe)              763,90414  _test_A.HighestOne3  (unsafe)              766,43315  _test_A.HighestOne1  (unsafe)              767,32116  _test_A.HighestOne4  (unsafe)              771,70217  _test_B.HighestOne2  (unsafe)              772,13618  _test_B.HighestOne1  (unsafe)              772,52719  _test_B.HighestOne3  (unsafe)              774,14020  _test_A.HighestOne7  (unsafe)              774,58121  _test_B.HighestOne7  (unsafe)              775,46322  _test_A.HighestOne2  (unsafe)              776,86523  candidates.HighestOne_NoTab                777,69824  _test_B.HighestOne6  (unsafe)              779,48125  _test_A.HighestOne6  (unsafe)              781,55326  _test_B.HighestOne4  (unsafe)              785,50427  _test_B.HighestOne5  (unsafe)              789,79728  _test_A.HighestOne0  (unsafe)              809,56629  _test_B.HighestOne0  (unsafe)              814,99030  _highest_one_bit.HighestOne                824,34530  _bitarray_ext.RtlFindMostSignificantBit    894,06931  candidates.HighestOne_Naive                898,865

Notable is that the terrible performance of ntdll.dll!RtlFindMostSignificantBit via P/Invoke:

值得注意的是ntdll.dll的糟糕性能!通过P / Invoke RtlFindMostSignificantBit:

[DllImport("ntdll.dll"), SuppressUnmanagedCodeSecurity, SecuritySafeCritical]public static extern int RtlFindMostSignificantBit(ulong ul);

It's really too bad, because here's the entire actual function:

这太糟糕了,因为这是整个实际的函数:

    RtlFindMostSignificantBit:        bsr rdx, rcx          mov eax,0FFFFFFFFh          movzx ecx, dl          cmovne      eax,ecx          ret

I can't imagine the poor performance originating with these five lines, so the managed/native transition penalties must be to blame. I was also surprised that the testing really favored the 32KB (and 64KB) short (16-bit) direct-lookup tables over the 128-byte (and 256-byte) byte (8-bit) lookup tables. I thought the following would be more competitive with the 16-bit lookups, but the latter consistently outperformed this:

我无法想象这5条线带来的糟糕表现,所以管理/原生过渡的惩罚应该受到责备。我还感到惊讶的是,测试实际上更喜欢32KB(和64KB)短(16位)的直接查找表,而不是128字节(和256字节)的8位查找表。我认为下面的16位查找会更有竞争力,但后者的表现一直比这更好:

public static int HighestOne_Tab8A(ulong v){    if ((long)v <= 0)        return (int)((v >> 57) & 64) - 1;    int j;    j =  /**/ (int)((0xFFFFFFFFU - v) >> 58) & 32;    j += /**/ (int)((0x0000FFFFU - (v >> j)) >> 59) & 16;    j += /**/ (int)((0x000000FFU - (v >> j)) >> 60) & 8;    return j + msb_tab_8[v >> j];}

The last thing I'll point out is that I was quite shocked that my deBruijn method didn't fare better. This is the method that I had previously been using pervasively:

最后我要指出的是,我对我的deBruijn法没有取得更好的效果感到非常震惊。这是我以前广泛使用的方法:

const ulong N_bsf64 = 0x07EDD5E59A4E28C2,            N_bsr64 = 0x03F79D71B4CB0A89;readonly public static sbyte[]bsf64 ={    63,  0, 58,  1, 59, 47, 53,  2, 60, 39, 48, 27, 54, 33, 42,  3,    61, 51, 37, 40, 49, 18, 28, 20, 55, 30, 34, 11, 43, 14, 22,  4,    62, 57, 46, 52, 38, 26, 32, 41, 50, 36, 17, 19, 29, 10, 13, 21,    56, 45, 25, 31, 35, 16,  9, 12, 44, 24, 15,  8, 23,  7,  6,  5,},bsr64 ={     0, 47,  1, 56, 48, 27,  2, 60, 57, 49, 41, 37, 28, 16,  3, 61,    54, 58, 35, 52, 50, 42, 21, 44, 38, 32, 29, 23, 17, 11,  4, 62,    46, 55, 26, 59, 40, 36, 15, 53, 34, 51, 20, 43, 31, 22, 10, 45,    25, 39, 14, 33, 19, 30,  9, 24, 13, 18,  8, 12,  7,  6,  5, 63,};public static int IndexOfLSB(ulong v) =>    v != 0 ? bsf64[((v & (ulong)-(long)v) * N_bsf64) >> 58] : -1;public static int IndexOfMSB(ulong v){    if ((long)v <= 0)        return (int)((v >> 57) & 64) - 1;    v |= v >> 1; v |= v >> 2;  v |= v >> 4;   // does anybody know a better    v |= v >> 8; v |= v >> 16; v |= v >> 32;  // way than these 12 ops?    return bsr64[(v * N_bsr64) >> 58];}

There's much discussion of how superior and great deBruijn methods at this SO question, and I had tended to agree. My speculation is that, while both the deBruijn and direct lookup table methods (that I found to be fastest) both have to do a table lookup, and both have very minimal branching, only the deBruijn has a 64-bit multiply operation. I only tested the IndexOfMSB functions here--not the deBruijn IndexOfLSB--but I expect the latter to fare much better chance since it has so many fewer operations (see above), and I'll likely continue to use it for LSB.

关于德布鲁因在这个问题上的卓越和伟大的方法有很多讨论,我倾向于同意。我的推测是,尽管deBruijn和直接查找表方法(我发现最快)都必须进行表查找,而且它们的分支都非常少,只有deBruijn有一个64位的乘法操作。我只在这里测试了IndexOfMSB函数——而不是deBruijn IndexOfLSB——但是我希望后者能有更好的机会,因为它的操作更少(见上面),我可能会继续在LSB上使用它。

#26


-5  

One approach could be to keep shifting left till the number becomes negative.

一种方法是向左移动直到数字变成负数。

Here is the code:

这是代码:

Funct() {   int number; int count;  while(number > 0) {    number = number << 1;    count++;  }  printf("It is the no "%d" bit from the left", (count+1));    }

#1


51  

GCC has:

GCC有:

 -- Built-in Function: int __builtin_clz (unsigned int x)     Returns the number of leading 0-bits in X, starting at the most     significant bit position.  If X is 0, the result is undefined. -- Built-in Function: int __builtin_clzl (unsigned long)     Similar to `__builtin_clz', except the argument type is `unsigned     long'. -- Built-in Function: int __builtin_clzll (unsigned long long)     Similar to `__builtin_clz', except the argument type is `unsigned     long long'.

I'd expect them to be translated into something reasonably efficient for your current platform, whether it be one of those fancy bit-twiddling algorithms, or a single instruction.

我希望它们能被翻译成一种对你目前的平台来说相当有效的东西,无论它是一种花哨的小算法,还是一种指令。

#2


38  

Assuming you're on x86 and game for a bit of inline assembler, Intel provides a BSR instruction ("bit scan reverse"). It's fast on some x86s (microcoded on others). From the manual:

假设您正在使用x86和game进行一些内联汇编,Intel提供了一个BSR指令(“位扫描反向”)。它在某些x86s上运行得很快(在其他一些上进行微编码)。从手册:

Searches the source operand for the most significant set bit (1 bit). If a most significant 1 bit is found, its bit index is stored in the destination operand. The source operand can be a register or a memory location; the destination operand is a register. The bit index is an unsigned offset from bit 0 of the source operand. If the content source operand is 0, the content of the destination operand is undefined.

在源操作数中搜索最重要的设置位(1位)。如果找到一个最重要的1位,它的位索引将存储在目标操作数中。源操作数可以是寄存器或内存位置;目标操作数是一个寄存器。位索引是源操作数的位0的无符号偏移量。如果内容源操作数为0,则目标操作数的内容未定义。

(If you're on PowerPC there's a similar cntlz ("count leading zeros") instruction.)

(如果你在PowerPC上,也有类似的cntlz(“计数前导零”)指令。)

Example code for gcc:

gcc的示例代码:

#include <iostream>int main (int,char**){  int n=1;  for (;;++n) {    int msb;    asm("bsrl %1,%0" : "=r"(msb) : "r"(n));    std::cout << n << " : " << msb << std::endl;  }  return 0;}

See also this inline assembler tutorial, which shows (section 9.4) it being considerably faster than looping code.

请参阅这个内联汇编器教程,它显示(第9.4节)它比循环代码快得多。

#3


34  

Since 2^N is an integer with only the Nth bit set (1 << N), finding the position (N) of the highest set bit is the integer log base 2 of that integer.

因为2 ^ N是一个整数只有第N个点集(1 < < N),找到最高的位置(N)设置是整数log 2的整数。

http://graphics.stanford.edu/~seander/bithacks.html#IntegerLogObvious

http://graphics.stanford.edu/ ~ seander / bithacks.html # IntegerLogObvious

unsigned int v;unsigned r = 0;while (v >>= 1) {    r++;}

This "obvious" algorithm may not be transparent to everyone, but when you realize that the code shifts right by one bit repeatedly until the leftmost bit has been shifted off (note that C treats any non-zero value as true) and returns the number of shifts, it makes perfect sense. It also means that it works even when more than one bit is set — the result is always for the most significant bit.

这种“明显的”算法可能对每个人都不透明,但是当您意识到代码反复地向右移动一点,直到最左边的位被移走(注意C将任何非零值视为true)并返回移位数时,这是完全合理的。它还意味着,即使设置了多个位,它仍然可以工作——结果总是为最重要的位。

If you scroll down on that page, there are faster, more complex variations. However, if you know you're dealing with numbers with a lot of leading zeroes, the naive approach may provide acceptable speed, since bit shifting is rather fast in C, and the simple algorithm doesn't require indexing an array.

如果你向下滚动页面,会有更快、更复杂的变化。但是,如果您知道您正在处理的数字有很多前导0,那么这种简单的方法可能提供可接受的速度,因为在C语言中位移速度相当快,而且简单的算法不需要对数组进行索引。

NOTE: When using 64-bit values, be extremely cautious about using extra-clever algorithms; many of them only work correctly for 32-bit values.

注意:使用64位值时,要特别小心使用超智能算法;它们中的许多只适用于32位的值。

#4


15  

This should be lightning fast:

这应该是闪电速度:

int msb(unsigned int v) {  static const int pos[32] = {0, 1, 28, 2, 29, 14, 24, 3,    30, 22, 20, 15, 25, 17, 4, 8, 31, 27, 13, 23, 21, 19,    16, 7, 26, 12, 18, 6, 11, 5, 10, 9};  v |= v >> 1;  v |= v >> 2;  v |= v >> 4;  v |= v >> 8;  v |= v >> 16;  v = (v >> 1) + 1;  return pos[(v * 0x077CB531UL) >> 27];}

#5


12  

This is sort of like finding a kind of integer log. There are bit-twiddling tricks, but I've made my own tool for this. The goal of course is for speed.

这有点像找到一种整数对数。这里有一些小技巧,但我自己做了一个工具。目标当然是速度。

My realization is that the CPU has an automatic bit-detector already, used for integer to float conversion! So use that.

我的意识是CPU已经有一个自动位检测器,用于整型浮点转换!所以使用它。

double ff=(double)(v|1);return ((*(1+(uint32_t *)&ff))>>20)-1023;  // assumes x86 endianness

This version casts the value to a double, then reads off the exponent, which tells you where the bit was. The fancy shift and subtract is to extract the proper parts from the IEEE value.

这个版本将值转换为double,然后读取指数,它告诉您位在哪里。巧妙的变换和减法是从IEEE值中提取适当的部分。

It's slightly faster to use floats, but a float can only give you the first 24 bit positions because of its smaller precision.

使用浮点数稍微快一点,但是由于浮点数的精度较小,它只能提供最初的24位位置。


To do this safely, without undefined behaviour in C++ or C, use memcpy instead of pointer casting for type-punning. Compilers know how to inline it efficiently.

为了安全地实现这一点,在c++或C中没有未定义的行为,使用memcpy而不是指针强制类型转换。编译器知道如何高效地内联它。

// static_assert(sizeof(double) == 2 * sizeof(uint32_t), "double isn't 8-byte IEEE binary64");// and also static_assert something about FLT_ENDIAN?double ff=(double)(v|1);uint32_t tmp;memcpy(&tmp, ((const char*)&ff)+sizeof(uint32_t), sizeof(uint32_t));return (tmp>>20)-1023;

Or in C99 and later, use a union {double d; uint32_t u[2];};. But note that in C++, union type punning is only supported on some compilers as an extension, not in ISO C++.

或在C99或以后,使用联合{双d;uint32_t u[2];}。但是请注意,在c++中,联合类型punning只支持一些编译器作为扩展,而不是在ISO c++中。


This will usually be slower than a platform-specific intrinsic for a leading-zeros counting instruction, but portable ISO C has no such function. Some CPUs also lack a leading-zero counting instruction, but some of those can efficiently convert integers to double. Type-punning an FP bit pattern back to integer can be slow, though (e.g. on PowerPC it requires a store/reload and usually causes a load-hit-store stall).

这通常要比特定于平台的、用于开始- 0计数指令的固有程序慢,但是可移植的ISO C没有这样的功能。有些cpu也缺少一个开始为零的计数指令,但是有些cpu可以有效地将整数转换为双精度。但是,将FP位模式类型转换为integer可能很慢(例如,在PowerPC上,它需要一个存储/重载,通常会导致加载-hit-store失速)。

This algorithm could potentially be useful for SIMD implementations, because fewer CPUs have SIMD lzcnt. x86 only got such an instruction with AVX512CD

该算法可能对SIMD实现有用,因为较少的cpu有SIMD lzcnt。x86只能用AVX512CD得到这样的指令

#6


9  

Kaz Kylheku here

Kaz Kylheku这里

I benchmarked two approaches for this over 63 bit numbers (the long long type on gcc x86_64), staying away from the sign bit.

我为超过63位的数字(gcc x86_64上的长类型)设置了两种方法,远离符号位。

(I happen to need this "find highest bit" for something, you see.)

(你知道,我碰巧需要这个“找到最高位”。)

I implemented the data-driven binary search (closely based on one of the above answers). I also implemented a completely unrolled decision tree by hand, which is just code with immediate operands. No loops, no tables.

我实现了数据驱动的二进制搜索(紧密基于上面的一个答案)。我还手工实现了一个完全未滚动的决策树,它只是带有即时操作数的代码。没有循环,没有表。

The decision tree (highest_bit_unrolled) benchmarked to be 69% faster, except for the n = 0 case for which the binary search has an explicit test.

决策树(highest_bit_unroll)基准测试的速度要快69%,除了在n = 0的情况下,二进制搜索有一个显式测试。

The binary-search's special test for 0 case is only 48% faster than the decision tree, which does not have a special test.

二值搜索对于0的特殊测试只比没有特殊测试的决策树快48%。

Compiler, machine: (GCC 4.5.2, -O3, x86-64, 2867 Mhz Intel Core i5).

编译器,机器(GCC 4.5.2, -O3, x86-64, 2867mhz Intel Core i5)。

int highest_bit_unrolled(long long n){  if (n & 0x7FFFFFFF00000000) {    if (n & 0x7FFF000000000000) {      if (n & 0x7F00000000000000) {        if (n & 0x7000000000000000) {          if (n & 0x4000000000000000)            return 63;          else            return (n & 0x2000000000000000) ? 62 : 61;        } else {          if (n & 0x0C00000000000000)            return (n & 0x0800000000000000) ? 60 : 59;          else            return (n & 0x0200000000000000) ? 58 : 57;        }      } else {        if (n & 0x00F0000000000000) {          if (n & 0x00C0000000000000)            return (n & 0x0080000000000000) ? 56 : 55;          else            return (n & 0x0020000000000000) ? 54 : 53;        } else {          if (n & 0x000C000000000000)            return (n & 0x0008000000000000) ? 52 : 51;          else            return (n & 0x0002000000000000) ? 50 : 49;        }      }    } else {      if (n & 0x0000FF0000000000) {        if (n & 0x0000F00000000000) {          if (n & 0x0000C00000000000)            return (n & 0x0000800000000000) ? 48 : 47;          else            return (n & 0x0000200000000000) ? 46 : 45;        } else {          if (n & 0x00000C0000000000)            return (n & 0x0000080000000000) ? 44 : 43;          else            return (n & 0x0000020000000000) ? 42 : 41;        }      } else {        if (n & 0x000000F000000000) {          if (n & 0x000000C000000000)            return (n & 0x0000008000000000) ? 40 : 39;          else            return (n & 0x0000002000000000) ? 38 : 37;        } else {          if (n & 0x0000000C00000000)            return (n & 0x0000000800000000) ? 36 : 35;          else            return (n & 0x0000000200000000) ? 34 : 33;        }      }    }  } else {    if (n & 0x00000000FFFF0000) {      if (n & 0x00000000FF000000) {        if (n & 0x00000000F0000000) {          if (n & 0x00000000C0000000)            return (n & 0x0000000080000000) ? 32 : 31;          else            return (n & 0x0000000020000000) ? 30 : 29;        } else {          if (n & 0x000000000C000000)            return (n & 0x0000000008000000) ? 28 : 27;          else            return (n & 0x0000000002000000) ? 26 : 25;        }      } else {        if (n & 0x0000000000F00000) {          if (n & 0x0000000000C00000)            return (n & 0x0000000000800000) ? 24 : 23;          else            return (n & 0x0000000000200000) ? 22 : 21;        } else {          if (n & 0x00000000000C0000)            return (n & 0x0000000000080000) ? 20 : 19;          else            return (n & 0x0000000000020000) ? 18 : 17;        }      }    } else {      if (n & 0x000000000000FF00) {        if (n & 0x000000000000F000) {          if (n & 0x000000000000C000)            return (n & 0x0000000000008000) ? 16 : 15;          else            return (n & 0x0000000000002000) ? 14 : 13;        } else {          if (n & 0x0000000000000C00)            return (n & 0x0000000000000800) ? 12 : 11;          else            return (n & 0x0000000000000200) ? 10 : 9;        }      } else {        if (n & 0x00000000000000F0) {          if (n & 0x00000000000000C0)            return (n & 0x0000000000000080) ? 8 : 7;          else            return (n & 0x0000000000000020) ? 6 : 5;        } else {          if (n & 0x000000000000000C)            return (n & 0x0000000000000008) ? 4 : 3;          else            return (n & 0x0000000000000002) ? 2 : (n ? 1 : 0);        }      }    }  }}int highest_bit(long long n){  const long long mask[] = {    0x000000007FFFFFFF,    0x000000000000FFFF,    0x00000000000000FF,    0x000000000000000F,    0x0000000000000003,    0x0000000000000001  };  int hi = 64;  int lo = 0;  int i = 0;  if (n == 0)    return 0;  for (i = 0; i < sizeof mask / sizeof mask[0]; i++) {    int mi = lo + (hi - lo) / 2;    if ((n >> mi) != 0)      lo = mi;    else if ((n & (mask[i] << lo)) != 0)      hi = mi;  }  return lo + 1;}

Quick and dirty test program:

快速检测程序:

#include <stdio.h>#include <time.h>#include <stdlib.h>int highest_bit_unrolled(long long n);int highest_bit(long long n);main(int argc, char **argv){  long long n = strtoull(argv[1], NULL, 0);  int b1, b2;  long i;  clock_t start = clock(), mid, end;  for (i = 0; i < 1000000000; i++)    b1 = highest_bit_unrolled(n);  mid = clock();  for (i = 0; i < 1000000000; i++)    b2 = highest_bit(n);  end = clock();  printf("highest bit of 0x%llx/%lld = %d, %d\n", n, n, b1, b2);  printf("time1 = %d\n", (int) (mid - start));  printf("time2 = %d\n", (int) (end - mid));  return 0;}

Using only -O2, the difference becomes greater. The decision tree is almost four times faster.

只用-O2,差别就更大了。决策树几乎快了四倍。

I also benchmarked against the naive bit shifting code:

我还以幼稚的比特转换代码为基准:

int highest_bit_shift(long long n){  int i = 0;  for (; n; n >>= 1, i++)    ; /* empty */  return i;}

This is only fast for small numbers, as one would expect. In determining that the highest bit is 1 for n == 1, it benchmarked more than 80% faster. However, half of randomly chosen numbers in the 63 bit space have the 63rd bit set!

正如人们所预期的那样,这对于小数字来说是非常快的。在确定n = 1的最高位为1时,它的基准测试速度要快80%以上。然而,在63位空间中随机选择的数字中,有一半是第63位!

On the input 0x3FFFFFFFFFFFFFFF, the decision tree version is quite a bit faster than it is on 1, and shows to be 1120% faster (12.2 times) than the bit shifter.

在输入0x3fffffffffffffffffffffff上,决策树版本比在1上快得多,显示比位移位器快1120%(12.2倍)。

I will also benchmark the decision tree against the GCC builtins, and also try a mixture of inputs rather than repeating against the same number. There may be some sticking branch prediction going on and perhaps some unrealistic caching scenarios which makes it artificially faster on repetitions.

我还将基准决策树与GCC构建相比较,并尝试混合输入,而不是重复相同的数字。可能会有一些关于分支的预测,可能会有一些不切实际的缓存场景,使其在重复的过程中人为地更快。

#7


7  

What about

是什么

int highest_bit(unsigned int a) {    int count;    std::frexp(a, &count);    return count - 1;}

?

吗?

#8


6  

unsigned intmsb32(register unsigned int x){        x |= (x >> 1);        x |= (x >> 2);        x |= (x >> 4);        x |= (x >> 8);        x |= (x >> 16);        return(x & ~(x >> 1));}

1 register, 13 instructions. Believe it or not, this is usually faster than the BSR instruction mentioned above, which operates in linear time. This is logarithmic time.

1注册,13指令。信不信由你,这通常比上面提到的BSR指令要快,它在线性时间内运行。这是对数时间。

From http://aggregate.org/MAGIC/#Most%20Significant%201%20Bit

从http://aggregate.org/MAGIC/ # % 20大多数重要% 20位201%

#9


5  

Here are some (simple) benchmarks, of algorithms currently given on this page...

以下是一些(简单的)基准,是本页当前给出的算法……

The algorithms have not been tested over all inputs of unsigned int; so check that first, before blindly using something ;)

算法尚未对所有无符号整型输入进行测试;所以,在盲目使用某样东西之前,先检查一下。

On my machine clz (__builtin_clz) and asm work best. asm seems even faster then clz... but it might be due to the simple benchmark...

在我的机器上,clz (__builtin_clz)和asm工作得最好。asm似乎比clz还要快……但这可能是由于简单的基准……

//////// go.c ///////////////////////////////// compile with:  gcc go.c -o go -lm#include <math.h>#include <stdio.h>#include <stdlib.h>#include <time.h>/***************** math ********************/#define POS_OF_HIGHESTBITmath(a) /* 0th position is the Least-Signif-Bit */    \  ((unsigned) log2(a))         /* thus: do not use if a <= 0 */  #define NUM_OF_HIGHESTBITmath(a) ((a)               \                  ? (1U << POS_OF_HIGHESTBITmath(a))    \                  : 0)/***************** clz ********************/unsigned NUM_BITS_U = ((sizeof(unsigned) << 3) - 1);#define POS_OF_HIGHESTBITclz(a) (NUM_BITS_U - __builtin_clz(a)) /* only works for a != 0 */#define NUM_OF_HIGHESTBITclz(a) ((a)                    \                 ? (1U << POS_OF_HIGHESTBITclz(a))  \                 : 0)/***************** i2f ********************/double FF;#define POS_OF_HIGHESTBITi2f(a) (FF = (double)(ui|1), ((*(1+(unsigned*)&FF))>>20)-1023)#define NUM_OF_HIGHESTBITi2f(a) ((a)                    \                 ? (1U << POS_OF_HIGHESTBITi2f(a))  \                 : 0)/***************** asm ********************/unsigned OUT;#define POS_OF_HIGHESTBITasm(a) (({asm("bsrl %1,%0" : "=r"(OUT) : "r"(a));}), OUT)#define NUM_OF_HIGHESTBITasm(a) ((a)                    \                 ? (1U << POS_OF_HIGHESTBITasm(a))  \                 : 0)/***************** bitshift1 ********************/#define NUM_OF_HIGHESTBITbitshift1(a) (({   \  OUT = a;                  \  OUT |= (OUT >> 1);                \  OUT |= (OUT >> 2);                \  OUT |= (OUT >> 4);                \  OUT |= (OUT >> 8);                \  OUT |= (OUT >> 16);               \      }), (OUT & ~(OUT >> 1)))          \/***************** bitshift2 ********************/int POS[32] = {0, 1, 28, 2, 29, 14, 24, 3,             30, 22, 20, 15, 25, 17, 4, 8, 31, 27, 13, 23, 21, 19,             16, 7, 26, 12, 18, 6, 11, 5, 10, 9};#define POS_OF_HIGHESTBITbitshift2(a) (({   \  OUT = a;                  \  OUT |= OUT >> 1;              \  OUT |= OUT >> 2;              \  OUT |= OUT >> 4;              \  OUT |= OUT >> 8;              \  OUT |= OUT >> 16;             \  OUT = (OUT >> 1) + 1;             \      }), POS[(OUT * 0x077CB531UL) >> 27])#define NUM_OF_HIGHESTBITbitshift2(a) ((a)              \                       ? (1U << POS_OF_HIGHESTBITbitshift2(a)) \                       : 0)#define LOOPS 100000000Uint main(){  time_t start, end;  unsigned ui;  unsigned n;  /********* Checking the first few unsigned values (you'll need to check all if you want to use an algorithm here) **************/  printf("math\n");  for (ui = 0U; ui < 18; ++ui)    printf("%i\t%i\n", ui, NUM_OF_HIGHESTBITmath(ui));  printf("\n\n");  printf("clz\n");  for (ui = 0U; ui < 18U; ++ui)    printf("%i\t%i\n", ui, NUM_OF_HIGHESTBITclz(ui));  printf("\n\n");  printf("i2f\n");  for (ui = 0U; ui < 18U; ++ui)    printf("%i\t%i\n", ui, NUM_OF_HIGHESTBITi2f(ui));  printf("\n\n");  printf("asm\n");  for (ui = 0U; ui < 18U; ++ui) {    printf("%i\t%i\n", ui, NUM_OF_HIGHESTBITasm(ui));  }  printf("\n\n");  printf("bitshift1\n");  for (ui = 0U; ui < 18U; ++ui) {    printf("%i\t%i\n", ui, NUM_OF_HIGHESTBITbitshift1(ui));  }  printf("\n\n");  printf("bitshift2\n");  for (ui = 0U; ui < 18U; ++ui) {    printf("%i\t%i\n", ui, NUM_OF_HIGHESTBITbitshift2(ui));  }  printf("\n\nPlease wait...\n\n");  /************************* Simple clock() benchmark ******************/  start = clock();  for (ui = 0; ui < LOOPS; ++ui)    n = NUM_OF_HIGHESTBITmath(ui);  end = clock();  printf("math:\t%e\n", (double)(end-start)/CLOCKS_PER_SEC);  start = clock();  for (ui = 0; ui < LOOPS; ++ui)    n = NUM_OF_HIGHESTBITclz(ui);  end = clock();  printf("clz:\t%e\n", (double)(end-start)/CLOCKS_PER_SEC);  start = clock();  for (ui = 0; ui < LOOPS; ++ui)    n = NUM_OF_HIGHESTBITi2f(ui);  end = clock();  printf("i2f:\t%e\n", (double)(end-start)/CLOCKS_PER_SEC);  start = clock();  for (ui = 0; ui < LOOPS; ++ui)    n = NUM_OF_HIGHESTBITasm(ui);  end = clock();  printf("asm:\t%e\n", (double)(end-start)/CLOCKS_PER_SEC);  start = clock();  for (ui = 0; ui < LOOPS; ++ui)    n = NUM_OF_HIGHESTBITbitshift1(ui);  end = clock();  printf("bitshift1:\t%e\n", (double)(end-start)/CLOCKS_PER_SEC);  start = clock();  for (ui = 0; ui < LOOPS; ++ui)    n = NUM_OF_HIGHESTBITbitshift2(ui);  end = clock();  printf("bitshift2\t%e\n", (double)(end-start)/CLOCKS_PER_SEC);  printf("\nThe lower, the better. Take note that a negative exponent is good! ;)\n");  return EXIT_SUCCESS;}

#10


5  

Although I would probably only use this method if I absolutely required the best possible performance (e.g. for writing some sort of board game AI involving bitboards), the most efficient solution is to use inline ASM. See the Optimisations section of this blog post for code with an explanation.

尽管我可能只会使用这种方法,如果我绝对需要最好的性能(例如,编写一些包含位板的棋盘游戏AI),最有效的解决方案是使用内联ASM。请参阅本博客文章的Optimisations部分以获得解释。

[...], the bsrl assembly instruction computes the position of the most significant bit. Thus, we could use this asm statement:

[…, bsrl汇编指令计算最重要位的位置。因此,我们可以使用这个asm语句:

asm ("bsrl %1, %0"      : "=r" (position)      : "r" (number));

#11


4  

I had a need for a routine to do this and before searching the web (and finding this page) I came up with my own solution basedon a binary search. Although I'm sure someone has done this before! It runs in constant time and can be faster than the "obvious" solution posted, although I'm not making any great claims, just posting it for interest.

我需要一个例行程序来完成这个任务,在搜索web(以及找到这个页面)之前,我找到了我自己的解决方案basedon,一个二分查找。虽然我肯定有人曾经做过!它运行的时间是恒定的,而且可能比发布的“显而易见的”解决方案要快,尽管我并没有发表任何伟大的声明,只是出于兴趣发布它。

int highest_bit(unsigned int a) {  static const unsigned int maskv[] = { 0xffff, 0xff, 0xf, 0x3, 0x1 };  const unsigned int *mask = maskv;  int l, h;  if (a == 0) return -1;  l = 0;  h = 32;  do {    int m = l + (h - l) / 2;    if ((a >> m) != 0) l = m;    else if ((a & (*mask << l)) != 0) h = m;    mask++;  } while (l < h - 1);  return l;}

#12


3  

As the answers above point out, there are a number of ways to determine the most significant bit. However, as was also pointed out, the methods are likely to be unique to either 32bit or 64bit registers. The stanford.edu bithacks page provides solutions that work for both 32bit and 64bit computing. With a little work, they can be combined to provide a solid cross-architecture approach to obtaining the MSB. The solution I arrived at that compiled/worked across 64 & 32 bit computers was:

正如上面的答案所指出的,有很多方法可以确定最重要的部分。然而,正如所指出的,这些方法可能对32位或64位寄存器都是唯一的。stanford。edu bithacks页面提供适用于32位和64位计算的解决方案。只需做一点工作,它们就可以组合在一起,为获得MSB提供可靠的跨体系结构方法。我在64位和32位计算机上得到的解决方案是:

#if defined(__LP64__) || defined(_LP64)# define BUILD_64   1#endif#include <stdio.h>#include <stdint.h>  /* for uint32_t *//* CHAR_BIT  (or include limits.h) */#ifndef CHAR_BIT#define CHAR_BIT  8#endif  /* CHAR_BIT *//*  * Find the log base 2 of an integer with the MSB N set in O(N) * operations. (on 64bit & 32bit architectures) */intgetmsb (uint32_t word){    int r = 0;    if (word < 1)        return 0;#ifdef BUILD_64    union { uint32_t u[2]; double d; } t;  // temp    t.u[__FLOAT_WORD_ORDER==LITTLE_ENDIAN] = 0x43300000;    t.u[__FLOAT_WORD_ORDER!=LITTLE_ENDIAN] = word;    t.d -= 4503599627370496.0;    r = (t.u[__FLOAT_WORD_ORDER==LITTLE_ENDIAN] >> 20) - 0x3FF;#else    while (word >>= 1)    {        r++;    }#endif  /* BUILD_64 */    return r;}

#13


3  

A version in C using successive approximation:

使用逐次逼近的C版本:

unsigned int getMsb(unsigned int n){  unsigned int msb  = sizeof(n) * 4;  unsigned int step = msb;  while (step > 1) {    step /=2;    if (n>>msb)     msb += step;   else     msb -= step; }  if (n>>msb)    msb++;  return (msb - 1);}

Advantage: the running time is constant regardless of the provided number, as the number of loops are always the same.( 4 loops when using "unsigned int")

优点:运行时间是常量,不管所提供的数量是多少,因为循环的数量总是相同的。(使用“无符号int”时循环4次)

#14


3  

thats some kind of binary search, it works with all kinds of (unsigned!) integer types

这是一种二进制搜索,它适用于所有类型的(无符号!)整数类型

#include <climits>#define UINT (unsigned int)#define UINT_BIT (CHAR_BIT*sizeof(UINT))int msb(UINT x){    if(0 == x)        return -1;    int c = 0;    for(UINT i=UINT_BIT>>1; 0<i; i>>=1)    if(static_cast<UINT>(x >> i))    {        x >>= i;        c |= i;    }    return c;}

to make complete:

完成:

#include <climits>#define UINT unsigned int#define UINT_BIT (CHAR_BIT*sizeof(UINT))int lsb(UINT x){    if(0 == x)        return -1;    int c = UINT_BIT-1;    for(UINT i=UINT_BIT>>1; 0<i; i>>=1)    if(static_cast<UINT>(x << i))    {        x <<= i;        c ^= i;    }    return c;}

#15


3  

Some overly complex answers here. The Debruin technique should only be used when the input is already a power of two, otherwise there's a better way. For a power of 2 input, Debruin is the absolute fastest, even faster than _BitScanReverse on any processor I've tested. However, in the general case, _BitScanReverse (or whatever the intrinsic is called in your compiler) is the fastest (on certain CPU's it can be microcoded though).

这里有一些过于复杂的答案。Debruin技术应该只在输入已经是2的幂时使用,否则会有更好的方法。对于2输入的能力,Debruin是绝对最快的,甚至比我测试过的任何处理器上的_BitScanReverse都要快。然而,在一般情况下,_BitScanReverse(或者无论在编译器中调用什么本质)是最快的(在某些CPU上它可以被微编码)。

If the intrinsic function is not an option, here is an optimal software solution for processing general inputs.

如果内部函数不是一个选项,这里有一个处理一般输入的最佳软件解决方案。

u8  inline log2 (u32 val)  {    u8  k = 0;    if (val > 0x0000FFFFu) { val >>= 16; k  = 16; }    if (val > 0x000000FFu) { val >>= 8;  k |= 8;  }    if (val > 0x0000000Fu) { val >>= 4;  k |= 4;  }    if (val > 0x00000003u) { val >>= 2;  k |= 2;  }    k |= (val & 2) >> 1;    return k;}

Note that this version does not require a Debruin lookup at the end, unlike most of the other answers. It computes the position in place.

注意,这个版本不需要在末尾进行Debruin查找,这与其他大多数答案不同。它计算位置。

Tables can be preferable though, if you call it repeatedly enough times, the risk of a cache miss becomes eclipsed by the speedup of a table.

但是,如果重复调用表的次数足够多,则表的加速比缓存丢失的风险要大。

u8 kTableLog2[256] = {0,0,1,1,2,2,2,2,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7};u8 log2_table(u32 val)  {    u8  k = 0;    if (val > 0x0000FFFFuL) { val >>= 16; k  = 16; }    if (val > 0x000000FFuL) { val >>=  8; k |=  8; }    k |= kTableLog2[val]; // precompute the Log2 of the low byte    return k;}

This should produce the highest throughput of any of the software answers given here, but if you only call it occasionally, prefer a table-free solution like my first snippet.

这应该会产生最高的吞吐量,在这里给出的任何软件答案中,但是如果您只是偶尔调用它,那么更喜欢像我的第一个代码片段那样的无表解决方案。

#16


3  

has given us log2. This removes the need for all the special sauce log2 implementations you see on this page. You can use the standard's log2 implementation like this:

c99给出了log2。这消除了您在此页上看到的所有特殊的酱汁log2实现的需要。您可以使用标准的log2实现,如下所示:

const auto n = 13UL;const auto Index = (unsigned long)log2(n);printf("MSB is: %u\n", Index); // Prints 3 (zero offset)

An n of 0UL needs to be guarded against as well, because:

一个n的0UL也需要防备,因为:

-∞ is returned and FE_DIVBYZERO is raised

——∞返回和FE_DIVBYZERO

I have written an example with that check that arbitrarily sets Index to ULONG_MAX here: https://ideone.com/u26vsi

我已经编写了一个示例,其中的检查将索引任意设置为ULONG_MAX: https://ideone.com/u26vsi


The corollary to ephemient's gcc only answer is:

视觉工作室推论ephemient的gcc唯一答案是:

const auto n = 13UL;unsigned long Index;_BitScanReverse(&Index, n);printf("MSB is: %u\n", Index); // Prints 3 (zero offset)

The documentation for _BitScanReverse states that Index is:

_BitScanReverse文件表明索引是:

Loaded with the bit position of the first set bit (1) found

加载找到的第一个set位(1)的位位置

In practice I've found that if n is 0UL that Index is set to 0UL, just as it would be for an n of 1UL. But the only thing guaranteed in the documentation in the case of an n of 0UL is that the return is:

在实践中,我发现如果n是0UL,那么该索引将被设置为0UL,就像对于n (1UL)一样。但在文档中,对于n (0UL)的情况,唯一保证的是返回值是:

0 if no set bits were found

如果没有找到集合点。

Thus, similarly to the preferable log2 implementation above the return should be checked setting Index to a flagged value in this case. I've again written an example of using ULONG_MAX for this flag value here: http://rextester.com/GCU61409

因此,在这种情况下,同样地,在返回值之上的log2实现应该检查设置索引到标记值。我再次为这里的标志值使用ULONG_MAX编写了一个示例:http://rextester.com/GCU61409

#17


2  

Think bitwise operators.

认为逐位运算符。

I missunderstood the question the first time. You should produce an int with the leftmost bit set (the others zero). Assuming cmp is set to that value:

我第一次误解了这个问题。您应该使用最左边的位集(其余的为0)生成一个int。假设cmp设为该值:

position = sizeof(int)*8while(!(n & cmp)){    n <<=1;   position--;}

#18


2  

Expanding on Josh's benchmark...one can improve the clz as follows

扩大Josh基准…可以改进clz,如下所示

/***************** clz2 ********************/#define NUM_OF_HIGHESTBITclz2(a) ((a)                              \                  ? (((1U) << (sizeof(unsigned)*8-1)) >> __builtin_clz(a)) \                  : 0)

Regarding the asm: note that there are bsr and bsrl (this is the "long" version). the normal one might be a bit faster.

关于asm:注意有bsr和bsrl(这是“long”版本)。正常的可能要快一点。

#19


1  

Note that what you are trying to do is calculate the integer log2 of an integer,

注意,你要做的是计算一个整数的log2,

#include <stdio.h>#include <stdlib.h>unsigned intLog2(unsigned long x){    unsigned long n = x;    int bits = sizeof(x)*8;    int step = 1; int k=0;    for( step = 1; step < bits; ) {        n |= (n >> step);        step *= 2; ++k;    }    //printf("%ld %ld\n",x, (x - (n >> 1)) );    return(x - (n >> 1));}

Observe that you can attempt to search more than 1 bit at a time.

注意,您可以尝试每次搜索超过1位。

unsigned intLog2_a(unsigned long x){    unsigned long n = x;    int bits = sizeof(x)*8;    int step = 1;    int step2 = 0;    //observe that you can move 8 bits at a time, and there is a pattern...    //if( x>1<<step2+8 ) { step2+=8;        //if( x>1<<step2+8 ) { step2+=8;            //if( x>1<<step2+8 ) { step2+=8;            //}        //}    //}    for( step2=0; x>1L<<step2+8; ) {        step2+=8;    }    //printf("step2 %d\n",step2);    for( step = 0; x>1L<<(step+step2); ) {        step+=1;        //printf("step %d\n",step+step2);    }    printf("log2(%ld) %d\n",x,step+step2);    return(step+step2);}

This approach uses a binary search

这种方法使用二进制搜索

unsigned intLog2_b(unsigned long x){    unsigned long n = x;    unsigned int bits = sizeof(x)*8;    unsigned int hbit = bits-1;    unsigned int lbit = 0;    unsigned long guess = bits/2;    int found = 0;    while ( hbit-lbit>1 ) {        //printf("log2(%ld) %d<%d<%d\n",x,lbit,guess,hbit);        //when value between guess..lbit        if( (x<=(1L<<guess)) ) {           //printf("%ld < 1<<%d %ld\n",x,guess,1L<<guess);            hbit=guess;            guess=(hbit+lbit)/2;            //printf("log2(%ld) %d<%d<%d\n",x,lbit,guess,hbit);        }        //when value between hbit..guess        //else        if( (x>(1L<<guess)) ) {            //printf("%ld > 1<<%d %ld\n",x,guess,1L<<guess);            lbit=guess;            guess=(hbit+lbit)/2;            //printf("log2(%ld) %d<%d<%d\n",x,lbit,guess,hbit);        }    }    if( (x>(1L<<guess)) ) ++guess;    printf("log2(x%ld)=r%d\n",x,guess);    return(guess);}

Another binary search method, perhaps more readable,

另一个二分查找方法,也许更容易读,

unsigned intLog2_c(unsigned long x){    unsigned long v = x;    unsigned int bits = sizeof(x)*8;    unsigned int step = bits;    unsigned int res = 0;    for( step = bits/2; step>0; )    {        //printf("log2(%ld) v %d >> step %d = %ld\n",x,v,step,v>>step);        while ( v>>step ) {            v>>=step;            res+=step;            //printf("log2(%ld) step %d res %d v>>step %ld\n",x,step,res,v);        }        step /= 2;    }    if( (x>(1L<<res)) ) ++res;    printf("log2(x%ld)=r%ld\n",x,res);    return(res);}

And because you will want to test these,

因为你想测试这些,

int main(){    unsigned long int x = 3;    for( x=2; x<1000000000; x*=2 ) {        //printf("x %ld, x+1 %ld, log2(x+1) %d\n",x,x+1,Log2(x+1));        printf("x %ld, x+1 %ld, log2_a(x+1) %d\n",x,x+1,Log2_a(x+1));        printf("x %ld, x+1 %ld, log2_b(x+1) %d\n",x,x+1,Log2_b(x+1));        printf("x %ld, x+1 %ld, log2_c(x+1) %d\n",x,x+1,Log2_c(x+1));    }    return(0);}

#20


1  

Putting this in since it's 'yet another' approach, seems to be different from others already given.

既然这是“另一种”方法,那么把它放在这里似乎与已经给出的其他方法不同。

returns -1 if x==0, otherwise floor( log2(x)) (max result 31)

如果x= 0,则返回-1,否则返回层(log2(x))(最大结果31)

Reduce from 32 to 4 bit problem, then use a table. Perhaps inelegant, but pragmatic.

从32位问题减少到4位问题,然后使用表。也许不雅,但务实。

This is what I use when I don't want to use __builtin_clz because of portability issues.

由于可移植性问题,当我不想使用__builtin_clz时,我就使用这种方法。

To make it more compact, one could instead use a loop to reduce, adding 4 to r each time, max 7 iterations. Or some hybrid, such as (for 64 bits): loop to reduce to 8, test to reduce to 4.

为了使它更紧凑,可以使用循环来减少,每次增加4到r,最多7次迭代。或者有些混合,如(64位):循环减少到8,测试减少到4。

int log2floor( unsigned x ){   static const signed char wtab[16] = {-1,0,1,1, 2,2,2,2, 3,3,3,3,3,3,3,3};   int r = 0;   unsigned xk = x >> 16;   if( xk != 0 ){       r = 16;       x = xk;   }   // x is 0 .. 0xFFFF   xk = x >> 8;   if( xk != 0){       r += 8;       x = xk;   }   // x is 0 .. 0xFF   xk = x >> 4;   if( xk != 0){       r += 4;       x = xk;   }   // now x is 0..15; x=0 only if originally zero.   return r + wtab[x];}

#21


1  

I know this question is very old, but just having implemented an msb() function myself,I found that most solutions presented here and on other websites are not necessarily the most efficient - at least for my personal definition of efficiency (see also Update below). Here's why:

我知道这个问题很老了,但是仅仅是我自己实现了msb()函数,我就发现这里和其他网站上提供的大多数解决方案不一定是最有效的——至少对于我个人对效率的定义是这样的(参见下面的更新)。原因如下:

Most solutions (especially those which employ some sort of binary search scheme or the naïve approach which does a linear scan from right to left) seem to neglect the fact that for arbitrary binary numbers, there are not many which start with a very long sequence of zeros. In fact, for any bit-width, half of all integers start with a 1 and a quarter of them start with 01.See where i'm getting at? My argument is that a linear scan starting from the most significant bit position to the least significant (left to right) is not so "linear" as it might look like at first glance.

大多数解决方案(特别是那些采用某种二进制搜索方案或从右到左进行线性扫描的简单方法)似乎忽略了一个事实,即对于任意的二进制数,没有多少从一个非常长的零序列开始。事实上,对于任何位宽,所有整数的一半以1开头,四分之一以01开头。明白我的意思了吗?我的论点是,从最重要的位位置到最不重要的(左到右)的线性扫描不是那么“线性”的,就像乍看上去的那样。

It can be shown1, that for any bit-width, the average number of bits that need to be tested is at most 2. This translates to an amortized time complexity of O(1) with respect to the number of bits (!).

可以说明一点,对于任何位宽,需要测试的位的平均数目最多为2。这转化为O(1)相对于比特数(!)的平摊时间复杂度。

Of course, the worst case is still O(n), worse than the O(log(n)) you get with binary-search-like approaches, but since there are so few worst cases, they are negligible for most applications (Update: not quite: There may be few, but they might occur with high probability - see Update below).

当然,最坏的情况仍然是O(n),比用类二进制搜索的方法得到的O(log(n)更糟糕,但是由于最坏的情况非常少,所以对大多数应用程序来说它们是可以忽略的(Update:不完全是这样:可能很少,但是很有可能会发生,请参阅下面的Update)。

Here is the "naïve" approach i've come up with, which at least on my machine beats most other approaches (binary search schemes for 32-bit ints always require log2(32) = 5 steps, whereas this silly algorithm requires less than 2 on average) - sorry for this being C++ and not pure C:

这是我提出了“幼稚”方法,至少在我的机器上比大多数其他方法(32位整数二进制搜索方案总是需要log2(32)= 5步骤,而这愚蠢的算法需要平均小于2)——对不起,这是c++和不纯C:

template <typename T>auto msb(T n) -> int{    static_assert(std::is_integral<T>::value && !std::is_signed<T>::value,        "msb<T>(): T must be an unsigned integral type.");    for (T i = std::numeric_limits<T>::digits - 1, mask = 1 << i; i >= 0; --i, mask >>= 1)    {        if ((n & mask) != 0)            return i;    }    return 0;}

Update: While what i wrote here is perfectly true for arbitrary integers, where every combination of bits is equally probable (my speed test simply measured how long it took to determine the MSB for all 32-bit integers), real-life integers, for which such a function will be called, usually follow a different pattern: In my code, for example, this function is used to determine whether an object size is a power of 2, or to find the next power of 2 greater or equal than an object size.My guess is that most applications using the MSB involve numbers which are much smaller than the maximum number an integer can represent (object sizes rarely utilize all the bits in a size_t). In this case, my solution will actually perform worse than a binary search approach - so the latter should probably be preferred, even though my solution will be faster looping through all integers.
TL;DR: Real-life integers will probably have a bias towards the worst case of this simple algorithm, which will make it perform worse in the end - despite the fact that it's amortized O(1) for truly arbitrary integers.

更新:虽然我在这里写的是完全适用于任意整数,每一位同样可能的组合(速度测试仅仅用了多长时间来确定测量MSB所有32位整数),真实的整数,而这样一个函数将被调用,通常遵循一个不同的模式:在我的代码,例如,这个函数是用来确定一个对象大小是2的幂,或找到下一个2的力量大于或等于物体的大小。我的猜测是,大多数使用MSB的应用程序都包含比整数所代表的最大数字小得多的数字(对象大小很少使用size_t中的所有位)。在这种情况下,我的解决方案实际上会比二进制搜索方法执行得更差——因此应该选择后者,尽管我的解决方案将在所有整数中循环得更快。现实生活中的整数可能会倾向于这个简单算法最坏的情况,这最终会使它表现得更差——尽管对于真正任意的整数,它是被平摊的O(1)。

1The argument goes like this (rough draft):Let n be the number of bits (bit-width). There are a total of 2n integers wich can be represented with n bits. There are 2n - 1 integers starting with a 1 (first 1 is fixed, remaining n - 1 bits can be anything). Those integers require only one interation of the loop to determine the MSB. Further, There are 2n - 2 integers starting with 01, requiring 2 iterations, 2n - 3 integers starting with 001, requiring 3 iterations, and so on.

论点是这样的(粗略的草稿):让n是位(位宽)的个数。总共有2n个整数可以用n位来表示。从1开始有2n - 1个整数(第一个1是固定的,剩下的n - 1位可以是任何数)。这些整数只需要循环的一次交互就可以确定MSB。此外,有2n - 2个整数从01开始,需要2次迭代,2n - 3个整数从001开始,需要3次迭代,以此类推。

If we sum up all the required iterations for all possible integers and divide them by 2n, the total number of integers, we get the average number of iterations needed for determining the MSB for n-bit integers:

如果我们对所有可能的整数进行必要的迭代,并将它们除以2n,即整数的总数,我们就得到了确定n位整数的MSB所需的平均迭代次数:

(1 * 2n - 1 + 2 * 2n - 2 + 3 * 2n - 3 + ... + n) / 2n

(1 * 2n - 1 + 2 * 2n - 2 + 3 * 2n - 3 +…)n + n)/ 2

This series of average iterations is actually convergent and has a limit of 2 for n towards infinity

这一系列的平均迭代实际上是收敛的,n趋于无穷时的极限是2

Thus, the naïve left-to-right algorithm has actually an amortized constant time complexity of O(1) for any number of bits.

因此,从左到右的朴素算法实际上对任意数量的比特都有O(1)的平摊常数时间复杂度。

#22


0  

The code:

代码:

    // x>=1;    unsigned func(unsigned x) {    double d = x ;    int p= (*reinterpret_cast<long long*>(&d) >> 52) - 1023;    printf( "The left-most non zero bit of %d is bit %d\n", x, p);    }

Or get the integer part of FPU instruction FYL2X (Y*Log2 X) by setting Y=1

或者通过设置Y=1获得FPU指令FYL2X (Y*Log2 X)的整数部分

#23


0  

Woaw, that was many answers. I am not sorry for answering on an old question.

啊,有很多答案。我不为回答一个老问题而感到遗憾。

int result = 0;//could be a char or int8_t insteadif(value){//this assumes the value is 64bit    if(0xFFFFFFFF00000000&value){  value>>=(1<<5); result|=(1<<5);  }//if it is 32bit then remove this line    if(0x00000000FFFF0000&value){  value>>=(1<<4); result|=(1<<4);  }//and remove the 32msb    if(0x000000000000FF00&value){  value>>=(1<<3); result|=(1<<3);  }    if(0x00000000000000F0&value){  value>>=(1<<2); result|=(1<<2);  }    if(0x000000000000000C&value){  value>>=(1<<1); result|=(1<<1);  }    if(0x0000000000000002&value){  result|=(1<<0);  }}else{  result=-1;}

This answer is pretty similar to another answer... oh well.

这个答案和另一个答案很相似……哦。

#24


0  

I assume your question is for an integer (called v below) and not an unsigned integer.

我假设您的问题是一个整数(以下称为v),而不是无符号整数。

int v = 612635685; // whatever value you wishunsigned int get_msb(int v){    int r = 31;                         // maximum number of iteration until integer has been totally left shifted out, considering that first bit is index 0. Also we could use (sizeof(int)) << 3 - 1 instead of 31 to make it work on any platform.    while (!(v & 0x8000000) && r--) {   // mask of the highest bit        v <<= 1;                        // multiply integer by 2.    }    return r;                           // will even return -1 if no bit was set, allowing error catch}

If you want to make it work without taking into account the sign you can add an extra 'v <<= 1;' before the loop (and change r value to 30 accordingly).Please let me know if I forgot anything. I haven't tested it but it should work just fine.

如果您想让它在不考虑符号的情况下工作,您可以在循环之前添加一个额外的'v <<= 1;'(并相应地将r值更改为30)。如果我忘了什么,请告诉我。我还没有对它进行测试,但它应该可以正常工作。

#25


-1  

Another poster provided a lookup-table using a byte-wide lookup. In case you want to eke out a bit more performance (at the cost of 32K of memory instead of just 256 lookup entries) here is a solution using a 15-bit lookup table, in C# 7 for .NET.

另一张海报提供了一个使用字节范围查找的查找表。如果您希望获得更多的性能(以32K内存而不是256个查找项为代价),这里有一个使用15位查找表的解决方案,在c# 7中的。net中。

The interesting part is initializing the table. Since it's a relatively small block that we want for the lifetime of the process, I allocate unmanaged memory for this by using Marshal.AllocHGlobal. As you can see, for maximum performance, the whole example is written as native:

有趣的部分是初始化表。由于它是一个相对较小的块,我们希望在整个过程中使用它,所以我使用Marshal.AllocHGlobal来分配非托管内存。正如您所看到的,为了获得最大的性能,整个示例是作为本机编写的:

readonly static byte[] msb_tab_15;// Initialize a table of 32768 bytes with the bit position (counting from LSB=0)// of the highest 'set' (non-zero) bit of its corresponding 16-bit index value.// The table is compressed by half, so use (value >> 1) for indexing.static MyStaticInit(){    var p = new byte[0x8000];    for (byte n = 0; n < 16; n++)        for (int c = (1 << n) >> 1, i = 0; i < c; i++)            p[c + i] = n;    msb_tab_15 = p;}

The table requires one-time initialization via the code above. It is read-only so a single global copy can be shared for concurrent access. With this table you can quickly look up the integer log2, which is what we're looking for here, for all the various integer widths (8, 16, 32, and 64 bits).

表需要通过上面的代码进行一次初始化。它是只读的,因此可以为并发访问共享一个全局副本。使用这个表,您可以快速查找整数log2,这就是我们在这里寻找的,对于所有的整数宽度(8、16、32和64位)。

Notice that the table entry for 0, the sole integer for which the notion of 'highest set bit' is undefined, is given the value -1. This distinction is necessary for proper handling of 0-valued upper words in the code below. Without further ado, here is the code for each of the various integer primitives:

注意,0的表条目(“最高集位”的概念未定义的唯一整数)的值为-1。这种区别对于正确处理下面代码中0值的大写字母是必要的。下面是各种整数原语的代码。

ulong (64-bit) Version

ulong(64位)的版本

/// <summary> Index of the highest set bit in 'v', or -1 for value '0' </summary>public static int HighestOne(this ulong v){    if ((long)v <= 0)        return (int)((v >> 57) & 0x40) - 1;      // handles cases v==0 and MSB==63    int j = /**/ (int)((0xFFFFFFFFU - v /****/) >> 58) & 0x20;    j |= /*****/ (int)((0x0000FFFFU - (v >> j)) >> 59) & 0x10;    return j + msb_tab_15[v >> (j + 1)];}

uint (32-bit) Version

单位(32位)版本

/// <summary> Index of the highest set bit in 'v', or -1 for value '0' </summary>public static int HighestOne(uint v){    if ((int)v <= 0)        return (int)((v >> 26) & 0x20) - 1;     // handles cases v==0 and MSB==31    int j = (int)((0x0000FFFFU - v) >> 27) & 0x10;    return j + msb_tab_15[v >> (j + 1)];}

Various overloads for the above

上面的各种重载。

public static int HighestOne(long v) => HighestOne((ulong)v);public static int HighestOne(int v) => HighestOne((uint)v);public static int HighestOne(ushort v) => msb_tab_15[v >> 1];public static int HighestOne(short v) => msb_tab_15[(ushort)v >> 1];public static int HighestOne(char ch) => msb_tab_15[ch >> 1];public static int HighestOne(sbyte v) => msb_tab_15[(byte)v >> 1];public static int HighestOne(byte v) => msb_tab_15[v >> 1];

This is a complete, working solution which represents the best performance on .NET 4.7.2 for numerous alternatives that I compared with a specialized performance test harness. Some of these are mentioned below. The test parameters were a uniform density of all 65 bit positions, i.e., 0 ... 31/63 plus value 0 (which produces result -1). The bits below the target index position were filled randomly. The tests were x64 only, release mode, with JIT-optimizations enabled.

这是一个完整的、可工作的解决方案,它代表了。net 4.7.2的最佳性能,我将其与专门的性能测试工具进行了比较。下面提到了其中的一些。测试参数为所有65位位置的均匀密度,即,0…31/63 + 0(产生结果-1)目标索引位置下方的位随机填充。测试仅为x64,释放模式,启用了jit优化。




That's the end of my formal answer here; what follows are some casual notes and links to source code for alternative test candidates associated with the testing I ran to validate the performance and correctness of the above code.

这就是我正式回答的结尾;下面是与我运行的测试相关的一些临时注释和源代码链接,以验证上述代码的性能和正确性。


The version provided above above, coded as Tab16A was a consistent winner over many runs. These various candidates, in active working/scratch form, can be found here, here, and here.

上面提供的版本编码为Tab16A,在许多次运行中都是一致的赢家。这些不同的候选人,以积极的工作/抓拍形式,可以在这里,这里和这里找到。

 1  candidates.HighestOne_Tab16A               622,496 2  candidates.HighestOne_Tab16C               628,234 3  candidates.HighestOne_Tab8A                649,146 4  candidates.HighestOne_Tab8B                656,847 5  candidates.HighestOne_Tab16B               657,147 6  candidates.HighestOne_Tab16D               659,650 7  _highest_one_bit_UNMANAGED.HighestOne_U    702,900 8  de_Bruijn.IndexOfMSB                       709,672 9  _old_2.HighestOne_Old2                     715,81010  _test_A.HighestOne8                        757,18811  _old_1.HighestOne_Old1                     757,92512  _test_A.HighestOne5  (unsafe)              760,38713  _test_B.HighestOne8  (unsafe)              763,90414  _test_A.HighestOne3  (unsafe)              766,43315  _test_A.HighestOne1  (unsafe)              767,32116  _test_A.HighestOne4  (unsafe)              771,70217  _test_B.HighestOne2  (unsafe)              772,13618  _test_B.HighestOne1  (unsafe)              772,52719  _test_B.HighestOne3  (unsafe)              774,14020  _test_A.HighestOne7  (unsafe)              774,58121  _test_B.HighestOne7  (unsafe)              775,46322  _test_A.HighestOne2  (unsafe)              776,86523  candidates.HighestOne_NoTab                777,69824  _test_B.HighestOne6  (unsafe)              779,48125  _test_A.HighestOne6  (unsafe)              781,55326  _test_B.HighestOne4  (unsafe)              785,50427  _test_B.HighestOne5  (unsafe)              789,79728  _test_A.HighestOne0  (unsafe)              809,56629  _test_B.HighestOne0  (unsafe)              814,99030  _highest_one_bit.HighestOne                824,34530  _bitarray_ext.RtlFindMostSignificantBit    894,06931  candidates.HighestOne_Naive                898,865

Notable is that the terrible performance of ntdll.dll!RtlFindMostSignificantBit via P/Invoke:

值得注意的是ntdll.dll的糟糕性能!通过P / Invoke RtlFindMostSignificantBit:

[DllImport("ntdll.dll"), SuppressUnmanagedCodeSecurity, SecuritySafeCritical]public static extern int RtlFindMostSignificantBit(ulong ul);

It's really too bad, because here's the entire actual function:

这太糟糕了,因为这是整个实际的函数:

    RtlFindMostSignificantBit:        bsr rdx, rcx          mov eax,0FFFFFFFFh          movzx ecx, dl          cmovne      eax,ecx          ret

I can't imagine the poor performance originating with these five lines, so the managed/native transition penalties must be to blame. I was also surprised that the testing really favored the 32KB (and 64KB) short (16-bit) direct-lookup tables over the 128-byte (and 256-byte) byte (8-bit) lookup tables. I thought the following would be more competitive with the 16-bit lookups, but the latter consistently outperformed this:

我无法想象这5条线带来的糟糕表现,所以管理/原生过渡的惩罚应该受到责备。我还感到惊讶的是,测试实际上更喜欢32KB(和64KB)短(16位)的直接查找表,而不是128字节(和256字节)的8位查找表。我认为下面的16位查找会更有竞争力,但后者的表现一直比这更好:

public static int HighestOne_Tab8A(ulong v){    if ((long)v <= 0)        return (int)((v >> 57) & 64) - 1;    int j;    j =  /**/ (int)((0xFFFFFFFFU - v) >> 58) & 32;    j += /**/ (int)((0x0000FFFFU - (v >> j)) >> 59) & 16;    j += /**/ (int)((0x000000FFU - (v >> j)) >> 60) & 8;    return j + msb_tab_8[v >> j];}

The last thing I'll point out is that I was quite shocked that my deBruijn method didn't fare better. This is the method that I had previously been using pervasively:

最后我要指出的是,我对我的deBruijn法没有取得更好的效果感到非常震惊。这是我以前广泛使用的方法:

const ulong N_bsf64 = 0x07EDD5E59A4E28C2,            N_bsr64 = 0x03F79D71B4CB0A89;readonly public static sbyte[]bsf64 ={    63,  0, 58,  1, 59, 47, 53,  2, 60, 39, 48, 27, 54, 33, 42,  3,    61, 51, 37, 40, 49, 18, 28, 20, 55, 30, 34, 11, 43, 14, 22,  4,    62, 57, 46, 52, 38, 26, 32, 41, 50, 36, 17, 19, 29, 10, 13, 21,    56, 45, 25, 31, 35, 16,  9, 12, 44, 24, 15,  8, 23,  7,  6,  5,},bsr64 ={     0, 47,  1, 56, 48, 27,  2, 60, 57, 49, 41, 37, 28, 16,  3, 61,    54, 58, 35, 52, 50, 42, 21, 44, 38, 32, 29, 23, 17, 11,  4, 62,    46, 55, 26, 59, 40, 36, 15, 53, 34, 51, 20, 43, 31, 22, 10, 45,    25, 39, 14, 33, 19, 30,  9, 24, 13, 18,  8, 12,  7,  6,  5, 63,};public static int IndexOfLSB(ulong v) =>    v != 0 ? bsf64[((v & (ulong)-(long)v) * N_bsf64) >> 58] : -1;public static int IndexOfMSB(ulong v){    if ((long)v <= 0)        return (int)((v >> 57) & 64) - 1;    v |= v >> 1; v |= v >> 2;  v |= v >> 4;   // does anybody know a better    v |= v >> 8; v |= v >> 16; v |= v >> 32;  // way than these 12 ops?    return bsr64[(v * N_bsr64) >> 58];}

There's much discussion of how superior and great deBruijn methods at this SO question, and I had tended to agree. My speculation is that, while both the deBruijn and direct lookup table methods (that I found to be fastest) both have to do a table lookup, and both have very minimal branching, only the deBruijn has a 64-bit multiply operation. I only tested the IndexOfMSB functions here--not the deBruijn IndexOfLSB--but I expect the latter to fare much better chance since it has so many fewer operations (see above), and I'll likely continue to use it for LSB.

关于德布鲁因在这个问题上的卓越和伟大的方法有很多讨论,我倾向于同意。我的推测是,尽管deBruijn和直接查找表方法(我发现最快)都必须进行表查找,而且它们的分支都非常少,只有deBruijn有一个64位的乘法操作。我只在这里测试了IndexOfMSB函数——而不是deBruijn IndexOfLSB——但是我希望后者能有更好的机会,因为它的操作更少(见上面),我可能会继续在LSB上使用它。

#26


-5  

One approach could be to keep shifting left till the number becomes negative.

一种方法是向左移动直到数字变成负数。

Here is the code:

这是代码:

Funct() {   int number; int count;  while(number > 0) {    number = number << 1;    count++;  }  printf("It is the no "%d" bit from the left", (count+1));    }