将特定的行/列/对角线与64位数字“隔离”

时间:2022-09-01 10:04:56

OK, let's consider a 64-bit number, with its bits forming a 8x8 table.

好的,让我们考虑一个64位的数字,它的位构成一个8x8表。

E.g.

如。

0 1 1 0 1 0 1 0 0 1 1 0 1 0 1 1 0 1 1 1 1 0 1 0 0 1 1 0 1 0 1 0 1 1 1 0 1 0 1 0 0 1 1 0 1 0 1 0 0 1 1 0 1 1 1 0 0 1 1 0 1 0 1 0

0 1 1 0 1 0 1 0 0 1 1 0 1 0 1 1 0 1 1 1 1 0 1 0 0 1 1 0 1 0 1 0 1 1 1 0 1 0 1 0 0 1 1 0 1 0 1 0 0 1 1 0 1 1 1 0 0 1 1 0 1 0 1 0

written as

写成

a b c d e f g h
----------------
0 1 1 0 1 0 1 0
0 1 1 0 1 0 1 1 
0 1 1 1 1 0 1 0 
0 1 1 0 1 0 1 0 
1 1 1 0 1 0 1 0 
0 1 1 0 1 0 1 0 
0 1 1 0 1 1 1 0 
0 1 1 0 1 0 1 0

Now, what if we want to isolate JUST e.g. column d (00100000) (or any row/diagonal for that matter) ?

现在,如果我们想要分离,例如d(00100000)列(或者任何行/对角线)?

Can this be done? And if so, how?

这个可以做吗?如果是这样,如何?


HINTS :

提示:

  • (a) My main objective here - though not initially mentioned - is raw speed. I'm searching for the fastest algorithm around, since the "retrieval" function is being performed some millions of times per second.

    (a)我在这里的主要目标——尽管最初没有提到——是原始速度。我正在寻找最快的算法,因为“检索”函数每秒被执行数百万次。

  • (b) This is what comes closer to what I mean : http://chessprogramming.wikispaces.com/Kindergarten+Bitboards

    (b)这更接近我的意思:http://chessprogramming.wikispaces.com/Kindergarten+Bitboards

9 个解决方案

#1


61  

Here's a solution with only 4 main steps:

这里有一个只有4个主要步骤的解决方案:

const uint64_t column_mask = 0x8080808080808080ull;
const uint64_t magic = 0x2040810204081ull;

int get_col(uint64_t board, int col) {
    uint64_t column = (board << col) & column_mask;
    column *= magic;
    return (column >> 56) & 0xff;
}

It works like this:

是这样的:

  • the board is shifted to align the column with the left side
  • 将板子移动以使柱与左边对齐
  • it's masked to only contain the required column (0..8)
  • 它只包含所需的列(0.. .8)
  • it's multiplied by a magic number which results in all the original bits pushed to the left side
  • 它乘以一个神奇的数字,结果所有原始的比特被推到左边
  • the left-most byte is shifted to the right
  • 最左边的字节被移到右边。

The magic number is chosen to copy only the needed bits and let the rest fall into unused places / overflow over the number. The process looks like this (digits are bit "IDs", rather than the number itself):

魔术数字被选择只复制需要的位,让其余的落在未使用的地方/溢出的数字。过程是这样的(数字是位“id”,而不是数字本身):

original column: ...1.......2.......3.......4.......5.......6.......7.......8....
aligned column:  1.......2.......3.......4.......5.......6.......7.......8.......
multiplied:      123456782345678.345678..45678...5678....678.....78......8.......
shifted to right:........................................................12345678

If you add the const keywords, assembly becomes quite nice actually:

如果你添加const关键字,汇编实际上变得相当不错:

get_col:
.LFB7:
        .cfi_startproc
        movl    %esi, %ecx
        movabsq $-9187201950435737472, %rax
        salq    %cl, %rdi
        andq    %rax, %rdi
        movabsq $567382630219905, %rax
        imulq   %rax, %rdi
        shrq    $56, %rdi
        movl    %edi, %eax
        ret

No branching, no external data, around 0.4ns per calculation.

没有分支,没有外部数据,每计算大约0。4ns。

Edit: takes around 6th of the time using NPE's solution as baseline (next fastest one)

编辑:以NPE的解决方案为基线(下一个最快的),大约需要6个时间

#2


6  

Right, so to "settle" the debate as to which is faster/slower/etc, I've put all the code into one program [and I hope I've credited the right person for the right code-snippet].

好的,为了“解决”关于哪个更快/更慢/等等的争论,我把所有的代码都放到了一个程序中[我希望我为正确的代码片段归功于正确的人]。

The code can be found below, for inspection that I've intepreded the code correctly when I've made it into functions. I did run it wout proper output and check that each function gives the same result [bearing in mind that the order is slightly different in some cases - so I made a variation to run the other way of my code, just to see that it gives the "right" result]. So without further ado, here's the results:

代码可以在下面找到,以便检查当我将代码转换成函数时是否正确地插入了代码。我运行它wout适当的输出和检查每个函数给出相同的结果(记住在某些情况下顺序略有不同,所以我做了一个变异的其他方式运行我的代码,看看它给了“正确”的结果)。因此,事不宜迟,结果如下:

mats1 time in clocks per iteration 10.3457
mats2 time in clocks per iteration 10.4785
mats3 time in clocks per iteration 10.5538
viraptor time in clocks per iteration 6.24603
lemees time in clocks per iteration 14.4818
npe time in clocks per iteration 13.1455
alex time in clocks per iteration 24.8272

(viraptor's results from core i5, g++ 4.7)

(viraptor来自core i5, g++ 4.7)

mats1 time in clocks per iteration 7.62338
mats2 time in clocks per iteration 7.36226
mats3 time in clocks per iteration 7.45361
viraptor time in clocks per iteration 2.09582
lemees time in clocks per iteration 9.43744
npe time in clocks per iteration 7.51016
alex time in clocks per iteration 19.3554

(viraptor's results from core i5, clang++ 3.2)

(viraptor核心i5的结果,clang++ 3.2)

mats1 time in clocks per iteration 12.956
mats2 time in clocks per iteration 13.4395
mats3 time in clocks per iteration 13.3178
viraptor time in clocks per iteration 2.12914
lemees time in clocks per iteration 13.9267
npe time in clocks per iteration 16.2102
alex time in clocks per iteration 13.8705

That's clock-cycles on a 3.4GHz AMD Athlon2 - I don't have a modern Intel machine - if someone wishes to run the code on that, I'd be interested to see what it looks like. I'm fairly sure all of it runs well within the cache - perhaps aside from fetching some of the values in to check.

这是一台3.4GHz的AMD Athlon2上的时钟周期——我没有一台现代的英特尔机器——如果有人想在上面运行代码,我很想看看它是什么样子。我很确定所有的数据都在缓存中运行——也许除了提取一些值来检查。

So, the winner is clearly viraptor, by about 40% - "my" code is second. Alex's code doesn't have any jumps/branches, but it appears to run slower than the other alternatives still. Not sure why npe's results are that much slower than mine - it does almost the same thing (and the code looks very similar when looking at the assembler output from g++).

因此,胜出者显然是viraptor(约40%)——“我的”代码位居第二。Alex的代码没有任何跳转/分支,但是它的运行速度似乎仍然比其他选项慢。不知道为什么npe的结果比我的慢得多——它几乎做了同样的事情(当查看g++的汇编程序输出时,代码看起来非常相似)。

#include <iostream>
#include <fstream>
#include <cstdint>

using namespace std;

const int SIZE = 1000000;

uint64_t g_val[SIZE];

ofstream nulloutput;

static __inline__ unsigned long long rdtsc(void)
{
    unsigned hi, lo;
    __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
    return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
}

#define BITA_TO_B(x, a, b) (((x) >> (a-b)) & (1 << b))

unsigned char get_col_mats1(uint64_t val, int col)
{
    return BITA_TO_B(val, 56+col, 7) |
    BITA_TO_B(val, 48+col, 6) |
    BITA_TO_B(val, 40+col, 5) |
    BITA_TO_B(val, 32+col, 4) |
    BITA_TO_B(val, 24+col, 3) |
    BITA_TO_B(val, 16+col, 2) |
    BITA_TO_B(val, 8+col, 1) |
    BITA_TO_B(val, 0+col, 0);
}

unsigned char get_col_mats2(uint64_t val, int col)
{
    return BITA_TO_B(val, 63-col, 7) |
    BITA_TO_B(val, 55-col, 6) |
    BITA_TO_B(val, 47-col, 5) |
    BITA_TO_B(val, 39-col, 4) |
    BITA_TO_B(val, 31-col, 3) |
    BITA_TO_B(val, 23-col, 2) |
    BITA_TO_B(val, 15-col, 1) |
    BITA_TO_B(val, 7-col, 0);
}


unsigned char get_col_viraptor(uint64_t board, int col) {
    const uint64_t column_mask = 0x8080808080808080ull;
    const uint64_t magic = 0x2040810204081ull ;
    uint64_t column = board & (column_mask >> col);
    column <<= col;
    column *= magic;
    return (column >> 56) & 0xff;
}


unsigned char get_col_alex(uint64_t bitboard, int col)
{
    unsigned char result;
    result |= (bitboard & (1ULL << 63-col)) ? 0x80 : 0;
    result |= (bitboard & (1ULL << 55-col)) ? 0x40 : 0;
    result |= (bitboard & (1ULL << 47-col)) ? 0x20 : 0;
    result |= (bitboard & (1ULL << 39-col)) ? 0x10 : 0;
    result |= (bitboard & (1ULL << 31-col)) ? 0x08 : 0;
    result |= (bitboard & (1ULL << 23-col)) ? 0x04 : 0;
    result |= (bitboard & (1ULL << 15-col)) ? 0x02 : 0;
    result |= (bitboard & (1ULL << 7-col))  ? 0x01 : 0;

    return result;
}

unsigned char get_col_lemees(uint64_t val, int column)
{
    int result = 0;
    int source_bitpos = 7 - column; // "point" to last entry in this column
    for (int target_bitpos = 0; target_bitpos < 8; ++target_bitpos)
    {
    bool bit = (val >> source_bitpos) & 1;  // "extract" bit
    result |= bit << target_bitpos;            // add bit if it was set
    source_bitpos += 8;                        // move one up in table
    }
    return result;
}

int get(uint64_t board, int row, int col) {
  return (board >> (row * 8 + col)) & 1;
}

uint8_t get_col_npe(uint64_t board, int col) {
  uint8_t ret = 0;
  for (int i = 0; i < 8; ++i) {
    ret = (ret << 1) + get(board, i, col);
  }
  return ret;
}



#define BITA_TO_B2(x, a, b) (((x) >> (a-b)) & (1 << b))

unsigned char get_col_mats3(uint64_t val, int col)
{
    return BITA_TO_B2(val, 63-col, 7) |
    BITA_TO_B2(val, 55-col, 6) |
    BITA_TO_B2(val, 47-col, 5) |
    BITA_TO_B2(val, 39-col, 4) |
    BITA_TO_B2(val, 31-col, 3) |
    BITA_TO_B2(val, 23-col, 2) |
    BITA_TO_B2(val, 15-col, 1) |
    BITA_TO_B2(val, 7-col, 0);
}

template<unsigned char (*f)(uint64_t val, int col)>
void runbench(const char *name)
{
    unsigned char col[8]  = {0};
    uint64_t long t = rdtsc();
    for(int j = 0; j < SIZE; j++)
    {
    uint64_t val = g_val[j];
    for(int i = 0; i < 8; i++)
    {
        col[i] += f(val, i);
    }
//  __asm__ __volatile__("":::"memory");
    }
    t = rdtsc() - t;
    for(int i = 0; i < 8; i++)
    {
    nulloutput<< "col " << i << " has bits " << hex << (int)col[i] << endl;
    }
    cout << name << " time in clocks per iteration " << dec << t / (8.0 * SIZE) << endl;
}

#define BM(name) void bench_##name() { runbench<get_col_##name>(#name); }

BM(mats1);
BM(mats2);
BM(mats3);
BM(viraptor);
BM(lemees);
BM(npe);
BM(alex);

struct function
{
    void (*func)(void);
    const char *name;
};


#define FUNC(f) { bench_##f, #f }

function funcs[] = 
{
    FUNC(mats1),
    FUNC(mats2),
    FUNC(mats3),
    FUNC(viraptor),
    FUNC(lemees),
    FUNC(npe),
    FUNC(alex),
}; 


int main()
{
    unsigned long long a, b;
    int i;
    int sum = 0;

    nulloutput.open("/dev/nul");
    for(i = 0; i < SIZE; i++)
    {
    g_val[i] = rand() + ((long)rand() << 32L);
    }
    unsigned char col[8];

    for(i = 0; i < sizeof(funcs)/sizeof(funcs[0]); i++)
    {
    funcs[i].func();
    }
}

#3


2  

Code it up with straightforward loops and let the optimizer do the inlining and loop unrolling for you.

用简单的循环进行编码,让优化器为您进行内联和循环展开。

Compiled using 4.7.2 with -O3, on my box the following can perform about 300 million get_col() calls per second.

使用4.7.2和-O3编译,在我的框中,以下代码每秒可以执行大约3亿个get_col()调用。

bitboard.cpp:

bitboard.cpp:

#include <cinttypes>
#include <iostream>

int get(uint64_t board, int row, int col) {
  return (board >> (row * 8 + col)) & 1;
}

uint8_t get_col(uint64_t board, int col) {
  uint8_t ret = 0;
  for (int i = 0; i < 8; ++i) {
    ret = (ret << 1) + get(board, i, col);
  }
  return ret;
}

extern uint64_t board;
extern int sum;

extern void f();

int main() {
  for (int i = 0; i < 40000000; ++i) {
    for (int j = 0; j < 8; ++j) {
      sum += get_col(board, j);
    }
    f();
  }
  std::cout << sum << std::endl;
}

bitboard_b.cpp:

bitboard_b.cpp:

#include <cinttypes>

uint64_t board = 0x1234567890ABCDEFull;
int sum = 0;

void f() {}

If you look at the assembly code for get_col(), you'll see that it contains zero loops and is probably as efficient as anything you're likely to handcraft:

如果您查看get_col()的汇编代码,您会发现它包含零循环,并且可能与您手工编写的任何代码一样高效:

__Z7get_colyi:
LFB1248:
        movl    %esi, %ecx
        movq    %rdi, %rax
        movq    %rdi, %rdx
        shrq    %cl, %rax
        leal    8(%rsi), %ecx
        andl    $1, %eax
        shrq    %cl, %rdx
        leal    16(%rsi), %ecx
        andl    $1, %edx
        leal    (%rdx,%rax,2), %eax
        movq    %rdi, %rdx
        shrq    %cl, %rdx
        leal    24(%rsi), %ecx
        andl    $1, %edx
        leal    (%rdx,%rax,2), %eax
        movq    %rdi, %rdx
        shrq    %cl, %rdx
        leal    32(%rsi), %ecx
        andl    $1, %edx
        leal    (%rdx,%rax,2), %eax
        movq    %rdi, %rdx
        shrq    %cl, %rdx
        leal    40(%rsi), %ecx
        andl    $1, %edx
        leal    (%rdx,%rax,2), %edx
        movq    %rdi, %rax
        shrq    %cl, %rax
        leal    48(%rsi), %ecx
        andl    $1, %eax
        leal    (%rax,%rdx,2), %edx
        movq    %rdi, %rax
        shrq    %cl, %rax
        leal    56(%rsi), %ecx
        andl    $1, %eax
        leal    (%rax,%rdx,2), %eax
        shrq    %cl, %rdi
        andl    $1, %edi
        leal    (%rdi,%rax,2), %eax
        ret

This is not meant a complete implementation, just an rough illustration of the idea. In particular, the ordering of bits may be the opposite of what you expect, etc.

这并不是指一个完整的实现,只是对这个想法的粗略说明。特别是,位的顺序可能与您期望的相反,等等。

#4


1  

In your case (specialized for 8x8 = 64 bit tables), you need to perform bitshifting to extract the specific bits and re-arrange them in a new integer variable, also using bitshifting:

在您的情况下(专用于8x8 = 64位表),您需要执行位移来提取特定的位,并将它们重新排列到一个新的整型变量中,也使用位移:

uint64_t matrix = ... // input
int column = 3; // "d"-column

int result = 0;
int source_bitpos = 7 - column; // "point" to last entry in this column
for (int target_bitpos = 0; target_bitpos < 8; ++target_bitpos)
{
    bool bit = (matrix >> source_bitpos) & 1;  // "extract" bit
    result |= bit << target_bitpos;            // add bit if it was set
    source_bitpos += 8;                        // move one up in table
}

See here: http://ideone.com/UlWAAO

在这里看到的:http://ideone.com/UlWAAO

#5


1  

#define BITA_TO_B(x, a, b) (((x) >> (a)) & 1) << b;

unsigned long long x = 0x6A6B7A6AEA6E6A;

unsigned char col_d = BITA_TO_B(x, 60, 7) |
                      BITA_TO_B(x, 52, 6) |
                      BITA_TO_B(x, 44, 5) |
                      BITA_TO_B(x, 36, 4) |
                      BITA_TO_B(x, 28, 3) |
                      BITA_TO_B(x, 20, 2) |
                      BITA_TO_B(x, 12, 1) |
                      BITA_TO_B(x,  4, 0);

Perhaps a somewhat more optimized idea:

也许有一个更优化的想法:

#define BITA_TO_B(x, a, b) (((x) >> (a-b)) & (1 << b));

If b is a constant, this will perform slightly better.

如果b是一个常数,它的表现会稍微好一点。

Another way may be:

另一种可能是:

unsigned long xx = x & 0x10101010101010; 
col_d = (xx >> 53) | (xx >> 46) | (xx >> 39) ... (xx >> 4); 

Doing one "and" rather than many helps speed it up.

做一个“和”,而不是很多有助于加速它。

#6


1  

You can transpose the number, and then simply select the relevant column, which is now a row, and therefore contiguous bits, as you wanted.

你可以转置数字,然后简单地选择相关的列,它现在是一个行,因此是连续的位,正如你想要的。

In my tests it wasn't much better than ORing together 8 individually selected bits, but it is much better if you intend to select multiple columns (since the transpose is the limiting factor).

在我的测试中,它并不比将8个单独选择的位混合在一起好多少,但是如果您想要选择多个列(因为转置是限制因素),它会好得多。

#7


1  

Here's a solution that can execute once a cycle (if the value and mask are in registers), if you are willing to use the intrinsic for the PEXT instruction on Intel (and if you are doing bitboard stuff, you likely are):

这里有一个解决方案,可以在一个周期内执行一次(如果值和掩码在寄存器中),如果您愿意使用关于Intel的PEXT指令的内部指令(如果您正在做位板的事情,您很可能是这样):

int get_col(uint64_t board) {
    return _pext_u64(board, 0x8080808080808080ull);
}

That's for the 0th column - if you want another one just shift the mask appropriately. Of course, this is cheating by using a hardware specific instruction, but bitboards are all about cheating.

这是第0列——如果你想要另一个的话,只需适当地移动蒙版。当然,这是通过使用特定于硬件的指令来作弊,但是位板都是关于作弊的。

#8


0  

How about this...

这个怎么样…

uint64_t bitboard = ...;
uint8_t result = 0;

result |= (bitboard & (1ULL << 60)) ? 0x80 : 0;
result |= (bitboard & (1ULL << 52)) ? 0x40 : 0;
result |= (bitboard & (1ULL << 44)) ? 0x20 : 0;
result |= (bitboard & (1ULL << 36)) ? 0x10 : 0;
result |= (bitboard & (1ULL << 28)) ? 0x08 : 0;
result |= (bitboard & (1ULL << 20)) ? 0x04 : 0;
result |= (bitboard & (1ULL << 12)) ? 0x02 : 0;
result |= (bitboard & (1ULL <<  4)) ? 0x01 : 0;

#9


0  

This one is from the Chess Programming Wiki. It transposes the board, after which isolating a single row is trivial. It also lets you go back the other way.

这是国际象棋程序的*。它改变了板子,在此之后隔离单行是很简单的。它也让你回到另一个方向。

/**
 * Flip a bitboard about the antidiagonal a8-h1.
 * Square a1 is mapped to h8 and vice versa.
 * @param x any bitboard
 * @return bitboard x flipped about antidiagonal a8-h1
 */
U64 flipDiagA8H1(U64 x) {
   U64 t;
   const U64 k1 = C64(0xaa00aa00aa00aa00);
   const U64 k2 = C64(0xcccc0000cccc0000);
   const U64 k4 = C64(0xf0f0f0f00f0f0f0f);
   t  =       x ^ (x << 36) ;
   x ^= k4 & (t ^ (x >> 36));
   t  = k2 & (x ^ (x << 18));
   x ^=       t ^ (t >> 18) ;
   t  = k1 & (x ^ (x <<  9));
   x ^=       t ^ (t >>  9) ;
   return x;
}

#1


61  

Here's a solution with only 4 main steps:

这里有一个只有4个主要步骤的解决方案:

const uint64_t column_mask = 0x8080808080808080ull;
const uint64_t magic = 0x2040810204081ull;

int get_col(uint64_t board, int col) {
    uint64_t column = (board << col) & column_mask;
    column *= magic;
    return (column >> 56) & 0xff;
}

It works like this:

是这样的:

  • the board is shifted to align the column with the left side
  • 将板子移动以使柱与左边对齐
  • it's masked to only contain the required column (0..8)
  • 它只包含所需的列(0.. .8)
  • it's multiplied by a magic number which results in all the original bits pushed to the left side
  • 它乘以一个神奇的数字,结果所有原始的比特被推到左边
  • the left-most byte is shifted to the right
  • 最左边的字节被移到右边。

The magic number is chosen to copy only the needed bits and let the rest fall into unused places / overflow over the number. The process looks like this (digits are bit "IDs", rather than the number itself):

魔术数字被选择只复制需要的位,让其余的落在未使用的地方/溢出的数字。过程是这样的(数字是位“id”,而不是数字本身):

original column: ...1.......2.......3.......4.......5.......6.......7.......8....
aligned column:  1.......2.......3.......4.......5.......6.......7.......8.......
multiplied:      123456782345678.345678..45678...5678....678.....78......8.......
shifted to right:........................................................12345678

If you add the const keywords, assembly becomes quite nice actually:

如果你添加const关键字,汇编实际上变得相当不错:

get_col:
.LFB7:
        .cfi_startproc
        movl    %esi, %ecx
        movabsq $-9187201950435737472, %rax
        salq    %cl, %rdi
        andq    %rax, %rdi
        movabsq $567382630219905, %rax
        imulq   %rax, %rdi
        shrq    $56, %rdi
        movl    %edi, %eax
        ret

No branching, no external data, around 0.4ns per calculation.

没有分支,没有外部数据,每计算大约0。4ns。

Edit: takes around 6th of the time using NPE's solution as baseline (next fastest one)

编辑:以NPE的解决方案为基线(下一个最快的),大约需要6个时间

#2


6  

Right, so to "settle" the debate as to which is faster/slower/etc, I've put all the code into one program [and I hope I've credited the right person for the right code-snippet].

好的,为了“解决”关于哪个更快/更慢/等等的争论,我把所有的代码都放到了一个程序中[我希望我为正确的代码片段归功于正确的人]。

The code can be found below, for inspection that I've intepreded the code correctly when I've made it into functions. I did run it wout proper output and check that each function gives the same result [bearing in mind that the order is slightly different in some cases - so I made a variation to run the other way of my code, just to see that it gives the "right" result]. So without further ado, here's the results:

代码可以在下面找到,以便检查当我将代码转换成函数时是否正确地插入了代码。我运行它wout适当的输出和检查每个函数给出相同的结果(记住在某些情况下顺序略有不同,所以我做了一个变异的其他方式运行我的代码,看看它给了“正确”的结果)。因此,事不宜迟,结果如下:

mats1 time in clocks per iteration 10.3457
mats2 time in clocks per iteration 10.4785
mats3 time in clocks per iteration 10.5538
viraptor time in clocks per iteration 6.24603
lemees time in clocks per iteration 14.4818
npe time in clocks per iteration 13.1455
alex time in clocks per iteration 24.8272

(viraptor's results from core i5, g++ 4.7)

(viraptor来自core i5, g++ 4.7)

mats1 time in clocks per iteration 7.62338
mats2 time in clocks per iteration 7.36226
mats3 time in clocks per iteration 7.45361
viraptor time in clocks per iteration 2.09582
lemees time in clocks per iteration 9.43744
npe time in clocks per iteration 7.51016
alex time in clocks per iteration 19.3554

(viraptor's results from core i5, clang++ 3.2)

(viraptor核心i5的结果,clang++ 3.2)

mats1 time in clocks per iteration 12.956
mats2 time in clocks per iteration 13.4395
mats3 time in clocks per iteration 13.3178
viraptor time in clocks per iteration 2.12914
lemees time in clocks per iteration 13.9267
npe time in clocks per iteration 16.2102
alex time in clocks per iteration 13.8705

That's clock-cycles on a 3.4GHz AMD Athlon2 - I don't have a modern Intel machine - if someone wishes to run the code on that, I'd be interested to see what it looks like. I'm fairly sure all of it runs well within the cache - perhaps aside from fetching some of the values in to check.

这是一台3.4GHz的AMD Athlon2上的时钟周期——我没有一台现代的英特尔机器——如果有人想在上面运行代码,我很想看看它是什么样子。我很确定所有的数据都在缓存中运行——也许除了提取一些值来检查。

So, the winner is clearly viraptor, by about 40% - "my" code is second. Alex's code doesn't have any jumps/branches, but it appears to run slower than the other alternatives still. Not sure why npe's results are that much slower than mine - it does almost the same thing (and the code looks very similar when looking at the assembler output from g++).

因此,胜出者显然是viraptor(约40%)——“我的”代码位居第二。Alex的代码没有任何跳转/分支,但是它的运行速度似乎仍然比其他选项慢。不知道为什么npe的结果比我的慢得多——它几乎做了同样的事情(当查看g++的汇编程序输出时,代码看起来非常相似)。

#include <iostream>
#include <fstream>
#include <cstdint>

using namespace std;

const int SIZE = 1000000;

uint64_t g_val[SIZE];

ofstream nulloutput;

static __inline__ unsigned long long rdtsc(void)
{
    unsigned hi, lo;
    __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
    return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
}

#define BITA_TO_B(x, a, b) (((x) >> (a-b)) & (1 << b))

unsigned char get_col_mats1(uint64_t val, int col)
{
    return BITA_TO_B(val, 56+col, 7) |
    BITA_TO_B(val, 48+col, 6) |
    BITA_TO_B(val, 40+col, 5) |
    BITA_TO_B(val, 32+col, 4) |
    BITA_TO_B(val, 24+col, 3) |
    BITA_TO_B(val, 16+col, 2) |
    BITA_TO_B(val, 8+col, 1) |
    BITA_TO_B(val, 0+col, 0);
}

unsigned char get_col_mats2(uint64_t val, int col)
{
    return BITA_TO_B(val, 63-col, 7) |
    BITA_TO_B(val, 55-col, 6) |
    BITA_TO_B(val, 47-col, 5) |
    BITA_TO_B(val, 39-col, 4) |
    BITA_TO_B(val, 31-col, 3) |
    BITA_TO_B(val, 23-col, 2) |
    BITA_TO_B(val, 15-col, 1) |
    BITA_TO_B(val, 7-col, 0);
}


unsigned char get_col_viraptor(uint64_t board, int col) {
    const uint64_t column_mask = 0x8080808080808080ull;
    const uint64_t magic = 0x2040810204081ull ;
    uint64_t column = board & (column_mask >> col);
    column <<= col;
    column *= magic;
    return (column >> 56) & 0xff;
}


unsigned char get_col_alex(uint64_t bitboard, int col)
{
    unsigned char result;
    result |= (bitboard & (1ULL << 63-col)) ? 0x80 : 0;
    result |= (bitboard & (1ULL << 55-col)) ? 0x40 : 0;
    result |= (bitboard & (1ULL << 47-col)) ? 0x20 : 0;
    result |= (bitboard & (1ULL << 39-col)) ? 0x10 : 0;
    result |= (bitboard & (1ULL << 31-col)) ? 0x08 : 0;
    result |= (bitboard & (1ULL << 23-col)) ? 0x04 : 0;
    result |= (bitboard & (1ULL << 15-col)) ? 0x02 : 0;
    result |= (bitboard & (1ULL << 7-col))  ? 0x01 : 0;

    return result;
}

unsigned char get_col_lemees(uint64_t val, int column)
{
    int result = 0;
    int source_bitpos = 7 - column; // "point" to last entry in this column
    for (int target_bitpos = 0; target_bitpos < 8; ++target_bitpos)
    {
    bool bit = (val >> source_bitpos) & 1;  // "extract" bit
    result |= bit << target_bitpos;            // add bit if it was set
    source_bitpos += 8;                        // move one up in table
    }
    return result;
}

int get(uint64_t board, int row, int col) {
  return (board >> (row * 8 + col)) & 1;
}

uint8_t get_col_npe(uint64_t board, int col) {
  uint8_t ret = 0;
  for (int i = 0; i < 8; ++i) {
    ret = (ret << 1) + get(board, i, col);
  }
  return ret;
}



#define BITA_TO_B2(x, a, b) (((x) >> (a-b)) & (1 << b))

unsigned char get_col_mats3(uint64_t val, int col)
{
    return BITA_TO_B2(val, 63-col, 7) |
    BITA_TO_B2(val, 55-col, 6) |
    BITA_TO_B2(val, 47-col, 5) |
    BITA_TO_B2(val, 39-col, 4) |
    BITA_TO_B2(val, 31-col, 3) |
    BITA_TO_B2(val, 23-col, 2) |
    BITA_TO_B2(val, 15-col, 1) |
    BITA_TO_B2(val, 7-col, 0);
}

template<unsigned char (*f)(uint64_t val, int col)>
void runbench(const char *name)
{
    unsigned char col[8]  = {0};
    uint64_t long t = rdtsc();
    for(int j = 0; j < SIZE; j++)
    {
    uint64_t val = g_val[j];
    for(int i = 0; i < 8; i++)
    {
        col[i] += f(val, i);
    }
//  __asm__ __volatile__("":::"memory");
    }
    t = rdtsc() - t;
    for(int i = 0; i < 8; i++)
    {
    nulloutput<< "col " << i << " has bits " << hex << (int)col[i] << endl;
    }
    cout << name << " time in clocks per iteration " << dec << t / (8.0 * SIZE) << endl;
}

#define BM(name) void bench_##name() { runbench<get_col_##name>(#name); }

BM(mats1);
BM(mats2);
BM(mats3);
BM(viraptor);
BM(lemees);
BM(npe);
BM(alex);

struct function
{
    void (*func)(void);
    const char *name;
};


#define FUNC(f) { bench_##f, #f }

function funcs[] = 
{
    FUNC(mats1),
    FUNC(mats2),
    FUNC(mats3),
    FUNC(viraptor),
    FUNC(lemees),
    FUNC(npe),
    FUNC(alex),
}; 


int main()
{
    unsigned long long a, b;
    int i;
    int sum = 0;

    nulloutput.open("/dev/nul");
    for(i = 0; i < SIZE; i++)
    {
    g_val[i] = rand() + ((long)rand() << 32L);
    }
    unsigned char col[8];

    for(i = 0; i < sizeof(funcs)/sizeof(funcs[0]); i++)
    {
    funcs[i].func();
    }
}

#3


2  

Code it up with straightforward loops and let the optimizer do the inlining and loop unrolling for you.

用简单的循环进行编码,让优化器为您进行内联和循环展开。

Compiled using 4.7.2 with -O3, on my box the following can perform about 300 million get_col() calls per second.

使用4.7.2和-O3编译,在我的框中,以下代码每秒可以执行大约3亿个get_col()调用。

bitboard.cpp:

bitboard.cpp:

#include <cinttypes>
#include <iostream>

int get(uint64_t board, int row, int col) {
  return (board >> (row * 8 + col)) & 1;
}

uint8_t get_col(uint64_t board, int col) {
  uint8_t ret = 0;
  for (int i = 0; i < 8; ++i) {
    ret = (ret << 1) + get(board, i, col);
  }
  return ret;
}

extern uint64_t board;
extern int sum;

extern void f();

int main() {
  for (int i = 0; i < 40000000; ++i) {
    for (int j = 0; j < 8; ++j) {
      sum += get_col(board, j);
    }
    f();
  }
  std::cout << sum << std::endl;
}

bitboard_b.cpp:

bitboard_b.cpp:

#include <cinttypes>

uint64_t board = 0x1234567890ABCDEFull;
int sum = 0;

void f() {}

If you look at the assembly code for get_col(), you'll see that it contains zero loops and is probably as efficient as anything you're likely to handcraft:

如果您查看get_col()的汇编代码,您会发现它包含零循环,并且可能与您手工编写的任何代码一样高效:

__Z7get_colyi:
LFB1248:
        movl    %esi, %ecx
        movq    %rdi, %rax
        movq    %rdi, %rdx
        shrq    %cl, %rax
        leal    8(%rsi), %ecx
        andl    $1, %eax
        shrq    %cl, %rdx
        leal    16(%rsi), %ecx
        andl    $1, %edx
        leal    (%rdx,%rax,2), %eax
        movq    %rdi, %rdx
        shrq    %cl, %rdx
        leal    24(%rsi), %ecx
        andl    $1, %edx
        leal    (%rdx,%rax,2), %eax
        movq    %rdi, %rdx
        shrq    %cl, %rdx
        leal    32(%rsi), %ecx
        andl    $1, %edx
        leal    (%rdx,%rax,2), %eax
        movq    %rdi, %rdx
        shrq    %cl, %rdx
        leal    40(%rsi), %ecx
        andl    $1, %edx
        leal    (%rdx,%rax,2), %edx
        movq    %rdi, %rax
        shrq    %cl, %rax
        leal    48(%rsi), %ecx
        andl    $1, %eax
        leal    (%rax,%rdx,2), %edx
        movq    %rdi, %rax
        shrq    %cl, %rax
        leal    56(%rsi), %ecx
        andl    $1, %eax
        leal    (%rax,%rdx,2), %eax
        shrq    %cl, %rdi
        andl    $1, %edi
        leal    (%rdi,%rax,2), %eax
        ret

This is not meant a complete implementation, just an rough illustration of the idea. In particular, the ordering of bits may be the opposite of what you expect, etc.

这并不是指一个完整的实现,只是对这个想法的粗略说明。特别是,位的顺序可能与您期望的相反,等等。

#4


1  

In your case (specialized for 8x8 = 64 bit tables), you need to perform bitshifting to extract the specific bits and re-arrange them in a new integer variable, also using bitshifting:

在您的情况下(专用于8x8 = 64位表),您需要执行位移来提取特定的位,并将它们重新排列到一个新的整型变量中,也使用位移:

uint64_t matrix = ... // input
int column = 3; // "d"-column

int result = 0;
int source_bitpos = 7 - column; // "point" to last entry in this column
for (int target_bitpos = 0; target_bitpos < 8; ++target_bitpos)
{
    bool bit = (matrix >> source_bitpos) & 1;  // "extract" bit
    result |= bit << target_bitpos;            // add bit if it was set
    source_bitpos += 8;                        // move one up in table
}

See here: http://ideone.com/UlWAAO

在这里看到的:http://ideone.com/UlWAAO

#5


1  

#define BITA_TO_B(x, a, b) (((x) >> (a)) & 1) << b;

unsigned long long x = 0x6A6B7A6AEA6E6A;

unsigned char col_d = BITA_TO_B(x, 60, 7) |
                      BITA_TO_B(x, 52, 6) |
                      BITA_TO_B(x, 44, 5) |
                      BITA_TO_B(x, 36, 4) |
                      BITA_TO_B(x, 28, 3) |
                      BITA_TO_B(x, 20, 2) |
                      BITA_TO_B(x, 12, 1) |
                      BITA_TO_B(x,  4, 0);

Perhaps a somewhat more optimized idea:

也许有一个更优化的想法:

#define BITA_TO_B(x, a, b) (((x) >> (a-b)) & (1 << b));

If b is a constant, this will perform slightly better.

如果b是一个常数,它的表现会稍微好一点。

Another way may be:

另一种可能是:

unsigned long xx = x & 0x10101010101010; 
col_d = (xx >> 53) | (xx >> 46) | (xx >> 39) ... (xx >> 4); 

Doing one "and" rather than many helps speed it up.

做一个“和”,而不是很多有助于加速它。

#6


1  

You can transpose the number, and then simply select the relevant column, which is now a row, and therefore contiguous bits, as you wanted.

你可以转置数字,然后简单地选择相关的列,它现在是一个行,因此是连续的位,正如你想要的。

In my tests it wasn't much better than ORing together 8 individually selected bits, but it is much better if you intend to select multiple columns (since the transpose is the limiting factor).

在我的测试中,它并不比将8个单独选择的位混合在一起好多少,但是如果您想要选择多个列(因为转置是限制因素),它会好得多。

#7


1  

Here's a solution that can execute once a cycle (if the value and mask are in registers), if you are willing to use the intrinsic for the PEXT instruction on Intel (and if you are doing bitboard stuff, you likely are):

这里有一个解决方案,可以在一个周期内执行一次(如果值和掩码在寄存器中),如果您愿意使用关于Intel的PEXT指令的内部指令(如果您正在做位板的事情,您很可能是这样):

int get_col(uint64_t board) {
    return _pext_u64(board, 0x8080808080808080ull);
}

That's for the 0th column - if you want another one just shift the mask appropriately. Of course, this is cheating by using a hardware specific instruction, but bitboards are all about cheating.

这是第0列——如果你想要另一个的话,只需适当地移动蒙版。当然,这是通过使用特定于硬件的指令来作弊,但是位板都是关于作弊的。

#8


0  

How about this...

这个怎么样…

uint64_t bitboard = ...;
uint8_t result = 0;

result |= (bitboard & (1ULL << 60)) ? 0x80 : 0;
result |= (bitboard & (1ULL << 52)) ? 0x40 : 0;
result |= (bitboard & (1ULL << 44)) ? 0x20 : 0;
result |= (bitboard & (1ULL << 36)) ? 0x10 : 0;
result |= (bitboard & (1ULL << 28)) ? 0x08 : 0;
result |= (bitboard & (1ULL << 20)) ? 0x04 : 0;
result |= (bitboard & (1ULL << 12)) ? 0x02 : 0;
result |= (bitboard & (1ULL <<  4)) ? 0x01 : 0;

#9


0  

This one is from the Chess Programming Wiki. It transposes the board, after which isolating a single row is trivial. It also lets you go back the other way.

这是国际象棋程序的*。它改变了板子,在此之后隔离单行是很简单的。它也让你回到另一个方向。

/**
 * Flip a bitboard about the antidiagonal a8-h1.
 * Square a1 is mapped to h8 and vice versa.
 * @param x any bitboard
 * @return bitboard x flipped about antidiagonal a8-h1
 */
U64 flipDiagA8H1(U64 x) {
   U64 t;
   const U64 k1 = C64(0xaa00aa00aa00aa00);
   const U64 k2 = C64(0xcccc0000cccc0000);
   const U64 k4 = C64(0xf0f0f0f00f0f0f0f);
   t  =       x ^ (x << 36) ;
   x ^= k4 & (t ^ (x >> 36));
   t  = k2 & (x ^ (x << 18));
   x ^=       t ^ (t >> 18) ;
   t  = k1 & (x ^ (x <<  9));
   x ^=       t ^ (t >>  9) ;
   return x;
}