n是负的，正的还是零的?返回1 2或4。

I'm building a PowerPC interpreter, and it works quite well. In the Power architecture the condition register CR0 (EFLAGS on x86) is updated on almost any instruction. It is set like this. The value of CR0 is 1, if the last result was negative, 2 if the last result was positive, 4 otherwise.

我正在构建一个PowerPC解释器，它运行得很好。在Power体系结构中，几乎所有指令都更新了条件寄存器CR0 (x86上的EFLAGS)。它是这样设置的。CR0的值为1，如果最后一个结果为负，则为2，如果最后一个结果为正数，则为4。

My first naive method to interpret this is:

我第一个简单的解释是:

if (n < 0)
    cr0 = 1
else if (n > 0)
    cr0 = 2;
else
    cr0 = 4;

However I understand that all those branches won't be optimal, being run millions of times per second. I've seen some bit hacking on SO, but none seemed adeguate. For example I found many examples to convert a number to -1, 0, or 1 accordingly to the sign or 0. But how to make -1 = 1, 1 = 2, 0 = 4? I'm asking for the help of the Bit Hackers...

但是，我知道所有这些分支都不是最佳的，每秒运行数百万次。我曾见过一些黑客攻击，但似乎都没有。例如，我发现许多例子可以将数字转换为-1、0或1，从而将其转换为符号或0。但是怎么做-1 = 1,1 = 2,0 = 4?我在请求黑客的帮助…

Thanks in advance

谢谢提前

Update: First of all: thanks guys, you've been great. I'll test all of your codes carefully for speed and you'll be the first to know who's the winner.

更新:首先:谢谢大家，你们太棒了。我将仔细测试你的所有代码，以确保速度，你将是第一个知道谁是赢家。

@jalf: About your first advice, I wasn't actually calculating CR0 on every instruction. I was rather keeping a lastResult variable, and when (and if) a following instruction asked for a flag, do the comparison. Three main motivations took me back to "everytime" update:

关于你的第一个建议，我实际上并没有在每条指令上计算CR0。我宁愿保留一个lastResult变量，当(并且如果)下面的指令要求一个标志，做比较。三个主要的动机让我回到“每次”更新:

On PPC you're not forced to update CR0 like on x86 (where ADD always change EFLAGS, even if not needed), you have two flavours of ADD, one updating. If the compiler chooses to use the updating one, it means that it's going to use the CR0 at some point, so there no point at delaying...
在PPC上，你不必像在x86上那样*更新CR0(即使不需要，添加总是改变EFLAGS)，你有两种添加的风格，一种更新。如果编译器选择使用更新一个，它意味着它将在某个点使用CR0，因此没有时间延迟…
There's a particularly painful instruction called mtcrf, that enables you to change the CR0 arbitrarly. You can even set it to 7, with no arithmetic meaning... This just destroys the possibility of keeping a "lastResult" variable.
有一个特别痛苦的指令叫做mtcrf，它使你可以改变CR0仲裁。你甚至可以把它设置为7，没有算术意义…这就破坏了保持“lastResult”变量的可能性。

8 个解决方案

#1

First, if this variable is to be updated after (nearly) every instruction, the obvious piece of advice is this:

首先，如果这个变量在(几乎)每条指令之后都要更新，那么显而易见的一条建议是:

don't

不

Only update it when the subsequent instructions need its value. At any other time, there's no point in updating it.

只有在后续指令需要它的值时才更新它。在任何时候，更新都没有意义。

But anyway, when we update it, what we want is this behavior:

但是无论如何，当我们更新它时，我们想要的是这种行为:

R < 0  => CR0 == 0b001 
R > 0  => CR0 == 0b010
R == 0 => CR0 == 0b100

Ideally, we won't need to branch at all. Here's one possible approach:

理想情况下，我们根本不需要分支。这是一个可能的方法:

Set CR0 to the value 1. (if you really want speed, investigate whether this can be done without fetching the constant from memory. Even if you have to spend a couple of instructions on it, it may well be worth it)
将CR0设置为1。(如果你真的想要速度，那就调查一下是否可以不用从内存中获取常量。即使你不得不在上面花上几条指示，这也很值得。
If R >= 0, left shift by one bit.
如果R >= 0，左移一点。
If R == 0, left shift by one bit
如果R == 0，左移一点。

Where steps 2 and 3 can be transformed to eliminate the "if" part

第2步和第3步可以转换为消除“if”部分?

CR0 <<= (R >= 0);
CR0 <<= (R == 0);

Is this faster? I don't know. As always, when you are concerned about performance, you need to measure, measure, measure.

这是更快吗?我不知道。与往常一样，当您关注性能时，您需要度量、度量、度量。

However, I can see a couple of advantages of this approach:

但是，我可以看到这种方法的几个优点:

we avoid branches completely
我们完全避免分支
we avoid memory loads/stores.
我们避免内存加载/存储。
the instructions we rely on (bit shifting and comparison) should have low latency, which isn't always the case for multiplication, for example.
我们所依赖的指令(移位和比较)应该具有较低的延迟，例如，对于乘法来说并不总是这样。

The downside is that we have a dependency chain between all three lines: Each modifies CR0, which is then used in the next line. This limits instruction-level parallelism somewhat.

缺点是我们在所有三行之间都有一个依赖链:每个修改CR0，然后在下一行中使用。这在一定程度上限制了指令级的并行性。

To minimize this dependency chain, we could do something like this instead:

为了最小化这个依赖链，我们可以这样做:

CR0 <<= ((R >= 0) + (R == 0));

so we only have to modify CR0 once, after its initialization.

我们只需要修改CR0一次，在它初始化之后。

Or, doing everything in a single line:

或者，在单行中做所有事情:

CR0 = 1 << ((R >= 0) + (R == 0));

Of course, there are a lot of possible variations of this theme, so go ahead and experiment.

当然，这个主题有很多可能的变化，所以继续做实验吧。

#2

Lots of answers that are approximately "don't" already, as usual :) You want the bit hack? You will get it. Then feel free to use it or not as you see fit.

很多答案都是“不”的，就像往常一样:“你想要这个小窍门吗?”你会得到它。然后你可以随意使用它，或者你认为合适。

You could use that mapping to -1, 0 and 1 (sign), and then do this:

你可以用这个映射到-1 0和1(符号)，然后这样做:

return 7 & (0x241 >> ((sign(x) + 1) * 4));

Which is essentially using a tiny lookup table.

本质上是使用一个小的查找表。

Or the "naive bithack":

或“天真bithack”:

int y = ((x >> 31) & 1) | ((-x >> 31) & 2)
return (~(-y >> 31) & 4) | y;

The first line maps x < 0 to 1, x > 0 to 2 and x == 0 to 0. The second line then maps y == 0 to 4 and y != 0 to y.

第一行映射x < 0到1,x > 0到2,x == 0到0。第二行映射y = 0到4,y = 0到y。

And of course it has a sneaky edge case for x = 0x80000000 which is mapped to 3. Oops. Well let's fix that:

当然，它有一个x = 0x80000000的一个很好的例子，它被映射到3。哦。我们解决这个问题:

int y = ((x >> 31) & 1) | ((-x >> 31) & 2)
y &= 1 | ~(y << 1);  // remove the 2 if odd
return (~(-y >> 31) & 4) | y;

#3

The following expression is a little cryptic, but not excessively so, and it looks to be something the compiler can optimize pretty easily:

下面的表达式有点神秘，但不是过度，它看起来是编译器可以很容易地优化的东西:

cr0 = 4 >> ((2 * (n < 0)) + (n > 0));

Here's what GCC 4.6.1 for an x86 target compiles it to with -O2:

这是一个x86目标的GCC 4.6.1编译它与-O2:

xor ecx, ecx
mov eax, edx
sar eax, 31
and eax, 2
test    edx, edx
setg    cl
add ecx, eax
mov eax, 4
sar eax, cl

And VC 2010 with /Ox looks pretty similar:

VC 2010和/Ox看起来很相似:

xor ecx, ecx
test eax, eax
sets cl
xor edx, edx
test eax, eax
setg dl
mov eax, 4
lea ecx, DWORD PTR [edx+ecx*2]
sar eax, cl

The version using if tests compiles to assembly that uses jumps with either of these compilers. Of course, you'll never really be sure what any particular compiler is going to do with whatever particular bit of code you choose unless you actually examine the output. My expression is cryptic enough that unless it was really a performance critical bit of code, I might still go with with if statement version. Since you need to set the CR0 register frequently, I think it might be worth measuring if this expression helps at all.

如果测试编译到与这些编译器中的任何一个都使用跳转的程序集，则使用该版本。当然，您永远不会真正确定任何特定的编译器将如何处理您选择的任何特定的代码，除非您实际检查输出。我的表达式很神秘，除非它确实是一个性能关键的代码，否则我可能仍然会使用if语句版本。因为您需要频繁地设置CR0寄存器，所以我认为如果这个表达式有帮助的话，就值得进行测量了。

#4

I was working on this one when my computer crashed.

当我的电脑死机时，我正在做这个。

int cr0 = (-(n | n-1) >> 31) & 6;
cr0 |= (n >> 31) & 5;
cr0 ^= 4;

Here's the resulting assembly (for Intel x86):

这是产生的程序集(用于Intel x86):

PUBLIC  ?tricky@@YAHH@Z                                 ; tricky
; Function compile flags: /Ogtpy
_TEXT   SEGMENT
_n$ = 8                                                 ; size = 4
?tricky@@YAHH@Z PROC                                    ; tricky
; Line 18
        mov     ecx, DWORD PTR _n$[esp-4]
        lea     eax, DWORD PTR [ecx-1]
        or      eax, ecx
        neg     eax
        sar     eax, 31                                 ; 0000001fH
; Line 19
        sar     ecx, 31                                 ; 0000001fH
        and     eax, 6
        and     ecx, 5
        or      eax, ecx
; Line 20
        xor     eax, 4
; Line 22
        ret     0
?tricky@@YAHH@Z ENDP                                    ; tricky

And a complete exhaustive test which is also reasonably suitable for benchmarking:

以及一个完全详尽的测试，也相当适用于基准测试:

#include <limits.h>

int direct(int n)
{
    int cr0;
    if (n < 0)
        cr0 = 1;
    else if (n > 0)
        cr0 = 2;
    else
        cr0 = 4;
    return cr0;
}

const int shift_count = sizeof(int) * CHAR_BIT - 1;
int tricky(int n)
{
    int cr0 = (-(n | n-1) >> shift_count) & 6;
    cr0 |= (n >> shift_count) & 5;
    cr0 ^= 4;
    return cr0;
}

#include <iostream>
#include <iomanip>
int main(void)
{
    int i = 0;
    do {
        if (direct(i) != tricky(i)) {
            std::cerr << std::hex << i << std::endl;
            return i;
        }
    } while (++i);
    return 0;
}

#5

gcc with no optimization

gcc没有优化

        movl    %eax, 24(%esp)  ; eax has result of reading n
        cmpl    $0, 24(%esp)
        jns     .L2
        movl    $1, 28(%esp)
        jmp     .L3
.L2:
        cmpl    $0, 24(%esp)
        jle     .L4
        movl    $2, 28(%esp)
        jmp     .L3
.L4:
        movl    $4, 28(%esp)
.L3:

With -O2:

- 02:

        movl    $1, %edx       ; edx = 1
        cmpl    $0, %eax
        jl      .L2            ; n < 0
        cmpl    $1, %eax       ; n < 1
        sbbl    %edx, %edx     ; edx = 0 or -1
        andl    $2, %edx       ; now 0 or 2
        addl    $2, %edx       ; now 2 or 4
.L2:
        movl    %edx, 4(%esp)

I don't think you are likely to do much better

我不认为你有可能做得更好。

#6

If there is a faster method, the compiler probably already is using it.

如果有一个更快的方法，编译器可能已经在使用它了。

Keep your code short and simple; that makes the optimizer most effective.

保持代码简短;这使得优化器最有效。

The simple straightforward solution does surprisingly well speed-wise:

简单直接的解决方案的速度惊人地快:

cr0 = n? (n < 0)? 1: 2: 4;

x86 Assembly (produced by VC++ 2010, flags /Ox):

x86汇编(vc++ 2010, flags /Ox):

PUBLIC  ?tricky@@YAHH@Z                                 ; tricky
; Function compile flags: /Ogtpy
_TEXT   SEGMENT
_n$ = 8                                                 ; size = 4
?tricky@@YAHH@Z PROC                                    ; tricky
; Line 26
        mov     eax, DWORD PTR _n$[esp-4]
        test    eax, eax
        je      SHORT $LN3@tricky
        xor     ecx, ecx
        test    eax, eax
        setns   cl
        lea     eax, DWORD PTR [ecx+1]
; Line 31
        ret     0
$LN3@tricky:
; Line 26
        mov     eax, 4
; Line 31
        ret     0
?tricky@@YAHH@Z ENDP                                    ; tricky

#7

For a completely unportable approach, I wonder if this might have any speed benefit:

对于一个完全不可移植的方法，我想知道这是否有任何速度优势:

void func(signed n, signed& cr0) {
    cr0 = 1 << (!(unsigned(n)>>31)+(n==0));
}

mov         ecx,eax  ;with MSVC10, all optimizations except inlining on.
shr         ecx,1Fh  
not         ecx  
and         ecx,1  
xor         edx,edx  
test        eax,eax  
sete        dl  
mov         eax,1  
add         ecx,edx  
shl         eax,cl  
mov         ecx,dword ptr [cr0]  
mov         dword ptr [ecx],eax

compared to your code on my machine:

与我机器上的代码相比:

test        eax,eax            ; if (n < 0)
jns         func+0Bh (401B1Bh)  
mov         dword ptr [ecx],1  ; cr0 = 1;
ret                            ; cr0 = 2; else cr0 = 4; }
xor         edx,edx            ; else if (n > 0)
test        eax,eax  
setle       dl  
lea         edx,[edx+edx+2]  
mov         dword ptr [ecx],edx ; cr0 = 2; else cr0 = 4; }
ret

I don't know much at all about assembly, so I can't say for sure if this would have any benefit (or even if mine has any jumps. I see no instructions beginning with j anyway). As always, (and as everyone else said a million times) PROFILE.

我不太了解装配，所以我不能确定这是否会有任何好处(即使我的产品有任何的跳跃)。我看不出从j开始的任何指令。和往常一样，(而且每个人都说了一百万次)。

I doubt this is faster than say Jalf or Ben's, but I didn't see any that took advantage of the fact that on x86 all negative numbers have a certain bit set, and I figured I'd throw one out.

我怀疑这是否比Jalf或Ben的速度快，但我没有看到任何利用x86上所有负数都有一定位集的事实，我想我应该把其中一个去掉。

[EDIT]BenVoigt suggests cr0 = 4 >> ((n != 0) + (unsigned(n) >> 31)); to remove the logical negation, and my tests show that is a vast improvement.

[编辑]BenVoigt建议cr0 = 4 >> ((n != 0) + (unsigned(n) >> 31);为了消除逻辑上的否定，我的测试表明这是一个巨大的改进。

#8

-1

The following is my attempt.

以下是我的尝试。

int cro = 4 >> (((n > 0) - (n < 0)) % 3 + (n < 0)*3);

#1