为什么处理排序数组比未排序的数组更快?

时间:2022-04-02 01:15:51

Here is a piece of C++ code that seems very peculiar. For some strange reason, sorting the data miraculously makes the code almost six times faster.

这里有一段看起来很特别的c++代码。出于某种奇怪的原因,对数据进行排序会奇迹般地使代码快6倍。

#include <algorithm>
#include <ctime>
#include <iostream>

int main()
{
    // Generate data
    const unsigned arraySize = 32768;
    int data[arraySize];

    for (unsigned c = 0; c < arraySize; ++c)
        data[c] = std::rand() % 256;

    // !!! With this, the next loop runs faster
    std::sort(data, data + arraySize);

    // Test
    clock_t start = clock();
    long long sum = 0;

    for (unsigned i = 0; i < 100000; ++i)
    {
        // Primary loop
        for (unsigned c = 0; c < arraySize; ++c)
        {
            if (data[c] >= 128)
                sum += data[c];
        }
    }

    double elapsedTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;

    std::cout << elapsedTime << std::endl;
    std::cout << "sum = " << sum << std::endl;
}
  • Without std::sort(data, data + arraySize);, the code runs in 11.54 seconds.
  • 没有std::sort(数据、数据+ arraySize);代码运行时间为11.54秒。
  • With the sorted data, the code runs in 1.93 seconds.
  • 使用排序后的数据,代码在1.93秒内运行。

Initially, I thought this might be just a language or compiler anomaly. So I tried it in Java.

最初,我认为这可能只是一种语言或编译器异常。我在Java中尝试过。

import java.util.Arrays;
import java.util.Random;

public class Main
{
    public static void main(String[] args)
    {
        // Generate data
        int arraySize = 32768;
        int data[] = new int[arraySize];

        Random rnd = new Random(0);
        for (int c = 0; c < arraySize; ++c)
            data[c] = rnd.nextInt() % 256;

        // !!! With this, the next loop runs faster
        Arrays.sort(data);

        // Test
        long start = System.nanoTime();
        long sum = 0;

        for (int i = 0; i < 100000; ++i)
        {
            // Primary loop
            for (int c = 0; c < arraySize; ++c)
            {
                if (data[c] >= 128)
                    sum += data[c];
            }
        }

        System.out.println((System.nanoTime() - start) / 1000000000.0);
        System.out.println("sum = " + sum);
    }
}

With a somewhat similar but less extreme result.

有点类似但不太极端的结果。


My first thought was that sorting brings the data into the cache, but then I thought how silly that is because the array was just generated.

我的第一个想法是,排序将数据带进缓存,但我认为这很傻,因为数组刚刚生成。

  • What is going on?
  • 什么是怎么回事?
  • Why is it faster to process a sorted array than an unsorted array?
  • 为什么处理排序数组比未排序的数组更快?
  • The code is summing up some independent terms, and the order should not matter.
  • 代码正在总结一些独立的术语,顺序不重要。

21 个解决方案

#1


27770  

You are a victim of branch prediction fail.

你是分支预测失败的受害者。


What is Branch Prediction?

Consider a railroad junction:

考虑一个铁路枢纽:

为什么处理排序数组比未排序的数组更快? Image by Mecanismo, via Wikimedia Commons. Used under the CC-By-SA 3.0 license.

图片由Mecanismo,通过维基共享。在CC-By-SA 3.0许可下使用。

Now for the sake of argument, suppose this is back in the 1800s - before long distance or radio communication.

为了论证,假设这是在19世纪,在远距离或无线电通信之前。

You are the operator of a junction and you hear a train coming. You have no idea which way it is supposed to go. You stop the train to ask the driver which direction they want. And then you set the switch appropriately.

你是一个路口的接线员,你听到火车来了。你不知道该走哪条路。你停下来问司机他们想要哪个方向。然后适当地设置开关。

Trains are heavy and have a lot of inertia. So they take forever to start up and slow down.

火车很重,而且有很大的惯性。所以他们会花很长时间来启动和减速。

Is there a better way? You guess which direction the train will go!

有更好的办法吗?你猜火车会朝哪个方向走!

  • If you guessed right, it continues on.
  • 如果你猜对了,它还在继续。
  • If you guessed wrong, the captain will stop, back up, and yell at you to flip the switch. Then it can restart down the other path.
  • 如果你猜错了,船长会停下来,对着你大喊大叫。然后它可以重新启动另一条路径。

If you guess right every time, the train will never have to stop.
If you guess wrong too often, the train will spend a lot of time stopping, backing up, and restarting.

如果你每次都猜对了,火车就永远不会停下来。如果你经常猜错了,火车会花很多时间停下来,倒车,重新开始。


Consider an if-statement: At the processor level, it is a branch instruction:

考虑一个if语句:在处理器级别,它是一个分支指令:

为什么处理排序数组比未排序的数组更快?

You are a processor and you see a branch. You have no idea which way it will go. What do you do? You halt execution and wait until the previous instructions are complete. Then you continue down the correct path.

你是一个处理器,你看到一个分支。你不知道该走哪条路。你做什么工作?您停止执行并等待之前的指示完成。然后继续沿着正确的路径前进。

Modern processors are complicated and have long pipelines. So they take forever to "warm up" and "slow down".

现代处理器很复杂,而且有长长的管道。所以他们需要永远的“热身”和“慢下来”。

Is there a better way? You guess which direction the branch will go!

有更好的办法吗?你猜树枝会朝哪个方向走!

  • If you guessed right, you continue executing.
  • 如果你猜对了,你就继续执行。
  • If you guessed wrong, you need to flush the pipeline and roll back to the branch. Then you can restart down the other path.
  • 如果您猜错了,您需要冲洗管道并回滚到分支。然后可以重新启动另一条路径。

If you guess right every time, the execution will never have to stop.
If you guess wrong too often, you spend a lot of time stalling, rolling back, and restarting.

如果你每次都猜对了,执行就永远不会停止。如果你经常猜错了,你会花很多时间拖延,回滚,重新开始。


This is branch prediction. I admit it's not the best analogy since the train could just signal the direction with a flag. But in computers, the processor doesn't know which direction a branch will go until the last moment.

这是分支预测。我承认这并不是最好的类比,因为火车可能只是用一面旗帜来发出指示。但是在计算机中,处理器直到最后一刻才知道分支的方向。

So how would you strategically guess to minimize the number of times that the train must back up and go down the other path? You look at the past history! If the train goes left 99% of the time, then you guess left. If it alternates, then you alternate your guesses. If it goes one way every 3 times, you guess the same...

那么,你如何策略性地猜测火车必须后退的次数,然后沿着另一条路线行驶呢?你看过去的历史!如果火车离开99%的时间,那么你就猜是左转。如果它是交替的,那么你可以改变你的猜测。如果它每3次走一遍,你猜也是一样的…

In other words, you try to identify a pattern and follow it. This is more or less how branch predictors work.

换句话说,您试图确定一个模式并遵循它。这或多或少地说明了分支预测器是如何工作的。

Most applications have well-behaved branches. So modern branch predictors will typically achieve >90% hit rates. But when faced with unpredictable branches with no recognizable patterns, branch predictors are virtually useless.

大多数应用程序都有很好的分支。因此,现代分支预测器通常会实现>90%的命中率。但是,当面对不可预测的分支时,没有可识别的模式,分支预测器实际上是无用的。

Further reading: "Branch predictor" article on Wikipedia.

进一步阅读:*上的“分支预测器”文章。


As hinted from above, the culprit is this if-statement:

if (data[c] >= 128)
    sum += data[c];

Notice that the data is evenly distributed between 0 and 255. When the data is sorted, roughly the first half of the iterations will not enter the if-statement. After that, they will all enter the if-statement.

注意,数据在0到255之间是均匀分布的。当数据被排序时,大约一半的迭代将不会进入if语句。在那之后,他们将全部输入if语句。

This is very friendly to the branch predictor since the branch consecutively goes the same direction many times. Even a simple saturating counter will correctly predict the branch except for the few iterations after it switches direction.

这对分支预测器非常友好,因为分支连续地进行了多次相同的方向。即使是一个简单的饱和计数器,也可以正确地预测分支,除了在它转换方向之后的几个迭代之外。

Quick visualization:

快速可视化:

T = branch taken
N = branch not taken

data[] = 0, 1, 2, 3, 4, ... 126, 127, 128, 129, 130, ... 250, 251, 252, ...
branch = N  N  N  N  N  ...   N    N    T    T    T  ...   T    T    T  ...

       = NNNNNNNNNNNN ... NNNNNNNTTTTTTTTT ... TTTTTTTTTT  (easy to predict)

However, when the data is completely random, the branch predictor is rendered useless because it can't predict random data. Thus there will probably be around 50% misprediction. (no better than random guessing)

然而,当数据完全随机时,分支预测器就变得无用了,因为它不能预测随机数据。因此可能会有50%左右的错误。(没有比随机猜测更好的了)

data[] = 226, 185, 125, 158, 198, 144, 217, 79, 202, 118,  14, 150, 177, 182, 133, ...
branch =   T,   T,   N,   T,   T,   T,   T,  N,   T,   N,   N,   T,   T,   T,   N  ...

       = TTNTTTTNTNNTTTN ...   (completely random - hard to predict)

So what can be done?

那么,我们能做些什么呢?

If the compiler isn't able to optimize the branch into a conditional move, you can try some hacks if you are willing to sacrifice readability for performance.

如果编译器不能将分支优化为有条件的移动,如果您愿意牺牲可读性以获得性能,您可以尝试一些技巧。

Replace:

替换:

if (data[c] >= 128)
    sum += data[c];

with:

:

int t = (data[c] - 128) >> 31;
sum += ~t & data[c];

This eliminates the branch and replaces it with some bitwise operations.

这就消除了分支,并用一些位操作替换它。

(Note that this hack is not strictly equivalent to the original if-statement. But in this case, it's valid for all the input values of data[].)

(请注意,这个hack并不完全等同于原始的if-statement。但在本例中,它对所有数据的输入值都有效[]。

Benchmarks: Core i7 920 @ 3.5 GHz

基准:核心i7 920 @ 3.5 GHz。

C++ - Visual Studio 2010 - x64 Release

c++ - Visual Studio 2010 - x64发行版。

//  Branch - Random
seconds = 11.777

//  Branch - Sorted
seconds = 2.352

//  Branchless - Random
seconds = 2.564

//  Branchless - Sorted
seconds = 2.587

Java - Netbeans 7.1.1 JDK 7 - x64

Java - Netbeans 7.1.1 JDK 7 - x64。

//  Branch - Random
seconds = 10.93293813

//  Branch - Sorted
seconds = 5.643797077

//  Branchless - Random
seconds = 3.113581453

//  Branchless - Sorted
seconds = 3.186068823

Observations:

观察:

  • With the Branch: There is a huge difference between the sorted and unsorted data.
  • 分支:排序和未排序的数据之间有很大的差别。
  • With the Hack: There is no difference between sorted and unsorted data.
  • 使用黑客:排序和未排序的数据没有区别。
  • In the C++ case, the hack is actually a tad slower than with the branch when the data is sorted.
  • 在c++的例子中,当数据被排序时,黑客实际上比分支慢。

A general rule of thumb is to avoid data-dependent branching in critical loops. (such as in this example)

一般的经验法则是在关键的循环中避免依赖于数据的分支。(例如在本例中)


Update:

更新:

  • GCC 4.6.1 with -O3 or -ftree-vectorize on x64 is able to generate a conditional move. So there is no difference between the sorted and unsorted data - both are fast.

    在x64上使用-O3或-ftree-vectorize的GCC 4.6.1能够生成一个有条件的移动。因此,排序和未排序的数据之间没有区别——两者都是快速的。

  • VC++ 2010 is unable to generate conditional moves for this branch even under /Ox.

    2010年的vc++无法为这个分支生成有条件的移动。

  • Intel Compiler 11 does something miraculous. It interchanges the two loops, thereby hoisting the unpredictable branch to the outer loop. So not only is it immune the mispredictions, it is also twice as fast as whatever VC++ and GCC can generate! In other words, ICC took advantage of the test-loop to defeat the benchmark...

    Intel编译器11做了一些不可思议的事情。它将两个循环互换,从而将不可预测的分支提升到外部循环。因此,它不仅是免疫的错误,而且它的速度是vc++和GCC生成的速度的两倍!换句话说,ICC利用了测试循环来击败基准测试……

  • If you give the Intel Compiler the branchless code, it just out-right vectorizes it... and is just as fast as with the branch (with the loop interchange).

    如果你给英特尔的编译器无分支的代码,它就会向右矢量化。并且与分支(与循环交换)一样快。

This goes to show that even mature modern compilers can vary wildly in their ability to optimize code...

这表明,即使是成熟的现代编译器在优化代码的能力上也会有很大的差异。

#2


3498  

Branch prediction.

分支预测。

With a sorted array, the condition data[c] >= 128 is first false for a streak of values, then becomes true for all later values. That's easy to predict. With an unsorted array, you pay for the branching cost.

对于已排序的数组,条件数据[c] >= 128是第一个错误的值,然后对所有以后的值都是正确的。这是容易预测。对于未排序的数组,您需要支付分支成本。

#3


2843  

The reason why performance improves drastically when the data is sorted is that the branch prediction penalty is removed, as explained beautifully in Mysticial's answer.

当数据被排序的时候,性能显著提高的原因是分支预测的惩罚被删除了,这在神秘的答案中解释得很好。

Now, if we look at the code

现在,如果我们看一下代码。

if (data[c] >= 128)
    sum += data[c];

we can find that the meaning of this particular if... else... branch is to add something when a condition is satisfied. This type of branch can be easily transformed into a conditional move statement, which would be compiled into a conditional move instruction: cmovl, in an x86 system. The branch and thus the potential branch prediction penalty is removed.

我们可以发现,如果…其他的……分支是在满足条件时添加一些东西。这种类型的分支可以很容易地转换成条件移动语句,它将被编译成一个有条件的移动指令:cmovl,在x86系统中。该分支和潜在的分支预测惩罚被移除。

In C, thus C++, the statement, which would compile directly (without any optimization) into the conditional move instruction in x86, is the ternary operator ... ? ... : .... So we rewrite the above statement into an equivalent one:

在C中,这样的语句,将直接编译(没有任何优化)到x86的条件移动指令中,是三元运算符…吗?…:....所以我们把上面的式子改写成等价的

sum += data[c] >=128 ? data[c] : 0;

While maintaining readability, we can check the speedup factor.

在保持可读性的同时,我们可以检查加速因子。

On an Intel Core i7-2600K @ 3.4 GHz and Visual Studio 2010 Release Mode, the benchmark is (format copied from Mysticial):

在Intel Core i7-2600K @ 3.4 GHz和Visual Studio 2010发布模式中,基准是(从神秘学中拷贝的格式):

x86

x86

//  Branch - Random
seconds = 8.885

//  Branch - Sorted
seconds = 1.528

//  Branchless - Random
seconds = 3.716

//  Branchless - Sorted
seconds = 3.71

x64

x64

//  Branch - Random
seconds = 11.302

//  Branch - Sorted
 seconds = 1.830

//  Branchless - Random
seconds = 2.736

//  Branchless - Sorted
seconds = 2.737

The result is robust in multiple tests. We get a great speedup when the branch result is unpredictable, but we suffer a little bit when it is predictable. In fact, when using a conditional move, the performance is the same regardless of the data pattern.

这个结果在多个测试中都很健壮。当分支结果不可预测时,我们会得到极大的加速,但当它是可预测的时,我们会遭受一些损失。实际上,当使用条件移动时,无论数据模式如何,性能都是相同的。

Now let's look more closely by investigating the x86 assembly they generate. For simplicity, we use two functions max1 and max2.

现在,让我们更仔细地研究它们生成的x86程序集。为了简单起见,我们使用两个函数max1和max2。

max1 uses the conditional branch if... else ...:

max1使用条件分支,如果…其他:…

int max1(int a, int b) {
    if (a > b)
        return a;
    else
        return b;
}

max2 uses the ternary operator ... ? ... : ...:

max2使用三元运算符…吗?…:……

int max2(int a, int b) {
    return a > b ? a : b;
}

On a x86-64 machine, GCC -S generates the assembly below.

在x86-64机器上,GCC -S生成下面的程序集。

:max1
    movl    %edi, -4(%rbp)
    movl    %esi, -8(%rbp)
    movl    -4(%rbp), %eax
    cmpl    -8(%rbp), %eax
    jle     .L2
    movl    -4(%rbp), %eax
    movl    %eax, -12(%rbp)
    jmp     .L4
.L2:
    movl    -8(%rbp), %eax
    movl    %eax, -12(%rbp)
.L4:
    movl    -12(%rbp), %eax
    leave
    ret

:max2
    movl    %edi, -4(%rbp)
    movl    %esi, -8(%rbp)
    movl    -4(%rbp), %eax
    cmpl    %eax, -8(%rbp)
    cmovge  -8(%rbp), %eax
    leave
    ret

max2 uses much less code due to the usage of instruction cmovge. But the real gain is that max2 does not involve branch jumps, jmp, which would have a significant performance penalty if the predicted result is not right.

由于使用了指令cmovge, max2使用了更少的代码。但是真正的好处是,max2不涉及分支跳转,jmp,如果预测的结果不正确,它将会有显著的性能损失。

So why does a conditional move perform better?

那么为什么有条件的移动会表现得更好呢?

In a typical x86 processor, the execution of an instruction is divided into several stages. Roughly, we have different hardware to deal with different stages. So we do not have to wait for one instruction to finish to start a new one. This is called pipelining.

在典型的x86处理器中,指令的执行分为几个阶段。粗略地说,我们有不同的硬件来处理不同的阶段。所以我们不需要等待一个指令来完成一个新的指令。这就是所谓的流水线。

In a branch case, the following instruction is determined by the preceding one, so we cannot do pipelining. We have to either wait or predict.

在一个分支案例中,下面的指令是由前一个指令决定的,所以我们不能进行流水线操作。我们要么等待,要么预测。

In a conditional move case, the execution conditional move instruction is divided into several stages, but the earlier stages like Fetch and Decode does not depend on the result of the previous instruction; only latter stages need the result. Thus, we wait a fraction of one instruction's execution time. This is why the conditional move version is slower than the branch when prediction is easy.

在条件移动情况下,执行条件移动指令被划分为几个阶段,但获取和解码的早期阶段并不依赖于前一条指令的结果;只有后阶段才需要结果。因此,我们等待一个指令执行时间的一小部分。这就是为什么当预测很容易时,条件移动版本比分支慢。

The book Computer Systems: A Programmer's Perspective, second edition explains this in detail. You can check Section 3.6.6 for Conditional Move Instructions, entire Chapter 4 for Processor Architecture, and Section 5.11.2 for a special treatment for Branch Prediction and Misprediction Penalties.

图书计算机系统:一个程序员的观点,第二版详细解释了这一点。您可以检查3.6.6章节的条件移动说明,整个第4章的处理器架构,和第5.11.2节对于分支预测和错误惩罚的特殊处理。

Sometimes, some modern compilers can optimize our code to assembly with better performance, sometimes some compilers can't (the code in question is using Visual Studio's native compiler). Knowing the performance difference between branch and conditional move when unpredictable can help us write code with better performance when the scenario gets so complex that the compiler can not optimize them automatically.

有时,一些现代编译器可以优化我们的代码,使其具有更好的性能,有时一些编译器不能(问题中的代码使用Visual Studio的本机编译器)。了解分支和条件移动之间的性能差异时,当场景变得非常复杂,编译器无法自动优化时,它可以帮助我们编写具有更好性能的代码。

#4


1943  

If you are curious about even more optimizations that can be done to this code, consider this:

如果您对可以对该代码执行的更多优化感到好奇,请考虑以下内容:

Starting with the original loop:

从原始循环开始:

for (unsigned i = 0; i < 100000; ++i)
{
    for (unsigned j = 0; j < arraySize; ++j)
    {
        if (data[j] >= 128)
            sum += data[j];
    }
}

With loop interchange, we can safely change this loop to:

通过循环交换,我们可以安全地将这个循环更改为:

for (unsigned j = 0; j < arraySize; ++j)
{
    for (unsigned i = 0; i < 100000; ++i)
    {
        if (data[j] >= 128)
            sum += data[j];
    }
}

Then, you can see that the if conditional is constant throughout the execution of the i loop, so you can hoist the if out:

然后,您可以看到if条件在i循环的执行过程中是常量,因此您可以将if输出:

for (unsigned j = 0; j < arraySize; ++j)
{
    if (data[j] >= 128)
    {
        for (unsigned i = 0; i < 100000; ++i)
        {
            sum += data[j];
        }
    }
}

Then, you see that the inner loop can be collapsed into one single expression, assuming the floating point model allows it (/fp:fast is thrown, for example)

然后,您会看到内部循环可以被折叠成一个单一的表达式,假设浮点模型允许它(/fp:fast被抛出,例如)

for (unsigned j = 0; j < arraySize; ++j)
{
    if (data[j] >= 128)
    {
        sum += data[j] * 100000;
    }
}

That one is 100,000x faster than before

那个比以前快10万倍。

#5


1618  

No doubt some of us would be interested in ways of identifying code that is problematic for the CPU's branch-predictor. The Valgrind tool cachegrind has a branch-predictor simulator, enabled by using the --branch-sim=yes flag. Running it over the examples in this question, with the number of outer loops reduced to 10000 and compiled with g++, gives these results:

毫无疑问,我们中的一些人会对识别代码的方式感兴趣,这对于CPU的分支预测器是有问题的。Valgrind工具cachegrind有一个分支预测器模拟器,可以使用-branch-sim=yes标志。在这个问题的示例中运行它,将外部循环的数量减少到10000,并使用g++编译,结果如下:

Sorted:

排序:

==32551== Branches:        656,645,130  (  656,609,208 cond +    35,922 ind)
==32551== Mispredicts:         169,556  (      169,095 cond +       461 ind)
==32551== Mispred rate:            0.0% (          0.0%     +       1.2%   )

Unsorted:

未分类的:

==32555== Branches:        655,996,082  (  655,960,160 cond +  35,922 ind)
==32555== Mispredicts:     164,073,152  (  164,072,692 cond +     460 ind)
==32555== Mispred rate:           25.0% (         25.0%     +     1.2%   )

Drilling down into the line-by-line output produced by cg_annotate we see for the loop in question:

向下钻进cg_annotate生成的逐行输出,我们可以看到循环中的问题:

Sorted:

排序:

          Bc    Bcm Bi Bim
      10,001      4  0   0      for (unsigned i = 0; i < 10000; ++i)
           .      .  .   .      {
           .      .  .   .          // primary loop
 327,690,000 10,016  0   0          for (unsigned c = 0; c < arraySize; ++c)
           .      .  .   .          {
 327,680,000 10,006  0   0              if (data[c] >= 128)
           0      0  0   0                  sum += data[c];
           .      .  .   .          }
           .      .  .   .      }

Unsorted:

未分类的:

          Bc         Bcm Bi Bim
      10,001           4  0   0      for (unsigned i = 0; i < 10000; ++i)
           .           .  .   .      {
           .           .  .   .          // primary loop
 327,690,000      10,038  0   0          for (unsigned c = 0; c < arraySize; ++c)
           .           .  .   .          {
 327,680,000 164,050,007  0   0              if (data[c] >= 128)
           0           0  0   0                  sum += data[c];
           .           .  .   .          }
           .           .  .   .      }

This lets you easily identify the problematic line - in the unsorted version the if (data[c] >= 128) line is causing 164,050,007 mispredicted conditional branches (Bcm) under cachegrind's branch-predictor model, whereas it's only causing 10,006 in the sorted version.

这让您可以很容易地识别出问题的行——在未排序的版本中,如果(data[c] >= 128)在cachegrind的分支预测器模型中导致了164,050,007个错误的条件分支(Bcm),而在排序的版本中只导致了10,006。


Alternatively, on Linux you can use the performance counters subsystem to accomplish the same task, but with native performance using CPU counters.

或者,在Linux上,您可以使用性能计数器子系统来完成相同的任务,但是使用CPU计数器的本机性能。

perf stat ./sumtest_sorted

Sorted:

排序:

 Performance counter stats for './sumtest_sorted':

  11808.095776 task-clock                #    0.998 CPUs utilized          
         1,062 context-switches          #    0.090 K/sec                  
            14 CPU-migrations            #    0.001 K/sec                  
           337 page-faults               #    0.029 K/sec                  
26,487,882,764 cycles                    #    2.243 GHz                    
41,025,654,322 instructions              #    1.55  insns per cycle        
 6,558,871,379 branches                  #  555.455 M/sec                  
       567,204 branch-misses             #    0.01% of all branches        

  11.827228330 seconds time elapsed

Unsorted:

未分类的:

 Performance counter stats for './sumtest_unsorted':

  28877.954344 task-clock                #    0.998 CPUs utilized          
         2,584 context-switches          #    0.089 K/sec                  
            18 CPU-migrations            #    0.001 K/sec                  
           335 page-faults               #    0.012 K/sec                  
65,076,127,595 cycles                    #    2.253 GHz                    
41,032,528,741 instructions              #    0.63  insns per cycle        
 6,560,579,013 branches                  #  227.183 M/sec                  
 1,646,394,749 branch-misses             #   25.10% of all branches        

  28.935500947 seconds time elapsed

It can also do source code annotation with dissassembly.

它还可以用dissassembly进行源代码注释。

perf record -e branch-misses ./sumtest_unsorted
perf annotate -d sumtest_unsorted
 Percent |      Source code & Disassembly of sumtest_unsorted
------------------------------------------------
...
         :                      sum += data[c];
    0.00 :        400a1a:       mov    -0x14(%rbp),%eax
   39.97 :        400a1d:       mov    %eax,%eax
    5.31 :        400a1f:       mov    -0x20040(%rbp,%rax,4),%eax
    4.60 :        400a26:       cltq   
    0.00 :        400a28:       add    %rax,-0x30(%rbp)
...

See the performance tutorial for more details.

有关更多细节,请参见性能教程。

#6


1089  

I just read up on this question and its answers, and I feel an answer is missing.

我只是读了一下这个问题和它的答案,我觉得答案不见了。

A common way to eliminate branch prediction that I've found to work particularly good in managed languages is a table lookup instead of using a branch (although I haven't tested it in this case).

消除分支预测的一种常见方法是使用表查找,而不是使用分支(尽管我还没有在这种情况下进行测试)。

This approach works in general if:

这种方法一般适用于:

  1. It's a small table and is likely to be cached in the processor
  2. 它是一个小的表,很可能被缓存到处理器中。
  3. You are running things in a quite tight loop and/or the processor can pre-load the data
  4. 您正在一个非常紧密的循环中运行,并且/或处理器可以预加载数据。

Background and why

背景和原因

Pfew, so what the hell is that supposed to mean?

Pfew,那到底是什么意思?

From a processor perspective, your memory is slow. To compensate for the difference in speed, they build in a couple of caches in your processor (L1/L2 cache) that compensate for that. So imagine that you're doing your nice calculations and figure out that you need a piece of memory. The processor will get its 'load' operation and loads the piece of memory into cache - and then uses the cache to do the rest of the calculations. Because memory is relatively slow, this 'load' will slow down your program.

从处理器的角度来看,您的内存很慢。为了弥补速度上的差异,他们在你的处理器(L1/L2缓存)中构建了几个缓存,以弥补这一点。想象一下,你正在计算,计算出你需要一段记忆。处理器将得到它的“负载”操作,并将内存块加载到缓存中,然后使用缓存来完成剩余的计算。因为内存比较慢,所以这个“负载”会降低程序的速度。

Like branch prediction, this was optimized in the Pentium processors: the processor predicts that it needs to load a piece of data and attempts to load that into the cache before the operation actually hits the cache. As we've already seen, branch prediction sometimes goes horribly wrong -- in the worst case scenario you need to go back and actually wait for a memory load, which will take forever (in other words: failing branch prediction is bad, a memory load after a branch prediction fail is just horrible!).

像分支预测一样,这在奔腾处理器中得到了优化:处理器预测它需要加载一段数据,并尝试在操作实际到达缓存之前将其加载到缓存中。正如我们已经看到的,分支预测有时可怕的错误——在最坏的情况,你需要回去等待内存负载,将永远(换句话说:没有分支预测不好,内存负载后分支预测失败是可怕的!)。

Fortunately for us, if the memory access pattern is predictable, the processor will load it in its fast cache and all is well.

幸运的是,如果内存访问模式是可预测的,处理器将在它的高速缓存中加载它,并且一切都很好。

The first thing we need to know is what is small? While smaller is generally better, a rule of thumb is to stick to lookup tables that are <= 4096 bytes in size. As an upper limit: if your lookup table is larger than 64K it's probably worth reconsidering.

我们首先要知道的是什么是小的?虽然小的通常比较好,但是一个经验法则是坚持查找大小为<= 4096字节的查找表。作为一个上限:如果您的查找表大于64K,那么它可能值得重新考虑。

Constructing a table

构建一个表

So we've figured out that we can create a small table. Next thing to do is get a lookup function in place. Lookup functions are usually small functions that use a couple of basic integer operations (and, or, xor, shift, add, remove and perhaps a multiply). You want to have your input translated by the lookup function to some kind of 'unique key' in your table, which then simply gives you the answer of all the work you wanted it to do.

我们已经知道我们可以创建一个小表格。接下来要做的是找到一个查找函数。查找函数通常是使用几个基本整数操作的小函数(或者,或者,xor, shift, add, remove,也许还有一个乘法)。您希望将您的输入由查找函数翻译成表中某种“唯一键”,然后简单地给出您想要它做的所有工作的答案。

In this case: >= 128 means we can keep the value, < 128 means we get rid of it. The easiest way to do that is by using an 'AND': if we keep it, we AND it with 7FFFFFFF; if we want to get rid of it, we AND it with 0. Notice also that 128 is a power of 2 -- so we can go ahead and make a table of 32768/128 integers and fill it with one zero and a lot of 7FFFFFFFF's.

在本例中:>= 128表示我们可以保留值,< 128表示我们可以去掉它。最简单的方法是使用“AND”:如果我们保留了它,我们和它就有了7FFFFFFF;如果我们想摆脱它,我们和它是0。注意,128是2的幂,所以我们可以做一个32768/128整数的表格,并填满一个0和很多7FFFFFFFF。

Managed languages

管理语言

You might wonder why this works well in managed languages. After all, managed languages check the boundaries of the arrays with a branch to ensure you don't mess up...

您可能想知道为什么这在托管语言中很有效。毕竟,托管语言通过一个分支检查数组的边界,以确保您不会出错……

Well, not exactly... :-)

嗯,不完全是……:-)

There has been quite some work on eliminating this branch for managed languages. For example:

为了管理语言,消除这个分支的工作已经相当多了。例如:

for (int i=0; i<array.Length; ++i)
   // Use array[i]

In this case it's obvious to the compiler that the boundary condition will never be hit. At least the Microsoft JIT compiler (but I expect Java does similar things) will notice this and remove the check all together. WOW - that means no branch. Similarly, it will deal with other obvious cases.

在这种情况下,对于编译器来说,边界条件将永远不会被攻击。至少Microsoft JIT编译器(但我希望Java也做类似的事情)会注意到这一点并将所有的检查一起删除。哇,那表示没有分支。同样,它将处理其他明显的情况。

If you run into trouble with lookups on managed languages - the key is to add a & 0x[something]FFF to your lookup function to make the boundary check predictable - and watch it going faster.

如果您在管理语言上遇到麻烦,关键是在查找函数中添加一个& 0x[something]FFF,使边界检查可以预测,并且观察它运行得更快。

The result for this case

这种情况的结果。

// Generate data
int arraySize = 32768;
int[] data = new int[arraySize];

Random rnd = new Random(0);
for (int c = 0; c < arraySize; ++c)
    data[c] = rnd.Next(256);

// Too keep the spirit of the code in-tact I'll make a separate lookup table
// (I assume we cannot modify 'data' or the number of loops)
int[] lookup = new int[256];

for (int c = 0; c < 256; ++c)
    lookup[c] = (c >= 128) ? c : 0;

// Test
DateTime startTime = System.DateTime.Now;
long sum = 0;

for (int i = 0; i < 100000; ++i)
{
    // Primary loop
    for (int j = 0; j < arraySize; ++j)
    {
        // Here you basically want to use simple operations - so no
        // random branches, but things like &, |, *, -, +, etc. are fine.
        sum += lookup[data[j]];
    }
}

DateTime endTime = System.DateTime.Now;
Console.WriteLine(endTime - startTime);
Console.WriteLine("sum = " + sum);

Console.ReadLine();

#7


974  

As data is distributed between 0 and 255 when the array is sorted, around the first half of the iterations will not enter the if-statement (the if statement is shared below).

当数组被排序时,当数据在0到255之间分布时,在迭代的前半部分不会输入if语句(if语句在下面共享)。

if (data[c] >= 128)
    sum += data[c];

The question is: What makes the above statement not execute in certain cases as in case of sorted data? Here comes the "branch predictor". A branch predictor is a digital circuit that tries to guess which way a branch (e.g. an if-then-else structure) will go before this is known for sure. The purpose of the branch predictor is to improve the flow in the instruction pipeline. Branch predictors play a critical role in achieving high effective performance!

问题是:在某些情况下,在排序数据的情况下,是什么使上述语句不执行?这里是“分支预测器”。一个分支预测器是一个数字电路,它试图猜测一个分支(例如if- else结构)会在这之前确定。分支预测器的目的是提高指令管道的流量。分支预测器在实现高效性能方面起着至关重要的作用!

Let's do some bench marking to understand it better

让我们做一些长凳标记来更好地理解它。

The performance of an if-statement depends on whether its condition has a predictable pattern. If the condition is always true or always false, the branch prediction logic in the processor will pick up the pattern. On the other hand, if the pattern is unpredictable, the if-statement will be much more expensive.

一个if语句的性能取决于它的条件是否具有可预测的模式。如果条件总是正确或总是错误的,则处理器中的分支预测逻辑将会选择该模式。另一方面,如果模式是不可预测的,那么if-statement将会更加昂贵。

Let’s measure the performance of this loop with different conditions:

让我们用不同的条件来测量这个循环的性能:

for (int i = 0; i < max; i++)
    if (condition)
        sum++;

Here are the timings of the loop with different true-false patterns:

这里是循环的时间,有不同的真实错误模式:

Condition            Pattern                 Time (ms)

(i & 0×80000000) == 0    T repeated          322

(i & 0xffffffff) == 0    F repeated          276

(i & 1) == 0            TF alternating    760

(i & 3) == 0            TFFFTFFF…          513

(i & 2) == 0            TTFFTTFF…          1675

(i & 4) == 0            TTTTFFFFTTTTFFFF… 1275

(i & 8) == 0            8T 8F 8T 8F …     752

(i & 16) == 0            16T 16F 16T 16F … 490

A “bad” true-false pattern can make an if-statement up to six times slower than a “good” pattern! Of course, which pattern is good and which is bad depends on the exact instructions generated by the compiler and on the specific processor.

一个“坏”的真实模式可以使一个if语句比一个“好”模式慢6倍!当然,哪种模式是好的,哪种模式不好,取决于编译器和特定处理器生成的精确指令。

So there is no doubt about the impact of branch prediction on performance!

因此,关于分支预测对性能的影响是毫无疑问的!

#8


914  

One way to avoid branch prediction errors is to build a lookup table, and index it using the data. Stefan de Bruijn discussed that in his answer.

避免分支预测错误的一种方法是构建一个查找表,并使用数据对其进行索引。Stefan de Bruijn在他的回答中谈到了这一点。

But in this case, we know values are in the range [0, 255] and we only care about values >= 128. That means we can easily extract a single bit that will tell us whether we want a value or not: by shifting the data to the right 7 bits, we are left with a 0 bit or a 1 bit, and we only want to add the value when we have a 1 bit. Let's call this bit the "decision bit".

但是在这种情况下,我们知道值在[0,255]范围内,我们只关心>= 128的值。这意味着我们可以很容易地提取单个点,会告诉我们我们是否想要一个价值:通过将数据转移到正确的7位,剩下一个0或1位,我们只想添加1位时的值。让我们称之为“决策位”。

By using the 0/1 value of the decision bit as an index into an array, we can make code that will be equally fast whether the data is sorted or not sorted. Our code will always add a value, but when the decision bit is 0, we will add the value somewhere we don't care about. Here's the code:

通过将决策位的0/1值作为一个数组的索引,我们可以使代码同样快速地处理数据是否已排序。我们的代码总是会添加一个值,但是当决策位为0时,我们会将值添加到我们不关心的地方。这是代码:

// Test
clock_t start = clock();
long long a[] = {0, 0};
long long sum;

for (unsigned i = 0; i < 100000; ++i)
{
    // Primary loop
    for (unsigned c = 0; c < arraySize; ++c)
    {
        int j = (data[c] >> 7);
        a[j] += data[c];
    }
}

double elapsedTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;
sum = a[1];

This code wastes half of the adds, but never has a branch prediction failure. It's tremendously faster on random data than the version with an actual if statement.

这段代码浪费了一半的添加,但从来没有分支预测失败。与实际的if语句相比,它在随机数据上的速度要快得多。

But in my testing, an explicit lookup table was slightly faster than this, probably because indexing into a lookup table was slightly faster than bit shifting. This shows how my code sets up and uses the lookup table (unimaginatively called lut for "LookUp Table" in the code). Here's the C++ code:

但在我的测试中,显式查找表的速度略快于此,可能是因为索引到查找表的速度略快于移位。这显示了我的代码如何设置和使用查找表(在代码中没有想象地称为“查找表”的lut)。这里的c++代码:

// declare and then fill in the lookup table
int lut[256];
for (unsigned c = 0; c < 256; ++c)
    lut[c] = (c >= 128) ? c : 0;

// use the lookup table after it is built
for (unsigned i = 0; i < 100000; ++i)
{
    // Primary loop
    for (unsigned c = 0; c < arraySize; ++c)
    {
        sum += lut[data[c]];
    }
}

In this case the lookup table was only 256 bytes, so it fit nicely in cache and all was fast. This technique wouldn't work well if the data was 24-bit values and we only wanted half of them... the lookup table would be far too big to be practical. On the other hand, we can combine the two techniques shown above: first shift the bits over, then index a lookup table. For a 24-bit value that we only want the top half value, we could potentially shift the data right by 12 bits, and be left with a 12-bit value for a table index. A 12-bit table index implies a table of 4096 values, which might be practical.

在这种情况下,查找表只有256个字节,所以它在缓存中非常合适,而且速度很快。如果数据是24位值,而我们只需要其中的一半,那么这种技术就不能很好地工作。查找表太大而不实用。另一方面,我们可以将上面所示的两种技术组合在一起:首先将位移过来,然后索引查找表。对于一个24位的值,我们只需要上面的一半值,我们可以将数据右移12位,然后给表索引保留12位的值。一个12位的表索引包含4096个值的表,这可能是实用的。

EDIT: One thing I forgot to put in.

编辑:我忘了放一件东西。

The technique of indexing into an array, instead of using an if statement, can be used for deciding which pointer to use. I saw a library that implemented binary trees, and instead of having two named pointers (pLeft and pRight or whatever) had a length-2 array of pointers, and used the "decision bit" technique to decide which one to follow. For example, instead of:

索引到数组的技术,而不是使用if语句,可以用来决定要使用哪个指针。我看到了一个实现二进制树的库,而不是有两个命名的指针(pLeft和pRight或其他什么),它有一个长度为2的指针数组,并使用“决策位”技术来决定要遵循哪一个。例如,而不是:

if (x < node->value)
    node = node->pLeft;
else
    node = node->pRight;

this library would do something like:

这个图书馆会做一些类似的事情:

i = (x < node->value);
node = node->link[i];

Here's a link to this code: Red Black Trees, Eternally Confuzzled

这里有一个链接到这个代码:红黑树,永远混乱。

#9


843  

In the sorted case, you can do better than relying on successful branch prediction or any branchless comparison trick: completely remove the branch.

在排序的情况下,您可以比依赖成功的分支预测或任何无分支的比较技巧做得更好:完全删除分支。

Indeed, the array is partitioned in a contiguous zone with data < 128 and another with data >= 128. So you should find the partition point with a dichotomic search (using Lg(arraySize) = 15 comparisons), then do a straight accumulation from that point.

实际上,数组在一个相邻的区域中被分区,数据< 128,另一个带数据>= 128。因此,您应该找到一个二分查找的分区点(使用Lg(arraySize) = 15比较),然后从该点进行直接积累。

Something like (unchecked)

类似的(不)

int i= 0, j, k= arraySize;
while (i < k)
{
  j= (i + k) >> 1;
  if (data[j] >= 128)
    k= j;
  else
    i= j;
}
sum= 0;
for (; i < arraySize; i++)
  sum+= data[i];

or, slightly more obfuscated

或者,稍微混淆

int i, k, j= (i + k) >> 1;
for (i= 0, k= arraySize; i < k; (data[j] >= 128 ? k : i)= j)
  j= (i + k) >> 1;
for (sum= 0; i < arraySize; i++)
  sum+= data[i];

A yet faster approach, that gives an approximate solution for both sorted or unsorted is: sum= 3137536; (assuming a truly uniform distribution, 16384 samples with expected value 191.5) :-)

一种更快的方法,给出了排序或未排序的近似解:sum= 3137536;(假设有一个真正统一的分布,16384个样本,期望值是191.5):-)

#10


643  

The above behavior is happening because of Branch prediction.

上述行为是由于分支预测而发生的。

To understand branch prediction one must first understand Instruction Pipeline:

要理解分支预测,首先必须了解指令管道:

Any instruction is broken into sequence of steps so that different steps can be executed concurrently in parallel. This technique is known as instruction pipeline and this is used to increase throughput in modern processors. To understand this better please see this example on Wikipedia.

任何指令都被分解成一系列步骤,这样就可以并行执行不同的步骤。这种技术被称为指令管道,它用于提高现代处理器的吞吐量。为了更好地理解这一点,请在*上看到这个例子。

Generally modern processors have quite long pipelines, but for ease let's consider these 4 steps only.

一般来说,现代处理器的管道很长,但是为了方便起见,我们只考虑这4个步骤。

  1. IF -- Fetch the instruction from memory
  2. 如果——从内存中获取指令。
  3. ID -- Decode the instruction
  4. ID——解码指令。
  5. EX -- Execute the instruction
  6. 执行指令。
  7. WB -- Write back to CPU register
  8. WB——写回CPU寄存器。

4-stage pipeline in general for 2 instructions. 为什么处理排序数组比未排序的数组更快?

4级管道一般为2条指令。

Moving back to the above question let's consider the following instructions:

回到上面的问题,让我们考虑下面的说明:

                        A) if (data[c] >= 128)
                                /\
                               /  \
                              /    \
                        true /      \ false
                            /        \
                           /          \
                          /            \
                         /              \
              B) sum += data[c];          C) for loop or print().

Without branch prediction the following would occur:

如果没有分支预测,就会发生以下情况:

To execute instruction B or instruction C the processor will have to wait till the instruction A doesn't reach till EX stage in the pipeline, as the decision to go to instruction B or instruction C depends on the result of instruction A. So the pipeline will look like this.

要执行指令B或指令C,处理器将不得不等到指令A在管道的EX阶段才到达,因为进入指令B或指令C的决定取决于指令A的结果,所以管道会是这样的。

when if condition returns true: 为什么处理排序数组比未排序的数组更快?

如果条件返回true:

When if condition returns false: 为什么处理排序数组比未排序的数组更快?

如果条件返回false:

As a result of waiting for the result of instruction A, the total CPU cycles spent in the above case (without branch prediction; for both true and false) is 7.

由于等待指令a的结果,在上述情况下的总CPU周期(没有分支预测;对于真和假)是7。

So what is branch prediction?

什么是分支预测?

Branch predictor will try to guess which way a branch (an if-then-else structure) will go before this is known for sure. It will not wait for the instruction A to reach the EX stage of the pipeline, but it will guess the decision and go onto that instruction (B or C in case of our example).

分支预测器将尝试猜测分支(if- else结构)在此之前会发生什么。它不会等待指令A到达管道的前级,但它会猜测决定并执行该指令(例如,B或C)。

In case of a correct guess, the pipeline looks something like this: 为什么处理排序数组比未排序的数组更快?

如果有正确的猜测,管道看起来是这样的:

If it is later detected that the guess was wrong then the partially executed instructions are discarded and the pipeline starts over with the correct branch, incurring a delay. The time that is wasted in case of a branch misprediction is equal to the number of stages in the pipeline from the fetch stage to the execute stage. Modern microprocessors tend to have quite long pipelines so that the misprediction delay is between 10 and 20 clock cycles. The longer the pipeline the greater the need for a good branch predictor.

如果稍后检测到猜测是错误的,那么部分执行的指令就会被丢弃,而管道会以正确的分支开始,从而导致延迟。在一个分支错误的情况下被浪费的时间等于从获取阶段到执行阶段的管道中的阶段数。现代微处理器往往有相当长的管道,因此错误的延迟时间是10到20个时钟周期。管道长度越长,就越需要一个好的分支预测器。

In the OP's code, the first time when the conditional, the branch predictor does not have any information to base up prediction, so first time it will randomly choose the next instruction. Later in the for loop it can base the prediction on the history. For an array sorted in ascending order, there are three possibilities:

在OP的代码中,第一次当有条件的时候,分支预测器没有任何信息来建立预测,所以第一次它会随机选择下一个指令。在以后的循环中,它可以根据历史来预测。对于按升序排序的数组,有三种可能:

  1. All the elements are less than 128
  2. 所有的元素都小于128。
  3. All the elements are greater than 128
  4. 所有的元素都大于128。
  5. Some starting new elements are less than 128 and later it become greater than 128
  6. 一些开始的新元素小于128,然后它会大于128。

Let us assume that the predictor will always assume the true branch on the first run.

让我们假设预测器总是在第一次运行时假定真正的分支。

So in the first case it will always take the true branch since historically all its predictions are correct. In the 2nd case, initially it will predict wrong, but after a few iterations it will predict correctly. In the 3rd case it will initially predict correctly till the elements are less than 128. After which it will fail for some time and the correct itself when it see branch prediction failure in history.

因此,在第一种情况下,它总是会选择真正的分支,因为历史上所有的预测都是正确的。在第二种情况下,最初它将预测错误,但是经过几次迭代,它将正确地预测。在第三种情况下,它将正确地预测直到元素小于128。当它在历史上看到分支预测失败后,它将会失败一段时间和正确的本身。

In all these cases the failure will be too less in number and as a result only few times it will need to discard the partially executed instructions and start over with the correct branch, resulting in less CPU cycles.

在所有这些情况下,失败的数量将会减少,结果只需要放弃部分执行的指令,并开始使用正确的分支,从而减少CPU周期。

But in case of random unsorted array, the prediction will need to discard the partially executed instructions and start over with the correct branch most of the time and result in more CPU cycles compared to the sorted array.

但在随机无序数组的情况下,预测将需要放弃部分执行的指令,并在大部分时间内以正确的分支开始,并与排序的数组相比,导致更多的CPU周期。

#11


576  

An official answer would be from

官方的回答是。

  1. Intel - Avoiding the Cost of Branch Misprediction
  2. 英特尔——避免了分支错误的成本。
  3. Intel - Branch and Loop Reorganization to Prevent Mispredicts
  4. 英特尔-分公司和环路重组,以防止错误。
  5. Scientific papers - branch prediction computer architecture
  6. 科学论文——分支预测计算机体系结构。
  7. Books: J.L. Hennessy, D.A. Patterson: Computer architecture: a quantitative approach
  8. 书籍:J.L. Hennessy, D.A.帕特森:计算机体系结构:定量方法。
  9. Articles in scientific publications: T.Y. Yeh, Y.N. Patt made a lot of these on branch predictions.
  10. 在科学出版物上的文章:T.Y. Yeh, Y.N. Patt在分支预测上做了很多这样的研究。

You can also see from this lovely diagram why the branch predictor gets confused.

你也可以从这个可爱的图中看出为什么分支预测器会被混淆。

为什么处理排序数组比未排序的数组更快?

Each element in the original code is a random value

原始代码中的每个元素都是一个随机值。

data[c] = std::rand() % 256;

so the predictor will change sides as the std::rand() blow.

因此,预测器将会改变为std::rand()打击。

On the other hand, once it's sorted, the predictor will first move into a state of strongly not taken and when the values change to the high value the predictor will in three runs through change all the way from strongly not taken to strongly taken.

另一方面,一旦它被排序,预测者将首先进入一个强烈未被接受的状态,当值改变为高值时,预测者将会在三次中通过改变,从强烈的不被强烈的接受。


#12


542  

In the same line (I think this was not highlighted by any answer) it's good to mention that sometimes (specially in software where the performance matters—like in the Linux kernel) you can find some if statements like the following:

在同一行(我认为这并没有被任何答案高亮显示),很好地提一下,有时候(特别是在Linux内核中性能类似的软件中),您可以找到一些if语句,比如:

if (likely( everything_is_ok ))
{
    /* Do something */
}

or similarly:

或类似的:

if (unlikely(very_improbable_condition))
{
    /* Do something */    
}

Both likely() and unlikely() are in fact macros that are defined by using something like the GCC's __builtin_expect to help the compiler insert prediction code to favour the condition taking into account the information provided by the user. GCC supports other builtins that could change the behavior of the running program or emit low level instructions like clearing the cache, etc. See this documentation that goes through the available GCC's builtins.

可能()和不太可能()实际上都是宏,使用GCC的__builtin_expect来帮助编译器插入预测代码,以便考虑到用户提供的信息。GCC支持其他可以改变运行程序行为或发出低级别指令(如清除缓存等)的构建程序。

Normally this kind of optimizations are mainly found in hard-real time applications or embedded systems where execution time matters and it's critical. For example, if you are checking for some error condition that only happens 1/10000000 times, then why not inform the compiler about this? This way, by default, the branch prediction would assume that the condition is false.

通常,这种优化主要存在于硬实时应用程序或嵌入式系统中,执行时间很重要。例如,如果您正在检查仅发生1/10000000次的错误条件,那么为什么不告诉编译器这个呢?这样,默认情况下,分支预测将假定条件为假。

#13


522  

Frequently used Boolean operations in C++ produce many branches in compiled program. If these branches are inside loops and are hard to predict they can slow down execution significantly. Boolean variables are stored as 8-bit integers with the value 0 for false and 1 for true.

在c++中经常使用的布尔运算在编译程序中产生许多分支。如果这些分支处于循环中,并且很难预测它们会显著降低执行速度。布尔变量被存储为8位整数,值为0,为true。

Boolean variables are overdetermined in the sense that all operators that have Boolean variables as input check if the inputs have any other value than 0 or 1, but operators that have Booleans as output can produce no other value than 0 or 1. This makes operations with Boolean variables as input less efficient than necessary. Consider example:

布尔变量在所有运算符中都是超确定的,因为所有的运算符都有布尔变量作为输入检查,如果输入的值大于0或1,但是有布尔值的运算符可以产生0或1的其他值。这使得使用布尔变量的操作比需要的效率低。考虑的例子:

bool a, b, c, d;
c = a && b;
d = a || b;

This is typically implemented by the compiler in the following way:

这通常由编译器以下列方式实现:

bool a, b, c, d;
if (a != 0) {
    if (b != 0) {
        c = 1;
    }
    else {
        goto CFALSE;
    }
}
else {
    CFALSE:
    c = 0;
}
if (a == 0) {
    if (b == 0) {
        d = 0;
    }
    else {
        goto DTRUE;
    }
}
else {
    DTRUE:
    d = 1;
}

This code is far from optimal. The branches may take a long time in case of mispredictions. The Boolean operations can be made much more efficient if it is known with certainty that the operands have no other values than 0 and 1. The reason why the compiler does not make such an assumption is that the variables might have other values if they are uninitialized or come from unknown sources. The above code can be optimized if a and b have been initialized to valid values or if they come from operators that produce Boolean output. The optimized code looks like this:

这段代码并不理想。这些分支可能需要很长时间才会出现错误。如果可以确信操作数没有其他值大于0和1,那么布尔操作可以更有效。编译器没有做出这样的假设的原因是,如果变量未初始化或来自未知的源,那么变量可能有其他值。如果a和b已经被初始化为有效值,或者它们来自产生布尔输出的运算符,上述代码可以进行优化。优化后的代码如下所示:

char a = 0, b = 1, c, d;
c = a & b;
d = a | b;

char is used instead of bool in order to make it possible to use the bitwise operators (& and |) instead of the Boolean operators (&& and ||). The bitwise operators are single instructions that take only one clock cycle. The OR operator (|) works even if a and b have other values than 0 or 1. The AND operator (&) and the EXCLUSIVE OR operator (^) may give inconsistent results if the operands have other values than 0 and 1.

使用char代替bool,以使其能够使用位操作符(&和|)而不是布尔运算符(&&和||)。位运算符是只使用一个时钟周期的单一指令。OR运算符(|)工作,即使a和b的值大于0或1。如果操作数的值大于0和1,则和操作符(&)和独占或运算符()可能会产生不一致的结果。

~ can not be used for NOT. Instead, you can make a Boolean NOT on a variable which is known to be 0 or 1 by XOR'ing it with 1:

~不可以不使用。相反,你可以在一个已知为0或1的变量上创建一个布尔值,它的值为1:

bool a, b;
b = !a;

can be optimized to:

可以优化:

char a = 0, b;
b = a ^ 1;

a && b cannot be replaced with a & b if b is an expression that should not be evaluated if a is false ( && will not evaluate b, & will). Likewise, a || b can not be replaced with a | b if b is an expression that should not be evaluated if a is true.

a && b不能被a & b代替,如果b是一个不应该被评估的表达式,如果a是错误的(&&将不评估b, & will)。同样,|| b不能用| b代替,如果b是一个表达式,如果a为真,则不应该对其进行评估。

Using bitwise operators is more advantageous if the operands are variables than if the operands are comparisons:

如果操作数是变量,而不是操作数比较,则使用位运算符更有利。

bool a; double x, y, z;
a = x > y && z < 5.0;

is optimal in most cases (unless you expect the && expression to generate many branch mispredictions).

在大多数情况下是最优的(除非您期望&&表达式生成许多分支错误)。

#14


188  

This question has already been answered excellently many times over. Still I'd like to draw the group's attention to yet another interesting analysis.

这个问题已经回答了很多次了。不过,我还是想把大家的注意力引向另一个有趣的分析。

Recently this example (modified very slightly) was also used as a way to demonstrate how a piece of code can be profiled within the program itself on Windows. Along the way, the author also shows how to use the results to determine where the code is spending most of its time in both the sorted & unsorted case. Finally the piece also shows how to use a little known feature of the HAL (Hardware Abstraction Layer) to determine just how much branch misprediction is happening in the unsorted case.

最近,这个示例(稍微修改了一下)也被用来演示如何在Windows程序中对代码进行剖析。在此过程中,作者还展示了如何使用结果来确定代码在排序和未排序的情况下花费了大部分时间。最后,这篇文章还展示了如何使用HAL(硬件抽象层)的一个鲜为人知的特性,来确定在未排序的情况下发生了多少分支错误。

The link is here: http://www.geoffchappell.com/studies/windows/km/ntoskrnl/api/ex/profile/demo.htm

链接在这里:http://www.geoffch.com/studies/windows/km/ntoskrnl/api/ex/profile/demo.htm。

#15


179  

That's for sure!...

那是肯定的!

Branch prediction makes the logic run slower, because of the switching which happens in the code! It's like you are going a straight street or a street with a lot of turnings, for sure the straight one is going to be done quicker!...

分支预测使逻辑运行更慢,因为在代码中发生了切换。这就像你走在一条笔直的街道上,或者是一条有很多弯路的街道,你肯定会走得更快!

If the array is sorted, your condition is false at the first step: data[c] >= 128, then becomes a true value for the whole way to the end of the street. That's how you get to the end of the logic faster. On the other hand, using an unsorted array, you need a lot of turning and processing which make your code run slower for sure...

如果数组被排序,那么您的条件在第一步是错误的:数据[c] >= 128,然后就会成为整条路到街道尽头的真正值。这就是如何更快地结束逻辑的方法。另一方面,使用未排序的数组,需要进行大量的转换和处理,从而确保代码运行得更慢。

Look at the image I created for you below. Which street is going to be finished faster?

看看我为你创建的图片。哪条街会更快完工?

为什么处理排序数组比未排序的数组更快?

So programmatically, branch prediction causes the process to be slower...

因此,通过编程,分支预测使进程变得更慢……

Also at the end, it's good to know we have two kinds of branch predictions that each is going to affect your code differently:

最后,我们有两种不同的分支预测它们会对你的代码产生不同的影响:

1. Static

1。静态

2. Dynamic

2。动态

为什么处理排序数组比未排序的数组更快?

Static branch prediction is used by the microprocessor the first time a conditional branch is encountered, and dynamic branch prediction is used for succeeding executions of the conditional branch code.

静态分支预测是微处理器第一次遇到条件分支时使用的,动态分支预测用于对条件分支代码的后续执行。

In order to effectively write your code to take advantage of these rules, when writing if-else or switch statements, check the most common cases first and work progressively down to the least common. Loops do not necessarily require any special ordering of code for static branch prediction, as only the condition of the loop iterator is normally used.

为了有效地编写代码,以利用这些规则,在编写if-else或switch语句时,首先检查最常见的情况,然后逐步降低到最不常见的情况。循环不需要对静态分支预测进行任何特殊的代码排序,因为通常使用的是循环迭代器的条件。

#16


94  

Branch-prediction gain!

转移预测获得!

It is important to understand that branch misprediction doesn't slow down programs. The cost of a missed prediction is just as if branch prediction didn't exist and you waited for the evaluation of the expression to decide what code to run (further explanation in the next paragraph).

很重要的一点是要了解,分支的错误预测不会减慢程序的速度。一个错过的预测的成本就像分支预测不存在一样,你等待这个表达式的评估来决定运行什么代码(下一段的进一步解释)。

if (expression)
{
    // Run 1
} else {
    // Run 2
}

Whenever there's an if-else \ switch statement, the expression has to be evaluated to determine which block should be executed. In the assembly code generated by the compiler, conditional branch instructions are inserted.

每当有if-else \ switch语句时,必须对表达式进行评估,以确定应该执行哪个块。在编译器生成的汇编代码中,插入了条件分支指令。

A branch instruction can cause a computer to begin executing a different instruction sequence and thus deviate from its default behavior of executing instructions in order (i.e. if the expression is false, the program skips the code of the if block) depending on some condition, which is the expression evaluation in our case.

分支指令可能导致电脑开始执行不同的指令序列,从而偏离其违约行为的执行指令(即如果表达式为假,程序跳过如果的代码块)根据某些条件,即表达评估在我们的案例中。

That being said, the compiler tries to predict the outcome prior to it being actually evaluated. It will fetch instructions from the if block, and if the expression turns out to be true, then wonderful! We gained the time it took to evaluate it and made progress in the code; if not then we are running the wrong code, the pipeline is flushed, and the correct block is run.

也就是说,编译器试图在实际评估之前预测结果。它将从if块中获取指令,如果表达式为真,那么棒极了!我们获得了评估它的时间,并在代码中取得了进展;如果不是这样,我们就会运行错误的代码,管道会被刷新,并且运行正确的块。

Visualization:

Let's say you need to pick route 1 or route 2. Waiting for your partner to check the map, you have stopped at ## and waited, or you could just pick route1 and if you were lucky (route 1 is the correct route), then great you didn't have to wait for your partner to check the map (you saved the time it would have taken him to check the map), otherwise you will just turn back.

假设你需要选择1号线或2号线。等待你的伴侣查看地图,你停在# #和等待,或者你可以选择route1,如果你很幸运(路线1是正确的路线),然后好你没有等待你的伴侣检查地图(你保存的时间会带他去检查地图),否则你就会回头。

While flushing pipelines is super fast, nowadays taking this gamble is worth it. Predicting sorted data or a data that changes slowly is always easier and better than predicting fast changes.

虽然冲洗管道非常快,但现在进行这种赌博是值得的。预测排序后的数据或数据的变化总是比预测快速变化要容易得多。

 O      Route 1  /-------------------------------
/|\             /
 |  ---------##/
/ \            \
                \
        Route 2  \--------------------------------

#17


76  

As what has already been mentioned by others, what behind the mystery is Branch Predictor.

正如其他人已经提到的,神秘的背后是分支预测器。

I'm not trying to add something but explaining the concept in another way. There is concise introduction on the wiki which contains text and diagram. I do like the explanation below which uses diagram to elaborate the Branch Predictor intuitively.

我并不是要添加什么东西,而是用另一种方式解释这个概念。在wiki上有简明的介绍,其中包含文本和图表。我很喜欢下面的解释,它使用图来直观地阐述分支预测器。

In computer architecture, a branch predictor is a digital circuit that tries to guess which way a branch (e.g. an if-then-else structure) will go before this is known for sure. The purpose of the branch predictor is to improve the flow in the instruction pipeline. Branch predictors play a critical role in achieving high effective performance in many modern pipelined microprocessor architectures such as x86.

在计算机体系结构中,一个分支预测器是一个数字电路,它试图猜测一个分支(例如if- else结构)会在这之前确定。分支预测器的目的是提高指令管道的流量。分支预测器在许多现代流水线的微处理器体系结构(如x86)中发挥了关键作用。

Two-way branching is usually implemented with a conditional jump instruction. A conditional jump can either be "not taken" and continue execution with the first branch of code which follows immediately after the conditional jump, or it can be "taken" and jump to a different place in program memory where the second branch of code is stored. It is not known for certain whether a conditional jump will be taken or not taken until the condition has been calculated and the conditional jump has passed the execution stage in the instruction pipeline (see fig. 1).

双向分支通常使用条件跳转指令实现。条件跳转可以是“不被占用”,并且在条件跳转之后立即执行的代码的第一个分支继续执行,或者它可以被“捕获”并跳转到程序内存中的另一个位置,在那里存储第二个代码分支。尚不确定是否在计算条件之前采取或不采取条件跳转,条件跳转已通过指令管道中的执行阶段(见图1)。

为什么处理排序数组比未排序的数组更快?

Based on described scenario, I have written an animation demo to show how instructions are executed in pipeline in different situations.

基于描述的场景,我编写了一个动画演示,演示如何在不同情况下在管道中执行指令。

  1. Without the Branch Predictor.
  2. 没有分支预测器。

Without branch prediction, the processor would have to wait until the conditional jump instruction has passed the execute stage before the next instruction can enter the fetch stage in the pipeline.

如果没有分支预测,处理器将不得不等待,直到条件跳转指令在下一条指令进入管道的获取阶段之前已经通过了执行阶段。

The example contains three instructions and the first one is a conditional jump instruction. The latter two instructions can go into the pipeline until the conditional jump instruction is executed.

这个示例包含三个指令,第一个是条件跳转指令。后两个指令可以进入管道,直到执行条件跳转指令。

为什么处理排序数组比未排序的数组更快?

It will take 9 clock cycles for 3 instructions to be completed.

要完成3个指令,需要9个时钟周期。

  1. Use Branch Predictor and don't take conditional jump. Let's assume that the predict is not taking the conditional jump.
  2. 使用分支预测器,不要使用条件跳转。让我们假设预测没有接受条件跳跃。

为什么处理排序数组比未排序的数组更快?

It will take 7 clock cycles for 3 instructions to be completed.

要完成3个指令,需要7个时钟周期。

  1. Use Branch Predictor and take conditional jump. Let's assume that the predict is not taking the conditional jump.
  2. 使用分支预测器并进行条件跳转。让我们假设预测没有接受条件跳跃。

为什么处理排序数组比未排序的数组更快?

It will take 9 clock cycles for 3 instructions to be completed.

要完成3个指令,需要9个时钟周期。

The time that is wasted in case of a branch misprediction is equal to the number of stages in the pipeline from the fetch stage to the execute stage. Modern microprocessors tend to have quite long pipelines so that the misprediction delay is between 10 and 20 clock cycles. As a result, making a pipeline longer increases the need for a more advanced branch predictor.

在一个分支错误的情况下被浪费的时间等于从获取阶段到执行阶段的管道中的阶段数。现代微处理器往往有相当长的管道,因此错误的延迟时间是10到20个时钟周期。因此,延长管道的使用时间会增加更高级的分支预测器的需求。

As you can see, it seems we don't have reason not to use Branch Predictor.

正如您所看到的,我们似乎没有理由不使用分支预测器。

It's quite a simple demo that clarifies the very basic part of Branch Predictor. If those gifs are annoying, please feel free to remove them from the answer and visitors can also get the demo from git

这是一个简单的演示,阐明了分支预测器的基本部分。如果这些gif是烦人的,请随意删除它们,访问者也可以从git中得到演示。

#18


53  

It's about branch prediction. What is it?

它是关于分支预测。它是什么?

  • A branch predictor is one of the ancient performance improving techniques which still finds relevance into modern architectures. While the simple prediction techniques provide fast lookup and power efficiency they suffer from a high misprediction rate.

    分支预测器是一种古老的性能改进技术,它仍然与现代体系结构相关联。虽然简单的预测技术提供了快速查找和功率效率,但它们的误读率很高。

  • On the other hand, complex branch predictions –either neural based or variants of two-level branch prediction –provide better prediction accuracy, but they consume more power and complexity increases exponentially.

    另一方面,复杂的分支预测——无论是基于神经的还是两级的分支预测的变体——提供了更好的预测精度,但是它们消耗了更多的能量,复杂度呈指数级增长。

  • In addition to this, in complex prediction techniques the time taken to predict the branches is itself very high –ranging from 2 to 5 cycles –which is comparable to the execution time of actual branches.

    除此之外,在复杂的预测技术中,预测分支的时间本身非常高——从2到5个周期——这相当于实际分支的执行时间。

  • Branch prediction is essentially an optimization (minimization) problem where the emphasis is on to achieve lowest possible miss rate, low power consumption, and low complexity with minimum resources.

    分支预测本质上是一个优化(最小化)问题,重点是实现尽可能低的漏失率、低功耗和低复杂度的最小资源。

There really are three different kinds of branches:

有三种不同的分支:

Forward conditional branches - based on a run-time condition, the PC (program counter) is changed to point to an address forward in the instruction stream.

正向条件分支——基于运行时条件,PC(程序计数器)被更改为指向指令流中的地址。

Backward conditional branches - the PC is changed to point backward in the instruction stream. The branch is based on some condition, such as branching backwards to the beginning of a program loop when a test at the end of the loop states the loop should be executed again.

向后的条件分支——在指令流中,PC被更改为指向向后。该分支基于某些条件,例如,在循环语句结束时,在循环语句结束时,向程序循环的开头分支,应该再次执行循环。

Unconditional branches - this includes jumps, procedure calls and returns that have no specific condition. For example, an unconditional jump instruction might be coded in assembly language as simply "jmp", and the instruction stream must immediately be directed to the target location pointed to by the jump instruction, whereas a conditional jump that might be coded as "jmpne" would redirect the instruction stream only if the result of a comparison of two values in a previous "compare" instructions shows the values to not be equal. (The segmented addressing scheme used by the x86 architecture adds extra complexity, since jumps can be either "near" (within a segment) or "far" (outside the segment). Each type has different effects on branch prediction algorithms.)

无条件分支-包括跳转、过程调用和没有特定条件的返回。例如,无条件跳转指令可能是在汇编语言编码的简单的“人民币”,和指令流必须立即被引导到目标指向的位置跳转指令,而条件转移可能编码为“jmpne”重定向指令流只有两个值的比较的结果在之前的“比较”指令显示的值不相等。(x86架构使用的分段寻址方案增加了额外的复杂性,因为跳转可以是“near”(在段内)或“far”(在段之外)。每种类型对分支预测算法都有不同的影响。

Static/dynamic Branch Prediction: Static branch prediction is used by the microprocessor the first time a conditional branch is encountered, and dynamic branch prediction is used for succeeding executions of the conditional branch code.

静态/动态分支预测:静态分支预测是微处理器第一次使用条件分支时使用的,动态分支预测用于对条件分支代码的后续执行。

References:

引用:

#19


36  

Besides the fact that the branch prediction may slow you down, a sorted array has another advantage:

除了分支预测可能会减慢你的速度之外,排序数组还有另一个优点:

You can have a stop condition instead of just checking the value, this way you only loop over the relevant data, and ignore the rest.
The branch prediction will miss only once.

您可以有一个停止条件,而不是仅仅检查值,这样您只需要对相关数据进行循环,而忽略其余的数据。分支预测只会错过一次。

 // sort backwards (higher values first)
 std::sort(data, data + arraySize, std::greater<int>());

 for (unsigned c = 0; c < arraySize; ++c) {
       if (data[c] < 128) {
              break;
       }
       sum += data[c];               
 }

#20


28  

On ARM, there is no branch needed, because every instruction has a 4-bit condition field, which is tested at zero cost. This eliminates the need for short branches. The inner loop would look something like the following, and there would be no branch prediction hit. Therefore, the sorted version would run slower than the unsorted version on ARM, because of the extra overhead of sorting:

在ARM上,不需要分支,因为每条指令都有一个4位的条件字段,这是在零成本的情况下进行测试的。这就消除了对短分支的需求。内部循环看起来像下面这样,没有分支预测命中。因此,排序后的版本会比未排序的版本运行得慢,因为排序的额外开销是:

MOV R0, #0     // R0 = sum = 0
MOV R1, #0     // R1 = c = 0
ADR R2, data   // R2 = addr of data array (put this instruction outside outer loop)
.inner_loop    // Inner loop branch label
    LDRB R3, [R2, R1]     // R3 = data[c]
    CMP R3, #128          // compare R3 to 128
    ADDGE R0, R0, R3      // if R3 >= 128, then sum += data[c] -- no branch needed!
    ADD R1, R1, #1        // c++
    CMP R1, #arraySize    // compare c to arraySize
    BLT inner_loop        // Branch to inner_loop if c < arraySize

#21


13  

In simple words: Summary of everyone's wonderful answers for beginners...

简而言之:每个人对初学者的精彩答案的总结……

This happens due to something called the branch prediction.

这是由于所谓的分支预测。

Basically the computer predicts the next step and executes it. If it's wrong, it comes one step back and executes with another prediction and if it's right, then it will just continue.

基本上,计算机预测下一步并执行它。如果它是错误的,它就会后退一步,并执行另一个预测,如果它是正确的,那么它就会继续。

The answer your question is very simple. If the array is unsorted, the computer has to make many predictions which may lead to an increased chance of errors. But if the data is sorted, the computer makes fewer predictions and there is less chance of error.

你的问题很简单。如果数组未排序,计算机必须进行许多预测,这可能导致出错的几率增加。但是,如果数据被排序,计算机的预测就更少,出错的可能性也更小。

Sorted Array: Straight Road

排序数组:直路

____________________________________________________________________________
-  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  - 
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT

____________________________________________________________________________——————————————————————————————————————————————————————————————————————————————————————TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT

Unsorted Array: Curved Road
______     _______
|           |__|

无序数组:弯曲路______ _________ | | _ |

Branch prediction: Guessing/predicting which road is straight and following it without checking
___________________________________________ Straight road
   |_________________________________________|Longer road

分支预测:猜/预测哪一条路是直和后没有检查___________________________________________直路| _________________________________________ |长的道路

Although both the roads reach the same destination, the straight road is shorter, and the other is longer. If then you choose the other by mistake, there is no turning back, and so you will waste some extra time if you choose the longer road. This is similar to what happens in the computer, and I hope this helped you understand better.

虽然两条路都到达相同的目的地,但直路较短,而另一条路较长。如果你错误地选择了另一个,就没有回头路了,所以如果你选择了更长的路,你就会浪费一些额外的时间。这与计算机中发生的情况类似,我希望这能帮助您更好地理解。

#1


27770  

You are a victim of branch prediction fail.

你是分支预测失败的受害者。


What is Branch Prediction?

Consider a railroad junction:

考虑一个铁路枢纽:

为什么处理排序数组比未排序的数组更快? Image by Mecanismo, via Wikimedia Commons. Used under the CC-By-SA 3.0 license.

图片由Mecanismo,通过维基共享。在CC-By-SA 3.0许可下使用。

Now for the sake of argument, suppose this is back in the 1800s - before long distance or radio communication.

为了论证,假设这是在19世纪,在远距离或无线电通信之前。

You are the operator of a junction and you hear a train coming. You have no idea which way it is supposed to go. You stop the train to ask the driver which direction they want. And then you set the switch appropriately.

你是一个路口的接线员,你听到火车来了。你不知道该走哪条路。你停下来问司机他们想要哪个方向。然后适当地设置开关。

Trains are heavy and have a lot of inertia. So they take forever to start up and slow down.

火车很重,而且有很大的惯性。所以他们会花很长时间来启动和减速。

Is there a better way? You guess which direction the train will go!

有更好的办法吗?你猜火车会朝哪个方向走!

  • If you guessed right, it continues on.
  • 如果你猜对了,它还在继续。
  • If you guessed wrong, the captain will stop, back up, and yell at you to flip the switch. Then it can restart down the other path.
  • 如果你猜错了,船长会停下来,对着你大喊大叫。然后它可以重新启动另一条路径。

If you guess right every time, the train will never have to stop.
If you guess wrong too often, the train will spend a lot of time stopping, backing up, and restarting.

如果你每次都猜对了,火车就永远不会停下来。如果你经常猜错了,火车会花很多时间停下来,倒车,重新开始。


Consider an if-statement: At the processor level, it is a branch instruction:

考虑一个if语句:在处理器级别,它是一个分支指令:

为什么处理排序数组比未排序的数组更快?

You are a processor and you see a branch. You have no idea which way it will go. What do you do? You halt execution and wait until the previous instructions are complete. Then you continue down the correct path.

你是一个处理器,你看到一个分支。你不知道该走哪条路。你做什么工作?您停止执行并等待之前的指示完成。然后继续沿着正确的路径前进。

Modern processors are complicated and have long pipelines. So they take forever to "warm up" and "slow down".

现代处理器很复杂,而且有长长的管道。所以他们需要永远的“热身”和“慢下来”。

Is there a better way? You guess which direction the branch will go!

有更好的办法吗?你猜树枝会朝哪个方向走!

  • If you guessed right, you continue executing.
  • 如果你猜对了,你就继续执行。
  • If you guessed wrong, you need to flush the pipeline and roll back to the branch. Then you can restart down the other path.
  • 如果您猜错了,您需要冲洗管道并回滚到分支。然后可以重新启动另一条路径。

If you guess right every time, the execution will never have to stop.
If you guess wrong too often, you spend a lot of time stalling, rolling back, and restarting.

如果你每次都猜对了,执行就永远不会停止。如果你经常猜错了,你会花很多时间拖延,回滚,重新开始。


This is branch prediction. I admit it's not the best analogy since the train could just signal the direction with a flag. But in computers, the processor doesn't know which direction a branch will go until the last moment.

这是分支预测。我承认这并不是最好的类比,因为火车可能只是用一面旗帜来发出指示。但是在计算机中,处理器直到最后一刻才知道分支的方向。

So how would you strategically guess to minimize the number of times that the train must back up and go down the other path? You look at the past history! If the train goes left 99% of the time, then you guess left. If it alternates, then you alternate your guesses. If it goes one way every 3 times, you guess the same...

那么,你如何策略性地猜测火车必须后退的次数,然后沿着另一条路线行驶呢?你看过去的历史!如果火车离开99%的时间,那么你就猜是左转。如果它是交替的,那么你可以改变你的猜测。如果它每3次走一遍,你猜也是一样的…

In other words, you try to identify a pattern and follow it. This is more or less how branch predictors work.

换句话说,您试图确定一个模式并遵循它。这或多或少地说明了分支预测器是如何工作的。

Most applications have well-behaved branches. So modern branch predictors will typically achieve >90% hit rates. But when faced with unpredictable branches with no recognizable patterns, branch predictors are virtually useless.

大多数应用程序都有很好的分支。因此,现代分支预测器通常会实现>90%的命中率。但是,当面对不可预测的分支时,没有可识别的模式,分支预测器实际上是无用的。

Further reading: "Branch predictor" article on Wikipedia.

进一步阅读:*上的“分支预测器”文章。


As hinted from above, the culprit is this if-statement:

if (data[c] >= 128)
    sum += data[c];

Notice that the data is evenly distributed between 0 and 255. When the data is sorted, roughly the first half of the iterations will not enter the if-statement. After that, they will all enter the if-statement.

注意,数据在0到255之间是均匀分布的。当数据被排序时,大约一半的迭代将不会进入if语句。在那之后,他们将全部输入if语句。

This is very friendly to the branch predictor since the branch consecutively goes the same direction many times. Even a simple saturating counter will correctly predict the branch except for the few iterations after it switches direction.

这对分支预测器非常友好,因为分支连续地进行了多次相同的方向。即使是一个简单的饱和计数器,也可以正确地预测分支,除了在它转换方向之后的几个迭代之外。

Quick visualization:

快速可视化:

T = branch taken
N = branch not taken

data[] = 0, 1, 2, 3, 4, ... 126, 127, 128, 129, 130, ... 250, 251, 252, ...
branch = N  N  N  N  N  ...   N    N    T    T    T  ...   T    T    T  ...

       = NNNNNNNNNNNN ... NNNNNNNTTTTTTTTT ... TTTTTTTTTT  (easy to predict)

However, when the data is completely random, the branch predictor is rendered useless because it can't predict random data. Thus there will probably be around 50% misprediction. (no better than random guessing)

然而,当数据完全随机时,分支预测器就变得无用了,因为它不能预测随机数据。因此可能会有50%左右的错误。(没有比随机猜测更好的了)

data[] = 226, 185, 125, 158, 198, 144, 217, 79, 202, 118,  14, 150, 177, 182, 133, ...
branch =   T,   T,   N,   T,   T,   T,   T,  N,   T,   N,   N,   T,   T,   T,   N  ...

       = TTNTTTTNTNNTTTN ...   (completely random - hard to predict)

So what can be done?

那么,我们能做些什么呢?

If the compiler isn't able to optimize the branch into a conditional move, you can try some hacks if you are willing to sacrifice readability for performance.

如果编译器不能将分支优化为有条件的移动,如果您愿意牺牲可读性以获得性能,您可以尝试一些技巧。

Replace:

替换:

if (data[c] >= 128)
    sum += data[c];

with:

:

int t = (data[c] - 128) >> 31;
sum += ~t & data[c];

This eliminates the branch and replaces it with some bitwise operations.

这就消除了分支,并用一些位操作替换它。

(Note that this hack is not strictly equivalent to the original if-statement. But in this case, it's valid for all the input values of data[].)

(请注意,这个hack并不完全等同于原始的if-statement。但在本例中,它对所有数据的输入值都有效[]。

Benchmarks: Core i7 920 @ 3.5 GHz

基准:核心i7 920 @ 3.5 GHz。

C++ - Visual Studio 2010 - x64 Release

c++ - Visual Studio 2010 - x64发行版。

//  Branch - Random
seconds = 11.777

//  Branch - Sorted
seconds = 2.352

//  Branchless - Random
seconds = 2.564

//  Branchless - Sorted
seconds = 2.587

Java - Netbeans 7.1.1 JDK 7 - x64

Java - Netbeans 7.1.1 JDK 7 - x64。

//  Branch - Random
seconds = 10.93293813

//  Branch - Sorted
seconds = 5.643797077

//  Branchless - Random
seconds = 3.113581453

//  Branchless - Sorted
seconds = 3.186068823

Observations:

观察:

  • With the Branch: There is a huge difference between the sorted and unsorted data.
  • 分支:排序和未排序的数据之间有很大的差别。
  • With the Hack: There is no difference between sorted and unsorted data.
  • 使用黑客:排序和未排序的数据没有区别。
  • In the C++ case, the hack is actually a tad slower than with the branch when the data is sorted.
  • 在c++的例子中,当数据被排序时,黑客实际上比分支慢。

A general rule of thumb is to avoid data-dependent branching in critical loops. (such as in this example)

一般的经验法则是在关键的循环中避免依赖于数据的分支。(例如在本例中)


Update:

更新:

  • GCC 4.6.1 with -O3 or -ftree-vectorize on x64 is able to generate a conditional move. So there is no difference between the sorted and unsorted data - both are fast.

    在x64上使用-O3或-ftree-vectorize的GCC 4.6.1能够生成一个有条件的移动。因此,排序和未排序的数据之间没有区别——两者都是快速的。

  • VC++ 2010 is unable to generate conditional moves for this branch even under /Ox.

    2010年的vc++无法为这个分支生成有条件的移动。

  • Intel Compiler 11 does something miraculous. It interchanges the two loops, thereby hoisting the unpredictable branch to the outer loop. So not only is it immune the mispredictions, it is also twice as fast as whatever VC++ and GCC can generate! In other words, ICC took advantage of the test-loop to defeat the benchmark...

    Intel编译器11做了一些不可思议的事情。它将两个循环互换,从而将不可预测的分支提升到外部循环。因此,它不仅是免疫的错误,而且它的速度是vc++和GCC生成的速度的两倍!换句话说,ICC利用了测试循环来击败基准测试……

  • If you give the Intel Compiler the branchless code, it just out-right vectorizes it... and is just as fast as with the branch (with the loop interchange).

    如果你给英特尔的编译器无分支的代码,它就会向右矢量化。并且与分支(与循环交换)一样快。

This goes to show that even mature modern compilers can vary wildly in their ability to optimize code...

这表明,即使是成熟的现代编译器在优化代码的能力上也会有很大的差异。

#2


3498  

Branch prediction.

分支预测。

With a sorted array, the condition data[c] >= 128 is first false for a streak of values, then becomes true for all later values. That's easy to predict. With an unsorted array, you pay for the branching cost.

对于已排序的数组,条件数据[c] >= 128是第一个错误的值,然后对所有以后的值都是正确的。这是容易预测。对于未排序的数组,您需要支付分支成本。

#3


2843  

The reason why performance improves drastically when the data is sorted is that the branch prediction penalty is removed, as explained beautifully in Mysticial's answer.

当数据被排序的时候,性能显著提高的原因是分支预测的惩罚被删除了,这在神秘的答案中解释得很好。

Now, if we look at the code

现在,如果我们看一下代码。

if (data[c] >= 128)
    sum += data[c];

we can find that the meaning of this particular if... else... branch is to add something when a condition is satisfied. This type of branch can be easily transformed into a conditional move statement, which would be compiled into a conditional move instruction: cmovl, in an x86 system. The branch and thus the potential branch prediction penalty is removed.

我们可以发现,如果…其他的……分支是在满足条件时添加一些东西。这种类型的分支可以很容易地转换成条件移动语句,它将被编译成一个有条件的移动指令:cmovl,在x86系统中。该分支和潜在的分支预测惩罚被移除。

In C, thus C++, the statement, which would compile directly (without any optimization) into the conditional move instruction in x86, is the ternary operator ... ? ... : .... So we rewrite the above statement into an equivalent one:

在C中,这样的语句,将直接编译(没有任何优化)到x86的条件移动指令中,是三元运算符…吗?…:....所以我们把上面的式子改写成等价的

sum += data[c] >=128 ? data[c] : 0;

While maintaining readability, we can check the speedup factor.

在保持可读性的同时,我们可以检查加速因子。

On an Intel Core i7-2600K @ 3.4 GHz and Visual Studio 2010 Release Mode, the benchmark is (format copied from Mysticial):

在Intel Core i7-2600K @ 3.4 GHz和Visual Studio 2010发布模式中,基准是(从神秘学中拷贝的格式):

x86

x86

//  Branch - Random
seconds = 8.885

//  Branch - Sorted
seconds = 1.528

//  Branchless - Random
seconds = 3.716

//  Branchless - Sorted
seconds = 3.71

x64

x64

//  Branch - Random
seconds = 11.302

//  Branch - Sorted
 seconds = 1.830

//  Branchless - Random
seconds = 2.736

//  Branchless - Sorted
seconds = 2.737

The result is robust in multiple tests. We get a great speedup when the branch result is unpredictable, but we suffer a little bit when it is predictable. In fact, when using a conditional move, the performance is the same regardless of the data pattern.

这个结果在多个测试中都很健壮。当分支结果不可预测时,我们会得到极大的加速,但当它是可预测的时,我们会遭受一些损失。实际上,当使用条件移动时,无论数据模式如何,性能都是相同的。

Now let's look more closely by investigating the x86 assembly they generate. For simplicity, we use two functions max1 and max2.

现在,让我们更仔细地研究它们生成的x86程序集。为了简单起见,我们使用两个函数max1和max2。

max1 uses the conditional branch if... else ...:

max1使用条件分支,如果…其他:…

int max1(int a, int b) {
    if (a > b)
        return a;
    else
        return b;
}

max2 uses the ternary operator ... ? ... : ...:

max2使用三元运算符…吗?…:……

int max2(int a, int b) {
    return a > b ? a : b;
}

On a x86-64 machine, GCC -S generates the assembly below.

在x86-64机器上,GCC -S生成下面的程序集。

:max1
    movl    %edi, -4(%rbp)
    movl    %esi, -8(%rbp)
    movl    -4(%rbp), %eax
    cmpl    -8(%rbp), %eax
    jle     .L2
    movl    -4(%rbp), %eax
    movl    %eax, -12(%rbp)
    jmp     .L4
.L2:
    movl    -8(%rbp), %eax
    movl    %eax, -12(%rbp)
.L4:
    movl    -12(%rbp), %eax
    leave
    ret

:max2
    movl    %edi, -4(%rbp)
    movl    %esi, -8(%rbp)
    movl    -4(%rbp), %eax
    cmpl    %eax, -8(%rbp)
    cmovge  -8(%rbp), %eax
    leave
    ret

max2 uses much less code due to the usage of instruction cmovge. But the real gain is that max2 does not involve branch jumps, jmp, which would have a significant performance penalty if the predicted result is not right.

由于使用了指令cmovge, max2使用了更少的代码。但是真正的好处是,max2不涉及分支跳转,jmp,如果预测的结果不正确,它将会有显著的性能损失。

So why does a conditional move perform better?

那么为什么有条件的移动会表现得更好呢?

In a typical x86 processor, the execution of an instruction is divided into several stages. Roughly, we have different hardware to deal with different stages. So we do not have to wait for one instruction to finish to start a new one. This is called pipelining.

在典型的x86处理器中,指令的执行分为几个阶段。粗略地说,我们有不同的硬件来处理不同的阶段。所以我们不需要等待一个指令来完成一个新的指令。这就是所谓的流水线。

In a branch case, the following instruction is determined by the preceding one, so we cannot do pipelining. We have to either wait or predict.

在一个分支案例中,下面的指令是由前一个指令决定的,所以我们不能进行流水线操作。我们要么等待,要么预测。

In a conditional move case, the execution conditional move instruction is divided into several stages, but the earlier stages like Fetch and Decode does not depend on the result of the previous instruction; only latter stages need the result. Thus, we wait a fraction of one instruction's execution time. This is why the conditional move version is slower than the branch when prediction is easy.

在条件移动情况下,执行条件移动指令被划分为几个阶段,但获取和解码的早期阶段并不依赖于前一条指令的结果;只有后阶段才需要结果。因此,我们等待一个指令执行时间的一小部分。这就是为什么当预测很容易时,条件移动版本比分支慢。

The book Computer Systems: A Programmer's Perspective, second edition explains this in detail. You can check Section 3.6.6 for Conditional Move Instructions, entire Chapter 4 for Processor Architecture, and Section 5.11.2 for a special treatment for Branch Prediction and Misprediction Penalties.

图书计算机系统:一个程序员的观点,第二版详细解释了这一点。您可以检查3.6.6章节的条件移动说明,整个第4章的处理器架构,和第5.11.2节对于分支预测和错误惩罚的特殊处理。

Sometimes, some modern compilers can optimize our code to assembly with better performance, sometimes some compilers can't (the code in question is using Visual Studio's native compiler). Knowing the performance difference between branch and conditional move when unpredictable can help us write code with better performance when the scenario gets so complex that the compiler can not optimize them automatically.

有时,一些现代编译器可以优化我们的代码,使其具有更好的性能,有时一些编译器不能(问题中的代码使用Visual Studio的本机编译器)。了解分支和条件移动之间的性能差异时,当场景变得非常复杂,编译器无法自动优化时,它可以帮助我们编写具有更好性能的代码。

#4


1943  

If you are curious about even more optimizations that can be done to this code, consider this:

如果您对可以对该代码执行的更多优化感到好奇,请考虑以下内容:

Starting with the original loop:

从原始循环开始:

for (unsigned i = 0; i < 100000; ++i)
{
    for (unsigned j = 0; j < arraySize; ++j)
    {
        if (data[j] >= 128)
            sum += data[j];
    }
}

With loop interchange, we can safely change this loop to:

通过循环交换,我们可以安全地将这个循环更改为:

for (unsigned j = 0; j < arraySize; ++j)
{
    for (unsigned i = 0; i < 100000; ++i)
    {
        if (data[j] >= 128)
            sum += data[j];
    }
}

Then, you can see that the if conditional is constant throughout the execution of the i loop, so you can hoist the if out:

然后,您可以看到if条件在i循环的执行过程中是常量,因此您可以将if输出:

for (unsigned j = 0; j < arraySize; ++j)
{
    if (data[j] >= 128)
    {
        for (unsigned i = 0; i < 100000; ++i)
        {
            sum += data[j];
        }
    }
}

Then, you see that the inner loop can be collapsed into one single expression, assuming the floating point model allows it (/fp:fast is thrown, for example)

然后,您会看到内部循环可以被折叠成一个单一的表达式,假设浮点模型允许它(/fp:fast被抛出,例如)

for (unsigned j = 0; j < arraySize; ++j)
{
    if (data[j] >= 128)
    {
        sum += data[j] * 100000;
    }
}

That one is 100,000x faster than before

那个比以前快10万倍。

#5


1618  

No doubt some of us would be interested in ways of identifying code that is problematic for the CPU's branch-predictor. The Valgrind tool cachegrind has a branch-predictor simulator, enabled by using the --branch-sim=yes flag. Running it over the examples in this question, with the number of outer loops reduced to 10000 and compiled with g++, gives these results:

毫无疑问,我们中的一些人会对识别代码的方式感兴趣,这对于CPU的分支预测器是有问题的。Valgrind工具cachegrind有一个分支预测器模拟器,可以使用-branch-sim=yes标志。在这个问题的示例中运行它,将外部循环的数量减少到10000,并使用g++编译,结果如下:

Sorted:

排序:

==32551== Branches:        656,645,130  (  656,609,208 cond +    35,922 ind)
==32551== Mispredicts:         169,556  (      169,095 cond +       461 ind)
==32551== Mispred rate:            0.0% (          0.0%     +       1.2%   )

Unsorted:

未分类的:

==32555== Branches:        655,996,082  (  655,960,160 cond +  35,922 ind)
==32555== Mispredicts:     164,073,152  (  164,072,692 cond +     460 ind)
==32555== Mispred rate:           25.0% (         25.0%     +     1.2%   )

Drilling down into the line-by-line output produced by cg_annotate we see for the loop in question:

向下钻进cg_annotate生成的逐行输出,我们可以看到循环中的问题:

Sorted:

排序:

          Bc    Bcm Bi Bim
      10,001      4  0   0      for (unsigned i = 0; i < 10000; ++i)
           .      .  .   .      {
           .      .  .   .          // primary loop
 327,690,000 10,016  0   0          for (unsigned c = 0; c < arraySize; ++c)
           .      .  .   .          {
 327,680,000 10,006  0   0              if (data[c] >= 128)
           0      0  0   0                  sum += data[c];
           .      .  .   .          }
           .      .  .   .      }

Unsorted:

未分类的:

          Bc         Bcm Bi Bim
      10,001           4  0   0      for (unsigned i = 0; i < 10000; ++i)
           .           .  .   .      {
           .           .  .   .          // primary loop
 327,690,000      10,038  0   0          for (unsigned c = 0; c < arraySize; ++c)
           .           .  .   .          {
 327,680,000 164,050,007  0   0              if (data[c] >= 128)
           0           0  0   0                  sum += data[c];
           .           .  .   .          }
           .           .  .   .      }

This lets you easily identify the problematic line - in the unsorted version the if (data[c] >= 128) line is causing 164,050,007 mispredicted conditional branches (Bcm) under cachegrind's branch-predictor model, whereas it's only causing 10,006 in the sorted version.

这让您可以很容易地识别出问题的行——在未排序的版本中,如果(data[c] >= 128)在cachegrind的分支预测器模型中导致了164,050,007个错误的条件分支(Bcm),而在排序的版本中只导致了10,006。


Alternatively, on Linux you can use the performance counters subsystem to accomplish the same task, but with native performance using CPU counters.

或者,在Linux上,您可以使用性能计数器子系统来完成相同的任务,但是使用CPU计数器的本机性能。

perf stat ./sumtest_sorted

Sorted:

排序:

 Performance counter stats for './sumtest_sorted':

  11808.095776 task-clock                #    0.998 CPUs utilized          
         1,062 context-switches          #    0.090 K/sec                  
            14 CPU-migrations            #    0.001 K/sec                  
           337 page-faults               #    0.029 K/sec                  
26,487,882,764 cycles                    #    2.243 GHz                    
41,025,654,322 instructions              #    1.55  insns per cycle        
 6,558,871,379 branches                  #  555.455 M/sec                  
       567,204 branch-misses             #    0.01% of all branches        

  11.827228330 seconds time elapsed

Unsorted:

未分类的:

 Performance counter stats for './sumtest_unsorted':

  28877.954344 task-clock                #    0.998 CPUs utilized          
         2,584 context-switches          #    0.089 K/sec                  
            18 CPU-migrations            #    0.001 K/sec                  
           335 page-faults               #    0.012 K/sec                  
65,076,127,595 cycles                    #    2.253 GHz                    
41,032,528,741 instructions              #    0.63  insns per cycle        
 6,560,579,013 branches                  #  227.183 M/sec                  
 1,646,394,749 branch-misses             #   25.10% of all branches        

  28.935500947 seconds time elapsed

It can also do source code annotation with dissassembly.

它还可以用dissassembly进行源代码注释。

perf record -e branch-misses ./sumtest_unsorted
perf annotate -d sumtest_unsorted
 Percent |      Source code & Disassembly of sumtest_unsorted
------------------------------------------------
...
         :                      sum += data[c];
    0.00 :        400a1a:       mov    -0x14(%rbp),%eax
   39.97 :        400a1d:       mov    %eax,%eax
    5.31 :        400a1f:       mov    -0x20040(%rbp,%rax,4),%eax
    4.60 :        400a26:       cltq   
    0.00 :        400a28:       add    %rax,-0x30(%rbp)
...

See the performance tutorial for more details.

有关更多细节,请参见性能教程。

#6


1089  

I just read up on this question and its answers, and I feel an answer is missing.

我只是读了一下这个问题和它的答案,我觉得答案不见了。

A common way to eliminate branch prediction that I've found to work particularly good in managed languages is a table lookup instead of using a branch (although I haven't tested it in this case).

消除分支预测的一种常见方法是使用表查找,而不是使用分支(尽管我还没有在这种情况下进行测试)。

This approach works in general if:

这种方法一般适用于:

  1. It's a small table and is likely to be cached in the processor
  2. 它是一个小的表,很可能被缓存到处理器中。
  3. You are running things in a quite tight loop and/or the processor can pre-load the data
  4. 您正在一个非常紧密的循环中运行,并且/或处理器可以预加载数据。

Background and why

背景和原因

Pfew, so what the hell is that supposed to mean?

Pfew,那到底是什么意思?

From a processor perspective, your memory is slow. To compensate for the difference in speed, they build in a couple of caches in your processor (L1/L2 cache) that compensate for that. So imagine that you're doing your nice calculations and figure out that you need a piece of memory. The processor will get its 'load' operation and loads the piece of memory into cache - and then uses the cache to do the rest of the calculations. Because memory is relatively slow, this 'load' will slow down your program.

从处理器的角度来看,您的内存很慢。为了弥补速度上的差异,他们在你的处理器(L1/L2缓存)中构建了几个缓存,以弥补这一点。想象一下,你正在计算,计算出你需要一段记忆。处理器将得到它的“负载”操作,并将内存块加载到缓存中,然后使用缓存来完成剩余的计算。因为内存比较慢,所以这个“负载”会降低程序的速度。

Like branch prediction, this was optimized in the Pentium processors: the processor predicts that it needs to load a piece of data and attempts to load that into the cache before the operation actually hits the cache. As we've already seen, branch prediction sometimes goes horribly wrong -- in the worst case scenario you need to go back and actually wait for a memory load, which will take forever (in other words: failing branch prediction is bad, a memory load after a branch prediction fail is just horrible!).

像分支预测一样,这在奔腾处理器中得到了优化:处理器预测它需要加载一段数据,并尝试在操作实际到达缓存之前将其加载到缓存中。正如我们已经看到的,分支预测有时可怕的错误——在最坏的情况,你需要回去等待内存负载,将永远(换句话说:没有分支预测不好,内存负载后分支预测失败是可怕的!)。

Fortunately for us, if the memory access pattern is predictable, the processor will load it in its fast cache and all is well.

幸运的是,如果内存访问模式是可预测的,处理器将在它的高速缓存中加载它,并且一切都很好。

The first thing we need to know is what is small? While smaller is generally better, a rule of thumb is to stick to lookup tables that are <= 4096 bytes in size. As an upper limit: if your lookup table is larger than 64K it's probably worth reconsidering.

我们首先要知道的是什么是小的?虽然小的通常比较好,但是一个经验法则是坚持查找大小为<= 4096字节的查找表。作为一个上限:如果您的查找表大于64K,那么它可能值得重新考虑。

Constructing a table

构建一个表

So we've figured out that we can create a small table. Next thing to do is get a lookup function in place. Lookup functions are usually small functions that use a couple of basic integer operations (and, or, xor, shift, add, remove and perhaps a multiply). You want to have your input translated by the lookup function to some kind of 'unique key' in your table, which then simply gives you the answer of all the work you wanted it to do.

我们已经知道我们可以创建一个小表格。接下来要做的是找到一个查找函数。查找函数通常是使用几个基本整数操作的小函数(或者,或者,xor, shift, add, remove,也许还有一个乘法)。您希望将您的输入由查找函数翻译成表中某种“唯一键”,然后简单地给出您想要它做的所有工作的答案。

In this case: >= 128 means we can keep the value, < 128 means we get rid of it. The easiest way to do that is by using an 'AND': if we keep it, we AND it with 7FFFFFFF; if we want to get rid of it, we AND it with 0. Notice also that 128 is a power of 2 -- so we can go ahead and make a table of 32768/128 integers and fill it with one zero and a lot of 7FFFFFFFF's.

在本例中:>= 128表示我们可以保留值,< 128表示我们可以去掉它。最简单的方法是使用“AND”:如果我们保留了它,我们和它就有了7FFFFFFF;如果我们想摆脱它,我们和它是0。注意,128是2的幂,所以我们可以做一个32768/128整数的表格,并填满一个0和很多7FFFFFFFF。

Managed languages

管理语言

You might wonder why this works well in managed languages. After all, managed languages check the boundaries of the arrays with a branch to ensure you don't mess up...

您可能想知道为什么这在托管语言中很有效。毕竟,托管语言通过一个分支检查数组的边界,以确保您不会出错……

Well, not exactly... :-)

嗯,不完全是……:-)

There has been quite some work on eliminating this branch for managed languages. For example:

为了管理语言,消除这个分支的工作已经相当多了。例如:

for (int i=0; i<array.Length; ++i)
   // Use array[i]

In this case it's obvious to the compiler that the boundary condition will never be hit. At least the Microsoft JIT compiler (but I expect Java does similar things) will notice this and remove the check all together. WOW - that means no branch. Similarly, it will deal with other obvious cases.

在这种情况下,对于编译器来说,边界条件将永远不会被攻击。至少Microsoft JIT编译器(但我希望Java也做类似的事情)会注意到这一点并将所有的检查一起删除。哇,那表示没有分支。同样,它将处理其他明显的情况。

If you run into trouble with lookups on managed languages - the key is to add a & 0x[something]FFF to your lookup function to make the boundary check predictable - and watch it going faster.

如果您在管理语言上遇到麻烦,关键是在查找函数中添加一个& 0x[something]FFF,使边界检查可以预测,并且观察它运行得更快。

The result for this case

这种情况的结果。

// Generate data
int arraySize = 32768;
int[] data = new int[arraySize];

Random rnd = new Random(0);
for (int c = 0; c < arraySize; ++c)
    data[c] = rnd.Next(256);

// Too keep the spirit of the code in-tact I'll make a separate lookup table
// (I assume we cannot modify 'data' or the number of loops)
int[] lookup = new int[256];

for (int c = 0; c < 256; ++c)
    lookup[c] = (c >= 128) ? c : 0;

// Test
DateTime startTime = System.DateTime.Now;
long sum = 0;

for (int i = 0; i < 100000; ++i)
{
    // Primary loop
    for (int j = 0; j < arraySize; ++j)
    {
        // Here you basically want to use simple operations - so no
        // random branches, but things like &, |, *, -, +, etc. are fine.
        sum += lookup[data[j]];
    }
}

DateTime endTime = System.DateTime.Now;
Console.WriteLine(endTime - startTime);
Console.WriteLine("sum = " + sum);

Console.ReadLine();

#7


974  

As data is distributed between 0 and 255 when the array is sorted, around the first half of the iterations will not enter the if-statement (the if statement is shared below).

当数组被排序时,当数据在0到255之间分布时,在迭代的前半部分不会输入if语句(if语句在下面共享)。

if (data[c] >= 128)
    sum += data[c];

The question is: What makes the above statement not execute in certain cases as in case of sorted data? Here comes the "branch predictor". A branch predictor is a digital circuit that tries to guess which way a branch (e.g. an if-then-else structure) will go before this is known for sure. The purpose of the branch predictor is to improve the flow in the instruction pipeline. Branch predictors play a critical role in achieving high effective performance!

问题是:在某些情况下,在排序数据的情况下,是什么使上述语句不执行?这里是“分支预测器”。一个分支预测器是一个数字电路,它试图猜测一个分支(例如if- else结构)会在这之前确定。分支预测器的目的是提高指令管道的流量。分支预测器在实现高效性能方面起着至关重要的作用!

Let's do some bench marking to understand it better

让我们做一些长凳标记来更好地理解它。

The performance of an if-statement depends on whether its condition has a predictable pattern. If the condition is always true or always false, the branch prediction logic in the processor will pick up the pattern. On the other hand, if the pattern is unpredictable, the if-statement will be much more expensive.

一个if语句的性能取决于它的条件是否具有可预测的模式。如果条件总是正确或总是错误的,则处理器中的分支预测逻辑将会选择该模式。另一方面,如果模式是不可预测的,那么if-statement将会更加昂贵。

Let’s measure the performance of this loop with different conditions:

让我们用不同的条件来测量这个循环的性能:

for (int i = 0; i < max; i++)
    if (condition)
        sum++;

Here are the timings of the loop with different true-false patterns:

这里是循环的时间,有不同的真实错误模式:

Condition            Pattern                 Time (ms)

(i & 0×80000000) == 0    T repeated          322

(i & 0xffffffff) == 0    F repeated          276

(i & 1) == 0            TF alternating    760

(i & 3) == 0            TFFFTFFF…          513

(i & 2) == 0            TTFFTTFF…          1675

(i & 4) == 0            TTTTFFFFTTTTFFFF… 1275

(i & 8) == 0            8T 8F 8T 8F …     752

(i & 16) == 0            16T 16F 16T 16F … 490

A “bad” true-false pattern can make an if-statement up to six times slower than a “good” pattern! Of course, which pattern is good and which is bad depends on the exact instructions generated by the compiler and on the specific processor.

一个“坏”的真实模式可以使一个if语句比一个“好”模式慢6倍!当然,哪种模式是好的,哪种模式不好,取决于编译器和特定处理器生成的精确指令。

So there is no doubt about the impact of branch prediction on performance!

因此,关于分支预测对性能的影响是毫无疑问的!

#8


914  

One way to avoid branch prediction errors is to build a lookup table, and index it using the data. Stefan de Bruijn discussed that in his answer.

避免分支预测错误的一种方法是构建一个查找表,并使用数据对其进行索引。Stefan de Bruijn在他的回答中谈到了这一点。

But in this case, we know values are in the range [0, 255] and we only care about values >= 128. That means we can easily extract a single bit that will tell us whether we want a value or not: by shifting the data to the right 7 bits, we are left with a 0 bit or a 1 bit, and we only want to add the value when we have a 1 bit. Let's call this bit the "decision bit".

但是在这种情况下,我们知道值在[0,255]范围内,我们只关心>= 128的值。这意味着我们可以很容易地提取单个点,会告诉我们我们是否想要一个价值:通过将数据转移到正确的7位,剩下一个0或1位,我们只想添加1位时的值。让我们称之为“决策位”。

By using the 0/1 value of the decision bit as an index into an array, we can make code that will be equally fast whether the data is sorted or not sorted. Our code will always add a value, but when the decision bit is 0, we will add the value somewhere we don't care about. Here's the code:

通过将决策位的0/1值作为一个数组的索引,我们可以使代码同样快速地处理数据是否已排序。我们的代码总是会添加一个值,但是当决策位为0时,我们会将值添加到我们不关心的地方。这是代码:

// Test
clock_t start = clock();
long long a[] = {0, 0};
long long sum;

for (unsigned i = 0; i < 100000; ++i)
{
    // Primary loop
    for (unsigned c = 0; c < arraySize; ++c)
    {
        int j = (data[c] >> 7);
        a[j] += data[c];
    }
}

double elapsedTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;
sum = a[1];

This code wastes half of the adds, but never has a branch prediction failure. It's tremendously faster on random data than the version with an actual if statement.

这段代码浪费了一半的添加,但从来没有分支预测失败。与实际的if语句相比,它在随机数据上的速度要快得多。

But in my testing, an explicit lookup table was slightly faster than this, probably because indexing into a lookup table was slightly faster than bit shifting. This shows how my code sets up and uses the lookup table (unimaginatively called lut for "LookUp Table" in the code). Here's the C++ code:

但在我的测试中,显式查找表的速度略快于此,可能是因为索引到查找表的速度略快于移位。这显示了我的代码如何设置和使用查找表(在代码中没有想象地称为“查找表”的lut)。这里的c++代码:

// declare and then fill in the lookup table
int lut[256];
for (unsigned c = 0; c < 256; ++c)
    lut[c] = (c >= 128) ? c : 0;

// use the lookup table after it is built
for (unsigned i = 0; i < 100000; ++i)
{
    // Primary loop
    for (unsigned c = 0; c < arraySize; ++c)
    {
        sum += lut[data[c]];
    }
}

In this case the lookup table was only 256 bytes, so it fit nicely in cache and all was fast. This technique wouldn't work well if the data was 24-bit values and we only wanted half of them... the lookup table would be far too big to be practical. On the other hand, we can combine the two techniques shown above: first shift the bits over, then index a lookup table. For a 24-bit value that we only want the top half value, we could potentially shift the data right by 12 bits, and be left with a 12-bit value for a table index. A 12-bit table index implies a table of 4096 values, which might be practical.

在这种情况下,查找表只有256个字节,所以它在缓存中非常合适,而且速度很快。如果数据是24位值,而我们只需要其中的一半,那么这种技术就不能很好地工作。查找表太大而不实用。另一方面,我们可以将上面所示的两种技术组合在一起:首先将位移过来,然后索引查找表。对于一个24位的值,我们只需要上面的一半值,我们可以将数据右移12位,然后给表索引保留12位的值。一个12位的表索引包含4096个值的表,这可能是实用的。

EDIT: One thing I forgot to put in.

编辑:我忘了放一件东西。

The technique of indexing into an array, instead of using an if statement, can be used for deciding which pointer to use. I saw a library that implemented binary trees, and instead of having two named pointers (pLeft and pRight or whatever) had a length-2 array of pointers, and used the "decision bit" technique to decide which one to follow. For example, instead of:

索引到数组的技术,而不是使用if语句,可以用来决定要使用哪个指针。我看到了一个实现二进制树的库,而不是有两个命名的指针(pLeft和pRight或其他什么),它有一个长度为2的指针数组,并使用“决策位”技术来决定要遵循哪一个。例如,而不是:

if (x < node->value)
    node = node->pLeft;
else
    node = node->pRight;

this library would do something like:

这个图书馆会做一些类似的事情:

i = (x < node->value);
node = node->link[i];

Here's a link to this code: Red Black Trees, Eternally Confuzzled

这里有一个链接到这个代码:红黑树,永远混乱。

#9


843  

In the sorted case, you can do better than relying on successful branch prediction or any branchless comparison trick: completely remove the branch.

在排序的情况下,您可以比依赖成功的分支预测或任何无分支的比较技巧做得更好:完全删除分支。

Indeed, the array is partitioned in a contiguous zone with data < 128 and another with data >= 128. So you should find the partition point with a dichotomic search (using Lg(arraySize) = 15 comparisons), then do a straight accumulation from that point.

实际上,数组在一个相邻的区域中被分区,数据< 128,另一个带数据>= 128。因此,您应该找到一个二分查找的分区点(使用Lg(arraySize) = 15比较),然后从该点进行直接积累。

Something like (unchecked)

类似的(不)

int i= 0, j, k= arraySize;
while (i < k)
{
  j= (i + k) >> 1;
  if (data[j] >= 128)
    k= j;
  else
    i= j;
}
sum= 0;
for (; i < arraySize; i++)
  sum+= data[i];

or, slightly more obfuscated

或者,稍微混淆

int i, k, j= (i + k) >> 1;
for (i= 0, k= arraySize; i < k; (data[j] >= 128 ? k : i)= j)
  j= (i + k) >> 1;
for (sum= 0; i < arraySize; i++)
  sum+= data[i];

A yet faster approach, that gives an approximate solution for both sorted or unsorted is: sum= 3137536; (assuming a truly uniform distribution, 16384 samples with expected value 191.5) :-)

一种更快的方法,给出了排序或未排序的近似解:sum= 3137536;(假设有一个真正统一的分布,16384个样本,期望值是191.5):-)

#10


643  

The above behavior is happening because of Branch prediction.

上述行为是由于分支预测而发生的。

To understand branch prediction one must first understand Instruction Pipeline:

要理解分支预测,首先必须了解指令管道:

Any instruction is broken into sequence of steps so that different steps can be executed concurrently in parallel. This technique is known as instruction pipeline and this is used to increase throughput in modern processors. To understand this better please see this example on Wikipedia.

任何指令都被分解成一系列步骤,这样就可以并行执行不同的步骤。这种技术被称为指令管道,它用于提高现代处理器的吞吐量。为了更好地理解这一点,请在*上看到这个例子。

Generally modern processors have quite long pipelines, but for ease let's consider these 4 steps only.

一般来说,现代处理器的管道很长,但是为了方便起见,我们只考虑这4个步骤。

  1. IF -- Fetch the instruction from memory
  2. 如果——从内存中获取指令。
  3. ID -- Decode the instruction
  4. ID——解码指令。
  5. EX -- Execute the instruction
  6. 执行指令。
  7. WB -- Write back to CPU register
  8. WB——写回CPU寄存器。

4-stage pipeline in general for 2 instructions. 为什么处理排序数组比未排序的数组更快?

4级管道一般为2条指令。

Moving back to the above question let's consider the following instructions:

回到上面的问题,让我们考虑下面的说明:

                        A) if (data[c] >= 128)
                                /\
                               /  \
                              /    \
                        true /      \ false
                            /        \
                           /          \
                          /            \
                         /              \
              B) sum += data[c];          C) for loop or print().

Without branch prediction the following would occur:

如果没有分支预测,就会发生以下情况:

To execute instruction B or instruction C the processor will have to wait till the instruction A doesn't reach till EX stage in the pipeline, as the decision to go to instruction B or instruction C depends on the result of instruction A. So the pipeline will look like this.

要执行指令B或指令C,处理器将不得不等到指令A在管道的EX阶段才到达,因为进入指令B或指令C的决定取决于指令A的结果,所以管道会是这样的。

when if condition returns true: 为什么处理排序数组比未排序的数组更快?

如果条件返回true:

When if condition returns false: 为什么处理排序数组比未排序的数组更快?

如果条件返回false:

As a result of waiting for the result of instruction A, the total CPU cycles spent in the above case (without branch prediction; for both true and false) is 7.

由于等待指令a的结果,在上述情况下的总CPU周期(没有分支预测;对于真和假)是7。

So what is branch prediction?

什么是分支预测?

Branch predictor will try to guess which way a branch (an if-then-else structure) will go before this is known for sure. It will not wait for the instruction A to reach the EX stage of the pipeline, but it will guess the decision and go onto that instruction (B or C in case of our example).

分支预测器将尝试猜测分支(if- else结构)在此之前会发生什么。它不会等待指令A到达管道的前级,但它会猜测决定并执行该指令(例如,B或C)。

In case of a correct guess, the pipeline looks something like this: 为什么处理排序数组比未排序的数组更快?

如果有正确的猜测,管道看起来是这样的:

If it is later detected that the guess was wrong then the partially executed instructions are discarded and the pipeline starts over with the correct branch, incurring a delay. The time that is wasted in case of a branch misprediction is equal to the number of stages in the pipeline from the fetch stage to the execute stage. Modern microprocessors tend to have quite long pipelines so that the misprediction delay is between 10 and 20 clock cycles. The longer the pipeline the greater the need for a good branch predictor.

如果稍后检测到猜测是错误的,那么部分执行的指令就会被丢弃,而管道会以正确的分支开始,从而导致延迟。在一个分支错误的情况下被浪费的时间等于从获取阶段到执行阶段的管道中的阶段数。现代微处理器往往有相当长的管道,因此错误的延迟时间是10到20个时钟周期。管道长度越长,就越需要一个好的分支预测器。

In the OP's code, the first time when the conditional, the branch predictor does not have any information to base up prediction, so first time it will randomly choose the next instruction. Later in the for loop it can base the prediction on the history. For an array sorted in ascending order, there are three possibilities:

在OP的代码中,第一次当有条件的时候,分支预测器没有任何信息来建立预测,所以第一次它会随机选择下一个指令。在以后的循环中,它可以根据历史来预测。对于按升序排序的数组,有三种可能:

  1. All the elements are less than 128
  2. 所有的元素都小于128。
  3. All the elements are greater than 128
  4. 所有的元素都大于128。
  5. Some starting new elements are less than 128 and later it become greater than 128
  6. 一些开始的新元素小于128,然后它会大于128。

Let us assume that the predictor will always assume the true branch on the first run.

让我们假设预测器总是在第一次运行时假定真正的分支。

So in the first case it will always take the true branch since historically all its predictions are correct. In the 2nd case, initially it will predict wrong, but after a few iterations it will predict correctly. In the 3rd case it will initially predict correctly till the elements are less than 128. After which it will fail for some time and the correct itself when it see branch prediction failure in history.

因此,在第一种情况下,它总是会选择真正的分支,因为历史上所有的预测都是正确的。在第二种情况下,最初它将预测错误,但是经过几次迭代,它将正确地预测。在第三种情况下,它将正确地预测直到元素小于128。当它在历史上看到分支预测失败后,它将会失败一段时间和正确的本身。

In all these cases the failure will be too less in number and as a result only few times it will need to discard the partially executed instructions and start over with the correct branch, resulting in less CPU cycles.

在所有这些情况下,失败的数量将会减少,结果只需要放弃部分执行的指令,并开始使用正确的分支,从而减少CPU周期。

But in case of random unsorted array, the prediction will need to discard the partially executed instructions and start over with the correct branch most of the time and result in more CPU cycles compared to the sorted array.

但在随机无序数组的情况下,预测将需要放弃部分执行的指令,并在大部分时间内以正确的分支开始,并与排序的数组相比,导致更多的CPU周期。

#11


576  

An official answer would be from

官方的回答是。

  1. Intel - Avoiding the Cost of Branch Misprediction
  2. 英特尔——避免了分支错误的成本。
  3. Intel - Branch and Loop Reorganization to Prevent Mispredicts
  4. 英特尔-分公司和环路重组,以防止错误。
  5. Scientific papers - branch prediction computer architecture
  6. 科学论文——分支预测计算机体系结构。
  7. Books: J.L. Hennessy, D.A. Patterson: Computer architecture: a quantitative approach
  8. 书籍:J.L. Hennessy, D.A.帕特森:计算机体系结构:定量方法。
  9. Articles in scientific publications: T.Y. Yeh, Y.N. Patt made a lot of these on branch predictions.
  10. 在科学出版物上的文章:T.Y. Yeh, Y.N. Patt在分支预测上做了很多这样的研究。

You can also see from this lovely diagram why the branch predictor gets confused.

你也可以从这个可爱的图中看出为什么分支预测器会被混淆。

为什么处理排序数组比未排序的数组更快?

Each element in the original code is a random value

原始代码中的每个元素都是一个随机值。

data[c] = std::rand() % 256;

so the predictor will change sides as the std::rand() blow.

因此,预测器将会改变为std::rand()打击。

On the other hand, once it's sorted, the predictor will first move into a state of strongly not taken and when the values change to the high value the predictor will in three runs through change all the way from strongly not taken to strongly taken.

另一方面,一旦它被排序,预测者将首先进入一个强烈未被接受的状态,当值改变为高值时,预测者将会在三次中通过改变,从强烈的不被强烈的接受。


#12


542  

In the same line (I think this was not highlighted by any answer) it's good to mention that sometimes (specially in software where the performance matters—like in the Linux kernel) you can find some if statements like the following:

在同一行(我认为这并没有被任何答案高亮显示),很好地提一下,有时候(特别是在Linux内核中性能类似的软件中),您可以找到一些if语句,比如:

if (likely( everything_is_ok ))
{
    /* Do something */
}

or similarly:

或类似的:

if (unlikely(very_improbable_condition))
{
    /* Do something */    
}

Both likely() and unlikely() are in fact macros that are defined by using something like the GCC's __builtin_expect to help the compiler insert prediction code to favour the condition taking into account the information provided by the user. GCC supports other builtins that could change the behavior of the running program or emit low level instructions like clearing the cache, etc. See this documentation that goes through the available GCC's builtins.

可能()和不太可能()实际上都是宏,使用GCC的__builtin_expect来帮助编译器插入预测代码,以便考虑到用户提供的信息。GCC支持其他可以改变运行程序行为或发出低级别指令(如清除缓存等)的构建程序。

Normally this kind of optimizations are mainly found in hard-real time applications or embedded systems where execution time matters and it's critical. For example, if you are checking for some error condition that only happens 1/10000000 times, then why not inform the compiler about this? This way, by default, the branch prediction would assume that the condition is false.

通常,这种优化主要存在于硬实时应用程序或嵌入式系统中,执行时间很重要。例如,如果您正在检查仅发生1/10000000次的错误条件,那么为什么不告诉编译器这个呢?这样,默认情况下,分支预测将假定条件为假。

#13


522  

Frequently used Boolean operations in C++ produce many branches in compiled program. If these branches are inside loops and are hard to predict they can slow down execution significantly. Boolean variables are stored as 8-bit integers with the value 0 for false and 1 for true.

在c++中经常使用的布尔运算在编译程序中产生许多分支。如果这些分支处于循环中,并且很难预测它们会显著降低执行速度。布尔变量被存储为8位整数,值为0,为true。

Boolean variables are overdetermined in the sense that all operators that have Boolean variables as input check if the inputs have any other value than 0 or 1, but operators that have Booleans as output can produce no other value than 0 or 1. This makes operations with Boolean variables as input less efficient than necessary. Consider example:

布尔变量在所有运算符中都是超确定的,因为所有的运算符都有布尔变量作为输入检查,如果输入的值大于0或1,但是有布尔值的运算符可以产生0或1的其他值。这使得使用布尔变量的操作比需要的效率低。考虑的例子:

bool a, b, c, d;
c = a && b;
d = a || b;

This is typically implemented by the compiler in the following way:

这通常由编译器以下列方式实现:

bool a, b, c, d;
if (a != 0) {
    if (b != 0) {
        c = 1;
    }
    else {
        goto CFALSE;
    }
}
else {
    CFALSE:
    c = 0;
}
if (a == 0) {
    if (b == 0) {
        d = 0;
    }
    else {
        goto DTRUE;
    }
}
else {
    DTRUE:
    d = 1;
}

This code is far from optimal. The branches may take a long time in case of mispredictions. The Boolean operations can be made much more efficient if it is known with certainty that the operands have no other values than 0 and 1. The reason why the compiler does not make such an assumption is that the variables might have other values if they are uninitialized or come from unknown sources. The above code can be optimized if a and b have been initialized to valid values or if they come from operators that produce Boolean output. The optimized code looks like this:

这段代码并不理想。这些分支可能需要很长时间才会出现错误。如果可以确信操作数没有其他值大于0和1,那么布尔操作可以更有效。编译器没有做出这样的假设的原因是,如果变量未初始化或来自未知的源,那么变量可能有其他值。如果a和b已经被初始化为有效值,或者它们来自产生布尔输出的运算符,上述代码可以进行优化。优化后的代码如下所示:

char a = 0, b = 1, c, d;
c = a & b;
d = a | b;

char is used instead of bool in order to make it possible to use the bitwise operators (& and |) instead of the Boolean operators (&& and ||). The bitwise operators are single instructions that take only one clock cycle. The OR operator (|) works even if a and b have other values than 0 or 1. The AND operator (&) and the EXCLUSIVE OR operator (^) may give inconsistent results if the operands have other values than 0 and 1.

使用char代替bool,以使其能够使用位操作符(&和|)而不是布尔运算符(&&和||)。位运算符是只使用一个时钟周期的单一指令。OR运算符(|)工作,即使a和b的值大于0或1。如果操作数的值大于0和1,则和操作符(&)和独占或运算符()可能会产生不一致的结果。

~ can not be used for NOT. Instead, you can make a Boolean NOT on a variable which is known to be 0 or 1 by XOR'ing it with 1:

~不可以不使用。相反,你可以在一个已知为0或1的变量上创建一个布尔值,它的值为1:

bool a, b;
b = !a;

can be optimized to:

可以优化:

char a = 0, b;
b = a ^ 1;

a && b cannot be replaced with a & b if b is an expression that should not be evaluated if a is false ( && will not evaluate b, & will). Likewise, a || b can not be replaced with a | b if b is an expression that should not be evaluated if a is true.

a && b不能被a & b代替,如果b是一个不应该被评估的表达式,如果a是错误的(&&将不评估b, & will)。同样,|| b不能用| b代替,如果b是一个表达式,如果a为真,则不应该对其进行评估。

Using bitwise operators is more advantageous if the operands are variables than if the operands are comparisons:

如果操作数是变量,而不是操作数比较,则使用位运算符更有利。

bool a; double x, y, z;
a = x > y && z < 5.0;

is optimal in most cases (unless you expect the && expression to generate many branch mispredictions).

在大多数情况下是最优的(除非您期望&&表达式生成许多分支错误)。

#14


188  

This question has already been answered excellently many times over. Still I'd like to draw the group's attention to yet another interesting analysis.

这个问题已经回答了很多次了。不过,我还是想把大家的注意力引向另一个有趣的分析。

Recently this example (modified very slightly) was also used as a way to demonstrate how a piece of code can be profiled within the program itself on Windows. Along the way, the author also shows how to use the results to determine where the code is spending most of its time in both the sorted & unsorted case. Finally the piece also shows how to use a little known feature of the HAL (Hardware Abstraction Layer) to determine just how much branch misprediction is happening in the unsorted case.

最近,这个示例(稍微修改了一下)也被用来演示如何在Windows程序中对代码进行剖析。在此过程中,作者还展示了如何使用结果来确定代码在排序和未排序的情况下花费了大部分时间。最后,这篇文章还展示了如何使用HAL(硬件抽象层)的一个鲜为人知的特性,来确定在未排序的情况下发生了多少分支错误。

The link is here: http://www.geoffchappell.com/studies/windows/km/ntoskrnl/api/ex/profile/demo.htm

链接在这里:http://www.geoffch.com/studies/windows/km/ntoskrnl/api/ex/profile/demo.htm。

#15


179  

That's for sure!...

那是肯定的!

Branch prediction makes the logic run slower, because of the switching which happens in the code! It's like you are going a straight street or a street with a lot of turnings, for sure the straight one is going to be done quicker!...

分支预测使逻辑运行更慢,因为在代码中发生了切换。这就像你走在一条笔直的街道上,或者是一条有很多弯路的街道,你肯定会走得更快!

If the array is sorted, your condition is false at the first step: data[c] >= 128, then becomes a true value for the whole way to the end of the street. That's how you get to the end of the logic faster. On the other hand, using an unsorted array, you need a lot of turning and processing which make your code run slower for sure...

如果数组被排序,那么您的条件在第一步是错误的:数据[c] >= 128,然后就会成为整条路到街道尽头的真正值。这就是如何更快地结束逻辑的方法。另一方面,使用未排序的数组,需要进行大量的转换和处理,从而确保代码运行得更慢。

Look at the image I created for you below. Which street is going to be finished faster?

看看我为你创建的图片。哪条街会更快完工?

为什么处理排序数组比未排序的数组更快?

So programmatically, branch prediction causes the process to be slower...

因此,通过编程,分支预测使进程变得更慢……

Also at the end, it's good to know we have two kinds of branch predictions that each is going to affect your code differently:

最后,我们有两种不同的分支预测它们会对你的代码产生不同的影响:

1. Static

1。静态

2. Dynamic

2。动态

为什么处理排序数组比未排序的数组更快?

Static branch prediction is used by the microprocessor the first time a conditional branch is encountered, and dynamic branch prediction is used for succeeding executions of the conditional branch code.

静态分支预测是微处理器第一次遇到条件分支时使用的,动态分支预测用于对条件分支代码的后续执行。

In order to effectively write your code to take advantage of these rules, when writing if-else or switch statements, check the most common cases first and work progressively down to the least common. Loops do not necessarily require any special ordering of code for static branch prediction, as only the condition of the loop iterator is normally used.

为了有效地编写代码,以利用这些规则,在编写if-else或switch语句时,首先检查最常见的情况,然后逐步降低到最不常见的情况。循环不需要对静态分支预测进行任何特殊的代码排序,因为通常使用的是循环迭代器的条件。

#16


94  

Branch-prediction gain!

转移预测获得!

It is important to understand that branch misprediction doesn't slow down programs. The cost of a missed prediction is just as if branch prediction didn't exist and you waited for the evaluation of the expression to decide what code to run (further explanation in the next paragraph).

很重要的一点是要了解,分支的错误预测不会减慢程序的速度。一个错过的预测的成本就像分支预测不存在一样,你等待这个表达式的评估来决定运行什么代码(下一段的进一步解释)。

if (expression)
{
    // Run 1
} else {
    // Run 2
}

Whenever there's an if-else \ switch statement, the expression has to be evaluated to determine which block should be executed. In the assembly code generated by the compiler, conditional branch instructions are inserted.

每当有if-else \ switch语句时,必须对表达式进行评估,以确定应该执行哪个块。在编译器生成的汇编代码中,插入了条件分支指令。

A branch instruction can cause a computer to begin executing a different instruction sequence and thus deviate from its default behavior of executing instructions in order (i.e. if the expression is false, the program skips the code of the if block) depending on some condition, which is the expression evaluation in our case.

分支指令可能导致电脑开始执行不同的指令序列,从而偏离其违约行为的执行指令(即如果表达式为假,程序跳过如果的代码块)根据某些条件,即表达评估在我们的案例中。

That being said, the compiler tries to predict the outcome prior to it being actually evaluated. It will fetch instructions from the if block, and if the expression turns out to be true, then wonderful! We gained the time it took to evaluate it and made progress in the code; if not then we are running the wrong code, the pipeline is flushed, and the correct block is run.

也就是说,编译器试图在实际评估之前预测结果。它将从if块中获取指令,如果表达式为真,那么棒极了!我们获得了评估它的时间,并在代码中取得了进展;如果不是这样,我们就会运行错误的代码,管道会被刷新,并且运行正确的块。

Visualization:

Let's say you need to pick route 1 or route 2. Waiting for your partner to check the map, you have stopped at ## and waited, or you could just pick route1 and if you were lucky (route 1 is the correct route), then great you didn't have to wait for your partner to check the map (you saved the time it would have taken him to check the map), otherwise you will just turn back.

假设你需要选择1号线或2号线。等待你的伴侣查看地图,你停在# #和等待,或者你可以选择route1,如果你很幸运(路线1是正确的路线),然后好你没有等待你的伴侣检查地图(你保存的时间会带他去检查地图),否则你就会回头。

While flushing pipelines is super fast, nowadays taking this gamble is worth it. Predicting sorted data or a data that changes slowly is always easier and better than predicting fast changes.

虽然冲洗管道非常快,但现在进行这种赌博是值得的。预测排序后的数据或数据的变化总是比预测快速变化要容易得多。

 O      Route 1  /-------------------------------
/|\             /
 |  ---------##/
/ \            \
                \
        Route 2  \--------------------------------

#17


76  

As what has already been mentioned by others, what behind the mystery is Branch Predictor.

正如其他人已经提到的,神秘的背后是分支预测器。

I'm not trying to add something but explaining the concept in another way. There is concise introduction on the wiki which contains text and diagram. I do like the explanation below which uses diagram to elaborate the Branch Predictor intuitively.

我并不是要添加什么东西,而是用另一种方式解释这个概念。在wiki上有简明的介绍,其中包含文本和图表。我很喜欢下面的解释,它使用图来直观地阐述分支预测器。

In computer architecture, a branch predictor is a digital circuit that tries to guess which way a branch (e.g. an if-then-else structure) will go before this is known for sure. The purpose of the branch predictor is to improve the flow in the instruction pipeline. Branch predictors play a critical role in achieving high effective performance in many modern pipelined microprocessor architectures such as x86.

在计算机体系结构中,一个分支预测器是一个数字电路,它试图猜测一个分支(例如if- else结构)会在这之前确定。分支预测器的目的是提高指令管道的流量。分支预测器在许多现代流水线的微处理器体系结构(如x86)中发挥了关键作用。

Two-way branching is usually implemented with a conditional jump instruction. A conditional jump can either be "not taken" and continue execution with the first branch of code which follows immediately after the conditional jump, or it can be "taken" and jump to a different place in program memory where the second branch of code is stored. It is not known for certain whether a conditional jump will be taken or not taken until the condition has been calculated and the conditional jump has passed the execution stage in the instruction pipeline (see fig. 1).

双向分支通常使用条件跳转指令实现。条件跳转可以是“不被占用”,并且在条件跳转之后立即执行的代码的第一个分支继续执行,或者它可以被“捕获”并跳转到程序内存中的另一个位置,在那里存储第二个代码分支。尚不确定是否在计算条件之前采取或不采取条件跳转,条件跳转已通过指令管道中的执行阶段(见图1)。

为什么处理排序数组比未排序的数组更快?

Based on described scenario, I have written an animation demo to show how instructions are executed in pipeline in different situations.

基于描述的场景,我编写了一个动画演示,演示如何在不同情况下在管道中执行指令。

  1. Without the Branch Predictor.
  2. 没有分支预测器。

Without branch prediction, the processor would have to wait until the conditional jump instruction has passed the execute stage before the next instruction can enter the fetch stage in the pipeline.

如果没有分支预测,处理器将不得不等待,直到条件跳转指令在下一条指令进入管道的获取阶段之前已经通过了执行阶段。

The example contains three instructions and the first one is a conditional jump instruction. The latter two instructions can go into the pipeline until the conditional jump instruction is executed.

这个示例包含三个指令,第一个是条件跳转指令。后两个指令可以进入管道,直到执行条件跳转指令。

为什么处理排序数组比未排序的数组更快?

It will take 9 clock cycles for 3 instructions to be completed.

要完成3个指令,需要9个时钟周期。

  1. Use Branch Predictor and don't take conditional jump. Let's assume that the predict is not taking the conditional jump.
  2. 使用分支预测器,不要使用条件跳转。让我们假设预测没有接受条件跳跃。

为什么处理排序数组比未排序的数组更快?

It will take 7 clock cycles for 3 instructions to be completed.

要完成3个指令,需要7个时钟周期。

  1. Use Branch Predictor and take conditional jump. Let's assume that the predict is not taking the conditional jump.
  2. 使用分支预测器并进行条件跳转。让我们假设预测没有接受条件跳跃。

为什么处理排序数组比未排序的数组更快?

It will take 9 clock cycles for 3 instructions to be completed.

要完成3个指令,需要9个时钟周期。

The time that is wasted in case of a branch misprediction is equal to the number of stages in the pipeline from the fetch stage to the execute stage. Modern microprocessors tend to have quite long pipelines so that the misprediction delay is between 10 and 20 clock cycles. As a result, making a pipeline longer increases the need for a more advanced branch predictor.

在一个分支错误的情况下被浪费的时间等于从获取阶段到执行阶段的管道中的阶段数。现代微处理器往往有相当长的管道,因此错误的延迟时间是10到20个时钟周期。因此,延长管道的使用时间会增加更高级的分支预测器的需求。

As you can see, it seems we don't have reason not to use Branch Predictor.

正如您所看到的,我们似乎没有理由不使用分支预测器。

It's quite a simple demo that clarifies the very basic part of Branch Predictor. If those gifs are annoying, please feel free to remove them from the answer and visitors can also get the demo from git

这是一个简单的演示,阐明了分支预测器的基本部分。如果这些gif是烦人的,请随意删除它们,访问者也可以从git中得到演示。

#18


53  

It's about branch prediction. What is it?

它是关于分支预测。它是什么?

  • A branch predictor is one of the ancient performance improving techniques which still finds relevance into modern architectures. While the simple prediction techniques provide fast lookup and power efficiency they suffer from a high misprediction rate.

    分支预测器是一种古老的性能改进技术,它仍然与现代体系结构相关联。虽然简单的预测技术提供了快速查找和功率效率,但它们的误读率很高。

  • On the other hand, complex branch predictions –either neural based or variants of two-level branch prediction –provide better prediction accuracy, but they consume more power and complexity increases exponentially.

    另一方面,复杂的分支预测——无论是基于神经的还是两级的分支预测的变体——提供了更好的预测精度,但是它们消耗了更多的能量,复杂度呈指数级增长。

  • In addition to this, in complex prediction techniques the time taken to predict the branches is itself very high –ranging from 2 to 5 cycles –which is comparable to the execution time of actual branches.

    除此之外,在复杂的预测技术中,预测分支的时间本身非常高——从2到5个周期——这相当于实际分支的执行时间。

  • Branch prediction is essentially an optimization (minimization) problem where the emphasis is on to achieve lowest possible miss rate, low power consumption, and low complexity with minimum resources.

    分支预测本质上是一个优化(最小化)问题,重点是实现尽可能低的漏失率、低功耗和低复杂度的最小资源。

There really are three different kinds of branches:

有三种不同的分支:

Forward conditional branches - based on a run-time condition, the PC (program counter) is changed to point to an address forward in the instruction stream.

正向条件分支——基于运行时条件,PC(程序计数器)被更改为指向指令流中的地址。

Backward conditional branches - the PC is changed to point backward in the instruction stream. The branch is based on some condition, such as branching backwards to the beginning of a program loop when a test at the end of the loop states the loop should be executed again.

向后的条件分支——在指令流中,PC被更改为指向向后。该分支基于某些条件,例如,在循环语句结束时,在循环语句结束时,向程序循环的开头分支,应该再次执行循环。

Unconditional branches - this includes jumps, procedure calls and returns that have no specific condition. For example, an unconditional jump instruction might be coded in assembly language as simply "jmp", and the instruction stream must immediately be directed to the target location pointed to by the jump instruction, whereas a conditional jump that might be coded as "jmpne" would redirect the instruction stream only if the result of a comparison of two values in a previous "compare" instructions shows the values to not be equal. (The segmented addressing scheme used by the x86 architecture adds extra complexity, since jumps can be either "near" (within a segment) or "far" (outside the segment). Each type has different effects on branch prediction algorithms.)

无条件分支-包括跳转、过程调用和没有特定条件的返回。例如,无条件跳转指令可能是在汇编语言编码的简单的“人民币”,和指令流必须立即被引导到目标指向的位置跳转指令,而条件转移可能编码为“jmpne”重定向指令流只有两个值的比较的结果在之前的“比较”指令显示的值不相等。(x86架构使用的分段寻址方案增加了额外的复杂性,因为跳转可以是“near”(在段内)或“far”(在段之外)。每种类型对分支预测算法都有不同的影响。

Static/dynamic Branch Prediction: Static branch prediction is used by the microprocessor the first time a conditional branch is encountered, and dynamic branch prediction is used for succeeding executions of the conditional branch code.

静态/动态分支预测:静态分支预测是微处理器第一次使用条件分支时使用的,动态分支预测用于对条件分支代码的后续执行。

References:

引用:

#19


36  

Besides the fact that the branch prediction may slow you down, a sorted array has another advantage:

除了分支预测可能会减慢你的速度之外,排序数组还有另一个优点:

You can have a stop condition instead of just checking the value, this way you only loop over the relevant data, and ignore the rest.
The branch prediction will miss only once.

您可以有一个停止条件,而不是仅仅检查值,这样您只需要对相关数据进行循环,而忽略其余的数据。分支预测只会错过一次。

 // sort backwards (higher values first)
 std::sort(data, data + arraySize, std::greater<int>());

 for (unsigned c = 0; c < arraySize; ++c) {
       if (data[c] < 128) {
              break;
       }
       sum += data[c];               
 }

#20


28  

On ARM, there is no branch needed, because every instruction has a 4-bit condition field, which is tested at zero cost. This eliminates the need for short branches. The inner loop would look something like the following, and there would be no branch prediction hit. Therefore, the sorted version would run slower than the unsorted version on ARM, because of the extra overhead of sorting:

在ARM上,不需要分支,因为每条指令都有一个4位的条件字段,这是在零成本的情况下进行测试的。这就消除了对短分支的需求。内部循环看起来像下面这样,没有分支预测命中。因此,排序后的版本会比未排序的版本运行得慢,因为排序的额外开销是:

MOV R0, #0     // R0 = sum = 0
MOV R1, #0     // R1 = c = 0
ADR R2, data   // R2 = addr of data array (put this instruction outside outer loop)
.inner_loop    // Inner loop branch label
    LDRB R3, [R2, R1]     // R3 = data[c]
    CMP R3, #128          // compare R3 to 128
    ADDGE R0, R0, R3      // if R3 >= 128, then sum += data[c] -- no branch needed!
    ADD R1, R1, #1        // c++
    CMP R1, #arraySize    // compare c to arraySize
    BLT inner_loop        // Branch to inner_loop if c < arraySize

#21


13  

In simple words: Summary of everyone's wonderful answers for beginners...

简而言之:每个人对初学者的精彩答案的总结……

This happens due to something called the branch prediction.

这是由于所谓的分支预测。

Basically the computer predicts the next step and executes it. If it's wrong, it comes one step back and executes with another prediction and if it's right, then it will just continue.

基本上,计算机预测下一步并执行它。如果它是错误的,它就会后退一步,并执行另一个预测,如果它是正确的,那么它就会继续。

The answer your question is very simple. If the array is unsorted, the computer has to make many predictions which may lead to an increased chance of errors. But if the data is sorted, the computer makes fewer predictions and there is less chance of error.

你的问题很简单。如果数组未排序,计算机必须进行许多预测,这可能导致出错的几率增加。但是,如果数据被排序,计算机的预测就更少,出错的可能性也更小。

Sorted Array: Straight Road

排序数组:直路

____________________________________________________________________________
-  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  -  - 
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT

____________________________________________________________________________——————————————————————————————————————————————————————————————————————————————————————TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT

Unsorted Array: Curved Road
______     _______
|           |__|

无序数组:弯曲路______ _________ | | _ |

Branch prediction: Guessing/predicting which road is straight and following it without checking
___________________________________________ Straight road
   |_________________________________________|Longer road

分支预测:猜/预测哪一条路是直和后没有检查___________________________________________直路| _________________________________________ |长的道路

Although both the roads reach the same destination, the straight road is shorter, and the other is longer. If then you choose the other by mistake, there is no turning back, and so you will waste some extra time if you choose the longer road. This is similar to what happens in the computer, and I hope this helped you understand better.

虽然两条路都到达相同的目的地,但直路较短,而另一条路较长。如果你错误地选择了另一个,就没有回头路了,所以如果你选择了更长的路,你就会浪费一些额外的时间。这与计算机中发生的情况类似,我希望这能帮助您更好地理解。