为什么从多个线程使用相同的缓存行不会导致严重的减速?

时间:2021-02-19 21:00:37

Look at this snippet:

看看这个片段:

#include <atomic>
#include <thread>

typedef volatile unsigned char Type;
// typedef std::atomic_uchar Type;

void fn(Type *p) {
    for (int i=0; i<500000000; i++) {
        (*p)++;
    }
}

int main() {
    const int N = 4;

    std::thread thr[N];
    alignas(64) Type buffer[N*64];

    for (int i=0; i<N; i++) {
        thr[i] = std::thread(&fn, &buffer[i*1]);
    }

    for (int i=0; i<N; i++) {
        thr[i].join();
    }

}

This little program increments four adjacent bytes a lot of times from four different threads. Before, I used the rule: don't use the same cache line from different threads, as cache line sharing is bad. So I expected that a four thread version (N=4) is much slower than a one thread version (N=1).

这个小程序从四个不同的线程中多次增加四个相邻的字节。以前,我使用了规则:不要使用来自不同线程的相同缓存行,因为缓存行共享很糟糕。所以我期望四线程版本(N = 4)比一个线程版本(N = 1)慢得多。

However, these are my measurements (on a Haswell CPU):

但是,这些是我的测量结果(在Haswell CPU上):

  • N=1: 1 sec
  • N = 1:1秒

  • N=4: 1.2 sec
  • N = 4:1.2秒

So N=4 is not much slower. If I use different cache lines (replace *1 with *64), then N=4 becomes a little faster: 1.1 sec.

所以N = 4并不慢很多。如果我使用不同的缓存行(用* 64替换* 1),那么N = 4会变得更快:1.1秒。

The same measurements for atomic access (swap the comments at typedef), same cache line:

原子访问的相同测量(在typedef交换注释),相同的缓存行:

  • N=1: 3.1 sec
  • N = 1:3.1秒

  • N=4: 48 sec
  • N = 4:48秒

So the N=4 case is much slower (as I expected). If different cache lines used, then N=4 has similar performance as N=1: 3.3 sec.

所以N = 4的情况要慢得多(正如我预期的那样)。如果使用不同的高速缓存行,那么N = 4具有与N = 1:3.3秒相似的性能。

I don't understand the reason behind these results. Why don't I get a serious slowdown the non-atomic, N=4 case? Four cores have the same memory in their caches, so they must synchronize them somehow, don't they? How can they run almost perfectly parallel? Why just the atomic case gets a serious slowdown?

我不明白这些结果背后的原因。为什么我不会严重减慢非原子,N = 4的情况?四个内核在其缓存中具有相同的内存,因此它们必须以某种方式同步它们,不是吗?他们怎么能几乎完全平行运行?为什么原子案会严重减速?


I think I need to understand how memory gets updated in this case. In the beginning, no cores have buffer in their caches. After one for iteration (in fn), all 4 cores have buffer in their cache-lines, but each core writes a different byte. How do these cache-lines get synchronized (in the non-atomic case)? How does the cache know, which byte is dirty? Or is there some other mechanism to handle this case? Why is this mechanism a lot cheaper (actually, it is almost free) than the atomic-one?

我想我需要了解在这种情况下如何更新内存。最初,没有核心在缓存中有缓冲区。在迭代之后(在fn中),所有4个内核在其缓存行中都有缓冲区,但每个内核写入不同的字节。这些缓存行如何同步(在非原子情况下)?缓存如何知道哪个字节是脏的?还是有其他机制来处理这种情况?为什么这个机制比原子机器便宜得多(实际上,它几乎是免费的)?

2 个解决方案

#1


22  

What you are seeing is basically the effect of store-to-load forwarding allowing each core to work mostly independently, despite sharing a cache line. As we will see below, it is truly a weird case where more contention is bad, up to a point, then even more contention suddenly makes things really fast!

您所看到的基本上是存储到转发转发的影响,允许每个核心大部分独立工作,尽管共享一个缓存行。正如我们将在下面看到的,真正的一个奇怪的情况是,更多的争用是不好的,直到某一点,然后更多的争论突然使事情变得非常快!

Now with the conventional view of contention your code seems like something that will be high contention and therefore much slower than ideal. What happens, however, is that as soon as each core gets a single pending write in its write buffer, all later reads can be satisfied from the write buffer (store forwarding), and later writes just go into the buffer as well. This turns most of the work into a totally local operation. The cache line is still bouncing around between the cores, but it's decoupled from the core execution path and is only needed to actually commit the stores now and then1.

现在有了传统的争用观点,你的代码看起来像是一个高争用的东西,因此比理想慢得多。但是,只要每个内核在其写缓冲区中获得单个挂起写入,就会从写缓冲区(存储转发)中满足所有后续读取,并且稍后写入也会进入缓冲区。这将大部分工作转变为完全本地化的操作。高速缓存行仍在核心之间反弹,但它与核心执行路径分离,只需要实际提交商店然后1。

The std::atomic version can't use this magic at all since it has to use locked operations to maintain atomicity and defeat the store buffer, so you see both the full cost of contention and the cost of the long-latency atomic operations2.

std :: atomic版本根本不能使用这种魔法,因为它必须使用锁定操作来维持原子性并击败存储缓冲区,因此您既可以看到争用的全部成本,也可以看到长延迟原子操作的成本2。

Let's try to actually collect some evidence that this is what's occurring. All of the discussion below deals with the non-atomic version of the benchmark that uses volatile to force reads and writes from buffer.

让我们试着收集一些证据证明这就是正在发生的事情。下面的所有讨论都涉及基准的非原子版本,该版本使用volatile来强制从缓冲区读取和写入。

Let's first check the assembly, to make sure it's what we expect:

让我们先检查一下装配,确保它符合我们的期望:

0000000000400c00 <fn(unsigned char volatile*)>:
  400c00:   ba 00 65 cd 1d          mov    edx,0x1dcd6500
  400c05:   0f 1f 00                nop    DWORD PTR [rax]
  400c08:   0f b6 07                movzx  eax,BYTE PTR [rdi]
  400c0b:   83 c0 01                add    eax,0x1
  400c0e:   83 ea 01                sub    edx,0x1
  400c11:   88 07                   mov    BYTE PTR [rdi],al
  400c13:   75 f3                   jne    400c08 <fn(unsigned char volatile*)+0x8>
  400c15:   f3 c3                   repz ret 

It's straightforward: a five instruction loop with a byte load, an increment of the loaded byte, a byte store, and a loop increment and conditional jump. Gcc has done a bad jump by breaking up the sub and jne, inhibiting macro-fusion, but overall it's OK and the store-forwarding latency is going to limit the loop in any case.

它很简单:一个五指令循环,带有字节加载,加载字节的增量,字节存储,循环增量和条件跳转。 Gcc通过分解sub和jne做了一个糟糕的跳跃,抑制了宏观融合,但总的来说它是可以的,并且存储转发延迟在任何情况下都会限制循环。

Next, let's take a look at the number of L1D misses. Every time a core needs to write into the line that has been stolen away, it will suffer an L1D miss, which we can measure with perf. First, the single threaded (N=1) case:

接下来,我们来看看L1D未命中的数量。每当核心需要写入被盗的线路时,它将遭受L1D未命中,我们可以用perf测量。首先,单线程(N = 1)情况:

$ perf stat -e task-clock,cycles,instructions,L1-dcache-loads,L1-dcache-load-misses ./cache-line-increment 

 Performance counter stats for './cache-line-increment':

       1070.188749      task-clock (msec)         #    0.998 CPUs utilized          
     2,775,874,257      cycles                    #    2.594 GHz                    
     2,504,256,018      instructions              #    0.90  insn per cycle         
       501,139,187      L1-dcache-loads           #  468.272 M/sec                  
            69,351      L1-dcache-load-misses     #    0.01% of all L1-dcache hits  

       1.072119673 seconds time elapsed

It's about what we expect: essentially zero L1D misses (0.01% of the total, probably mostly from interrupts and other code outside the loop), just over 500,000,000 hits (the number of times we loop). Note also that we can easily calculate the cyles per loop: about 5.5, which primarily reflects the cost of store-to-load forwarding, plus one cycle for the increment, which is a carried dependency chain as the same location is repeatedly updated (and volatile means it can't be hoisted into a register).

这是我们所期望的:基本上没有L1D未命中(总数的0.01%,可能主要来自中断和循环外的其他代码),仅超过500,000,000次点击(我们循环的次数)。还要注意,我们可以很容易地计算每个循环的cyles:大约5.5,这主要反映了存储到转发转发的成本,加上一个循环的增量,这是一个携带的依赖链,因为相同的位置被重复更新(和volatile意味着它不能被提升到寄存器中)。

Let's take a look at the N=4 case:

我们来看看N = 4的情况:

$ perf stat -e task-clock,cycles,instructions,L1-dcache-loads,L1-dcache-load-misses ./cache-line-increment 

 Performance counter stats for './cache-line-increment':

       5920.758885      task-clock (msec)         #    3.773 CPUs utilized          
    15,356,014,570      cycles                    #    2.594 GHz                    
    10,012,249,418      instructions              #    0.65  insn per cycle         
     2,003,487,964      L1-dcache-loads           #  338.384 M/sec                  
        61,450,818      L1-dcache-load-misses     #    3.07% of all L1-dcache hits  

       1.569040529 seconds time elapsed

As expected the L1 loads jumps from 500 million to 2 billion, since there are 4 threads each doing the 500 million loads. The number of L1D misses also jumped by about a factor of 1,000, to about 60 million. Still, that number is not a lot compared to the 2 billion loads (and 2 billion stores - not shown, but we know they are there). That's ~33 loads and ~33 stores for every miss. It also means 250 cycles between each miss.

正如预期的那样,L1负载从5亿增加到20亿,因为有4个线程各自负载5亿个负载。 L1D未命中数也增加了约1,000倍,达到约6,000万。尽管如此,与20亿个负载相比,这个数字并不是很多(20亿个商店 - 未显示,但我们知道它们在那里)。每次错过大约33次装载和~33次存储。它还意味着每次未命中之间有250个循环。

That doesn't really fit the model of the cache line bouncing around erratically between the cores, where as soon as a core gets the line, another core demands it. We know that lines bounce around between cores sharing an L2 in perhaps 20-50 cycles, so the ratio of one miss every 250 cycles seems way to low.

这并不适合在核心之间不规则地反弹的缓存线模型,其中只要核心获得线路,另一个核心就需要它。我们知道,在共享L2的核心之间,线路可能会在20-50个周期内反弹,因此每250个周期丢失一次的比率似乎很低。

Two Hypotheses

A couple ideas spring to mind for the above described behavior:

对于上述行为,我们会想到几个想法:

  • Perhaps the MESI protocol variant used in this chip is "smart" and recognizes that one line is hot among several cores, but only a small amount of work is being done each time a core gets the lock and the line spends more time moving between L1 and L2 than actually satisfying loads and stores for some core. In light of this, some smart component in the coherence protocol decides to enforce some kind of minimum "ownership time" for each line: after a core gets the line, it will keep it for N cycles, even if demanded by another core (the other cores just have to wait).

    也许这个芯片中使用的MESI协议变体是“智能的”并且认识到一条线路在几个核心中很热,但每次核心获得锁定时线路只进行少量工作,并且线路在L1之间花费更多时间和L2实际上满足一些核心的负载和存储。鉴于此,一致性协议中的一些智能组件决定对每一行强制执行某种最小“所有权时间”:在核心获得该行之后,即使另一个核心需要,它也将保持N个周期(其他核心只需要等待)。

    This would help balance out the overhead of cache line ping-pong with real work, at the cost of "fairness" and responsiveness of the other cores, kind of like the trade-off between unfair and fair locks, and counteracting the effect [described here], where the faster & fairer the coherency protocol is, the worse some (usually synthetic) loops may perform.

    这将有助于平衡高速缓存行乒乓与实际工作的开销,代价是“公平”和其他核心的响应性,有点像不公平和公平锁定之间的权衡,以及抵消效果[描述]在这里,一致性协议越快越公平,一些(通常是合成的)循环可能会越差。

    Now I've never heard of anything like that (and the immediately previous link shows that at least in the Sandy-Bridge era things were moving in the opposite direction), but it's certainly possible!

    现在我从来没有听说过这样的事情(前面的链接显示,至少在Sandy-Bridge时代,事情正朝着相反的方向发展),但这肯定是可能的!

  • The store-buffer effect described is actually occurring, so most operations can complete almost locally.

    所描述的存储缓冲区效果实际上正在发生,因此大多数操作几乎可以在本地完成。

Some Tests

Let's try to distinguish two cases with some modifications.

让我们尝试通过一些修改来区分两种情况。

Reading and Writing Distinct Bytes

The obvious approach is to change the fn() work function so that the threads still contend on the same cache line, but where store-forwarding can't kick in.

显而易见的方法是更改​​fn()工作函数,以便线程仍然在同一个缓存行上竞争,但是商店转发无法启动。

How about we just read from location x and then write to location x + 1? We'll give each thread two consecutive locations (i.e., thr[i] = std::thread(&fn, &buffer[i*2])) so each thread is operating on two private bytes. The modified fn() looks like:

我们如何从位置x读取然后写入位置x + 1?我们将给每个线程两个连续的位置(即thr [i] = std :: thread(&fn,&buffer [i * 2])),因此每个线程在两个私有字节上运行。修改后的fn()看起来像:

for (int i=0; i<500000000; i++)
    unsigned char temp = p[0];
    p[1] = temp + 1;
}

The core loop is pretty much identical to earlier:

核心循环与之前完全相同:

  400d78:   0f b6 07                movzx  eax,BYTE PTR [rdi]
  400d7b:   83 c0 01                add    eax,0x1
  400d7e:   83 ea 01                sub    edx,0x1
  400d81:   88 47 01                mov    BYTE PTR [rdi+0x1],al
  400d84:   75 f2                   jne    400d78

The only thing that's changed is that we write to [rdi+0x1] rather than [rdi].

唯一改变的是我们写入[rdi + 0x1]而不是[rdi]。

Now as I mentioned above, the original (same location) loop is actually running fairly slowly at about 5.5 cycles per iteration even in the best-case single-threaded case, because of the loop-carried load->add->store->load... dependency. This new code breaks that chain! The load no longer depends on the store so we can execute everything pretty much in parallel and I expect this loop to run at about 1.25 cycles per iteration (5 instructions / CPU width of 4).

现在正如我上面提到的,原始(相同位置)循环实际上运行相当慢,每次迭代大约5.5个循环,即使在最佳情况下单线程情况下,因为循环承载load-> add-> store->加载...依赖。这个新代码打破了这个链条!负载不再依赖于存储,因此我们可以并行执行所有操作,并且我希望此循环每次迭代运行大约1.25个循环(5个指令/ CPU宽度为4)。

Here's the single threaded case:

这是单线程案例:

$ perf stat -e task-clock,cycles,instructions,L1-dcache-loads,L1-dcache-load-misses ./cache-line-increment 

 Performance counter stats for './cache-line-increment':

        318.722631      task-clock (msec)         #    0.989 CPUs utilized          
       826,349,333      cycles                    #    2.593 GHz                    
     2,503,706,989      instructions              #    3.03  insn per cycle         
       500,973,018      L1-dcache-loads           # 1571.815 M/sec                  
            63,507      L1-dcache-load-misses     #    0.01% of all L1-dcache hits                 

       0.322146774 seconds time elapsed

So about 1.65 cycles per iteration3, about about three times faster versus incrementing the same location.

因此,每次迭代约为1.65个循环3,比增加相同位置快约三倍。

How about 4 threads?

4线程怎么样?

$ perf stat -e task-clock,cycles,instructions,L1-dcache-loads,L1-dcache-load-misses ./cache-line-increment 

 Performance counter stats for './cache-line-increment':

      22299.699256      task-clock (msec)         #    3.469 CPUs utilized          
    57,834,005,721      cycles                    #    2.593 GHz                    
    10,038,366,836      instructions              #    0.17  insn per cycle         
     2,011,160,602      L1-dcache-loads           #   90.188 M/sec                  
       237,664,926      L1-dcache-load-misses     #   11.82% of all L1-dcache hits  


       6.428730614 seconds time elapsed

So it's about 4 times slower than the same location case. Now rather than being just a bit slower than the single-threaded case it is about 20 times slower. This is the contention you've been looking for! Now also that the number of L1D misses has increased by a factor of 4 as well, nicely explaining the performance degradation and consistent with the idea that when store-to-load forwarding can't hide the contention, misses will increase by a lot.

所以它比同一个位置情况慢大约4倍。现在,它不是比单线程情况慢一点,而是慢了大约20倍。这是你一直在寻找的争论!现在,L1D未命中的数量也增加了4倍,很好地解释了性能下降,并且与存储到转发转发无法隐藏争用的想法一致,错失将增加很多。

Increasing the Distance Between Stores

Another approach would be to increase the distance in time/instructions between the store and the subsequent load. We can do this by incrementing SPAN consecutive locations in the fn() method, rather than always the same location. E.g, if SPAN is 4, increment consecutively 4 locations like:

另一种方法是增加商店与后续负载之间的时间/指令距离。我们可以通过在fn()方法中递增SPAN连续位置来完成此操作,而不是始终使用相同的位置。例如,如果SPAN为4,则连续增加4个位置,例如:

for (long i=0; i<500000000 / 4; i++) {
    p[0]++;
    p[1]++;
    p[2]++;
    p[3]++;
}

Note that we are still incrementing 500 million locations in total, just spreading out the increments among 4 bytes. Intuitively you would expect overall performance to increase since you now have SPAN parallel dependency with length 1/SPAN, so in the case above you might expect performance to improve by a factor of 4, since the 4 parallel chains can proceed at about 4 times the total throughput.

请注意,我们仍在增加5亿个位置,只是在4个字节之间分配增量。直觉上你会期望整体性能提高,因为你现在拥有长度为1 / SPAN的SPAN并行依赖,因此在上面的情况下你可能会期望性能提高4倍,因为4个并行链可以进行大约4倍的处理。总吞吐量。

Here's what we actually get for time (measured in cycles) for the 1 thread and 3 thread4, for SPAN values from 1 to 20:

这是我们实际获得的1个线程和3个线程4的时间(以周期测量),SPAN值从1到20:

为什么从多个线程使用相同的缓存行不会导致严重的减速?

Initially you see performance increase substantially in both single and multi-threaded cases; the increase from a SPAN of one to two and three is close to the theoretical expected in the case of perfect parallelism for both cases.

最初,您会发现单线程和多线程情况下性能都会大幅提升;从SPAN一到二和三的增加接近于两种情况下完美并行性的理论预期。

The single-threaded case reaches an asymptote of about 4.25x faster than the single-location write: at this point the store-forwarding latency isn't the bottleneck and other bottlenecks have taken over (max IPC and store port contention, mostly).

单线程案例的渐近线比单一位置写入速度快4.25倍:此时存储转发延迟不是瓶颈而且其他瓶颈已经接管(主要是最大IPC和存储端口争用)。

The multi-threaded case is very different, however! Once you hit a SPAN of about 7, the performance rapidly gets worse, leveling out at about 2.5 times worse than the SPAN=1 case and almost 10x worse compared to the best performance at SPAN=5. What happens is that store-to-load forwarding stops occurring because the store and subsequent load are far enough apart in time/cycles that the store has retired to L1, so the load actually has to get the line and participate in MESI.

然而,多线程的情况非常不同!一旦达到大约7的SPAN,性能会迅速恶化,比SPAN = 1的情况下平衡约2.5倍,与SPAN = 5时的最佳性能相比差不多10倍。发生的情况是,存储到加载转发停止发生,因为存储和后续加载在商店退役到L1的时间/周期中足够远,因此负载实际上必须获得该线并参与MESI。

Also plotted is the L1D misses, which as mentioned above is indicative of "cache line transfers" between cores. The single-threaded case has essentially zero, and they are uncorrelated with the performance. The performance of the multi-threaded case, however, pretty much tracks exactly the cache misses. With SPAN values in the 2 to 6 range, where store-forwarding is still working, there are proportionally fewer misses. Evidently the core is able to "buffer up" more stores between each cache line transfer since the core loop is faster.

还绘制了L1D未命中,其如上所述指示核之间的“高速缓存线传输”。单线程机箱基本上为零,并且它们与性能无关。然而,多线程案例的性能几乎完全跟踪缓存未命中。 SPAN值在2到6范围内,其中存储转发仍然有效,但错失的比例相应较少。显然,核心能够在每个缓存行传输之间“缓冲”更多存储,因为核心循环更快。

Another way to think of it is that in the contended case L1D misses are basically constant per unit-time (which makes sense, since they are basically tied to the L1->L2->L1 latency, plus some coherency protocol overhead), so the more work you can do in between the cache line transfers, the better.

另一种思考方式是,在竞争情况下,L1D未命中基本上每单位时间不变(这是有道理的,因为它们基本上与L1-> L2-> L1延迟相关联,加上一些一致性协议开销),所以在缓存行传输之间可以做的工作越多越好。

Here's the code for the multi-span case:

这是多范围案例的代码:

void fn(Type *p) {
    for (long i=0; i<500000000 / SPAN; i++) {
        for (int j = 0; j < SPAN; j++) {
            p[j]++;
        }
    }
}

The bash script to run perf for all SPAN value from 1 to 20:

bash脚本为所有SPAN值从1到20运行perf:

PERF_ARGS=${1:--x, -r10}

for span in {1..20}; do
    g++ -std=c++11 -g -O2 -march=native -DSPAN=$span  cache-line-increment.cpp  -lpthread -o cache-line-increment
    perf stat ${PERF_ARGS} -e cycles,L1-dcache-loads,L1-dcache-load-misses,machine_clears.count,machine_clears.memory_ordering ./cache-line-increment
done

Finally, "transpose" the results into proper CSV:

最后,将结果“转置”为适当的CSV:

FILE=result1.csv; for metric in cycles L1-dcache-loads L1-dcache-load-misses; do { echo $metric; grep $metric $FILE | cut -f1 -d,; } > ${metric}.tmp; done && paste -d, *.tmp

A Final Test

There's a final test that you can do to show that each core is effectively doing most of its work in private: use the version of the benchmark where the threads work on the same location (which doesn't change the performance characteristics) examine the sum of the final counter values (you'd need int counters rather than char). If everything was atomic, you'd have a sum of 2 billion, and in the non-atomic case how close the total is to that value is a rough measure of how frequently the cores were passing around the lines. If the cores are working almost totally privately, the value would be closer to 500 million than 2 billion, ad I guess that's what you'll find.

有一个最后的测试,你可以做到表明每个核心有效地完成了大部分私有工作:使用基线的版本,其中线程在同一位置工作(这不会改变性能特征)检查总和最终计数器值(你需要int计数器而不是char)。如果一切都是原子的,那么你将有20亿的总和,而在非原子的情况下,总数与该值的接近程度是对核心绕线的频率的粗略测量。如果核心几乎完全私有化,价值将接近5亿而不是20亿,我想这就是你会发现的。

With some more clever incrementing, you can even have each thread track how often the value they incremented came from their last increment rather than another threads increment (e.g., by using a few bits of the value to stash a thread identifier). With an even more clever test you could practically reconstruct thee way the cache line moved around between the cores (is there a pattern, e.g., does core A prefer to hand off to core B?) and which cores contributed most to the final value, etc.

通过一些更聪明的递增,您甚至可以让每个线程跟踪它们增加的值来自它们的最后一个增量而不是另一个线程增量的频率(例如,通过使用该值的几个位来存储线程标识符)。通过更加巧妙的测试,您可以实际重建高速缓存线在核心之间移动的方式(例如,核心A是否更愿意切换到核心B?)以及哪些核心对最终值贡献最大,等等

That's all left as an exercise :).

这一切都留作了练习:)。


1 On top of that, if Intel has a coalescing store buffer where later stores that fully overlap earlier ones kill the earlier stores, it would only have to commit one value to L1 (the latest store) every time it gets the line.

1最重要的是,如果英特尔有一个合并存储缓冲区,后面的存储与之前的存储完全重叠,则会杀死早期存储,每次获取该行时,它只需要向L1(最新存储)提交一个值。

2 You can't really separate the two effects here, but we will do it later by defeating store-to-load forwarding.

2你不能在这里真正区分这两种效果,但我们稍后会通过打败存储到转发转发来实现。

3 A bit more than I expected, perhaps bad scheduling leading to port pressure. If gcc would just all the sub and jne to fuse, it runs at 1.1 cycles per iteration (still worse than the 1.0 I'd expect). It will do that I use -march=haswell instead of -march=native but I'm not going to go back and change all the numbers.

3比我预期的要多一点,也许是糟糕的调度导致港口压力。如果gcc只是所有的sub和jne融合,它每次迭代运行1.1个周期(仍然比我期望的1.0差)。这样做我会使用-march = haswell而不是-march = native但我不会回去改变所有的数字。

4 The results hold with 4 threads as well: but I only have 4 cores and I'm running stuff like Firefox in the background, so using 1 less core makes the measurements a lot less noisy. Measuring time in cycles helps a lot too.

4结果也支持4个线程:但我只有4个核心,我在后台运行像Firefox这样的东西,所以使用1个核心可以减少测量噪音。以周期测量时间也有很大帮助。

#2


3  

The atomic version has to ensure that some other thread will be able to read the result in a sequentially consistent fashion. So there are fences for each write.

原子版必须确保其他一些线程能够以顺序一致的方式读取结果。所以每次写都有围栏。

The volatile version does not make any relationships visible to the other cores, so does not try and synchronize the memory so it is visible on other cores. For a multi-threaded system using C++11 or newer, volatile is not a mechanism for communicating between threads.

易失性版本不会使任何关系对其他内核可见,因此不会尝试同步内存,以便在其他内核上可见。对于使用C ++ 11或更高版本的多线程系统,volatile不是线程之间通信的机制。

#1


22  

What you are seeing is basically the effect of store-to-load forwarding allowing each core to work mostly independently, despite sharing a cache line. As we will see below, it is truly a weird case where more contention is bad, up to a point, then even more contention suddenly makes things really fast!

您所看到的基本上是存储到转发转发的影响,允许每个核心大部分独立工作,尽管共享一个缓存行。正如我们将在下面看到的,真正的一个奇怪的情况是,更多的争用是不好的,直到某一点,然后更多的争论突然使事情变得非常快!

Now with the conventional view of contention your code seems like something that will be high contention and therefore much slower than ideal. What happens, however, is that as soon as each core gets a single pending write in its write buffer, all later reads can be satisfied from the write buffer (store forwarding), and later writes just go into the buffer as well. This turns most of the work into a totally local operation. The cache line is still bouncing around between the cores, but it's decoupled from the core execution path and is only needed to actually commit the stores now and then1.

现在有了传统的争用观点,你的代码看起来像是一个高争用的东西,因此比理想慢得多。但是,只要每个内核在其写缓冲区中获得单个挂起写入,就会从写缓冲区(存储转发)中满足所有后续读取,并且稍后写入也会进入缓冲区。这将大部分工作转变为完全本地化的操作。高速缓存行仍在核心之间反弹,但它与核心执行路径分离,只需要实际提交商店然后1。

The std::atomic version can't use this magic at all since it has to use locked operations to maintain atomicity and defeat the store buffer, so you see both the full cost of contention and the cost of the long-latency atomic operations2.

std :: atomic版本根本不能使用这种魔法,因为它必须使用锁定操作来维持原子性并击败存储缓冲区,因此您既可以看到争用的全部成本,也可以看到长延迟原子操作的成本2。

Let's try to actually collect some evidence that this is what's occurring. All of the discussion below deals with the non-atomic version of the benchmark that uses volatile to force reads and writes from buffer.

让我们试着收集一些证据证明这就是正在发生的事情。下面的所有讨论都涉及基准的非原子版本,该版本使用volatile来强制从缓冲区读取和写入。

Let's first check the assembly, to make sure it's what we expect:

让我们先检查一下装配,确保它符合我们的期望:

0000000000400c00 <fn(unsigned char volatile*)>:
  400c00:   ba 00 65 cd 1d          mov    edx,0x1dcd6500
  400c05:   0f 1f 00                nop    DWORD PTR [rax]
  400c08:   0f b6 07                movzx  eax,BYTE PTR [rdi]
  400c0b:   83 c0 01                add    eax,0x1
  400c0e:   83 ea 01                sub    edx,0x1
  400c11:   88 07                   mov    BYTE PTR [rdi],al
  400c13:   75 f3                   jne    400c08 <fn(unsigned char volatile*)+0x8>
  400c15:   f3 c3                   repz ret 

It's straightforward: a five instruction loop with a byte load, an increment of the loaded byte, a byte store, and a loop increment and conditional jump. Gcc has done a bad jump by breaking up the sub and jne, inhibiting macro-fusion, but overall it's OK and the store-forwarding latency is going to limit the loop in any case.

它很简单:一个五指令循环,带有字节加载,加载字节的增量,字节存储,循环增量和条件跳转。 Gcc通过分解sub和jne做了一个糟糕的跳跃,抑制了宏观融合,但总的来说它是可以的,并且存储转发延迟在任何情况下都会限制循环。

Next, let's take a look at the number of L1D misses. Every time a core needs to write into the line that has been stolen away, it will suffer an L1D miss, which we can measure with perf. First, the single threaded (N=1) case:

接下来,我们来看看L1D未命中的数量。每当核心需要写入被盗的线路时,它将遭受L1D未命中,我们可以用perf测量。首先,单线程(N = 1)情况:

$ perf stat -e task-clock,cycles,instructions,L1-dcache-loads,L1-dcache-load-misses ./cache-line-increment 

 Performance counter stats for './cache-line-increment':

       1070.188749      task-clock (msec)         #    0.998 CPUs utilized          
     2,775,874,257      cycles                    #    2.594 GHz                    
     2,504,256,018      instructions              #    0.90  insn per cycle         
       501,139,187      L1-dcache-loads           #  468.272 M/sec                  
            69,351      L1-dcache-load-misses     #    0.01% of all L1-dcache hits  

       1.072119673 seconds time elapsed

It's about what we expect: essentially zero L1D misses (0.01% of the total, probably mostly from interrupts and other code outside the loop), just over 500,000,000 hits (the number of times we loop). Note also that we can easily calculate the cyles per loop: about 5.5, which primarily reflects the cost of store-to-load forwarding, plus one cycle for the increment, which is a carried dependency chain as the same location is repeatedly updated (and volatile means it can't be hoisted into a register).

这是我们所期望的:基本上没有L1D未命中(总数的0.01%,可能主要来自中断和循环外的其他代码),仅超过500,000,000次点击(我们循环的次数)。还要注意,我们可以很容易地计算每个循环的cyles:大约5.5,这主要反映了存储到转发转发的成本,加上一个循环的增量,这是一个携带的依赖链,因为相同的位置被重复更新(和volatile意味着它不能被提升到寄存器中)。

Let's take a look at the N=4 case:

我们来看看N = 4的情况:

$ perf stat -e task-clock,cycles,instructions,L1-dcache-loads,L1-dcache-load-misses ./cache-line-increment 

 Performance counter stats for './cache-line-increment':

       5920.758885      task-clock (msec)         #    3.773 CPUs utilized          
    15,356,014,570      cycles                    #    2.594 GHz                    
    10,012,249,418      instructions              #    0.65  insn per cycle         
     2,003,487,964      L1-dcache-loads           #  338.384 M/sec                  
        61,450,818      L1-dcache-load-misses     #    3.07% of all L1-dcache hits  

       1.569040529 seconds time elapsed

As expected the L1 loads jumps from 500 million to 2 billion, since there are 4 threads each doing the 500 million loads. The number of L1D misses also jumped by about a factor of 1,000, to about 60 million. Still, that number is not a lot compared to the 2 billion loads (and 2 billion stores - not shown, but we know they are there). That's ~33 loads and ~33 stores for every miss. It also means 250 cycles between each miss.

正如预期的那样,L1负载从5亿增加到20亿,因为有4个线程各自负载5亿个负载。 L1D未命中数也增加了约1,000倍,达到约6,000万。尽管如此,与20亿个负载相比,这个数字并不是很多(20亿个商店 - 未显示,但我们知道它们在那里)。每次错过大约33次装载和~33次存储。它还意味着每次未命中之间有250个循环。

That doesn't really fit the model of the cache line bouncing around erratically between the cores, where as soon as a core gets the line, another core demands it. We know that lines bounce around between cores sharing an L2 in perhaps 20-50 cycles, so the ratio of one miss every 250 cycles seems way to low.

这并不适合在核心之间不规则地反弹的缓存线模型,其中只要核心获得线路,另一个核心就需要它。我们知道,在共享L2的核心之间,线路可能会在20-50个周期内反弹,因此每250个周期丢失一次的比率似乎很低。

Two Hypotheses

A couple ideas spring to mind for the above described behavior:

对于上述行为,我们会想到几个想法:

  • Perhaps the MESI protocol variant used in this chip is "smart" and recognizes that one line is hot among several cores, but only a small amount of work is being done each time a core gets the lock and the line spends more time moving between L1 and L2 than actually satisfying loads and stores for some core. In light of this, some smart component in the coherence protocol decides to enforce some kind of minimum "ownership time" for each line: after a core gets the line, it will keep it for N cycles, even if demanded by another core (the other cores just have to wait).

    也许这个芯片中使用的MESI协议变体是“智能的”并且认识到一条线路在几个核心中很热,但每次核心获得锁定时线路只进行少量工作,并且线路在L1之间花费更多时间和L2实际上满足一些核心的负载和存储。鉴于此,一致性协议中的一些智能组件决定对每一行强制执行某种最小“所有权时间”:在核心获得该行之后,即使另一个核心需要,它也将保持N个周期(其他核心只需要等待)。

    This would help balance out the overhead of cache line ping-pong with real work, at the cost of "fairness" and responsiveness of the other cores, kind of like the trade-off between unfair and fair locks, and counteracting the effect [described here], where the faster & fairer the coherency protocol is, the worse some (usually synthetic) loops may perform.

    这将有助于平衡高速缓存行乒乓与实际工作的开销,代价是“公平”和其他核心的响应性,有点像不公平和公平锁定之间的权衡,以及抵消效果[描述]在这里,一致性协议越快越公平,一些(通常是合成的)循环可能会越差。

    Now I've never heard of anything like that (and the immediately previous link shows that at least in the Sandy-Bridge era things were moving in the opposite direction), but it's certainly possible!

    现在我从来没有听说过这样的事情(前面的链接显示,至少在Sandy-Bridge时代,事情正朝着相反的方向发展),但这肯定是可能的!

  • The store-buffer effect described is actually occurring, so most operations can complete almost locally.

    所描述的存储缓冲区效果实际上正在发生,因此大多数操作几乎可以在本地完成。

Some Tests

Let's try to distinguish two cases with some modifications.

让我们尝试通过一些修改来区分两种情况。

Reading and Writing Distinct Bytes

The obvious approach is to change the fn() work function so that the threads still contend on the same cache line, but where store-forwarding can't kick in.

显而易见的方法是更改​​fn()工作函数,以便线程仍然在同一个缓存行上竞争,但是商店转发无法启动。

How about we just read from location x and then write to location x + 1? We'll give each thread two consecutive locations (i.e., thr[i] = std::thread(&fn, &buffer[i*2])) so each thread is operating on two private bytes. The modified fn() looks like:

我们如何从位置x读取然后写入位置x + 1?我们将给每个线程两个连续的位置(即thr [i] = std :: thread(&fn,&buffer [i * 2])),因此每个线程在两个私有字节上运行。修改后的fn()看起来像:

for (int i=0; i<500000000; i++)
    unsigned char temp = p[0];
    p[1] = temp + 1;
}

The core loop is pretty much identical to earlier:

核心循环与之前完全相同:

  400d78:   0f b6 07                movzx  eax,BYTE PTR [rdi]
  400d7b:   83 c0 01                add    eax,0x1
  400d7e:   83 ea 01                sub    edx,0x1
  400d81:   88 47 01                mov    BYTE PTR [rdi+0x1],al
  400d84:   75 f2                   jne    400d78

The only thing that's changed is that we write to [rdi+0x1] rather than [rdi].

唯一改变的是我们写入[rdi + 0x1]而不是[rdi]。

Now as I mentioned above, the original (same location) loop is actually running fairly slowly at about 5.5 cycles per iteration even in the best-case single-threaded case, because of the loop-carried load->add->store->load... dependency. This new code breaks that chain! The load no longer depends on the store so we can execute everything pretty much in parallel and I expect this loop to run at about 1.25 cycles per iteration (5 instructions / CPU width of 4).

现在正如我上面提到的,原始(相同位置)循环实际上运行相当慢,每次迭代大约5.5个循环,即使在最佳情况下单线程情况下,因为循环承载load-> add-> store->加载...依赖。这个新代码打破了这个链条!负载不再依赖于存储,因此我们可以并行执行所有操作,并且我希望此循环每次迭代运行大约1.25个循环(5个指令/ CPU宽度为4)。

Here's the single threaded case:

这是单线程案例:

$ perf stat -e task-clock,cycles,instructions,L1-dcache-loads,L1-dcache-load-misses ./cache-line-increment 

 Performance counter stats for './cache-line-increment':

        318.722631      task-clock (msec)         #    0.989 CPUs utilized          
       826,349,333      cycles                    #    2.593 GHz                    
     2,503,706,989      instructions              #    3.03  insn per cycle         
       500,973,018      L1-dcache-loads           # 1571.815 M/sec                  
            63,507      L1-dcache-load-misses     #    0.01% of all L1-dcache hits                 

       0.322146774 seconds time elapsed

So about 1.65 cycles per iteration3, about about three times faster versus incrementing the same location.

因此,每次迭代约为1.65个循环3,比增加相同位置快约三倍。

How about 4 threads?

4线程怎么样?

$ perf stat -e task-clock,cycles,instructions,L1-dcache-loads,L1-dcache-load-misses ./cache-line-increment 

 Performance counter stats for './cache-line-increment':

      22299.699256      task-clock (msec)         #    3.469 CPUs utilized          
    57,834,005,721      cycles                    #    2.593 GHz                    
    10,038,366,836      instructions              #    0.17  insn per cycle         
     2,011,160,602      L1-dcache-loads           #   90.188 M/sec                  
       237,664,926      L1-dcache-load-misses     #   11.82% of all L1-dcache hits  


       6.428730614 seconds time elapsed

So it's about 4 times slower than the same location case. Now rather than being just a bit slower than the single-threaded case it is about 20 times slower. This is the contention you've been looking for! Now also that the number of L1D misses has increased by a factor of 4 as well, nicely explaining the performance degradation and consistent with the idea that when store-to-load forwarding can't hide the contention, misses will increase by a lot.

所以它比同一个位置情况慢大约4倍。现在,它不是比单线程情况慢一点,而是慢了大约20倍。这是你一直在寻找的争论!现在,L1D未命中的数量也增加了4倍,很好地解释了性能下降,并且与存储到转发转发无法隐藏争用的想法一致,错失将增加很多。

Increasing the Distance Between Stores

Another approach would be to increase the distance in time/instructions between the store and the subsequent load. We can do this by incrementing SPAN consecutive locations in the fn() method, rather than always the same location. E.g, if SPAN is 4, increment consecutively 4 locations like:

另一种方法是增加商店与后续负载之间的时间/指令距离。我们可以通过在fn()方法中递增SPAN连续位置来完成此操作,而不是始终使用相同的位置。例如,如果SPAN为4,则连续增加4个位置,例如:

for (long i=0; i<500000000 / 4; i++) {
    p[0]++;
    p[1]++;
    p[2]++;
    p[3]++;
}

Note that we are still incrementing 500 million locations in total, just spreading out the increments among 4 bytes. Intuitively you would expect overall performance to increase since you now have SPAN parallel dependency with length 1/SPAN, so in the case above you might expect performance to improve by a factor of 4, since the 4 parallel chains can proceed at about 4 times the total throughput.

请注意,我们仍在增加5亿个位置,只是在4个字节之间分配增量。直觉上你会期望整体性能提高,因为你现在拥有长度为1 / SPAN的SPAN并行依赖,因此在上面的情况下你可能会期望性能提高4倍,因为4个并行链可以进行大约4倍的处理。总吞吐量。

Here's what we actually get for time (measured in cycles) for the 1 thread and 3 thread4, for SPAN values from 1 to 20:

这是我们实际获得的1个线程和3个线程4的时间(以周期测量),SPAN值从1到20:

为什么从多个线程使用相同的缓存行不会导致严重的减速?

Initially you see performance increase substantially in both single and multi-threaded cases; the increase from a SPAN of one to two and three is close to the theoretical expected in the case of perfect parallelism for both cases.

最初,您会发现单线程和多线程情况下性能都会大幅提升;从SPAN一到二和三的增加接近于两种情况下完美并行性的理论预期。

The single-threaded case reaches an asymptote of about 4.25x faster than the single-location write: at this point the store-forwarding latency isn't the bottleneck and other bottlenecks have taken over (max IPC and store port contention, mostly).

单线程案例的渐近线比单一位置写入速度快4.25倍:此时存储转发延迟不是瓶颈而且其他瓶颈已经接管(主要是最大IPC和存储端口争用)。

The multi-threaded case is very different, however! Once you hit a SPAN of about 7, the performance rapidly gets worse, leveling out at about 2.5 times worse than the SPAN=1 case and almost 10x worse compared to the best performance at SPAN=5. What happens is that store-to-load forwarding stops occurring because the store and subsequent load are far enough apart in time/cycles that the store has retired to L1, so the load actually has to get the line and participate in MESI.

然而,多线程的情况非常不同!一旦达到大约7的SPAN,性能会迅速恶化,比SPAN = 1的情况下平衡约2.5倍,与SPAN = 5时的最佳性能相比差不多10倍。发生的情况是,存储到加载转发停止发生,因为存储和后续加载在商店退役到L1的时间/周期中足够远,因此负载实际上必须获得该线并参与MESI。

Also plotted is the L1D misses, which as mentioned above is indicative of "cache line transfers" between cores. The single-threaded case has essentially zero, and they are uncorrelated with the performance. The performance of the multi-threaded case, however, pretty much tracks exactly the cache misses. With SPAN values in the 2 to 6 range, where store-forwarding is still working, there are proportionally fewer misses. Evidently the core is able to "buffer up" more stores between each cache line transfer since the core loop is faster.

还绘制了L1D未命中,其如上所述指示核之间的“高速缓存线传输”。单线程机箱基本上为零,并且它们与性能无关。然而,多线程案例的性能几乎完全跟踪缓存未命中。 SPAN值在2到6范围内,其中存储转发仍然有效,但错失的比例相应较少。显然,核心能够在每个缓存行传输之间“缓冲”更多存储,因为核心循环更快。

Another way to think of it is that in the contended case L1D misses are basically constant per unit-time (which makes sense, since they are basically tied to the L1->L2->L1 latency, plus some coherency protocol overhead), so the more work you can do in between the cache line transfers, the better.

另一种思考方式是,在竞争情况下,L1D未命中基本上每单位时间不变(这是有道理的,因为它们基本上与L1-> L2-> L1延迟相关联,加上一些一致性协议开销),所以在缓存行传输之间可以做的工作越多越好。

Here's the code for the multi-span case:

这是多范围案例的代码:

void fn(Type *p) {
    for (long i=0; i<500000000 / SPAN; i++) {
        for (int j = 0; j < SPAN; j++) {
            p[j]++;
        }
    }
}

The bash script to run perf for all SPAN value from 1 to 20:

bash脚本为所有SPAN值从1到20运行perf:

PERF_ARGS=${1:--x, -r10}

for span in {1..20}; do
    g++ -std=c++11 -g -O2 -march=native -DSPAN=$span  cache-line-increment.cpp  -lpthread -o cache-line-increment
    perf stat ${PERF_ARGS} -e cycles,L1-dcache-loads,L1-dcache-load-misses,machine_clears.count,machine_clears.memory_ordering ./cache-line-increment
done

Finally, "transpose" the results into proper CSV:

最后,将结果“转置”为适当的CSV:

FILE=result1.csv; for metric in cycles L1-dcache-loads L1-dcache-load-misses; do { echo $metric; grep $metric $FILE | cut -f1 -d,; } > ${metric}.tmp; done && paste -d, *.tmp

A Final Test

There's a final test that you can do to show that each core is effectively doing most of its work in private: use the version of the benchmark where the threads work on the same location (which doesn't change the performance characteristics) examine the sum of the final counter values (you'd need int counters rather than char). If everything was atomic, you'd have a sum of 2 billion, and in the non-atomic case how close the total is to that value is a rough measure of how frequently the cores were passing around the lines. If the cores are working almost totally privately, the value would be closer to 500 million than 2 billion, ad I guess that's what you'll find.

有一个最后的测试,你可以做到表明每个核心有效地完成了大部分私有工作:使用基线的版本,其中线程在同一位置工作(这不会改变性能特征)检查总和最终计数器值(你需要int计数器而不是char)。如果一切都是原子的,那么你将有20亿的总和,而在非原子的情况下,总数与该值的接近程度是对核心绕线的频率的粗略测量。如果核心几乎完全私有化,价值将接近5亿而不是20亿,我想这就是你会发现的。

With some more clever incrementing, you can even have each thread track how often the value they incremented came from their last increment rather than another threads increment (e.g., by using a few bits of the value to stash a thread identifier). With an even more clever test you could practically reconstruct thee way the cache line moved around between the cores (is there a pattern, e.g., does core A prefer to hand off to core B?) and which cores contributed most to the final value, etc.

通过一些更聪明的递增,您甚至可以让每个线程跟踪它们增加的值来自它们的最后一个增量而不是另一个线程增量的频率(例如,通过使用该值的几个位来存储线程标识符)。通过更加巧妙的测试,您可以实际重建高速缓存线在核心之间移动的方式(例如,核心A是否更愿意切换到核心B?)以及哪些核心对最终值贡献最大,等等

That's all left as an exercise :).

这一切都留作了练习:)。


1 On top of that, if Intel has a coalescing store buffer where later stores that fully overlap earlier ones kill the earlier stores, it would only have to commit one value to L1 (the latest store) every time it gets the line.

1最重要的是,如果英特尔有一个合并存储缓冲区,后面的存储与之前的存储完全重叠,则会杀死早期存储,每次获取该行时,它只需要向L1(最新存储)提交一个值。

2 You can't really separate the two effects here, but we will do it later by defeating store-to-load forwarding.

2你不能在这里真正区分这两种效果,但我们稍后会通过打败存储到转发转发来实现。

3 A bit more than I expected, perhaps bad scheduling leading to port pressure. If gcc would just all the sub and jne to fuse, it runs at 1.1 cycles per iteration (still worse than the 1.0 I'd expect). It will do that I use -march=haswell instead of -march=native but I'm not going to go back and change all the numbers.

3比我预期的要多一点,也许是糟糕的调度导致港口压力。如果gcc只是所有的sub和jne融合,它每次迭代运行1.1个周期(仍然比我期望的1.0差)。这样做我会使用-march = haswell而不是-march = native但我不会回去改变所有的数字。

4 The results hold with 4 threads as well: but I only have 4 cores and I'm running stuff like Firefox in the background, so using 1 less core makes the measurements a lot less noisy. Measuring time in cycles helps a lot too.

4结果也支持4个线程:但我只有4个核心,我在后台运行像Firefox这样的东西,所以使用1个核心可以减少测量噪音。以周期测量时间也有很大帮助。

#2


3  

The atomic version has to ensure that some other thread will be able to read the result in a sequentially consistent fashion. So there are fences for each write.

原子版必须确保其他一些线程能够以顺序一致的方式读取结果。所以每次写都有围栏。

The volatile version does not make any relationships visible to the other cores, so does not try and synchronize the memory so it is visible on other cores. For a multi-threaded system using C++11 or newer, volatile is not a mechanism for communicating between threads.

易失性版本不会使任何关系对其他内核可见,因此不会尝试同步内存,以便在其他内核上可见。对于使用C ++ 11或更高版本的多线程系统,volatile不是线程之间通信的机制。