软件预取是否分配了一个行填充缓冲区(LFB)?

时间:2022-09-01 11:13:31

I've realized that Little's Law limits how fast data can be transferred at a given latency and with a given level of concurrency. If you want to transfer something faster, you either need larger transfers, more transfers "in flight", or lower latency. For the case of reading from RAM, the concurrency is limited by the number of Line Fill Buffers.

我已经认识到,Little的法律限制了在给定的延迟和给定的并发级别上传输数据的速度。如果您想要更快地传输一些东西,您要么需要更大的传输,要么需要更多的传输“在飞行中”,或者更低的延迟。对于从RAM读取的情况,并发性受到行填充缓冲区数量的限制。

A Line Fill Buffer is allocated when a load misses the L1 cache. Modern Intel chips (Nehalem, Sandy Bridge, Ivy Bridge, Haswell) have 10 LFB's per core, and thus are limited to 10 outstanding cache misses per core. If RAM latency is 70 ns (plausible), and each transfer is 128 Bytes (64B cache line plus its hardware prefetched twin), this limits bandwidth per core to: 10 * 128B / 75 ns = ~16 GB/s. Benchmarks such as single-threaded Stream confirm that this is reasonably accurate.

当负载错过L1缓存时,将分配一个行填充缓冲区。现代的英特尔芯片(Nehalem, Sandy Bridge, Ivy Bridge, Haswell)拥有10个LFB的核心,因此每个核心只能有10个突出的缓存缺失。如果RAM延迟是70 ns(可信的),并且每个传输是128字节(64B高速缓存线路加上其硬件预取的twin),这将限制每个核心的带宽:10 * 128B / 75 ns = ~16 GB/s。像单线程流这样的基准测试确认这是相当准确的。

The obvious way to reduce the latency would be prefetching the desired data with x64 instructions such as PREFETCHT0, PREFETCHT1, PREFETCHT2, or PREFETCHNTA so that it doesn't have to be read from RAM. But I haven't been able to speed anything up by using them. The problem seems to be that the __mm_prefetch() instructions themselves consume LFB's, so they too are subject to the same limits. Hardware prefetches don't touch the LFB's, but also will not cross page boundaries.

减少延迟的明显方法是用x64指令预取所需的数据,如PREFETCHT0、PREFETCHT1、PREFETCHT2或PREFETCHNTA,这样就不必从RAM中读取数据。但是我没有能够通过使用它们来加速任何事情。问题似乎是__mm_prefetch()指令本身使用了LFB,因此它们也受到相同的限制。硬件预取不会触及LFB,但也不会跨越页面边界。

But I can't find any of this documented anywhere. The closest I've found is 15 year old article that says mentions that prefetch on the Pentium III uses the Line Fill Buffers. I worry things may have changed since then. And since I think the LFB's are associated with the L1 cache, I'm not sure why a prefetch to L2 or L3 would consume them. And yet, the speeds I measure are consistent with this being the case.

但是我在任何地方都找不到这些文件。我找到的最接近15年前的文章提到,Pentium III上的prefetch使用了填充缓冲区。我担心自那以后情况可能发生了变化。由于我认为LFB与L1缓存相关,所以我不知道为什么将prefetch引入L2或L3会消耗它们。然而,我测量的速度与这种情况是一致的。

So: Is there any way to initiate a fetch from a new location in memory without using up one of those 10 Line Fill Buffers, thus achieving higher bandwidth by skirting around Little's Law?

那么,是否有方法可以在不使用这10行填充缓冲区的情况下,在内存中启动一个新位置的获取,从而通过绕过Little的法律来获得更高的带宽呢?

2 个解决方案

#1


7  

First of all a minor correction - read the optimization guide, and you'll note that some HW prefetchers belong in the L2 cache, and as such are not limited by the number of fill buffers but rather by the L2 counterpart.

首先是一个小的修正-阅读优化指南,你会注意到一些HW的预取器属于L2缓存,因此不受填充缓冲区的数量限制,而是由L2对应。

The "spatial prefetcher" (the colocated-64B line you meantion, completing to 128B chunks) is one of them, so in theory if you fetch every other line you'll be able to get a higher bandwidth (some DCU prefetchers might try to "fill the gaps for you", but in theory they should have lower priority so it might work).

“空间prefetcher”(托管meantion - 64 b线,完成128 b块)就是其中之一,所以理论上如果你获取每隔一行你就可以得到更高的带宽(都柏林城市大学一些预取器可能会试图“填补空白”,但在理论上他们应该低优先级可能工作)。

However, the "king" prefetcher is the other guy, the "L2 streamer". Section 2.1.5.4 reads:

然而,“国王”的预取者是另一个人,“L2 streamer”。2.1.5.4节写道:

Streamer : This prefetcher monitors read requests from the L1 cache for ascending and descending sequences of addresses. Monitored read requests include L1 DCache requests initiated by load and store operations and by the hardware prefetchers, and L1 ICache requests for code fetch. When a forward or backward stream of requests is detected, the anticipated cache lines are prefetched. Prefetched cache lines must be in the same 4K page

Streamer:这个prefetcher监视读取来自L1缓存的请求,用于提升和降序的地址序列。受监视的读取请求包括由加载和存储操作发起的L1 DCache请求,以及硬件预取器,以及用于代码获取的L1 ICache请求。当检测到一个向前或向后的请求流时,预期的缓存线是预取的。预取的缓存线必须位于相同的4K页面中。

The important part is -

重要的是。

The streamer may issue two prefetch requests on every L2 lookup. The streamer can run up to 20 lines ahead of the load reques

streamer可以在每个L2查找上发出两个预取请求。streamer可以在加载请求之前最多运行20行。

This 2:1 ratio means that for a stream of accesses that is recognized by this prefetcher, it would always run ahead of your accesses. It's true that you won't see these lines in your L1 automatically, but it does mean that if all works well, you should always get L2 hit latency for them (once the prefetch stream had enough time to run ahead and mitigate L3/memory latencies). You may only have 10 LFBs, but as you noted in your calculation - the shorter the access latency becomes, the faster you can replace them the the higher bandwidth you can reach. This is essentially detaching the L1 <-- mem latency into parallel streams of L1 <-- L2 and L2 <-- mem.

这个2:1比率意味着,对于一个被这个预取器识别的访问流,它总是在您的访问之前运行。确实,您不会自动地在L1中看到这些行,但是这意味着如果所有的工作都很好,那么您应该总是为它们获得L2的延迟(一旦prefetch流有足够的时间来运行,并减少L3/内存延迟)。您可能只有10个LFBs,但是正如您在计算中所指出的——访问延迟变得越短,您就可以越快地将它们替换为您可以到达的更高的带宽。这实际上是将L1 <——mem延迟与L1 <- L2和L2 <- mem的并行流分离。

As for the question in your headline - it stands to reason that prefetches attempting to fill the L1 would require a line fill buffer to hold the retrieved data for that level. This should probably include all L1 prefetches. As for SW prefetches, section 7.4.3 says:

至于你的标题中的问题——它代表的是,试图填充L1的prefetches需要一个行填充缓冲区来保存该级别的检索数据。这应该包括所有L1预取。至于SW prefetches,第7.4.3节说:

There are cases where a PREFETCH will not perform the data prefetch. These include:

有些情况下,预取不会执行数据预取。这些包括:

  • PREFETCH causes a DTLB (Data Translation Lookaside Buffer) miss. This applies to Pentium 4 processors with CPUID signature corresponding to family 15, model 0, 1, or 2. PREFETCH resolves DTLB misses and fetches data on Pentium 4 processors with CPUID signature corresponding to family 15, model 3.
  • PREFETCH会导致DTLB(数据转换后备缓冲区)错误。这适用于Pentium 4处理器,其CPUID签名对应于家庭15、模型0、1或2。PREFETCH解决了DTLB丢失的问题,并在Pentium 4处理器上获取数据,其中CPUID签名对应于家庭15,模型3。
  • An access to the specified address that causes a fault/exception.
  • 对导致错误/异常的指定地址的访问。
  • If the memory subsystem runs out of request buffers between the first-level cache and the second-level cache.
  • 如果内存子系统在一级缓存和二级缓存之间耗尽了请求缓冲区。

...

So I assume you're right and SW prefetches are not a way to artificially increase your number of outstanding requests. However, the same explanation applies here as well - if you know how to use SW prefetching to access your lines well enough in advance, you may be able to mitigate some of the access latency and increase your effective BW. This however won't work for long streams for two reasons: 1) your cache capacity is limited (even if the prefetch is temporal, like t0 flavor), and 2) you still need to pay the full L1-->mem latency for each prefetch, so you're just moving your stress ahead a bit - if your data manipulation is faster than memory access, you'll eventually catch up with your SW prefetching. So this only works if you can prefetch all you need well enough in advance, and keep it there.

所以我认为你是正确的,而SW prefetches并不是人为地增加你的请求数量。但是,同样的解释也适用于这里——如果您知道如何使用SW prefetchto来提前访问您的行,那么您可能可以减轻一些访问延迟并增加您的有效BW。这个但是不工作长时间流有两个原因:1)你的缓存容量是有限的(即使预取时间,像t0味道),和2)你仍然需要支付完整的L1 - - > mem延迟对于每个预取,所以之前你只是把你的压力——如果你的数据操作速度比内存访问,你最终会赶上SW预取。所以这只有在你预先准备好所有你需要的东西的时候才有用,并把它保存在那里。

#2


4  

Based on my testing, all types of prefetch instructions consume line fill buffers on recent Intel mainstream CPUs.

基于我的测试,所有类型的预取指令在最近的Intel主流cpu上消耗了行填充缓冲区。

In particular, I added some load & prefetch tests to uarch-bench, which use large-stride loads over buffers of various sizes. Here are typical results on my Skylake i7-6700HQ:

特别是,我添加了一些load & prefetch测试到uarchar -bench,它使用大跨步加载不同大小的缓冲区。这是我的Skylake i7-6700HQ的典型结果:

                     Benchmark   Cycles    Nanos
  16-KiB parallel        loads     0.50     0.19
  16-KiB parallel   prefetcht0     0.50     0.19
  16-KiB parallel   prefetcht1     1.15     0.44
  16-KiB parallel   prefetcht2     1.24     0.48
  16-KiB parallel prefetchtnta     0.50     0.19

  32-KiB parallel        loads     0.50     0.19
  32-KiB parallel   prefetcht0     0.50     0.19
  32-KiB parallel   prefetcht1     1.28     0.49
  32-KiB parallel   prefetcht2     1.28     0.49
  32-KiB parallel prefetchtnta     0.50     0.19

 128-KiB parallel        loads     1.00     0.39
 128-KiB parallel   prefetcht0     2.00     0.77
 128-KiB parallel   prefetcht1     1.31     0.50
 128-KiB parallel   prefetcht2     1.31     0.50
 128-KiB parallel prefetchtnta     4.10     1.58

 256-KiB parallel        loads     1.00     0.39
 256-KiB parallel   prefetcht0     2.00     0.77
 256-KiB parallel   prefetcht1     1.31     0.50
 256-KiB parallel   prefetcht2     1.31     0.50
 256-KiB parallel prefetchtnta     4.10     1.58

 512-KiB parallel        loads     4.09     1.58
 512-KiB parallel   prefetcht0     4.12     1.59
 512-KiB parallel   prefetcht1     3.80     1.46
 512-KiB parallel   prefetcht2     3.80     1.46
 512-KiB parallel prefetchtnta     4.10     1.58

2048-KiB parallel        loads     4.09     1.58
2048-KiB parallel   prefetcht0     4.12     1.59
2048-KiB parallel   prefetcht1     3.80     1.46
2048-KiB parallel   prefetcht2     3.80     1.46
2048-KiB parallel prefetchtnta    16.54     6.38

The key thing to note is that none of the prefetching techniques are much faster than loads at any buffer size. If any prefetch instruction didn't use the LFB, we would expect it to be very fast for a benchmark that fit into the level of cache it prefetches to. For example prefetcht1 brings lines into the L2, so for the 128-KiB test we might expect it to be faster than the load variant if it doesn't use LFBs.

需要注意的关键是,任何预抓取技术都比任何缓冲区大小的加载速度快得多。如果任何预取指令没有使用LFB,那么我们希望它能够非常快地作为一个基准测试,以适应它预取到的缓存级别。例如,prefetcht1将行引入L2,因此对于128-KiB测试,如果它不使用LFBs,我们可能期望它比负载变体更快。

More conclusively, we can examine the l1d_pend_miss.fb_full counter, whose description is:

更确切地说,我们可以检查l1d_pend_miss。fb_full计数器,其描述为:

Number of times a request needed a FB (Fill Buffer) entry but there was no entry available for it. A request includes cacheable/uncacheable demands that are load, store or SW prefetch instructions.

一个请求需要一个FB(填充缓冲区)条目的次数,但是没有可用的条目。一个请求包括可缓存/可缓存的需求,即加载、存储或SW预取指令。

The description already indicates that SW prefetches need LFB entries and testing confirmed it: for all types of prefetch, this figure was very high for any test where concurrency was a limiting factor. For example, for the 512-KiB prefetcht1 test:

描述已经表明,SW prefetches需要LFB条目和测试确认它:对于所有类型的预取,这个数字对于任何并发性是限制因素的测试来说都是非常高的。例如,对于512-KiB的prefetcht1测试:

 Performance counter stats for './uarch-bench --test-name 512-KiB parallel   prefetcht1':

        38,345,242      branches                                                    
     1,074,657,384      cycles                                                      
       284,646,019      mem_inst_retired.all_loads                                   
     1,677,347,358      l1d_pend_miss.fb_full                  

The fb_full value is more than the number of cycles, meaning that the LFB was full almost all the time (it can be more than the number of cycles since up to two loads might want an LFB per cycle). This workload is pure prefetches, so there is nothing to fill up the LFBs except prefetch.

fb_full值大于循环次数,这意味着LFB几乎一直都是满的(它可以比两个负载的周期数更大,可能需要一个LFB / cycle)。这个工作负载是纯预取的,因此除了预取之外,没有什么可以填充LFBs。

The results of this test also contract the claimed behavior in the section of the manual quoted by Leeor:

该测试的结果也与Leeor所引用的手册中声称的行为有关:

There are cases where a PREFETCH will not perform the data prefetch. These include:

有些情况下,预取不会执行数据预取。这些包括:

  • ...
  • If the memory subsystem runs out of request buffers between the first-level cache and the second-level cache.
  • 如果内存子系统在一级缓存和二级缓存之间耗尽了请求缓冲区。

Clearly this is not the case here: the prefetch requests are not dropped when the LFBs fill up, but are stalled like a normal load until resources are available (this is not an unreasonable behavior: if you asked for a software prefetch, you probably want to get it, perhaps even if it means stalling).

显然情况并非如此:预取请求不下降,当局部反馈填满,但停滞不前就像一个正常的负载,直到资源可用(这不是一个不合理的行为:如果你问一个软件预取,你可能想要得到它,也许即使这意味着拖延)。

We also note the following interesting behaviors:

我们还注意到以下有趣的行为:

  • It seems like there is some small difference between prefetcht1 and prefetcht2 as they report different performance for the 16-KiB test (the difference varies, but is consistently different), but if you repeat the test you'll see that this is more likely just run-to-run variation as those particular values are somewhat unstable (most other values are very stable).
  • 似乎有一些小的区别prefetcht1和prefetcht2 16-KiB测试报告不同性能(不同的差异,但始终不同),但如果你重复测试你会发现这很可能只是运行的变化与特定值有些不稳定(大多数其他值很稳定)。
  • For the L2 contained tests, we can sustain 1 load per cycle, but only one prefetcht0 prefetch. This is kind of weird because prefetcht0 should be very similar to a load (and it can issue 2 per cycle in the L1 cases).
  • 对于L2包含的测试,我们可以保持每循环一个负载,但只有一个prefetcht0预取。这有点奇怪,因为prefetcht0应该非常类似于一个负载(在L1案例中,它可以发出2个循环)。
  • Even though the L2 has ~12 cycle latency, we are able to fully hide the latency LFB with only 10 LFBs: we get 1.0 cycles per load (limited by L2 throughput), not 12 / 10 == 1.2 cycles per load that we'd expect (best case) if LFB were the limiting fact (and very low values for fb_full confirms it). That's probably because the 12 cycle latency is the full load-to-use latency all the way to the execution core, which includes also several cycles of additional latency (e.g., L1 latency is 4-5 cycles), so the actual time spent in the LFB is less than 10 cycles.
  • 尽管L2 ~ 12个周期延迟,我们能够完全隐藏延迟局部反馈只有10局部反馈:我们得到1.0周期载荷(L2吞吐量限制),而不是12/10 = = 1.2每加载周期,我们期望(最好的情况)如果局部反馈被限制的事实(fb_full证实它和非常低的值)。这可能是因为12个周期延迟是整个执行内核的全部加载到使用的延迟,这也包括几个额外的延迟周期(例如,L1延迟是4-5个周期),所以在LFB中实际花费的时间少于10个周期。
  • For the L3 tests, we see values of 3.8-4.1 cycles, very close to the expected 42/10 = 4.2 cycles based on the L3 load-to-use latency. So we are definitely limited by the 10 LFBs when we hit the L3. Here prefetcht1 and prefetcht2 are consistently 0.3 cycles faster than loads or prefetcht0. Given the 10 LFBs, that equals 3 cycles less occupancy, more or less explained by the prefetch stopping at L2 rather than going all the way to L1.
  • 对于L3测试,我们看到了3.8-4.1周期的值,非常接近于预期的42/10 = 4.2周期,基于L3的负载到使用延迟。所以当我们到达L3时,我们肯定受到10 LFBs的限制。在这里,prefetcht1和prefetcht2始终比负载或预fetcht0快0.3个周期。考虑到10个LFBs,这等于3个周期的占用率,或多或少解释了prefetch在L2的停留,而不是一直走到L1。
  • prefetchtnta generally has much lower throughput than the others outside of L1. This probably means that prefetchtnta is actually doing what it is supposed to, and appears to bring lines into L1, not into L2, and only "weakly" into L3. So for the L2-contained tests it has concurrency-limited throughput as if it was hitting the L3 cache, and for the 2048-KiB case (1/3 of the L3 cache size) it has the performance of hitting main memory. prefetchnta limits L3 cache pollution (to something like only one way per set), so we seem to be getting evictions.
  • 一般来说,除L1外,预取物的吞吐率要低得多。这可能意味着prefetchtnta实际上是在做它应该做的事情,并且似乎把线引入L1,而不是L2,而只是“弱”到L3。因此,对于l2包含的测试,它具有concurrer -limited的吞吐量,就好像它是命中L3缓存,而对于2048-KiB的情况(L3缓存大小的1/3),它具有命中主存的性能。prefetchnta限制了L3的缓存污染(对于每一套的一种方式),所以我们似乎正在被驱逐。

Could it be different?

Here's an older answer I wrote before testing, speculating on how it could work:

这是我在测试之前写的一个更古老的答案,推测它是如何工作的:

In general, I would expect any prefetch that results in data ending up in L1 to consume a line fill buffer, since I believe that the only path between L1 and the rest of the memory hierarchy is the LFB1. So SW and HW prefetches that target the L1 probably both use LFBs.

一般来说,我希望所有的prefetch结果都是在L1中结束,以消耗一个行填充缓冲区,因为我认为L1和其他内存层之间的唯一路径是LFB1。所以SW和HW prefetches针对L1可能都使用LFBs。

However, this leaves open the probability that prefetches that target L2 or higher levels don't consume LFBs. For the case of hardware prefetch, I'm quite sure this is the case: you can find many reference that explain that HW prefetch is a mechanism to effectively get more memory parallelism beyond the maximum of 10 offered by the LFB. Furthermore, it doesn't seem like the L2 prefetchers could use the LFBs if they wanted: they live in/near the L2 and issue requests to higher levels, presumably using the superqueue and wouldn't need the LFBs.

然而,这将打开目标L2或更高级别的预取不消耗LFBs的概率。对于硬件预取的情况,我很确定是这样的:您可以找到许多引用来解释HW预取是一种机制,它可以有效地获得更多的内存并行性,超过LFB提供的最大10倍。此外,如果他们想要的话,L2的预取器就不能使用LFBs:他们住在L2附近,并向更高的级别发出请求,可能是使用超级队列,而不需要LFBs。

That leaves software prefetch that target the L2 (or higher), such as prefetcht1 and prefetcht22. Unlike requests generated by the L2, these start in the core, so they need some way to get from the core out, and this could be via the LFB. From the Intel Optimization guide have the following interesting quote (emphasis mine):

这使得软件预取的目标是L2(或更高),比如prefetcht1和prefetcht22。与L2所生成的请求不同,这些请求从核心开始,所以它们需要一些方法从核心取出,这可以通过LFB。在Intel优化指南中有以下有趣的引用(强调我的):

Generally, software prefetching into the L2 will show more benefit than L1 prefetches. A software prefetch into L1 will consume critical hardware resources (fill buffer) until the cacheline fill completes. A software prefetch into L2 does not hold those resources, and it is less likely to have a negative perfor- mance impact. If you do use L1 software prefetches, it is best if the software prefetch is serviced by hits in the L2 cache, so the length of time that the hardware resources are held is minimized.

一般来说,在L2中预取的软件会比L1预取显示更多的好处。一个软件预取到L1将消耗关键的硬件资源(填充缓冲区),直到cacheline填充完成。在L2中预取的软件不具有这些资源,而且它不太可能具有负面的perfor- mance影响。如果您确实使用了L1软件预取,那么最好是在L2缓存中对软件预取进行服务,这样硬件资源所占用的时间就会最小化。

This would seem to indicate that software prefetches don't consume LFBs - but this quote only applies to the Knights Landing architecture, and I can't find similar language for any of the more mainstream architectures. It appears that the cache design of Knights Landing is significantly different (or the quote is wrong).

这似乎表明软件预取不会消耗LFBs——但是这句话只适用于骑士的着陆架构,而且对于任何主流的架构,我都找不到类似的语言。似乎骑士登陆的缓存设计有很大的不同(或者引号是错误的)。


1 In fact, I think that even non-temporal stores use the LFBs to get get out of the execution core - but their occupancy time is short because as soon as they get to the L2 they can enter the superqueue (without actually going into L2) and then free up their associated LFB.

1事实上,我认为,即使无时态商店使用的局部反馈的执行核心——但他们占用时间很短,因为一旦他们到达L2可以进入superqueue(实际上没有进入L2),然后释放相关的局部反馈。

2 I think both of these target the L2 on recent Intel, but this is also unclear - perhaps the t2 hint actually targets LLC on some uarchs?

我认为这两个目标都是最近Intel的L2,但这也不清楚——也许t2暗示了一些uarchs上的目标有限责任公司?

#1


7  

First of all a minor correction - read the optimization guide, and you'll note that some HW prefetchers belong in the L2 cache, and as such are not limited by the number of fill buffers but rather by the L2 counterpart.

首先是一个小的修正-阅读优化指南,你会注意到一些HW的预取器属于L2缓存,因此不受填充缓冲区的数量限制,而是由L2对应。

The "spatial prefetcher" (the colocated-64B line you meantion, completing to 128B chunks) is one of them, so in theory if you fetch every other line you'll be able to get a higher bandwidth (some DCU prefetchers might try to "fill the gaps for you", but in theory they should have lower priority so it might work).

“空间prefetcher”(托管meantion - 64 b线,完成128 b块)就是其中之一,所以理论上如果你获取每隔一行你就可以得到更高的带宽(都柏林城市大学一些预取器可能会试图“填补空白”,但在理论上他们应该低优先级可能工作)。

However, the "king" prefetcher is the other guy, the "L2 streamer". Section 2.1.5.4 reads:

然而,“国王”的预取者是另一个人,“L2 streamer”。2.1.5.4节写道:

Streamer : This prefetcher monitors read requests from the L1 cache for ascending and descending sequences of addresses. Monitored read requests include L1 DCache requests initiated by load and store operations and by the hardware prefetchers, and L1 ICache requests for code fetch. When a forward or backward stream of requests is detected, the anticipated cache lines are prefetched. Prefetched cache lines must be in the same 4K page

Streamer:这个prefetcher监视读取来自L1缓存的请求,用于提升和降序的地址序列。受监视的读取请求包括由加载和存储操作发起的L1 DCache请求,以及硬件预取器,以及用于代码获取的L1 ICache请求。当检测到一个向前或向后的请求流时,预期的缓存线是预取的。预取的缓存线必须位于相同的4K页面中。

The important part is -

重要的是。

The streamer may issue two prefetch requests on every L2 lookup. The streamer can run up to 20 lines ahead of the load reques

streamer可以在每个L2查找上发出两个预取请求。streamer可以在加载请求之前最多运行20行。

This 2:1 ratio means that for a stream of accesses that is recognized by this prefetcher, it would always run ahead of your accesses. It's true that you won't see these lines in your L1 automatically, but it does mean that if all works well, you should always get L2 hit latency for them (once the prefetch stream had enough time to run ahead and mitigate L3/memory latencies). You may only have 10 LFBs, but as you noted in your calculation - the shorter the access latency becomes, the faster you can replace them the the higher bandwidth you can reach. This is essentially detaching the L1 <-- mem latency into parallel streams of L1 <-- L2 and L2 <-- mem.

这个2:1比率意味着,对于一个被这个预取器识别的访问流,它总是在您的访问之前运行。确实,您不会自动地在L1中看到这些行,但是这意味着如果所有的工作都很好,那么您应该总是为它们获得L2的延迟(一旦prefetch流有足够的时间来运行,并减少L3/内存延迟)。您可能只有10个LFBs,但是正如您在计算中所指出的——访问延迟变得越短,您就可以越快地将它们替换为您可以到达的更高的带宽。这实际上是将L1 <——mem延迟与L1 <- L2和L2 <- mem的并行流分离。

As for the question in your headline - it stands to reason that prefetches attempting to fill the L1 would require a line fill buffer to hold the retrieved data for that level. This should probably include all L1 prefetches. As for SW prefetches, section 7.4.3 says:

至于你的标题中的问题——它代表的是,试图填充L1的prefetches需要一个行填充缓冲区来保存该级别的检索数据。这应该包括所有L1预取。至于SW prefetches,第7.4.3节说:

There are cases where a PREFETCH will not perform the data prefetch. These include:

有些情况下,预取不会执行数据预取。这些包括:

  • PREFETCH causes a DTLB (Data Translation Lookaside Buffer) miss. This applies to Pentium 4 processors with CPUID signature corresponding to family 15, model 0, 1, or 2. PREFETCH resolves DTLB misses and fetches data on Pentium 4 processors with CPUID signature corresponding to family 15, model 3.
  • PREFETCH会导致DTLB(数据转换后备缓冲区)错误。这适用于Pentium 4处理器,其CPUID签名对应于家庭15、模型0、1或2。PREFETCH解决了DTLB丢失的问题,并在Pentium 4处理器上获取数据,其中CPUID签名对应于家庭15,模型3。
  • An access to the specified address that causes a fault/exception.
  • 对导致错误/异常的指定地址的访问。
  • If the memory subsystem runs out of request buffers between the first-level cache and the second-level cache.
  • 如果内存子系统在一级缓存和二级缓存之间耗尽了请求缓冲区。

...

So I assume you're right and SW prefetches are not a way to artificially increase your number of outstanding requests. However, the same explanation applies here as well - if you know how to use SW prefetching to access your lines well enough in advance, you may be able to mitigate some of the access latency and increase your effective BW. This however won't work for long streams for two reasons: 1) your cache capacity is limited (even if the prefetch is temporal, like t0 flavor), and 2) you still need to pay the full L1-->mem latency for each prefetch, so you're just moving your stress ahead a bit - if your data manipulation is faster than memory access, you'll eventually catch up with your SW prefetching. So this only works if you can prefetch all you need well enough in advance, and keep it there.

所以我认为你是正确的,而SW prefetches并不是人为地增加你的请求数量。但是,同样的解释也适用于这里——如果您知道如何使用SW prefetchto来提前访问您的行,那么您可能可以减轻一些访问延迟并增加您的有效BW。这个但是不工作长时间流有两个原因:1)你的缓存容量是有限的(即使预取时间,像t0味道),和2)你仍然需要支付完整的L1 - - > mem延迟对于每个预取,所以之前你只是把你的压力——如果你的数据操作速度比内存访问,你最终会赶上SW预取。所以这只有在你预先准备好所有你需要的东西的时候才有用,并把它保存在那里。

#2


4  

Based on my testing, all types of prefetch instructions consume line fill buffers on recent Intel mainstream CPUs.

基于我的测试,所有类型的预取指令在最近的Intel主流cpu上消耗了行填充缓冲区。

In particular, I added some load & prefetch tests to uarch-bench, which use large-stride loads over buffers of various sizes. Here are typical results on my Skylake i7-6700HQ:

特别是,我添加了一些load & prefetch测试到uarchar -bench,它使用大跨步加载不同大小的缓冲区。这是我的Skylake i7-6700HQ的典型结果:

                     Benchmark   Cycles    Nanos
  16-KiB parallel        loads     0.50     0.19
  16-KiB parallel   prefetcht0     0.50     0.19
  16-KiB parallel   prefetcht1     1.15     0.44
  16-KiB parallel   prefetcht2     1.24     0.48
  16-KiB parallel prefetchtnta     0.50     0.19

  32-KiB parallel        loads     0.50     0.19
  32-KiB parallel   prefetcht0     0.50     0.19
  32-KiB parallel   prefetcht1     1.28     0.49
  32-KiB parallel   prefetcht2     1.28     0.49
  32-KiB parallel prefetchtnta     0.50     0.19

 128-KiB parallel        loads     1.00     0.39
 128-KiB parallel   prefetcht0     2.00     0.77
 128-KiB parallel   prefetcht1     1.31     0.50
 128-KiB parallel   prefetcht2     1.31     0.50
 128-KiB parallel prefetchtnta     4.10     1.58

 256-KiB parallel        loads     1.00     0.39
 256-KiB parallel   prefetcht0     2.00     0.77
 256-KiB parallel   prefetcht1     1.31     0.50
 256-KiB parallel   prefetcht2     1.31     0.50
 256-KiB parallel prefetchtnta     4.10     1.58

 512-KiB parallel        loads     4.09     1.58
 512-KiB parallel   prefetcht0     4.12     1.59
 512-KiB parallel   prefetcht1     3.80     1.46
 512-KiB parallel   prefetcht2     3.80     1.46
 512-KiB parallel prefetchtnta     4.10     1.58

2048-KiB parallel        loads     4.09     1.58
2048-KiB parallel   prefetcht0     4.12     1.59
2048-KiB parallel   prefetcht1     3.80     1.46
2048-KiB parallel   prefetcht2     3.80     1.46
2048-KiB parallel prefetchtnta    16.54     6.38

The key thing to note is that none of the prefetching techniques are much faster than loads at any buffer size. If any prefetch instruction didn't use the LFB, we would expect it to be very fast for a benchmark that fit into the level of cache it prefetches to. For example prefetcht1 brings lines into the L2, so for the 128-KiB test we might expect it to be faster than the load variant if it doesn't use LFBs.

需要注意的关键是,任何预抓取技术都比任何缓冲区大小的加载速度快得多。如果任何预取指令没有使用LFB,那么我们希望它能够非常快地作为一个基准测试,以适应它预取到的缓存级别。例如,prefetcht1将行引入L2,因此对于128-KiB测试,如果它不使用LFBs,我们可能期望它比负载变体更快。

More conclusively, we can examine the l1d_pend_miss.fb_full counter, whose description is:

更确切地说,我们可以检查l1d_pend_miss。fb_full计数器,其描述为:

Number of times a request needed a FB (Fill Buffer) entry but there was no entry available for it. A request includes cacheable/uncacheable demands that are load, store or SW prefetch instructions.

一个请求需要一个FB(填充缓冲区)条目的次数,但是没有可用的条目。一个请求包括可缓存/可缓存的需求,即加载、存储或SW预取指令。

The description already indicates that SW prefetches need LFB entries and testing confirmed it: for all types of prefetch, this figure was very high for any test where concurrency was a limiting factor. For example, for the 512-KiB prefetcht1 test:

描述已经表明,SW prefetches需要LFB条目和测试确认它:对于所有类型的预取,这个数字对于任何并发性是限制因素的测试来说都是非常高的。例如,对于512-KiB的prefetcht1测试:

 Performance counter stats for './uarch-bench --test-name 512-KiB parallel   prefetcht1':

        38,345,242      branches                                                    
     1,074,657,384      cycles                                                      
       284,646,019      mem_inst_retired.all_loads                                   
     1,677,347,358      l1d_pend_miss.fb_full                  

The fb_full value is more than the number of cycles, meaning that the LFB was full almost all the time (it can be more than the number of cycles since up to two loads might want an LFB per cycle). This workload is pure prefetches, so there is nothing to fill up the LFBs except prefetch.

fb_full值大于循环次数,这意味着LFB几乎一直都是满的(它可以比两个负载的周期数更大,可能需要一个LFB / cycle)。这个工作负载是纯预取的,因此除了预取之外,没有什么可以填充LFBs。

The results of this test also contract the claimed behavior in the section of the manual quoted by Leeor:

该测试的结果也与Leeor所引用的手册中声称的行为有关:

There are cases where a PREFETCH will not perform the data prefetch. These include:

有些情况下,预取不会执行数据预取。这些包括:

  • ...
  • If the memory subsystem runs out of request buffers between the first-level cache and the second-level cache.
  • 如果内存子系统在一级缓存和二级缓存之间耗尽了请求缓冲区。

Clearly this is not the case here: the prefetch requests are not dropped when the LFBs fill up, but are stalled like a normal load until resources are available (this is not an unreasonable behavior: if you asked for a software prefetch, you probably want to get it, perhaps even if it means stalling).

显然情况并非如此:预取请求不下降,当局部反馈填满,但停滞不前就像一个正常的负载,直到资源可用(这不是一个不合理的行为:如果你问一个软件预取,你可能想要得到它,也许即使这意味着拖延)。

We also note the following interesting behaviors:

我们还注意到以下有趣的行为:

  • It seems like there is some small difference between prefetcht1 and prefetcht2 as they report different performance for the 16-KiB test (the difference varies, but is consistently different), but if you repeat the test you'll see that this is more likely just run-to-run variation as those particular values are somewhat unstable (most other values are very stable).
  • 似乎有一些小的区别prefetcht1和prefetcht2 16-KiB测试报告不同性能(不同的差异,但始终不同),但如果你重复测试你会发现这很可能只是运行的变化与特定值有些不稳定(大多数其他值很稳定)。
  • For the L2 contained tests, we can sustain 1 load per cycle, but only one prefetcht0 prefetch. This is kind of weird because prefetcht0 should be very similar to a load (and it can issue 2 per cycle in the L1 cases).
  • 对于L2包含的测试,我们可以保持每循环一个负载,但只有一个prefetcht0预取。这有点奇怪,因为prefetcht0应该非常类似于一个负载(在L1案例中,它可以发出2个循环)。
  • Even though the L2 has ~12 cycle latency, we are able to fully hide the latency LFB with only 10 LFBs: we get 1.0 cycles per load (limited by L2 throughput), not 12 / 10 == 1.2 cycles per load that we'd expect (best case) if LFB were the limiting fact (and very low values for fb_full confirms it). That's probably because the 12 cycle latency is the full load-to-use latency all the way to the execution core, which includes also several cycles of additional latency (e.g., L1 latency is 4-5 cycles), so the actual time spent in the LFB is less than 10 cycles.
  • 尽管L2 ~ 12个周期延迟,我们能够完全隐藏延迟局部反馈只有10局部反馈:我们得到1.0周期载荷(L2吞吐量限制),而不是12/10 = = 1.2每加载周期,我们期望(最好的情况)如果局部反馈被限制的事实(fb_full证实它和非常低的值)。这可能是因为12个周期延迟是整个执行内核的全部加载到使用的延迟,这也包括几个额外的延迟周期(例如,L1延迟是4-5个周期),所以在LFB中实际花费的时间少于10个周期。
  • For the L3 tests, we see values of 3.8-4.1 cycles, very close to the expected 42/10 = 4.2 cycles based on the L3 load-to-use latency. So we are definitely limited by the 10 LFBs when we hit the L3. Here prefetcht1 and prefetcht2 are consistently 0.3 cycles faster than loads or prefetcht0. Given the 10 LFBs, that equals 3 cycles less occupancy, more or less explained by the prefetch stopping at L2 rather than going all the way to L1.
  • 对于L3测试,我们看到了3.8-4.1周期的值,非常接近于预期的42/10 = 4.2周期,基于L3的负载到使用延迟。所以当我们到达L3时,我们肯定受到10 LFBs的限制。在这里,prefetcht1和prefetcht2始终比负载或预fetcht0快0.3个周期。考虑到10个LFBs,这等于3个周期的占用率,或多或少解释了prefetch在L2的停留,而不是一直走到L1。
  • prefetchtnta generally has much lower throughput than the others outside of L1. This probably means that prefetchtnta is actually doing what it is supposed to, and appears to bring lines into L1, not into L2, and only "weakly" into L3. So for the L2-contained tests it has concurrency-limited throughput as if it was hitting the L3 cache, and for the 2048-KiB case (1/3 of the L3 cache size) it has the performance of hitting main memory. prefetchnta limits L3 cache pollution (to something like only one way per set), so we seem to be getting evictions.
  • 一般来说,除L1外,预取物的吞吐率要低得多。这可能意味着prefetchtnta实际上是在做它应该做的事情,并且似乎把线引入L1,而不是L2,而只是“弱”到L3。因此,对于l2包含的测试,它具有concurrer -limited的吞吐量,就好像它是命中L3缓存,而对于2048-KiB的情况(L3缓存大小的1/3),它具有命中主存的性能。prefetchnta限制了L3的缓存污染(对于每一套的一种方式),所以我们似乎正在被驱逐。

Could it be different?

Here's an older answer I wrote before testing, speculating on how it could work:

这是我在测试之前写的一个更古老的答案,推测它是如何工作的:

In general, I would expect any prefetch that results in data ending up in L1 to consume a line fill buffer, since I believe that the only path between L1 and the rest of the memory hierarchy is the LFB1. So SW and HW prefetches that target the L1 probably both use LFBs.

一般来说,我希望所有的prefetch结果都是在L1中结束,以消耗一个行填充缓冲区,因为我认为L1和其他内存层之间的唯一路径是LFB1。所以SW和HW prefetches针对L1可能都使用LFBs。

However, this leaves open the probability that prefetches that target L2 or higher levels don't consume LFBs. For the case of hardware prefetch, I'm quite sure this is the case: you can find many reference that explain that HW prefetch is a mechanism to effectively get more memory parallelism beyond the maximum of 10 offered by the LFB. Furthermore, it doesn't seem like the L2 prefetchers could use the LFBs if they wanted: they live in/near the L2 and issue requests to higher levels, presumably using the superqueue and wouldn't need the LFBs.

然而,这将打开目标L2或更高级别的预取不消耗LFBs的概率。对于硬件预取的情况,我很确定是这样的:您可以找到许多引用来解释HW预取是一种机制,它可以有效地获得更多的内存并行性,超过LFB提供的最大10倍。此外,如果他们想要的话,L2的预取器就不能使用LFBs:他们住在L2附近,并向更高的级别发出请求,可能是使用超级队列,而不需要LFBs。

That leaves software prefetch that target the L2 (or higher), such as prefetcht1 and prefetcht22. Unlike requests generated by the L2, these start in the core, so they need some way to get from the core out, and this could be via the LFB. From the Intel Optimization guide have the following interesting quote (emphasis mine):

这使得软件预取的目标是L2(或更高),比如prefetcht1和prefetcht22。与L2所生成的请求不同,这些请求从核心开始,所以它们需要一些方法从核心取出,这可以通过LFB。在Intel优化指南中有以下有趣的引用(强调我的):

Generally, software prefetching into the L2 will show more benefit than L1 prefetches. A software prefetch into L1 will consume critical hardware resources (fill buffer) until the cacheline fill completes. A software prefetch into L2 does not hold those resources, and it is less likely to have a negative perfor- mance impact. If you do use L1 software prefetches, it is best if the software prefetch is serviced by hits in the L2 cache, so the length of time that the hardware resources are held is minimized.

一般来说,在L2中预取的软件会比L1预取显示更多的好处。一个软件预取到L1将消耗关键的硬件资源(填充缓冲区),直到cacheline填充完成。在L2中预取的软件不具有这些资源,而且它不太可能具有负面的perfor- mance影响。如果您确实使用了L1软件预取,那么最好是在L2缓存中对软件预取进行服务,这样硬件资源所占用的时间就会最小化。

This would seem to indicate that software prefetches don't consume LFBs - but this quote only applies to the Knights Landing architecture, and I can't find similar language for any of the more mainstream architectures. It appears that the cache design of Knights Landing is significantly different (or the quote is wrong).

这似乎表明软件预取不会消耗LFBs——但是这句话只适用于骑士的着陆架构,而且对于任何主流的架构,我都找不到类似的语言。似乎骑士登陆的缓存设计有很大的不同(或者引号是错误的)。


1 In fact, I think that even non-temporal stores use the LFBs to get get out of the execution core - but their occupancy time is short because as soon as they get to the L2 they can enter the superqueue (without actually going into L2) and then free up their associated LFB.

1事实上,我认为,即使无时态商店使用的局部反馈的执行核心——但他们占用时间很短,因为一旦他们到达L2可以进入superqueue(实际上没有进入L2),然后释放相关的局部反馈。

2 I think both of these target the L2 on recent Intel, but this is also unclear - perhaps the t2 hint actually targets LLC on some uarchs?

我认为这两个目标都是最近Intel的L2,但这也不清楚——也许t2暗示了一些uarchs上的目标有限责任公司?