x86指令缓存是如何同步的?

I like examples, so I wrote a bit of self-modifying code in c...

我喜欢示例，所以我用c编写了一些自修改代码……

#include <stdio.h>
#include <sys/mman.h> // linux

int main(void) {
    unsigned char *c = mmap(NULL, 7, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|
                            MAP_ANONYMOUS, -1, 0); // get executable memory
    c[0] = 0b11000111; // mov (x86_64), immediate mode, full-sized (32 bits)
    c[1] = 0b11000000; // to register rax (000) which holds the return value
                       // according to linux x86_64 calling convention 
    c[6] = 0b11000011; // return
    for (c[2] = 0; c[2] < 30; c[2]++) { // incr immediate data after every run
        // rest of immediate data (c[3:6]) are already set to 0 by MAP_ANONYMOUS
        printf("%d ", ((int (*)(void)) c)()); // cast c to func ptr, call ptr
    }
    putchar('\n');
    return 0;
}

...which works, apparently:

…工作,显然是:

>>> gcc -Wall -Wextra -std=c11 -D_GNU_SOURCE -o test test.c; ./test
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

But honestly, I didn't expect it to work at all. I expected the instruction containing c[2] = 0 to be cached upon the first call to c, after which all consecutive calls to c would ignore the repeated changes made to c (unless I somehow explicitedly invalidated the cache). Luckily, my cpu appears to be smarter than that.

但老实说，我一点也没想到它会起作用。我希望在第一次调用c时缓存包含c[2] = 0的指令，之后所有对c的连续调用都将忽略对c的重复更改(除非我以某种方式明确地宣布缓存无效)。幸运的是，我的cpu似乎比这还要聪明。

I guess the cpu compares RAM (assuming c even resides in RAM) with the instruction cache whenever the instruction pointer makes a large-ish jump (as with the call to the mmapped memory above), and invalidates the cache when it doesn't match (all of it?), but I'm hoping to get more precise information on that. In particular, I'd like to know if this behavior can be considered predictable (barring any differences of hardware and os), and relied on?

甚至我想cpu比较RAM(假设c驻留在RAM)的指令缓存每当指令指针发出large-ish跳(与上面的调用mmap内存),和缓存无效不匹配(?),但我希望能得到更多确切的信息。特别是，我想知道这种行为是否可以被认为是可预测的(除非硬件和操作系统有任何差异)，并且依赖于它?

(I probably should refer to the Intel manual, but that thing is thousands of pages long and I tend to get lost in it...)

(我可能应该参考一下英特尔的手册，但那东西有几千页，我往往会迷失其中……)

5 个解决方案

#1

What you do is usually referred as self-modifying code. Intel's platforms (and probably AMD's too) do the job for you of maintaining an i/d cache-coherency, as the manual points it out (Manual 3A, System Programming)

您所做的通常称为自修改代码。正如手册所指出的那样，英特尔的平台(可能还有AMD的平台)在维护i/d缓存一致性方面很有用(手册3A，系统编程)

11.6 SELF-MODIFYING CODE

变为无效来11.6代码

A write to a memory location in a code segment that is currently cached in the processor causes the associated cache line (or lines) to be invalidated.

对当前缓存在处理器中的代码段中的内存位置的写入将导致相关的缓存行(或行)失效。

But this assertion is valid as long as the same linear address is used for modifying and fetching, which is not the case for debuggers and binary loaders since they don't run in the same address-space:

但是这个断言是有效的，只要相同的线性地址用于修改和获取，这不是调试器和二进制装入器的情况，因为它们不在相同的地址空间中运行:

Applications that include self-modifying code use the same linear address for modifying and fetching the instruction. Systems software, such as a debugger, that might possibly modify an instruction using a different linear address than that used to fetch the instruction, will execute a serializing operation, such as a CPUID instruction, before the modified instruction is executed, which will automatically resynchronize the instruction cache and prefetch queue.

包含自修改代码的应用程序使用相同的线性地址来修改和获取指令。系统软件(如调试器)可能使用与获取指令不同的线性地址来修改指令，它将在执行修改后的指令之前执行序列化操作，如CPUID指令，该操作将自动重新同步指令缓存和预取队列。

For instance, serialization operation are always requested by many other architectures such as PowerPC, where it must be done explicitely (E500 Core Manual):

例如，许多其他体系结构(如PowerPC)总是要求串行化操作，在这种体系结构中，必须明确地进行串行化操作(E500核心手册):

3.3.1.2.1 Self-Modifying Code

变为无效来3.3.1.2.1代码

When a processor modifies any memory location that can contain an instruction, software must ensure that the instruction cache is made consistent with data memory and that the modifications are made visible to the instruction fetching mechanism. This must be done even if the cache is disabled or if the page is marked caching-inhibited.

当处理器修改任何可以包含指令的内存位置时，软件必须确保指令缓存与数据内存保持一致，并且修改对指令获取机制是可见的。即使禁用了缓存，或者页面被标记为缓存抑制，也必须这样做。

It is interesting to notice that PowerPC requires the issue of a context-synchronizing instruction even when caches are disabled; I suspect it enforces a flush of deeper data processing units such as the load/store buffers.

有趣的是，PowerPC需要发出上下文同步指令，即使禁用了缓存;我怀疑它会产生大量更深入的数据处理单元，比如负载/存储缓冲区。

The code you proposed is unreliable on architectures without snooping or advanced cache-coherency facilities, and therefore likely to fail.

您所提出的代码在没有监视或高级缓存一致性工具的架构上是不可靠的，因此很可能会失败。

Hope this help.

希望这个有帮助。

#2

It's pretty simple; the write to an address that's in one of the cache lines in the instruction cache invalidates it from the instruction cache. No "synchronization" is involved.

很简单;对位于指令缓存中的缓存行的地址的写入使它从指令缓存中无效。没有“同步”。

#3

The CPU handles cache invalidation automatically, you don't have to do anything manually. Software can't reasonably predict what will or will not be in CPU cache at any point in time, so it's up to the hardware to take care of this. When the CPU saw that you modified data, it updated its various caches accordingly.

CPU自动处理缓存失效，您不需要手动操作。软件不能合理地预测在任何时候CPU缓存中会有什么或不会有什么，所以这取决于硬件来处理。当CPU看到您修改数据时，它会相应地更新它的各种缓存。

#4

By the way, many x86 processors (that I worked on) snoop not only the instruction cache but also the pipeline, instruction window - the instructions that are currently in flight. So self modifying code will take effect the very next instruction. But, you are encouraged to use a serializing instruction like CPUID to ensure that your newly written code will be executed.

顺便说一下，许多x86处理器(我从事的工作)不仅监视指令缓存，还监视管道、指令窗口——当前正在运行的指令。因此，自修改代码将在下一个指令中生效。但是，鼓励您使用像CPUID这样的序列化指令来确保新编写的代码将被执行。

#5

I just reached this page in one of my Search and want to share my knowledge on this area of Linux kernel!

我在一次搜索中刚刚到达这个页面，我想分享我在Linux内核这个领域的知识!

Your code executes as expected and there are no surprises for me here. The mmap() syscall and processor Cache coherency protocol does this trick for you. The flags "PROT_READ|PROT_WRITE|PROT_EXEC" asks the mmamp() to set the iTLB, dTLB of L1 Cache and TLB of L2 cache of this physical page correctly. This low level architecture specific kernel code does this differently depending on processor architecture(x86,AMD,ARM,SPARC etc...). Any kernel bug here will mess up your program!

您的代码按预期执行，我在这里不会感到意外。mmap() syscall和处理器缓存一致性协议为您提供了这个技巧。标志“PROT_READ|PROT_WRITE|PROT_EXEC”要求mmamp()正确设置此物理页面的iTLB、L1缓存的dTLB和L2缓存的TLB。这种特定于底层体系结构的内核代码根据处理器体系结构(x86、AMD、ARM、SPARC等)的不同而有所不同。这里的任何内核错误都将使您的程序陷入混乱!

This is just for explanation purpose. Assume that your system is not doing much and there are no process switches between between "a[0]=0b01000000;" and start of "printf("\n"):"... Also, assume that You have 1K of L1 iCache, 1K dCache in your processor and some L2 cache in the core, . (Now a days these are in the order of few MBs)

这只是为了解释。假设您的系统做的不多，并且在“[0]=0b01000000”和“printf(“\n”)”开头之间没有进程切换:“…”另外，假设你有1K的L1 iCache, 1K的dCache在你的处理器中，还有一些L2的缓存在内核中。(现在，这些都是少量的MBs)

mmap() sets up your virtual address space and iTLB1, dTLB1 and TLB2s.
mmap()设置虚拟地址空间和iTLB1、dTLB1和TLB2s。
"a[0]=0b01000000;" will actually Trap(H/W magic) into kernel code and your physical address will be setup and all Processor TLBs will be loaded by the kernel. Then, You will be back into user mode and your processor will actually Load 16 bytes(H/W magic a[0] to a[3]) into L1 dCache and L2 Cache. Processor will really go into Memory again, only when you refer a[4] and and so on(Ignore the prediction loading for now!). By the time you complete "a[7]=0b11000011;", Your processor had done 2 burst READs of 16 bytes each on the eternal Bus. Still no actual WRITEs into physical memory. All WRITEs are happening within L1 dCache(H/W magic, Processor knows) and L2 cache so for and the DIRTY bit is set for the Cache-line.
“[0]=0b01000000;”实际上会将(H/W魔法)捕获到内核代码中，然后设置物理地址，内核将加载所有处理器TLBs。然后，您将回到用户模式，您的处理器将实际将16字节(H/W魔法将[0]加载到[3])加载到L1 dCache和L2 Cache中。处理器将再次进入内存，只有当您引用[4]等时(暂时忽略预测加载!)当你完成“a[7]=0b11000011;”，您的处理器在永恒总线上完成了2个16字节的突发读取。仍然没有实际写入物理内存。所有的写操作都发生在L1 dCache(H/W魔法，处理器知道)和L2缓存中，所以脏位被设置为cache -line。
"a[3]++;" will have STORE Instruction in the Assembly code, but the Processor will store that only in L1 dCache&L2 and it will not go to Physical Memory.
“a[3]++”将在汇编代码中存储指令，但处理器将只存储在L1 dCache&L2中，并且不会进入物理内存。
Let's come to the function call "a()". Again the processor do the Instruction Fetch from L2 Cache into L1 iCache and so on.
我们来看函数“a()”。处理器再一次从L2缓存中获取指令到L1 iCache等等。
Result of this user mode program will be the same on any Linux under any processor, due to correct implementation of low level mmap() syscall and Cache coherency protocol!
由于正确地实现了低级别mmap() syscall和缓存一致性协议，这个用户模式程序在任何处理器下的任何Linux上的结果都是相同的!
If You are writing this code under any embedded processor environment without OS assistance of mmap() syscall, you will find the problem you are expecting. This is because your are not using either H/W mechanism(TLBs) or software mechanism(memory barrier instructions).
如果您在没有mmap() syscall操作系统帮助的任何嵌入式处理器环境下编写此代码，您将发现您期望的问题。这是因为您没有使用H/W机制(TLBs)或软件机制(内存屏障指令)。

#1