在x86上最好的写障碍是:lock+addl或xchgl?

时间:2021-04-09 18:20:32

The Linux kernel uses lock; addl $0,0(%%esp) as write barrier, while the RE2 library uses xchgl (%0),%0 as write barrier. What's the difference and which is better?

Linux内核使用锁;addl $0,0(% esp)作为写屏障,而RE2库使用xchgl(%0),%0作为写屏障。有什么区别,哪个更好?

Does x86 also require read barrier instructions? RE2 defines its read barrier function as a no-op on x86 while Linux defines it as either lfence or no-op depending on whether SSE2 is available. When is lfence required?

x86也需要阅读障碍指令吗?RE2将其读取屏障功能定义为在x86上的无op,而Linux将其定义为lfence或no-op,这取决于SSE2是否可用。lfence要求是什么时候?

4 个解决方案

#1


7  

The "lock; addl $0,0(%%esp)" is faster in case that we testing the 0 state of lock variable at (%%esp) address. Because we add 0 value to lock variable and the zero flag is set to 1 if the lock value of variable at address (%%esp) is 0.

“锁;addl $0,0(% esp)“在我们测试(%%esp)地址的锁定变量的0状态时速度更快。因为我们为锁定变量添加0值,如果在address (%%esp)上的变量锁定值为0,则将0标志设置为1。

lfence from Intel datasheet:

lfence从英特尔数据表:

Performs a serializing operation on all load-from-memory instructions that were issued prior the LFENCE instruction. This serializing operation guarantees that every load instruction that precedes in program order the LFENCE instruction is globally visible before any load instruction that follows the LFENCE instruction is globally visible.

对在LFENCE指令之前发出的所有负载-from-memory指令执行序列化操作。这种序列化操作保证了在程序顺序之前的每条负载指令都是全局可见的,在任何加载指令之前,LFENCE指令都是全局可见的。

For instance: memory write instruction like 'mov' are atomic (they don't need lock prefix) if there are properly aligned. But this instruction is normally executed in CPU cache and will not be globally visible at this moment for all other threads, because memory fence must be preformed first.

例如:如果有合适的对齐方式,像“mov”这样的内存写指令是原子的(它们不需要锁前缀)。但是这个指令通常是在CPU缓存中执行的,在这个时刻,对于所有其他线程来说,它不会在全局可见,因为必须先预先形成内存保护。

EDIT:

编辑:

So the main difference between these two instructions is that xchgl instruction will not have any effect on the conditional flags. Certainly we can test the lock variable state with lock cmpxchg instruction but this is still more complex than with lock add $0 instruction.

因此,这两个指令的主要区别是,xchgl指令不会对条件标志产生任何影响。当然,我们可以使用lock cmpxchg指令来测试锁变量状态,但这仍然比用lock添加$0指令更复杂。

#2


9  

Quoting from the IA32 manuals (Vol 3A, Chapter 8.2: Memory Ordering):

引用IA32手册(Vol 3A,第8.2章:内存订购):

In a single-processor system for memory regions defined as write-back cacheable, the memory-ordering model respects the following principles [..]

对于定义为回写缓存的内存区域的单处理器系统,内存排序模型遵循以下原则[…]

  • Reads are not reordered with other reads
  • 读取不会与其他读取一起重新排序。
  • Writes are not reordered with older reads
  • 在旧的阅读中,不会重新排序。
  • Writes to memory are not reordered with other writes, with the exception of
    • writes executed with the CLFLUSH instruction
    • 用CLFLUSH指令执行写入操作。
    • streaming stores (writes) executed with the non-temporal move instructions ([list of instructions here])
    • 流存储(写入)与非时间移动指令(这里的指令列表)执行。
    • string operations (see Section 8.2.4.1)
    • 字符串操作(见8.2.4.1节)
  • 对内存的写入不会与其他写入一起重新排序,除了写入执行的CLFLUSH指令流存储(写入)和非时间移动指令(这里的指令列表))字符串操作(见第8.2.4.1节)。
  • Reads may be reordered with older writes to different locations but not with older writes to the same location.
  • 读取可能会被重新排序,与旧的写入不同的位置,但不与旧的写入到相同的位置。
  • Reads or writes cannot be reordered with I/O instructions, locked instructions, or serializing instructions
  • 读或写不能用I/O指令、锁定指令或序列化指令重新排序。
  • Reads cannot pass LFENCE and MFENCE instructions
  • 读取不能通过LFENCE和MFENCE指令。
  • Writes cannot pass SFENCE and MFENCE instructions
  • 写入不能通过SFENCE和MFENCE指令。

Note: The "In a single-processor system" above is slightly misleading. The same rules hold for each (logical) processor individually; the manual then goes on to describe the additional ordering rules between multiple processors. The only bit about it pertaining to the question is that

注意:上面的“单处理器系统”稍微有点误导。每个(逻辑)处理器的规则都是相同的;该手册接着描述了多处理器之间的额外订购规则。关于这个问题的唯一一点就是。

  • Locked instructions have a total order.
  • 锁定指令有一个总的顺序。

In short, as long as you're writing to write-back memory (which is all memory you'll ever see as long as you're not a driver or graphics programmer), most x86 instructions are almost sequentially consistent - the only reordering an x86 CPU can perform is reorder later (independent) reads to execute before writes. The main thing about the write barriers is that they have a lock prefix (implicit or explicit), which forbids all reordering and ensures that the operations is seen in the same order by all processors in a multi-processor system.

简而言之,只要你写回写式内存(这是所有内存你看,只要你不是一个司机或图形程序员),大多数x86指令顺序几乎是一致的,只有重新排序x86处理器可以执行之前执行重新排序后(独立)读取写道。写障碍的主要原因是它们有一个锁前缀(隐式或显式),它禁止所有重新排序,并确保在多处理器系统中所有处理器都能看到相同顺序的操作。

Also, in write-back memory, reads are never reordered, so there's no need for read barriers. Recent x86 processors have a weaker memory consistency model for streaming stores and write-combined memory (commonly used for mapped graphics memory). That's where the various fence instructions come into play; they're not necessary for any other memory type, but some drivers in the Linux kernel do deal with write-combined memory so they just defined their read-barrier that way. The list of ordering model per memory type is in Section 11.3.1 in Vol. 3A of the IA-32 manuals. Short version: Write-Through, Write-Back and Write-Protected allow speculative reads (following the rules as detailed above), Uncachable and Strong Uncacheable memory has strong ordering guarantees (no processor reordering, reads/writes are immediately executed, used for MMIO) and Write Combined memory has weak ordering (i.e. relaxed ordering rules that need fences).

而且,在回写内存中,读取不会被重新排序,因此不需要阅读障碍。最近的x86处理器有一个较弱的内存一致性模型,用于流存储和写合并内存(通常用于映射图形内存)。这就是各种各样的栅栏指令发挥作用的地方;它们对于任何其他内存类型来说都不是必需的,但是Linux内核中的一些驱动程序确实处理写合并内存,因此他们只是以这种方式定义了它们的读写障碍。每个内存类型的排序模型的列表在第11.3.1节中,在IA-32手册的3A中。短版本:Write- through、Write- back和Write- protected允许有风险的读取(按照上面的规则执行)、无上限和强大的无缓存内存有强大的订购保证(没有处理器重新排序、读/写被立即执行、用于MMIO)和写合并内存有弱排序(即需要使用关系的轻松排序规则)。

#3


4  

As an aside to the other answers, the HotSpot devs found that lock; addl $0,0(%%esp) with a zero offset may not be optimal, on some processors it can introduce false data dependencies; related jdk bug.

除了其他的答案,HotSpot devs发现了这个锁;addl $0,0(% esp)与零偏移可能不是最优的,在某些处理器上它可以引入错误的数据依赖;jdk缺陷有关。

Touching a stack location with a different offset can improve performance under some circumstances.

用不同的偏移量触摸堆栈位置可以在某些情况下提高性能。

#4


3  

The important part of lock; addl and xchgl is the lock prefix. It's implicit for xchgl. There is really no difference between the two. I'd look at how they assemble and choose the one that's shorter (in bytes) since that's usually faster for equivalent operations on x86 (hence tricks like xorl eax,eax)

锁的重要部分;addl和xchgl是锁前缀。xchgl是隐性的。这两者之间没有什么区别。我将研究它们如何组合并选择更短的(以字节为单位),因为在x86上的等效操作通常更快(因此像xorl eax,eax)

The presence of SSE2 is probably just a proxy for the real condition which is ultimately a function of cpuid. It probably turns out that SSE2 implies the existence of lfence and the availability of SSE2 was checked/cached at boot. lfence is required when it's available.

SSE2的存在可能只是一个真实情况的代理,它最终是一个cpuid的功能。这可能证明SSE2意味着lfence的存在,而SSE2的可用性在引导时被检查/缓存。当它可用时,需要使用lfence。

#1


7  

The "lock; addl $0,0(%%esp)" is faster in case that we testing the 0 state of lock variable at (%%esp) address. Because we add 0 value to lock variable and the zero flag is set to 1 if the lock value of variable at address (%%esp) is 0.

“锁;addl $0,0(% esp)“在我们测试(%%esp)地址的锁定变量的0状态时速度更快。因为我们为锁定变量添加0值,如果在address (%%esp)上的变量锁定值为0,则将0标志设置为1。

lfence from Intel datasheet:

lfence从英特尔数据表:

Performs a serializing operation on all load-from-memory instructions that were issued prior the LFENCE instruction. This serializing operation guarantees that every load instruction that precedes in program order the LFENCE instruction is globally visible before any load instruction that follows the LFENCE instruction is globally visible.

对在LFENCE指令之前发出的所有负载-from-memory指令执行序列化操作。这种序列化操作保证了在程序顺序之前的每条负载指令都是全局可见的,在任何加载指令之前,LFENCE指令都是全局可见的。

For instance: memory write instruction like 'mov' are atomic (they don't need lock prefix) if there are properly aligned. But this instruction is normally executed in CPU cache and will not be globally visible at this moment for all other threads, because memory fence must be preformed first.

例如:如果有合适的对齐方式,像“mov”这样的内存写指令是原子的(它们不需要锁前缀)。但是这个指令通常是在CPU缓存中执行的,在这个时刻,对于所有其他线程来说,它不会在全局可见,因为必须先预先形成内存保护。

EDIT:

编辑:

So the main difference between these two instructions is that xchgl instruction will not have any effect on the conditional flags. Certainly we can test the lock variable state with lock cmpxchg instruction but this is still more complex than with lock add $0 instruction.

因此,这两个指令的主要区别是,xchgl指令不会对条件标志产生任何影响。当然,我们可以使用lock cmpxchg指令来测试锁变量状态,但这仍然比用lock添加$0指令更复杂。

#2


9  

Quoting from the IA32 manuals (Vol 3A, Chapter 8.2: Memory Ordering):

引用IA32手册(Vol 3A,第8.2章:内存订购):

In a single-processor system for memory regions defined as write-back cacheable, the memory-ordering model respects the following principles [..]

对于定义为回写缓存的内存区域的单处理器系统,内存排序模型遵循以下原则[…]

  • Reads are not reordered with other reads
  • 读取不会与其他读取一起重新排序。
  • Writes are not reordered with older reads
  • 在旧的阅读中,不会重新排序。
  • Writes to memory are not reordered with other writes, with the exception of
    • writes executed with the CLFLUSH instruction
    • 用CLFLUSH指令执行写入操作。
    • streaming stores (writes) executed with the non-temporal move instructions ([list of instructions here])
    • 流存储(写入)与非时间移动指令(这里的指令列表)执行。
    • string operations (see Section 8.2.4.1)
    • 字符串操作(见8.2.4.1节)
  • 对内存的写入不会与其他写入一起重新排序,除了写入执行的CLFLUSH指令流存储(写入)和非时间移动指令(这里的指令列表))字符串操作(见第8.2.4.1节)。
  • Reads may be reordered with older writes to different locations but not with older writes to the same location.
  • 读取可能会被重新排序,与旧的写入不同的位置,但不与旧的写入到相同的位置。
  • Reads or writes cannot be reordered with I/O instructions, locked instructions, or serializing instructions
  • 读或写不能用I/O指令、锁定指令或序列化指令重新排序。
  • Reads cannot pass LFENCE and MFENCE instructions
  • 读取不能通过LFENCE和MFENCE指令。
  • Writes cannot pass SFENCE and MFENCE instructions
  • 写入不能通过SFENCE和MFENCE指令。

Note: The "In a single-processor system" above is slightly misleading. The same rules hold for each (logical) processor individually; the manual then goes on to describe the additional ordering rules between multiple processors. The only bit about it pertaining to the question is that

注意:上面的“单处理器系统”稍微有点误导。每个(逻辑)处理器的规则都是相同的;该手册接着描述了多处理器之间的额外订购规则。关于这个问题的唯一一点就是。

  • Locked instructions have a total order.
  • 锁定指令有一个总的顺序。

In short, as long as you're writing to write-back memory (which is all memory you'll ever see as long as you're not a driver or graphics programmer), most x86 instructions are almost sequentially consistent - the only reordering an x86 CPU can perform is reorder later (independent) reads to execute before writes. The main thing about the write barriers is that they have a lock prefix (implicit or explicit), which forbids all reordering and ensures that the operations is seen in the same order by all processors in a multi-processor system.

简而言之,只要你写回写式内存(这是所有内存你看,只要你不是一个司机或图形程序员),大多数x86指令顺序几乎是一致的,只有重新排序x86处理器可以执行之前执行重新排序后(独立)读取写道。写障碍的主要原因是它们有一个锁前缀(隐式或显式),它禁止所有重新排序,并确保在多处理器系统中所有处理器都能看到相同顺序的操作。

Also, in write-back memory, reads are never reordered, so there's no need for read barriers. Recent x86 processors have a weaker memory consistency model for streaming stores and write-combined memory (commonly used for mapped graphics memory). That's where the various fence instructions come into play; they're not necessary for any other memory type, but some drivers in the Linux kernel do deal with write-combined memory so they just defined their read-barrier that way. The list of ordering model per memory type is in Section 11.3.1 in Vol. 3A of the IA-32 manuals. Short version: Write-Through, Write-Back and Write-Protected allow speculative reads (following the rules as detailed above), Uncachable and Strong Uncacheable memory has strong ordering guarantees (no processor reordering, reads/writes are immediately executed, used for MMIO) and Write Combined memory has weak ordering (i.e. relaxed ordering rules that need fences).

而且,在回写内存中,读取不会被重新排序,因此不需要阅读障碍。最近的x86处理器有一个较弱的内存一致性模型,用于流存储和写合并内存(通常用于映射图形内存)。这就是各种各样的栅栏指令发挥作用的地方;它们对于任何其他内存类型来说都不是必需的,但是Linux内核中的一些驱动程序确实处理写合并内存,因此他们只是以这种方式定义了它们的读写障碍。每个内存类型的排序模型的列表在第11.3.1节中,在IA-32手册的3A中。短版本:Write- through、Write- back和Write- protected允许有风险的读取(按照上面的规则执行)、无上限和强大的无缓存内存有强大的订购保证(没有处理器重新排序、读/写被立即执行、用于MMIO)和写合并内存有弱排序(即需要使用关系的轻松排序规则)。

#3


4  

As an aside to the other answers, the HotSpot devs found that lock; addl $0,0(%%esp) with a zero offset may not be optimal, on some processors it can introduce false data dependencies; related jdk bug.

除了其他的答案,HotSpot devs发现了这个锁;addl $0,0(% esp)与零偏移可能不是最优的,在某些处理器上它可以引入错误的数据依赖;jdk缺陷有关。

Touching a stack location with a different offset can improve performance under some circumstances.

用不同的偏移量触摸堆栈位置可以在某些情况下提高性能。

#4


3  

The important part of lock; addl and xchgl is the lock prefix. It's implicit for xchgl. There is really no difference between the two. I'd look at how they assemble and choose the one that's shorter (in bytes) since that's usually faster for equivalent operations on x86 (hence tricks like xorl eax,eax)

锁的重要部分;addl和xchgl是锁前缀。xchgl是隐性的。这两者之间没有什么区别。我将研究它们如何组合并选择更短的(以字节为单位),因为在x86上的等效操作通常更快(因此像xorl eax,eax)

The presence of SSE2 is probably just a proxy for the real condition which is ultimately a function of cpuid. It probably turns out that SSE2 implies the existence of lfence and the availability of SSE2 was checked/cached at boot. lfence is required when it's available.

SSE2的存在可能只是一个真实情况的代理,它最终是一个cpuid的功能。这可能证明SSE2意味着lfence的存在,而SSE2的可用性在引导时被检查/缓存。当它可用时,需要使用lfence。