为什么gcc允许从结构中推测负载?

Example Showing the gcc Optimization and User Code that May Fault

The function 'foo' in the snippet below will load only one of the struct members A or B; well at least that is the intention of the unoptimized code.

下面代码段中的函数“foo”将只加载其中一个结构成员A或B;至少这是未优化代码的意图。

typedef struct {
  int A;
  int B;
} Pair;

int foo(const Pair *P, int c) {
  int x;
  if (c)
    x = P->A;
  else
    x = P->B;
  return c/102 + x;
}

Here is what gcc -O3 gives:

以下是gcc -O3给出的:

mov eax, esi
mov edx, -1600085855
test esi, esi
mov ecx, DWORD PTR [rdi+4]   <-- ***load P->B**
cmovne ecx, DWORD PTR [rdi]  <-- ***load P->A***
imul edx
lea eax, [rdx+rsi]
sar esi, 31
sar eax, 6
sub eax, esi
add eax, ecx
ret

So it appears that gcc is allowed to speculatively load both struct members in order to eliminate branching. But then, is the following code considered undefined behavior or is the gcc optimization above illegal?

因此，gcc似乎可以推测地加载两个结构成员以消除分支。但是，下面的代码是否被认为是未定义的行为，还是上面的gcc优化是非法的?

#include <stdlib.h>  

int naughty_caller(int c) {
  Pair *P = (Pair*)malloc(sizeof(Pair)-1); // *** Allocation is enough for A but not for B ***
  if (!P) return -1;

  P->A = 0x42; // *** Initializing allocation only where it is guaranteed to be allocated ***

  int res = foo(P, 1); // *** Passing c=1 to foo should ensure only P->A is accessed? ***

  free(P);
  return res;
}

If the load speculation will happen in the above scenario there is a chance that loading P->B will cause an exception because the last byte of P->B may lie in unallocated memory. This exception will not happen if the optimization is turned off.

如果在上述场景中出现负载猜测，那么加载P->B将导致异常，因为P->B的最后一个字节可能位于未分配内存中。如果优化关闭，此异常将不会发生。

The Question

Is the gcc optimization shown above of load speculation legal? Where does the spec say or imply that it's ok? If the optimization is legal, how is the code in 'naughtly_caller' turn out to be undefined behavior?

上面所示的gcc优化是否合法?规范说了什么或者暗示了它是可以的?如果优化是合法的，那么“naughtly_caller”中的代码如何变成未定义的行为?

6 个解决方案

#1

Reading a variable (that was not declared as volatile) is not considered to be a "side effect" as specified by the C standard. So the program is free to read a location and then discard the result, as far as the C standard is concerned.

读取变量(未声明为volatile)不被认为是C标准指定的“副作用”。因此，程序可以*地读取一个位置，然后丢弃结果，就C标准而言。

This is very common. Suppose you request 1 byte of data from a 4 byte integer. The compiler may then read the whole 32 bits if that's faster (aligned read), and then discard everything but the requested byte. Your example is similar to this but the compiler decided to read the whole struct.

这是很常见的。假设您从一个4字节整数请求一个字节的数据。然后，编译器可能会读取整个32位，如果这是更快的(对齐读)，然后丢弃所有的东西，除了请求的字节。您的示例与此类似，但编译器决定读取整个结构。

Formally this is found in the behavior of "the abstract machine", C11 chapter 5.1.2.3. Given that the compiler follows the rules specified there, it is free to do as it pleases. And the only rules listed are regarding volatile objects and sequencing of instructions. Reading a different struct member in a volatile struct would not be ok.

从形式上看，这是在“抽象机器”的行为中发现的，C11章5.1.2.3。由于编译器遵循这里指定的规则，所以它可以随意执行。唯一列出的规则是关于挥发性物体和指令的顺序。在不稳定的结构中读取不同的结构成员是不可能的。

As for the case of allocating too little memory for the whole struct, that's undefined behavior. Because the memory layout of the struct is usually not for the programmer to decide - for example the compiler is allowed to add padding at the end. If there's not enough memory allocated, you might end up accessing forbidden memory even though your code only works with the first member of the struct.

至于对整个结构分配太少内存的情况，这是未定义的行为。因为结构的内存布局通常不是由程序员决定的——例如，编译器可以在最后添加填充。如果分配的内存不够，即使代码只与结构中的第一个成员一起工作，也可能会访问被禁用的内存。

#2

No, if *P is allocated correctly P->B will never be in unallocated memory. It might not be initialized, that is all.

不，如果*P正确分配，P->B将永远不会在未分配内存中。它可能不会被初始化，仅此而已。

The compiler has every right to do what they do. The only thing that is not allowed is to oops about the access of P->B with the excuse that it is not initialized. But what and how they do all of this is under the discretion of the implementation and not your concern.

编译器完全有权利做它们所做的事情。唯一不允许的是，对于P->B的访问，它没有初始化。但是，他们是如何做到这一切的，是由执行决定的，而不是你关心的。

If you cast a pointer to a block returned by malloc to Pair* that is not guaranteed to be wide enough to hold a Pair the behavior of your program is undefined.

如果您将一个指针指向一个由malloc返回的块，它不能保证足够宽，可以容纳一个对程序的行为，这是没有定义的。

#3

This is perfectly legal because reading some memory location isn't considered an observable behavior in the general case (volatile would change this).

这是完全合法的，因为在一般情况下，读取一些内存位置并不是一种可观察的行为(volatile将会改变这一点)。

Your example code is indeed undefined behavior, but I can't find any passage in the standard docs that explicitly states this. But I think it's enough to have a look at the rules for effective types ... from N1570, §6.5 p6:

您的示例代码确实是未定义的行为，但是在标准文档中，我找不到明确说明这一点的任何段落。但我认为有必要看看有效类型的规则……从N1570§6.5 p6:

If a value is stored into an object having no declared type through an lvalue having a type that is not a character type, then the type of the lvalue becomes the effective type of the object for that access and for subsequent accesses that do not modify the stored value.

如果一个值是存储在一个对象没有声明类型通过一个左值类型不是一个字符类型,然后左值的类型成为有效的访问类型的对象和后续访问,不要修改存储的值。

So, your write access to *P actually gives that object the type Pair -- therefore it just extends into memory you didn't allocate, the result is an out of bounds access.

所以，你的写访问*P实际上给了那个对象类型对，因此它只是扩展到你没有分配的内存中，结果是不允许进入的。

#4

A postfix expression followed by the -> operator and an identifier designates a member of a structure or union object. The value is that of the named member of the object to which the first expression points

一个后缀表达式后面跟着->操作符和一个标识符指定一个结构或联合对象的成员。值是第一个表达式所指向的对象的命名成员的值。

If invoking the expression P->A is well-defined, then P must actually point to an object of type struct Pair, and consequently P->B is well-defined as well.

如果调用P->A的表达式，那么P必须指向类型结构对的对象，因此P->B也定义良好。

#5

A -> operator on a Pair * implies that there's a whole Pair object fully allocated. (@Hurkyl quotes the standard.)

一个->操作符对一对*意味着有一个完整的成对对象被充分分配。(@Hurkyl引用标准)。

x86 (like any normal architecture) doesn't have side-effects for accessing normal allocated memory, so x86 memory semantics are compatible with the C abstract machine's semantics for non-volatile memory. Compilers can speculatively load if/when they think that will be a performance win on target microarchitecture they're tuning for in any given situation.

x86(像任何正常的架构一样)不会对访问正常的分配内存产生副作用，因此x86内存语义与C抽象机器的非易失性内存的语义兼容。如果/当他们认为在任何给定的情况下都将在目标微架构上进行性能优化时，编译器可以推测负载。

Note that on x86 memory protection operates with page granularity. The compiler could unroll a loop or vectorize with SIMD in a way that reads outside an object, as long as all pages touched contain some bytes of the object. Is it safe to read past the end of a buffer within the same page on x86 and x64?. libc strlen() implementations hand-written in assembly do this, but AFAIK gcc doesn't, instead using scalar loops for the leftover elements at the end of an auto-vectorized loop even where it already aligned the pointers with a (fully unrolled) startup loop. (Perhaps because it would make runtime bounds-checking with valgrind difficult.)

请注意，在x86内存保护中，操作的是页面粒度。编译器可以以一种在对象之外读取的方式展开一个循环或向矢量化，只要所涉及的所有页面都包含该对象的一些字节。在x86和x64的同一页中读取缓冲区的末尾是否安全?libc strlen()实现是在汇编程序中手工编写的，但是AFAIK gcc没有这样做，而是在自动向矢量循环的末尾使用标量循环，即使它已经将指针与一个(完全展开的)启动循环进行了对齐。(也许是因为它会让运行时的边界检查变得困难。)

To get the behaviour you were expecting, use a const int * arg.

为了得到你期望的行为，使用const int * arg。

An array is a single object, but pointers are different from arrays. (Even with inlining into a context where both array elements are known to be accessible, I wasn't able to get gcc to emit code like it does for the struct, so if it's struct code is a win, it's a missed optimization not to do it on arrays when it's also safe.).

数组是单个对象，但指针与数组不同。(即使内联到上下文访问数组元素都知道,我没能让gcc发出代码的结构,如果是struct代码是一个胜利,这是一个错过了优化阵列的时候不去做也安全。)。

In C, you're allowed to pass this function a pointer to a single int, as long as c is non-zero. When compiling for x86, gcc has to assume that it could be pointing to the last int in a page, with the following page unmapped.

在C中，只要C不为0，就可以将这个函数传递给单个int。在为x86编译时，gcc必须假设它可以指向一个页面中的最后一个int类型，而下面的页面没有映射。

Source + gcc and clang output for this and other variations on the Godbolt compiler explorer

源+ gcc和clang输出，用于这个和其他版本的Godbolt编译器资源管理器。

// exactly equivalent to  const int p[2]
int load_pointer(const int *p, int c) {
  int x;
  if (c)
    x = p[0];
  else
    x = p[1];  // gcc missed optimization: still does an add with c known to be zero
  return c + x;
}

load_pointer:    # gcc7.2 -O3
    test    esi, esi
    jne     .L9
    mov     eax, DWORD PTR [rdi+4]
    add     eax, esi         # missed optimization: esi=0 here so this is a no-op
    ret
.L9:
    mov     eax, DWORD PTR [rdi]
    add     eax, esi
    ret

In C, you can pass sort of pass an array object (by reference) to a function, guaranteeing to the function that it's allowed to touch all the memory even if the C abstract machine doesn't. The syntax is int p[static 2]

在C语言中，您可以通过将一个数组对象(引用)传递给一个函数，从而保证它可以触摸所有的内存，即使C抽象机器没有。语法是int p[静态2]

int load_array(const int p[static 2], int c) {
  ... // same body
}

But gcc doesn't take advantage, and emits identical code to load_pointer.

但是gcc并没有利用它，并向load_pointer发出相同的代码。

Off topic: clang compiles all versions (struct and array) the same way, using a cmov to branchlessly compute a load address.

Off主题:clang以同样的方式编译所有版本(struct和array)，使用一个cmov来无分支地计算一个加载地址。

    lea     rax, [rdi + 4]
    test    esi, esi
    cmovne  rax, rdi
    add     esi, dword ptr [rax]
    mov     eax, esi            # missed optimization: mov on the critical path
    ret

This isn't necessarily good: it has higher latency than gcc's struct code, because the load address is dependent on a couple extra ALU uops. It is pretty good if both addresses aren't safe to read and a branch would predict poorly.

这并不一定很好:它的延迟比gcc的struct代码要高，因为加载地址依赖于一对额外的ALU uops。如果两个地址都不安全，而且一个分支的预测很差，那就很好了。

We can get better code for the same strategy from gcc and clang, using setcc (1 uop with 1c latency on all CPUs except some really ancient ones), instead of cmovcc (2 uops on Intel before Skylake). xor-zeroing is always cheaper than an LEA, too.

我们可以从gcc和clang中获得更好的代码，使用setcc(在所有cpu上使用1c延迟，除了一些真正古老的cpu之外)，而不是cmovcc(在Skylake之前，在Intel上运行2个uop)。xor-zeroing也总是比LEA便宜。

int load_pointer_v3(const int *p, int c) {
  int offset = (c==0);
  int x = p[offset];
  return c + x;
}

    xor     eax, eax
    test    esi, esi
    sete    al
    add     esi, dword ptr [rdi + 4*rax]
    mov     eax, esi
    ret

gcc and clang both put the final mov on the critical path. And on Intel Sandybridge-family, the indexed addressing mode doesn't stay micro-fused with the add. So this would be better, like what it does in the branching version:

gcc和clang都把最后的mov放到关键路径上。在Intel sandybridgfamily上，索引寻址模式不会与add保持微融合，所以这就更好了，就像它在分支版本中所做的一样:

    xor     eax, eax
    test    esi, esi
    sete    al
    mov     eax, dword ptr [rdi + 4*rax]
    add     eax, esi
    ret

Simple addressing modes like [rdi] or [rdi+4] have 1c lower latency than others on Intel SnB-family CPUs, so this might actually be worse latency on Skylake (where cmov is cheap). The test and lea can run in parallel.

像[rdi]或[rdi+4]这样的简单寻址模式比其他的在Intel SnB-family cpu上的延迟要低1c，所以这实际上可能是Skylake (cmov便宜的地方)的更糟糕的延迟。测试和lea可以并行运行。

After inlining, that final mov probably wouldn't exist, and it could just add into esi.

在内联后，最后的mov可能不存在，它可以添加到esi中。

#6

This is always allowed under the "as-if" rule if no conforming program can tell the difference. For example, an implementation could guarantee that after each block allocated with malloc, there are at least eight bytes that can be accessed without side effects. In that situation, the compiler can generate code that would be undefined behaviour if you wrote it in your code. So it would be legal for the compiler to read P[1] whenever P[0] is correctly allocated, even if that would be undefined behaviour in your own code.

如果没有一致的程序可以分辨出区别，那么在“as-if”规则下，这总是允许的。例如，一个实现可以保证在使用malloc分配的每个块之后，至少有8个字节可以被访问，而不会产生副作用。在这种情况下，如果您在代码中编写代码，编译器可以生成未定义行为的代码。因此，当正确分配P[0]时，编译器读取P[1]是合法的[1]，即使这在您自己的代码中是未定义的行为。

But in your case, if you don't allocate enough memory for a struct, then reading any member is undefined behaviour. So here the compiler is allowed to do this, even if reading P->B crashes.

但是在您的情况中，如果您没有为结构分配足够的内存，那么读取任何成员都是未定义的行为。这里编译器可以这样做，即使读取P->B崩溃。

#1