CUDA C最佳实践：unsigned vs signed optimization

In the CUDA C Best Practices Guide there is a small section about using signed and unsigned integers.

在CUDA C最佳实践指南中,有一小部分关于使用有符号和无符号整数。

In the C language standard, unsigned integer overflow semantics are well defined, whereas signed integer overflow causes undefined results. Therefore, the compiler can optimize more aggressively with signed arithmetic than it can with unsigned arithmetic. This is of particular note with loop counters: since it is common for loop counters to have values that are always positive, it may be tempting to declare the counters as unsigned. For slightly better performance, however, they should instead be declared as signed.

在C语言标准中,无符号整数溢出语义被很好地定义,而有符号整数溢出导致未定义的结果。因此,编译器可以使用带符号算法比使用无符号算法更积极地进行优化。循环计数器特别注意这一点:由于循环计数器通常具有始终为正的值,因此将计数器声明为无符号可能很有吸引力。但是,为了稍微提高性能,应将它们声明为已签名。

For example, consider the following code:

例如,请考虑以下代码:
    for (i = 0; i < n; i++) {  
         out[i] = in[offset + stride*i];  
    }
Here, the sub-expression stride*i could overflow a 32-bit integer, so if i is declared as unsigned, the overflow semantics prevent the compiler from using some optimizations that might otherwise have applied, such as strength reduction. If instead i is declared as signed, where the overflow semantics are undefined, the compiler has more leeway to use these optimizations.

这里,子表达式stride * i可以溢出32位整数,因此如果i被声明为无符号,则溢出语义会阻止编译器使用可能已经应用的某些优化,例如强度降低。如果我声明为signed,其中溢出语义未定义,则编译器有更多的余地来使用这些优化。

The first two sentences in particular confuse me. If the semantics of unsigned values are well defined and signed values can produce undefined results, how is it the compiler can produce better code for the latter?

前两句特别让我困惑。如果无符号值的语义被很好地定义并且有符号值可以产生未定义的结果,那么编译器如何为后者生成更好的代码呢?

3 个解决方案

#1

The text shows this example:

文本显示了这个例子:

for (i = 0; i < n; i++) {  
     out[i] = in[offset + stride*i];  
}

It also mentions "strength reduction". The compiler is allowed to replace this with the following "pseudo-optimised-C" code:

它还提到了“力量减少”。允许编译器将此替换为以下“伪优化-C”代码:

tmp = offset;
for (i = 0; i < n; i++) {  
     out[i] = in[tmp];
     tmp += stride;
}

Now, imagine a processor that only supports floating point numbers (and integers as a subset). tmp would be of type "very large number".

现在,想象一下只支持浮点数(和整数作为子集)的处理器。 tmp将是“非常大的数字”类型。

Now, the C standard says that computations involving unsigned operands can never overflow, but instead are reduced modulo the largest value + 1. That means that in the case of unsigned i the compiler has to do this:

现在,C标准说涉及无符号操作数的计算永远不会溢出,而是以最大值+ 1为模减少。这意味着在无符号i的情况下,编译器必须这样做:

tmp = offset;
for (i = 0; i < n; i++) {  
     out[i] = in[tmp];
     tmp += stride;
     if (tmp > UINT_MAX)
     {
         tmp -= UINT_MAX + 1;
     }
}

But in the case of signed integer the compiler can do whatever it wants. It doesn't need to check for overflow - if it does overflow then it's the developer's problem (it could cause an exception, or produce erroneous values). So the code can be faster.

但是在有符号整数的情况下,编译器可以做任何想做的事情。它不需要检查溢出 - 如果它确实溢出那么它是开发人员的问题(它可能导致异常,或产生错误的值)。所以代码可以更快。

#2

Its because the definition of C limits what the compiler writer can do in the case of the unsigned integers. There is more leeway to fool around with what happens when signed integers overflow. The compiler writers have more room to move, so to speak.

这是因为C的定义限制了编译器编写者在无符号整数的情况下可以做什么。当有符号整数溢出时会发生什么事情,有更多余地可以理解。编译器编写者有更多的移动空间,可以这么说。

That's the way I read it.

这就是我读它的方式。

#3

The difference between the semantics of signed and unsigned becomes relevant for performance on processors which don't support all the word sizes defined by C. For instance, say you have a CPU that only supports 32-bit operations and has 32-bit registers, and you write a C function that uses both int (32-bit) and char (8-bit*):

有符号和无符号的语义之间的差异与不支持C定义的所有字大小的处理器上的性能相关。例如,假设您有一个仅支持32位操作且具有32位寄存器的CPU,并编写一个使用int(32位)和char(8位*)的C函数:

int test(char a) {
  char b = a * 100;
  return b;
}

Since the CPU can only store char in 32-bit registers and can only perform arithmetic on 32-bit values, it will use a 32-bit register to hold b, and a 32-bit multiplication operation.

由于CPU只能在32位寄存器中存储char,并且只能对32位值执行算术运算,因此它将使用32位寄存器来保持b和32位乘法运算。

Because the C standard states that signed integer overflow causes undefined results, it is fine for the compiler to create code for the above function that returns a value that is higher than 127 when a is higher than 2.

由于C标准声明有符号整数溢出会导致未定义的结果,因此编译器可以为上述函数创建代码,当a高于2时,该函数返回高于127的值。

However, if unsigned values are used:

但是,如果使用无符号值:

unsigned int test(unsigned char a) {
  unsigned char b = a * 100;
  return b;
}

The C standard defines the overflow semantics for unsigned operations, so, the compiler will have to add a masking operation to ensure that the function does not return values higher than 255 even when a is higher than 2.

C标准定义了无符号运算的溢出语义,因此,编译器必须添加一个屏蔽操作,以确保即使a高于2,该函数也不会返回高于255的值。

* The C specification allows char to wider than 8 bits, but that would break many programs, so we assume a compiler that uses 8-bit values for char in this example.

* C规范允许char超过8位,但这会破坏许多程序,因此我们假设在本例中使用8位值作为char的编译器。

#1