为什么C字的字符是ints而不是chars?

In C++, sizeof('a') == sizeof(char) == 1. This makes intuitive sense, since 'a' is a character literal, and sizeof(char) == 1 as defined by the standard.

在c++中，sizeof('a') == sizeof(char) == 1。这是很直观的，因为‘a’是一个文字字符，而sizeof(char) = 1是由标准定义的。

In C however, sizeof('a') == sizeof(int). That is, it appears that C character literals are actually integers. Does anyone know why? I can find plenty of mentions of this C quirk but no explanation for why it exists.

在C中，sizeof('a') == sizeof(int)。也就是说，看起来C字串实际上是整数。有人知道为什么吗?我能找到很多关于这个怪癖的提及，但是没有解释为什么它存在。

11 个解决方案

#1

discussion on same subject

讨论相同的话题

"More specifically the integral promotions. In K&R C it was virtually (?) impossible to use a character value without it being promoted to int first, so making character constant int in the first place eliminated that step. There were and still are multi character constants such as 'abcd' or however many will fit in an int."

更具体地说，是整体促销。在K&R C中，如果不首先将字符值提升为int，几乎不可能使用字符值，因此首先使字符常量int消除该步骤。还有很多的字符常数，比如abcd，或者是很多都适合int型的。

#2

I don't know the specific reasons why a character literal in C is of type int. But in C++, there is a good reason not to go that way. Consider this:

我不知道为什么C中的字符是int类型，但是在c++中，有很好的理由不这么做。考虑一下:

void print(int);
void print(char);

print('a');

You would expect that the call to print selects the second version taking a char. Having a character literal being an int would make that impossible. Note that in C++ literals having more than one character still have type int, although their value is implementation defined. So, 'ab' has type int, while 'a' has type char.

您可能会期望打印的调用选择使用char的第二个版本。如果一个字面值是整数，那就不可能了。注意，在c++中，具有多个字符的文字仍然具有类型int，尽管它们的值是已定义的实现。所以ab的类型是int, a的类型是char。

#3

The original question is "why?"

最初的问题是“为什么?”

The reason is that the definition of a literal character has evolved and changed, while trying to remain backwards compatible with existing code.

原因是文字字符的定义已经进化和改变，同时试图保持与现有代码的向后兼容。

In the dark days of early C there were no types at all. By the time I first learnt to program in C, types had been introduced, but functions didn't have prototypes to tell the caller what the argument types were. Instead it was standardised that everything passed as a parameter would either be the size of an int (this included all pointers) or it would be a double.

在早期的黑暗时期，根本就没有任何类型。当我第一次学习用C编程时，类型已经被引入，但是函数没有原型来告诉调用者参数类型是什么。相反，标准化的是，作为参数传递的所有内容要么是int(包括所有指针)的大小，要么是double。

This meant that when you were writing the function, all the parameters that weren't double were stored on the stack as ints, no matter how you declared them, and the compiler put code in the function to handle this for you.

这意味着，当您编写函数时，不管您如何声明它们，所有非双精度参数都以int形式存储在堆栈中，编译器将代码放入函数中为您处理。

This made things somewhat inconsistent, so when K&R wrote their famous book, they put in the rule that a character literal would always be promoted to an int in any expression, not just a function parameter.

这使得事情有点不一致，所以当K&R写他们著名的书时，他们加入了这样的规则:在任何表达式中，字符文字总是被提升为int型，而不仅仅是函数参数。

When the ANSI committee first standardised C, they changed this rule so that a character literal would simply be an int, since this seemed a simpler way of achieving the same thing.

当ANSI委员会第一次标准化C时，他们改变了这一规则，使文字字符仅仅是int，因为这似乎是实现相同目标的一种更简单的方式。

When C++ was being designed, all functions were required to have full prototypes (this is still not required in C, although it is universally accepted as good practice). Because of this, it was decided that a character literal could be stored in a char. The advantage of this in C++ is that a function with a char parameter and a function with an int parameter have different signatures. This advantage is not the case in C.

在设计c++时，所有的函数都需要有完整的原型(这在C中仍然不是必需的，尽管它被普遍认为是良好的实践)。因此，我们决定字符文本可以存储在字符中。在c++中，这样做的好处是具有char参数的函数和具有int参数的函数具有不同的签名。在C语言中，这一优势并非如此。

This is why they are different. Evolution...

这就是他们不同的原因。进化……

#4

using gcc on my MacBook, I try:

在我的MacBook上使用gcc，我尝试:

#include <stdio.h>
#define test(A) do{printf(#A":\t%i\n",sizeof(A));}while(0)
int main(void){
  test('a');
  test("a");
  test("");
  test(char);
  test(short);
  test(int);
  test(long);
  test((char)0x0);
  test((short)0x0);
  test((int)0x0);
  test((long)0x0);
  return 0;
};

which when run gives:

运行时提供:

'a':    4
"a":    2
"":     1
char:   1
short:  2
int:    4
long:   4
(char)0x0:      1
(short)0x0:     2
(int)0x0:       4
(long)0x0:      4

which suggests that a character is 8 bits, like you suspect, but a character literal is an int.

这表明一个字符是8位，就像您怀疑的那样，但是字符的文字是一个整数。

#5

Back when C was being written, the PDP-11's MACRO-11 assembly language had:

在C编写时，PDP-11的macro11汇编语言有:

MOV #'A, R0      // 8-bit character encoding for 'A' into 16 bit register

This kind of thing's quite common in assembly language - the low 8 bits will hold the character code, other bits cleared to 0. PDP-11 even had:

这种情况在汇编语言中很常见——低8位将保存字符代码，其他位被清除为0。)时甚至有:

MOV #"AB, R0     // 16-bit character encoding for 'A' (low byte) and 'B'

This provided a convenient way to load two characters into the low and high bytes of the 16 bit register. You might then write those elsewhere, updating some textual data or screen memory.

这为将两个字符加载到16位寄存器的低字节和高字节提供了一种方便的方式。然后，您可以在其他地方编写它们，更新一些文本数据或屏幕内存。

So, the idea of characters being promoted to register size is quite normal and desirable. But, let's say you need to get 'A' into a register not as part of the hard-coded opcode, but from somewhere in main memory containing:

因此，将字符提升为寄存器大小的想法是非常正常和可取的。但是，假设您需要将“A”输入寄存器，而不是硬编码操作码的一部分，而是从主存中的某个位置，包含:

address: value
20: 'X'
21: 'A'
22: 'A'
23: 'X'
24: 0
25: 'A'
26: 'A'
27: 0
28: 'A'

If you want to read just an 'A' from this main memory into a register, which one would you read?

如果你想从主存储器中读取一个“A”，你会读哪个?

Some CPUs may only directly support reading a 16 bit value into a 16 bit register, which would mean a read at 20 or 22 would then require the bits from 'X' be cleared out, and depending on the endianness of the CPU one or other would need shifting into the low order byte.

一些CPU只能直接支持读16位值为一个16位寄存器,这意味着一个读在20到22日将需要从“X”被清除,并根据CPU一个或其他需要的字节顺序转移到低阶字节。
Some CPUs may require a memory-aligned read, which means that the lowest address involved must be a multiple of the data size: you might be able to read from addresses 24 and 25, but not 27 and 28.

一些cpu可能需要一个与内存对齐的读取，这意味着涉及的最低地址必须是数据大小的倍数:您可能能够从地址24和25读取，而不是27和28。

So, a compiler generating code to get an 'A' into the register may prefer to waste a little extra memory and encode the value as 0 'A' or 'A' 0 - depending on endianness, and also ensuring it is aligned properly (i.e. not at an odd memory address).

因此，编译器生成代码以将“a”输入寄存器，可能更喜欢浪费一点额外的内存，将值编码为0 ' a '或' a ' 0——这取决于机缘巧合，并确保它被正确对齐(即不是在一个奇怪的内存地址)。

My guess is that C's simply carried this level of CPU-centric behaviour over, thinking of character constants occupying register sizes of memory, bearing out the common assessment of C as a "high level assembler".

我的猜测是，C只是简单地把这个水平的cpu中心的行为，考虑了字符常量占用寄存器大小的内存，并对C作为一个“高级汇编器”进行了共同的评估。

(See 6.3.3 on page 6-25 of http://www.dmv.net/dec/pdf/macro.pdf)

(见http://www.dmv.net/dec/pdf/macro.pdf第6-25页6.3.3)

#6

I remember reading K&R and seeing a code snippet that would read a character at a time until it hit EOF. Since all characters are valid characters to be in a file/input stream, this means that EOF cannot be any char value. What the code did was to put the read character into an int, then test for EOF, then convert to a char if it wasn't.

我记得读过K&R，看到一个代码片段，它会读取一个字符，直到它击中EOF。由于所有字符都是文件/输入流中的有效字符，这意味着EOF不能是任何char值。代码所做的是将读取字符放入int中，然后测试EOF，如果不是，则转换为char。

I realize this doesn't exactly answer your question, but it would make some sense for the rest of the character literals to be sizeof(int) if the EOF literal was.

我知道这并不能确切地回答你的问题，但是如果EOF是，那么对于其余的字元来说是sizeof(int)是有意义的。

int r;
char buffer[1024], *p; // don't use in production - buffer overflow likely
p = buffer;

while ((r = getc(file)) != EOF)
{
  *(p++) = (char) r;
}

#7

I haven't seen a rationale for it (C char literals being int types), but here's something Stroustrup had to say about it (from Design and Evolution 11.2.1 - Fine-Grain Resolution):

我还没有看到它的基本原理(C字符是int类型的)，但是这里有一些Stroustrup不得不说的(从设计和演化11.2.1 -细粒度解析):

In C, the type of a character literal such as 'a' is int. Surprisingly, giving 'a' type char in C++ doesn't cause any compatibility problems. Except for the pathological example sizeof('a'), every construct that can be expressed in both C and C++ gives the same result.

在C语言中，字符文字(比如a)的类型是int型，令人惊讶的是，在c++中给出“a”类型字符不会导致任何兼容性问题。除了病理学的例子sizeof(“a”)，所有可以用C和c++表达的构造都给出了相同的结果。

So for the most part, it should cause no problems.

所以在大多数情况下，它应该不会造成任何问题。

#8

This is the correct behavior, called "integral promotion". It can happen in other cases too (mainly binary operators, if I remember correctly).

这是正确的行为，叫做“整合提升”。它也可以在其他情况下发生(主要是二进制运算符，如果我没记错的话)。

EDIT: Just to be sure, I checked my copy of Expert C Programming: Deep Secrets, and I confirmed that a char literal does not start with a type int. It is initially of type char but when it is used in an expression, it is promoted to an int. The following is quoted from the book:

编辑:可以肯定的是,我的副本专家C编程:深的秘密,我证实了一个char文字并不开始于一个char类型的int类型。它最初但表达式中使用它时,它是提升为int。以下是引用从这本书中说:

Character literals have type int and they get there by following the rules for promotion from type char. This is too briefly covered in K&R 1, on page 39 where it says:

字符文字具有类型int，它们通过遵循从类型char升级的规则来实现。在第39页的K&R 1中，这句话太简单了

Every char in an expression is converted into an int....Notice that all float's in an expression are converted to double....Since a function argument is an expression, type conversions also take place when arguments are passed to functions: in particular, char and short become int, float becomes double.

每个字符表达式转换成int ....请注意,所有漂浮在一个表达式转换为双....由于函数参数是表达式，所以当参数传递给函数时也会发生类型转换:特别是，char和short变成int, float变成double。

#9

I don't know, but I'm going to guess it was easier to implement it that way and it didn't really matter. It wasn't until C++ when the type could determine which function would get called that it needed to be fixed.

我不知道，但我猜用这种方式来实现它会更容易一些，其实这并不重要。直到c++类型能够确定调用哪个函数时，它才需要被修复。

#10

I didn't know this indeed. Before prototypes existed, anything narrower than an int was converted to an int when using it as a function argument. That may be part of the explanation.

我真的不知道。在原型存在之前，任何比int稍窄的东西在使用int作为函数参数时都被转换为int。这可能是解释的一部分。

#11

This is only tangential to the language spec, but in hardware the CPU usually only has one register size -- 32 bits, let's say -- and so whenever it actually works on a char (by adding, subtracting, or comparing it) there is an implicit conversion to int when it is loaded into the register. The compiler takes care of properly masking and shifting the number after each operation so that if you add, say, 2 to (unsigned char) 254, it'll wrap around to 0 instead of 256, but inside the silicon it is really an int until you save it back to memory.

这只是切向语言规范,但在硬件CPU通常只有一个寄存器大小——32位,比方说,当它实际上是一个char(通过增加、减少或比较)有一个隐式转换为int时加载到寄存器中。编译器在每次操作之后都会适当地屏蔽和移动数字，这样，如果你添加2到(无符号字符)254，它就会从256变成0，但在硅内部，它实际上是一个int，直到你将它保存回内存。

It's sort of an academic point because the language could have specified an 8-bit literal type anyway, but in this case the language spec happens to reflect more closely what the CPU is really doing.

这是一个学术性的观点，因为语言本来可以指定一个8位的文字类型，但是在这种情况下，语言规范恰好反映了CPU真正在做什么。

(x86 wonks may note that there is eg a native addh op that adds the short-wide registers in one step, but inside the RISC core this translates to two steps: add the numbers, then extend sign, like an add/extsh pair on the PowerPC)

(x86专家可能会注意到，有一个本地的addh op，可以一步地添加短范围的寄存器，但在RISC核心中，这将转换为两个步骤:添加数字，然后扩展符号，比如PowerPC上的add/extsh对)

#1