
时间:2021-12-15 09:39:42

原文Why size_t matters


Numerous functions in the Standard C library accept arguments or return values that represent object sizes in bytes. For example, the lone argument in malloc(n) specifies the size of the object to be allocated, and the last argument in memcpy(s1, s2, n) specifies the size of the object to be copied. The return value of strlen(s) yields the length of (the number of characters in) null-terminated character array s excluding the null character, which isn't exactly the size of s, but it's in the ballpark.

在标准C库中,许多函数接收参数,或者返回对象的字节大小。例如,malloc(n)函数中,唯一的实参n指定要分配对象的大小;memcpy(s1, s2, n)函数中,最后一个实参n指定要拷贝对象的大小。还有strlen(s)函数的返回值得到的是数组s中以NULL结尾的非空字符的个数(不包括NULL在内),当然这不是s的真正长度,但这是可以接受的。

You might reasonably expect these parameters and return types that represent sizes to be declared with type int (possibly long and/or unsigned), but they aren't. Rather, the C standard declares them as type size_t. According to the standard, the declaration for malloc should appear in <stdlib.h> as something equivalent to:


void *malloc(size_t n);

and the declarations for memcpy and strlen should appear in <string.h> looking much like:


void *memcpy(void *s1, void const *s2, size_t n);
size_t strlen(char const *s);

The type size_t also appears throughout the C++ standard library. In addition, the C++ library uses a related symbol size_type, possibly even more than it uses size_t.


In my experience, most C and C++ programmers are aware that the standard libraries use size_t, but they really don't know what size_t represents or why the libraries use size_t as they do. Moreover, they don't know if and when they should use size_t themselves.


In this column, I'll explain what size_t is, why it exists, and how you should use it in your code.



Classic C (the early dialect of C described by Brian Kernighan and Dennis Ritchie in The C Programming Language, Prentice-Hall, 1978) didn't provide size_t. The C standards committee introduced size_t to eliminate a portability problem, illustrated by the following example.

传统的C(早期在The C Programming Language, Prentice-Hall, 1978一书中, Brian Kernighan和Dennis Ritchie对C的描述)并没有提供size_t。后来C标准委员会提出size_t来解决可移植性问题,如以下这个例子。

Let's examine the problem of writing a portable declaration for the standard memcpy function. We'll look at a few different declarations and see how well they work when compiled for different architectures with different-sized address spaces and data paths.


Recall that calling memcpy(s1, s2, n) copies the first n bytes from the object pointed to by s2 to the object pointed to by s1, and returns s1. The function can copy objects of any type, so the pointer parameters and return type should be declared as "pointer to void." Moreover, memcpy doesn't modify the source object, so the second parameter should really be "pointer to const void." None of this poses a problem.

调用memcpy()函数,会把s2指向的对象的前n个字节拷贝到s1所指向的对象中,并返回s1。这个函数可以拷贝任意类型的对象,所以指针形参和返回类型应该声明为“指向void的指针”。同时,memcpy()不能修改源对象,所以第二个形参应该为“指向const void的指针”。这些都不会引起问题。

The real concern is how to declare the function's third parameter, which represents the size of the source object. I suspect many programmers would choose plain int, as in:


void *memcpy(void *s1, void const *s2, int n);

which works fine most of the time, but it's not as general as it could be. Plain int is signed--it can represent negative values. However, sizes are never negative. Using unsigned int instead of int as the type of the third parameter lets memcpy copy larger objects, at no additional cost.

大多数时候运行得不错,但情况并非如此。int类型是有符号的,他可以表示负值。然而,大小永远不会有负值。用unsigned int代替int作为第三个参数的类型,可以让memcpy()函数在没有额外开销的情况下,拷贝更大的对象。

On most machines, the largest unsigned int value is roughly twice the largest positive int value. For example, on a 16-bit twos-complement machine, the largest unsigned int value is 65,535 and the largest positive int value is 32,767. Using an unsigned int as memcpy's third parameter lets you copy objects roughly twice as big as when using int.

在大多数机器上,unsigned int的最大值大致是int的最大正数值的两倍。例如,在16位二进制补码的机器上,unsigned int的最大值是65535,int的最大正数是32767。使用unsigned int作为memcpy()函数的第三个参数可以让你拷贝比使用int多一倍的对象。

Although the size of an int varies among C implementations, on any given implementation int objects are always the same size as unsigned int objects. Thus, passing an unsigned int argument is always the same cost as passing an int.

虽然在C的实现当中,int的大小各不相同,但是,任何给出的实现当中,int对象和unsigned int对象的大小都是相同的。也就是说,传递一个unsigned int实参的开销和传递int的开销总是相同的。

Using unsigned int as the parameter type, as in:

使用unsigned int作为形参,形如:

void *memcpy(void *s1, void const *s2, unsigned int n);

works just dandy on any platform in which an sunsigned int can represent the size of the largest data object. This is generally the case on any platform in which integers and pointers have the same size, such as IP16, in which both integers and pointers occupy 16 bits, or IP32, in which both occupy 32 bits. (See the sidebar on C data model notation.)

可以在任何平台下完美运行,同时unsigned int代表了这些平台上最大数据对象的大小。通常很多平台下都是这样,整数和指针有相同的大小,例如IP16下,整数和指针都占16位;IP32下,整数和指针都占32位。(见边栏上的C数据模型表示法。)

C data model notation

Of late, I've run across several articles that employ a compact notation for describing the C language data representation 
on different target platforms. I have yet to find the origins of this notation, a formal syntax, or even a name for it, but it 
appears to be simple enough to be usable without a formal definition. The general form of the notation appears to be:
I nI L nL LL nLL P nP
最近,我偶然发现几篇文章,他们使用简明的标记来表述不同目标平台下c语言数据的实现。我还没有找到这个标记的来源,一个正式的语法甚至连一个名字都没有,但他似乎很简单,即使没有正规的定义也可以很容易使用起来。这些标记的一边形式形如:I nI L nL LL nLL P nP。

where each capital letter (or pair thereof) represents a C data type, and each corresponding n is the number of bits that 
the type occupies. I stands for int, L stands for long, LL stands for long long, and P stands for pointer (to data, not pointer
 to function). Each letter and number is optional.
其中每个大写字母(或成对出现)代表一个C的数据类型,每一个对应的n是这个类型包含的位数。I代表int,L代表long,LL代表long long,以及P代表指针(指向数据,而不是函数)。每个字母和数字都是可选的。

For example, an I16P32 architecture supports 16-bit int and 32-bit pointers, without describing whether it supports long 
or long long. If two consecutive types have the same size, you typically omit the first number. For example, you typically 
write I16L32P32 as I16LP32, which is an architecture that supports 16-bit int, 32-bit long, and 32-bit pointers.
例如,I16P32架构支持16位int和32位指针类型,没有指明是否支持long或者long long。如果两个连续的类型具有相同的大小,通常省略第一个数字。例如,你通常将I16L32P32写为I16LP32,这是一个支持16位int,32位long,和32位指针的架构。

The notation typically arranges the letters so their corresponding numbers appear in ascending order. For example, 
IL32LL64P32 denotes an architecture with 32-bit int, 32-bit long, 64-bit long long, and 32-bit pointers; however, it 
appears more commonly as ILP32LL64.
标记通常把字母分类在一起,所以可以按照其对应的数字按照升序排列。例如,IL32LL64P32表示支持32位int,32位long,64位long long和32位指针的架构;然而,通常写作ILP32LL64。

Unfortunately, this declaration for memcpy comes up short on an I16LP32 processor (16-bits for int and 32-bits for long and pointers), such as the first generation Motorola 68000. In this case, the processor can copy objects larger than 65,536 bytes, but this memcpy can't because parameter n can't handle values that large.


Easy to fix, you say? Just change the type of memcpy's third parameter:


void *memcpy(void *s1, void const *s2, unsigned long n);

You can use this declaration to write a memcpy for an I16LP32 target, and it will be able to copy large objects. It will also work on IP16 and IP32 platforms, so it does provide a portable declaration for memcpy. Unfortunately, on an IP16 platform, the machine code you get from using unsigned long here is almost certainly a little less efficient (the code is both bigger and slower) than what you get from using an unsigned int.

你可以用这个声明来为I16LP32目标机器实现memcpy(),他也确实可以拷贝最大的对象。在IP16和IP32上也可以正常工作,所以他为memcpy()提供了一个可移植的声明。但不幸的是,在IP16平台上,使用unsigned long得到的机器码比使用unsigned int得到的机器码效率低,他的代码会更冗长,速度会更慢。

In Standard C, a long (whether signed or unsigned) must occupy at least 32 bits. Thus, an IP16 platform that supports Standard C really must be an IP16L32 platform. Such platforms typically implement each 32-bit long as a pair of 16-bit words. In that case, moving a 32-bit long usually requires two machine instructions, one to move each 16-bit chunk. In fact, almost all 32-bit operations on these platforms require at least two instructions, if not more.


Thus, declaring memcpy's third parameter as an unsigned long in the name of portability exacts a performance toll on some platforms, something we'd like to avoid. Using size_t avoids that toll.

因此,以可移植性为名牺牲某些平台的性能,将memcpy()的第三个形参声明为unsigned long,这不是我们希望看到的。使用size_t可以避免这些性能浪费。

Type size_t is a stypedef that's an alias for some unsigned integer type, typically unsigned int or unsigned long, but possibly even unsigned long long. Each Standard C implementation is supposed to choose the unsigned integer that's big enough--but no bigger than needed--to represent the size of the largest possible object on the target platform.

size_t类型是通过typedef定义的一些无符号整型的别名,通常是unsigned int或unsigned long,甚至是unsigned long long。每种标准C的实现应该选择足够大的无符号整型,来代表目标平台可能的最大对象,但不能供过于求。


The definition for size_t appears in several Standard C headers, namely, <stddef.h>, <stdio.h>, <stdlib.h>, <string.h>, <time.h>, and <wchar.h>. It also appears in the corresponding C++ headers, <cstddef>, <cstdio>, and so on. You should include at least one of these headers in your code before referring to size_t.

在几个标准C头文件中,size_t均有定义,即 <stddef.h>,<stdio.h>,<stdlib.h>,<string.h>,<time.h>以及<wchar.h>。他也在对应的C++头文件中出现过,<cstddef>,<cstdio>等等。在你引用size_t之前,你至少应该包含这些头文件中的一个。

Including any of the C headers (in a program compiled as either C or C++) declares size_t as a global name. Including any of the C++ headers (something you can do only in C++) defines size_t as a member of namespace std.


By definition, size_t is the result type of the sizeof operator. Thus, the appropriate way to declare n to make the assignment:


n = sizeof(thing);

both portable and efficient is to declare n with type size_t. Similarly, the appropriate way to declare a function foo to make the call:



both portable and efficient is to declare foo's parameter with type size_t. Functions with parameters of type size_t often have local variables that count up to or down from that size and index into arrays, and size_t is often a good type for those variables.


Using size_t appropriately makes your source code a little more self-documenting. When you see an object declared as a size_t, you immediately know it represents a size in bytes or an index, rather than an error code or a general arithmetic value.


Expect to see me using size_t in other examples in upcoming columns.
