在C中,具有大于127的无符号字符的%s格式说明符

时间:2022-07-07 21:45:00

I wrote the following example programs but their outputs were not what I expected.
In my first program, s contains some characters but one of them is bigger than 127(0xe1). When I print s the output is not what I expected.

我编写了以下示例程序,但它们的输出不是我所期望的。在我的第一个程序中,s包含一些字符,但其中一个大于127(0xe1)。当我打印s时,输出不是我所期望的。

#include <stdio.h>

int main()
{
    int i, len;

    unsigned char s[] = {0x74, 0x61, 0x6f, 0x62, 0xe1, 0x6f, 0x63, 0x64, 0x6e};

    for (i = 0; i < sizeof(s) / sizeof(unsigned char); i++) {
        printf("%c ", s[i]);
    }

    printf("\n%s\n", s);                                                                                                               
    return 0;
}

Guess what? the outputs were:

你猜怎么着?输出是:

t a o b c d n 
taobn@

Then I did some minor changes to the first program and here is my second program:

然后我对第一个项目做了一些小小的改动这是我的第二个项目:

#include <stdio.h>

int main()
{
    int i, len;

    unsigned char s[] = {0x74, 0x61, 0x6f, 0x62, 0xe1, 0x6f, 0x63, 0x64, 0x6e};
    // Iteratively output was deleted here

    printf("%s\n", s);                                                                                                               
    return 0;
}

The outputs also astonished me, they were:

这些产出也令我惊讶,它们是:

taobn

To checkout if this is a strange feature of glibc, I wrote the third program which bypasses glibc's I/O buffer and writes s directly into file with write system call.

为了检查这是否是glibc的一个奇怪特性,我编写了第三个程序,它绕过了glibc的I/O缓冲区,并使用write系统调用将s直接写入文件。

#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>

int main()
{  
   int fd;                                                  
   unsigned char s[] = {0x74, 0x61, 0x6f, 0x62, 0xe1, 0x6f, 0x63, 0x64, 0x6e};

   if((fd = open("./a.out", O_WRONLY | O_CREAT)) < 0) {
        printf("error open\n");
        return -1;
    }

    write(fd, s, sizeof(s));
    close(fd);

    return 0;
} 

The outputs were still:

输出还是:

[cobblau@baba test]$ cat a.out
taobn

Can anyone explain this? What's going on here?
Thanks.

谁能解释这个?这是怎么回事?谢谢。

3 个解决方案

#1


7  

Calling printf("\n%s\n", s) with variable s not pointing to a null-terminated string yields undefined behavior. In simple words, the last character in your array should be 0 (a.k.a. \0).

调用printf(“\n%s\n”,s),变量s不指向以null结尾的字符串,会产生未定义的行为。简单地说,数组中的最后一个字符应该是0(也就是a. \0)。

%s tells printf to print the characters located at the memory address pointed by the input argument, until a 0 character is encountered.

%s告诉printf打印位于输入参数所指向的内存地址的字符,直到遇到一个0字符为止。

You are passing an array of characters which does not contain a 0 character, and so printf will continue reading characters from memory until it encounters 0 or performs an illegal memory access.

您正在传递一个不包含0字符的字符数组,因此printf将继续从内存中读取字符,直到遇到0或执行非法的内存访问。


Here is a how you could end up printing "taobn@":

这里有一个你最终打印“淘宝网”的方法:

Your array of characters is:

您的字符数组是:

unsigned char s[] = {0x74, 0x61, 0x6f, 0x62, 0xe1, 0x6f, 0x63, 0x64, 0x6e};

Suppose that the characters located immediately after this array in memory are:

假设这个数组后面的字符是:

0x08, 0x08, 0x08, 0x08, 0x08, 0x6e, 0x40, 0x20, 0x20, 0x20, 0x08, 0x08, 0x08, 0x00

So in essence, printf will attempt to print the following null-terminated string:

因此,实质上,printf将尝试打印以下以null结尾的字符串:

unsigned char s[] = {0x74, 0x61, 0x6f, 0x62, 0xe1, 0x6f, 0x63, 0x64, 0x6e,
                     0x08, 0x08, 0x08, 0x08, 0x08, 0x6e, 0x40, 0x20, 0x20,
                     0x20, 0x08, 0x08, 0x08, 0x00};

Now, try to call printf("%s",s) and see what you get...

现在,尝试调用printf(“%s”,s),看看得到什么……

#2


5  

In addition to the problem that your string in currently not null terminated (which can lead to undefined behaviout) as others noted, the output of characters having code above 127 depends on current console charset.

正如其他人所指出的,除了您的字符串当前不是null结尾(这可能导致未定义行为)的问题之外,代码大于127的字符的输出取决于当前控制台字符集。

You can have Single Byte Character Set like ISO-8859-1 (AKA Latin1), or its slight variation Windows 1252, CP850 or CP437, each with its own representation for high characters but where one byte is one character on one side, and Multi Byte Character Set like UTF8 on the other side.

您可以拥有一个字节的字符集,比如ISO-8859-1(又名Latin1),或者它的稍微变化的Windows 1252、CP850或CP437,每一个都有自己的高字符表示法,但是其中一边的一个字节是一个字符,另一边的多字节字符集是UTF8。

As an example the string éè is represented by { 0xe9, 0xe8, 0 } in ISO-8859-1, { 0x82, 0x8a, 0 } in CP850 and { 0xc3, 0xa9, 0xc3, 0xa8, 0 } in UTF8

例如,字符串ee由ISO-8859-1中的{0xe9、0xe8、0}、{0x82、0x8a、0}在CP850和{0xc3、0xa9、0xc3、0xa8、0}在UTF8中表示

Currently, when you try to print a character whose code in unknown in the console, you can get a ?, a square or nothing depending on the system.

目前,当您尝试在控制台中打印代码未知的字符时,您可以根据系统获得一个?、一个正方形或什么都没有。

#3


1  

Printing individual characters is different from printing a char array which doesn't terminate with a null terminator

打印单个字符与打印一个没有终止符的char数组不同

unsigned char s[] = { 0x74, 0x61, 0x6f, 0x62, 0xe1, 0x6f, 0x63, 0x64, 0x6e };
printf("\n%s\n", s); // Wrong, undefined behavior

Alternatively you could provide the size yourself

或者您可以自己提供尺寸

printf("\n%.*s\n", (int)sizeof(s), s);

From printf()'s documentation:

从printf()的文档:

.number

.number

For s: this is the maximum number of characters to be printed. By default all characters are printed until the ending null character is encountered.

对于s:这是要打印的最大字符数。默认情况下,所有字符都被打印,直到遇到结束空字符为止。

#1


7  

Calling printf("\n%s\n", s) with variable s not pointing to a null-terminated string yields undefined behavior. In simple words, the last character in your array should be 0 (a.k.a. \0).

调用printf(“\n%s\n”,s),变量s不指向以null结尾的字符串,会产生未定义的行为。简单地说,数组中的最后一个字符应该是0(也就是a. \0)。

%s tells printf to print the characters located at the memory address pointed by the input argument, until a 0 character is encountered.

%s告诉printf打印位于输入参数所指向的内存地址的字符,直到遇到一个0字符为止。

You are passing an array of characters which does not contain a 0 character, and so printf will continue reading characters from memory until it encounters 0 or performs an illegal memory access.

您正在传递一个不包含0字符的字符数组,因此printf将继续从内存中读取字符,直到遇到0或执行非法的内存访问。


Here is a how you could end up printing "taobn@":

这里有一个你最终打印“淘宝网”的方法:

Your array of characters is:

您的字符数组是:

unsigned char s[] = {0x74, 0x61, 0x6f, 0x62, 0xe1, 0x6f, 0x63, 0x64, 0x6e};

Suppose that the characters located immediately after this array in memory are:

假设这个数组后面的字符是:

0x08, 0x08, 0x08, 0x08, 0x08, 0x6e, 0x40, 0x20, 0x20, 0x20, 0x08, 0x08, 0x08, 0x00

So in essence, printf will attempt to print the following null-terminated string:

因此,实质上,printf将尝试打印以下以null结尾的字符串:

unsigned char s[] = {0x74, 0x61, 0x6f, 0x62, 0xe1, 0x6f, 0x63, 0x64, 0x6e,
                     0x08, 0x08, 0x08, 0x08, 0x08, 0x6e, 0x40, 0x20, 0x20,
                     0x20, 0x08, 0x08, 0x08, 0x00};

Now, try to call printf("%s",s) and see what you get...

现在,尝试调用printf(“%s”,s),看看得到什么……

#2


5  

In addition to the problem that your string in currently not null terminated (which can lead to undefined behaviout) as others noted, the output of characters having code above 127 depends on current console charset.

正如其他人所指出的,除了您的字符串当前不是null结尾(这可能导致未定义行为)的问题之外,代码大于127的字符的输出取决于当前控制台字符集。

You can have Single Byte Character Set like ISO-8859-1 (AKA Latin1), or its slight variation Windows 1252, CP850 or CP437, each with its own representation for high characters but where one byte is one character on one side, and Multi Byte Character Set like UTF8 on the other side.

您可以拥有一个字节的字符集,比如ISO-8859-1(又名Latin1),或者它的稍微变化的Windows 1252、CP850或CP437,每一个都有自己的高字符表示法,但是其中一边的一个字节是一个字符,另一边的多字节字符集是UTF8。

As an example the string éè is represented by { 0xe9, 0xe8, 0 } in ISO-8859-1, { 0x82, 0x8a, 0 } in CP850 and { 0xc3, 0xa9, 0xc3, 0xa8, 0 } in UTF8

例如,字符串ee由ISO-8859-1中的{0xe9、0xe8、0}、{0x82、0x8a、0}在CP850和{0xc3、0xa9、0xc3、0xa8、0}在UTF8中表示

Currently, when you try to print a character whose code in unknown in the console, you can get a ?, a square or nothing depending on the system.

目前,当您尝试在控制台中打印代码未知的字符时,您可以根据系统获得一个?、一个正方形或什么都没有。

#3


1  

Printing individual characters is different from printing a char array which doesn't terminate with a null terminator

打印单个字符与打印一个没有终止符的char数组不同

unsigned char s[] = { 0x74, 0x61, 0x6f, 0x62, 0xe1, 0x6f, 0x63, 0x64, 0x6e };
printf("\n%s\n", s); // Wrong, undefined behavior

Alternatively you could provide the size yourself

或者您可以自己提供尺寸

printf("\n%.*s\n", (int)sizeof(s), s);

From printf()'s documentation:

从printf()的文档:

.number

.number

For s: this is the maximum number of characters to be printed. By default all characters are printed until the ending null character is encountered.

对于s:这是要打印的最大字符数。默认情况下,所有字符都被打印,直到遇到结束空字符为止。