为什么我的“特殊”Unicode字符使用json_encode编码奇怪?

时间:2022-04-15 00:27:55

When using "special" Unicode characters they come out as weird garbage when encoded to JSON:

当使用“特殊”Unicode字符时,当编码为JSON时,它们会变成奇怪的垃圾:

php > echo json_encode(['foo' => '馬']);
{"foo":"\u99ac"}

Why? Have I done something wrong with my encodings?

为什么?我的编码做错了什么吗?

(This is a reference question to clarify the topic once and for all, since this comes up again and again.)

(这是一个参考问题,可以一劳永逸地澄清这个话题,因为这个问题反复出现。)

1 个解决方案

#1


18  

First of all: There's nothing wrong here. This is how characters can be encoded in JSON. It is in the official standard. It is based on how string literals can be formed in Javascript ECMAScript (section 7.8.4 "String Literals") and is described as such:

首先:这里没有什么问题。这就是如何用JSON编码字符的方法。这是官方标准。它基于如何在Javascript ECMAScript(第7.8.4节“string literals”)中形成字符串文字,描述如下:

Any code point may be represented as a hexadecimal number. The meaning of such a number is determined by ISO/IEC 10646. If the code point is in the Basic Multilingual Plane (U+0000 through U+FFFF), then it may be represented as a six-character sequence: a reverse solidus, followed by the lowercase letter u, followed by four hexadecimal digits that encode the code point. [...] So, for example, a string containing only a single reverse solidus character may be represented as "\u005C".

任何代码点都可以表示为十六进制数字。这个数字的含义是由ISO/ iec10646决定的。如果代码点位于基本的多语言平面(U+0000通过U+FFFF),那么它可以表示为一个6字符序列:一个反向孤立点,后面跟着小写字母U,后面跟着四个十六进制数字,编码代码点。[…例如,一个只包含一个反向固相字符的字符串可以表示为“\u005C”。

In short: Any character can be encoded as \u...., where .... is the Unicode code point of the character (or the code point of half of a UTF-16 surrogate pair, for characters outside the BMP).

简而言之:任何字符可以被编码为\ u ....,....是字符的Unicode代码点(或UTF-16代理对的一半的代码点,用于BMP之外的字符)。

"馬"
"\u99ac"

These two string literals represent the exact same character, they're absolutely equivalent. When these string literals are parsed by a compliant JSON parser, they will both result in the string "馬". They don't look the same, but they mean the same thing in the JSON data encoding format.

这两个字符串文字表示完全相同的字符,它们是完全相等的。当这些字符串解析兼容的JSON解析器,它们都将导致字符串“馬”。它们看起来不一样,但在JSON数据编码格式中它们的意思是一样的。

PHP's json_encode preferably encodes non-ASCII characters using \u.... escape sequences. Technically it doesn't have to, but it does. And the result is perfectly valid. If you prefer to have literal characters in your JSON instead of escape sequences, you can set the JSON_UNESCAPED_UNICODE flag in PHP 5.4 or higher:

PHP的json_encode最好编码非ascii字符使用\ u ....转义序列。从技术上讲,它不需要,但它确实需要。结果是完全正确的。如果您喜欢在JSON中使用文字字符而不是转义序列,可以在PHP 5.4或更高版本中设置JSON_UNESCAPED_UNICODE标志:

php > echo json_encode(['foo' => '馬'], JSON_UNESCAPED_UNICODE);
{"foo":"馬"}

To emphasise: this is just a preference, it is not necessary in any way to transport "Unicode characters" in JSON.

要强调的是:这只是一种偏好,没有必要以任何方式在JSON中传输“Unicode字符”。

#1


18  

First of all: There's nothing wrong here. This is how characters can be encoded in JSON. It is in the official standard. It is based on how string literals can be formed in Javascript ECMAScript (section 7.8.4 "String Literals") and is described as such:

首先:这里没有什么问题。这就是如何用JSON编码字符的方法。这是官方标准。它基于如何在Javascript ECMAScript(第7.8.4节“string literals”)中形成字符串文字,描述如下:

Any code point may be represented as a hexadecimal number. The meaning of such a number is determined by ISO/IEC 10646. If the code point is in the Basic Multilingual Plane (U+0000 through U+FFFF), then it may be represented as a six-character sequence: a reverse solidus, followed by the lowercase letter u, followed by four hexadecimal digits that encode the code point. [...] So, for example, a string containing only a single reverse solidus character may be represented as "\u005C".

任何代码点都可以表示为十六进制数字。这个数字的含义是由ISO/ iec10646决定的。如果代码点位于基本的多语言平面(U+0000通过U+FFFF),那么它可以表示为一个6字符序列:一个反向孤立点,后面跟着小写字母U,后面跟着四个十六进制数字,编码代码点。[…例如,一个只包含一个反向固相字符的字符串可以表示为“\u005C”。

In short: Any character can be encoded as \u...., where .... is the Unicode code point of the character (or the code point of half of a UTF-16 surrogate pair, for characters outside the BMP).

简而言之:任何字符可以被编码为\ u ....,....是字符的Unicode代码点(或UTF-16代理对的一半的代码点,用于BMP之外的字符)。

"馬"
"\u99ac"

These two string literals represent the exact same character, they're absolutely equivalent. When these string literals are parsed by a compliant JSON parser, they will both result in the string "馬". They don't look the same, but they mean the same thing in the JSON data encoding format.

这两个字符串文字表示完全相同的字符,它们是完全相等的。当这些字符串解析兼容的JSON解析器,它们都将导致字符串“馬”。它们看起来不一样,但在JSON数据编码格式中它们的意思是一样的。

PHP's json_encode preferably encodes non-ASCII characters using \u.... escape sequences. Technically it doesn't have to, but it does. And the result is perfectly valid. If you prefer to have literal characters in your JSON instead of escape sequences, you can set the JSON_UNESCAPED_UNICODE flag in PHP 5.4 or higher:

PHP的json_encode最好编码非ascii字符使用\ u ....转义序列。从技术上讲,它不需要,但它确实需要。结果是完全正确的。如果您喜欢在JSON中使用文字字符而不是转义序列,可以在PHP 5.4或更高版本中设置JSON_UNESCAPED_UNICODE标志:

php > echo json_encode(['foo' => '馬'], JSON_UNESCAPED_UNICODE);
{"foo":"馬"}

To emphasise: this is just a preference, it is not necessary in any way to transport "Unicode characters" in JSON.

要强调的是:这只是一种偏好,没有必要以任何方式在JSON中传输“Unicode字符”。