unicode表示形式是什么?

时间:2022-09-06 21:20:03

I've been going around in circles on this problem where the JSON UTF-8 strings returned from a server contain unicode pairs like this:

我一直在这个问题上兜圈子,从服务器返回的JSON UTF-8字符串包含这样的unicode对:

\u00c3\u00bc

\ u00c3 \ u00bc

which is being rendered as two individual characters. However, It should be rendered as a single character. According to a table I found at this link, here are some more examples:

它被渲染成两个独立的字符。但是,它应该作为单个字符呈现。根据我在这个链接上找到的表格,这里有更多的例子:

0xc3,0xa0 agrave
0xc3,0xa1 aacute
0xc3,0xa2 acircumflex
0xc3,0xa3 atilde
0xc3,0xa4 adiaeresis
0xc3,0xa5 aring
0xc3,0xa6 ae
0xc3,0xa7 ccedilla
0xc3,0xa8 egrave
0xc3,0xa9 eacute
0xc3,0xaa ecircumflex
0xc3,0xab ediaeresis
0xc3,0xac igrave
0xc3,0xad iacute
0xc3,0xae icircumflex
0xc3,0xaf idiaeresis
0xc3,0xb0 eth
0xc3,0xb1 ntilde
0xc3,0xb2 ograve
0xc3,0xb3 oacute

(Every case where I see this in my data would convert to an appropriate single character.)

(我在数据中看到的每一种情况都将转换为适当的单个字符。)

Many of these apparently are 'aliases' of singlet forms like '\uxxxx', but I receive them this way as doublets. The raw data bytes show that this is actually how it is transmitted from the server.

其中许多显然是singlet表单的“别名”,如“\uxxxx”,但我将它们作为doublet接收。原始数据字节显示,这实际上是它从服务器传输的方式。

(Once I have received them in UTF-8, there is no reason for me to keep them that way in local representation in memory.)

(一旦我以UTF-8接收到它们,我就没有理由将它们以这种方式保存在内存中的本地表示中。)

I don't know what to call this, so I'm having difficulty finding much information on it and I'm not able to communicate clearly on the subject. I would like to know why it's used and where I can find code that will convert it to something that my UIWebView can render correctly, but knowing what it's called is the point of my question.

我不知道该怎么称呼它,所以我很难找到关于它的很多信息,我无法就这个话题进行清晰的交流。我想知道为什么要用它,在哪里我可以找到代码把它转换成我的UIWebView可以正确渲染的东西,但是知道它叫什么是我的问题。

My question then is what is this doublet or paired form called?

我的问题是这个偶极子或成对的形式叫什么?

(If it's helpful, I am working in Objective-C and CocoaTouch.)

(如果有帮助的话,我在Objective-C和CocoaTouch工作。)

2 个解决方案

#1


4  

The notation '\u00c3\u00bc' denotes a two-character sequence “ü”, using the normal JavaScript escape notation: within a string literal, '\uhhhh' stands for the character (or, technically, Unicode code unit) with Unicode number hhhh in hexadecimal.

“\ u00c3 \ u00bc”这个符号代表一个双字符序列“¼”,使用正常的JavaScript转义符号:在一个字符串文字,“\ uhhhh”代表人物(或者,从技术上讲,Unicode代码单元)与Unicode hhhh十六进制的字符数量。

This is a virtually certain sign of character data conversion error. Such errors occur frequently when UTF-8 encoded data is misinterpreted as ISO-8859-1 encoded (or in some other 8-bit encoding).

这实际上是字符数据转换错误的某种标志。当UTF-8编码的数据被错误地解释为ISO-8859-1编码(或者在其他8位编码中)时,这种错误经常发生。

Probably the real, uncorrupted data contains u with umlaut, ü, U+00FC, for which the UTF−8 encoding consists of bytes c3 and bc, see http://www.fileformat.info/info/unicode/char/fc/index.htm

可能是真实的,未堕落的数据包含u元音变音,u,u + 00 fc的UTF−公元前8字节编码由c3和,见http://www.fileformat.info/info/unicode/char/fc/index.htm

The document you are referring to, http://cpansearch.perl.org/src/JANPAZ/Cstools-3.42/Cz/Cstocs/enc/utf8.enc, appears to show UTF-8 encoded representations of characters, presented in text format by displaying the bytes as hexadecimal number.

您所提到的文档,http://cpansearch.perl.org/src/JANPAZ/Cstools-3.42/Cz/Cstocs/enc/utf8.enc,似乎显示了UTF-8编码的字符表示,以文本格式显示,显示字节为十六进制数。

#2


5  

\u00c3\u00bc

which is being rendered as two individual characters.

它被呈现为两个独立的字符。

That does explicitly mean the two characters ü. If you expected to see ü, then what you have is incorrect processing further upstream, either in the JSON generator or in the input fed into it. Someone has decoded a series of bytes as ISO-8859-1 where they should have used UTF-8.

这并明确说了¼两个字符。如果您期望看到u,那么您所得到的是更上游的错误处理,无论是在JSON生成器还是输入中。有人将一系列字节解码为ISO-8859-1,在这里他们应该使用UTF-8。

You can work around the problem by reading the JSON, encoding to ISO-8859-1, then decoding to UTF-8. But this will mangle any actual correct input, and it's impossible to tell from the example whether the ‘wrong’ charset is actually ISO-8859-1 or Windows code page 1252. Could be either.

您可以通过读取JSON、编码到ISO-8859-1、然后解码到UTF-8来解决这个问题。但这将破坏任何实际正确的输入,而且从示例中无法判断“错误”字符集实际上是ISO-8859-1还是Windows代码页1252。可以是。

You really need to fix the source of the problem rather than trying to work around it, though. Is it your server generating the JSON? Where does the data come from? Because \u00c3\u00bc to mean ü is explicitly incorrect.

你确实需要解决问题的根源,而不是试图解决它。它是生成JSON的服务器吗?这些数据来自哪里?因为u \u00c3\u00bc表示u是明确错误的。

#1


4  

The notation '\u00c3\u00bc' denotes a two-character sequence “ü”, using the normal JavaScript escape notation: within a string literal, '\uhhhh' stands for the character (or, technically, Unicode code unit) with Unicode number hhhh in hexadecimal.

“\ u00c3 \ u00bc”这个符号代表一个双字符序列“¼”,使用正常的JavaScript转义符号:在一个字符串文字,“\ uhhhh”代表人物(或者,从技术上讲,Unicode代码单元)与Unicode hhhh十六进制的字符数量。

This is a virtually certain sign of character data conversion error. Such errors occur frequently when UTF-8 encoded data is misinterpreted as ISO-8859-1 encoded (or in some other 8-bit encoding).

这实际上是字符数据转换错误的某种标志。当UTF-8编码的数据被错误地解释为ISO-8859-1编码(或者在其他8位编码中)时,这种错误经常发生。

Probably the real, uncorrupted data contains u with umlaut, ü, U+00FC, for which the UTF−8 encoding consists of bytes c3 and bc, see http://www.fileformat.info/info/unicode/char/fc/index.htm

可能是真实的,未堕落的数据包含u元音变音,u,u + 00 fc的UTF−公元前8字节编码由c3和,见http://www.fileformat.info/info/unicode/char/fc/index.htm

The document you are referring to, http://cpansearch.perl.org/src/JANPAZ/Cstools-3.42/Cz/Cstocs/enc/utf8.enc, appears to show UTF-8 encoded representations of characters, presented in text format by displaying the bytes as hexadecimal number.

您所提到的文档,http://cpansearch.perl.org/src/JANPAZ/Cstools-3.42/Cz/Cstocs/enc/utf8.enc,似乎显示了UTF-8编码的字符表示,以文本格式显示,显示字节为十六进制数。

#2


5  

\u00c3\u00bc

which is being rendered as two individual characters.

它被呈现为两个独立的字符。

That does explicitly mean the two characters ü. If you expected to see ü, then what you have is incorrect processing further upstream, either in the JSON generator or in the input fed into it. Someone has decoded a series of bytes as ISO-8859-1 where they should have used UTF-8.

这并明确说了¼两个字符。如果您期望看到u,那么您所得到的是更上游的错误处理,无论是在JSON生成器还是输入中。有人将一系列字节解码为ISO-8859-1,在这里他们应该使用UTF-8。

You can work around the problem by reading the JSON, encoding to ISO-8859-1, then decoding to UTF-8. But this will mangle any actual correct input, and it's impossible to tell from the example whether the ‘wrong’ charset is actually ISO-8859-1 or Windows code page 1252. Could be either.

您可以通过读取JSON、编码到ISO-8859-1、然后解码到UTF-8来解决这个问题。但这将破坏任何实际正确的输入,而且从示例中无法判断“错误”字符集实际上是ISO-8859-1还是Windows代码页1252。可以是。

You really need to fix the source of the problem rather than trying to work around it, though. Is it your server generating the JSON? Where does the data come from? Because \u00c3\u00bc to mean ü is explicitly incorrect.

你确实需要解决问题的根源,而不是试图解决它。它是生成JSON的服务器吗?这些数据来自哪里?因为u \u00c3\u00bc表示u是明确错误的。