The U+001A character appears frequently in error messages relating to character encoding. What is the U+001A character?
在与字符编码有关的错误消息中,U+001A字符经常出现。什么是U+001A字符?
3 个解决方案
#1
20
U+001A is defined in the Unicode Standard as a control character with the name SUBSTITUTE, and it belongs to a group characterized as follows, in chapter 16 of the standard: “There are 65 code points set aside in the Unicode Standard for compatibility with the C0 and C1 control codes defined in the ISO/IEC 2022 framework [...] The Unicode Standard provides for the intact interchange of these code points, neither adding to nor subtracting from their semantics. The semantics of the control codes are generally determined by the application with which they are used. However, in the absence of specific application uses, they may be interpreted according to the control function semantics specified in ISO/IEC 6429:1992.”
U + 001中定义的一个是Unicode标准作为一个控制字符的名称替代,它属于一组特征如下,在16章的标准:“有65在Unicode代码点留出标准兼容C0和C1控制代码中定义的ISO / IEC 2022框架[…[参考译文]Unicode标准规定了这些代码点的完整交换,既不增加也不减少它们的语义。控制代码的语义通常由使用它们的应用程序决定。但是,在没有特定应用程序的情况下,它们可以根据ISO/IEC 6429:1992中指定的控制函数语义进行解释。
ISO 6429 is effectively equivalent to ECMA 48, which mentions this code as having the short name SUB, too, and defines it as follows: “SUB is used in the place of a character that has been found to be invalid or in error. SUB is intended to be introduced by automatic means.” This reflects the definition of this control code in Ascii.
ISO 6429实际上相当于ECMA 48,它将此代码提到有短名称SUB,并将其定义为:“SUB被用来代替已被发现无效或错误的字符。”SUB是通过自动方式引入的。这反映了该控制码在Ascii中的定义。
Thus, in general, U+001A may be used to indicate a character-level data error, such as the presence of bytes, in purported character data, that have no interpretation in the character encoding being applied. Loosely speaking, it would thus mean “bad character data”, but more appropriately “malformed data, when trying to interpret data as characters”. However, in Unicode, U+FFFD REPLACEMENT CHARACTER is more appropriate, as it has specific Unicode semantics.
因此,一般来说,U+001A可以用来表示字符级的数据错误,比如字节的存在,在所谓的字符数据中,在字符编码中没有解释。粗略地说,它将意味着“糟糕的字符数据”,但更合适的是“在试图将数据解释为字符时”的“畸形数据”。但是,在Unicode中,U+FFFD替换字符更合适,因为它具有特定的Unicode语义。
Since the question has been tagged with “xml”, it needs to be noted that in XML 1.0, U+001A is forbidden, by clause 2.2 Characters. Note that the comment “any Unicode character, excluding the surrogate blocks, FFFE, and FFFF” is misleading (but comments are non-normative); U+001A is a Unicode character, though it is not a graphic character and its effect is not defined in the Unicode Standard.
由于这个问题已经被标记为“xml”,所以需要注意的是,在xml 1.0中,根据第2.2个字符,U+001A是被禁止的。请注意,注释“任何Unicode字符,不包括代理块、FFFE和FFFF”都具有误导性(但注释是非规范的);U+001A是一个Unicode字符,虽然它不是一个图形字符,它的效果不是在Unicode标准中定义的。
#2
11
That's the Ctrl+Z control code. It's kinda special in Windows, which inherited it from DOS which inherited it from CP/M. Its legacy use was as an end-of-text marker, similar to how Ctrl+D is used in Unix.
这是Ctrl+Z的控制代码。它在Windows中有点特别,它继承了从CP/M继承来的DOS。它的遗留应用是作为文本标记,类似于在Unix中使用Ctrl+D。
Seeing it in an error message or used as the fall-back character for a failed encoding conversion is however quite unusual. I'd double-check the code and make sure it isn't U+003F or U+FFFD, the more typical encoding fallback characters. Or just a plain quirk of the specific code you are dealing with.
但是,在错误消息中看到它或者作为失败的编码转换的回退字符是非常不寻常的。我会仔细检查代码,确保它不是U+003F或U+FFFD,更典型的编码回退字符。或者只是你正在处理的特定代码的一个简单的巧合。
#3
6
As far as I can tell U+001A is a legacy character in Unicode. Its only reason for existence is that it was already defined in ASCII as the substitute character ("... used in the place of a character that is recognized to be invalid or in error or that cannot be represented on a given device."). It was also sometimes used to end a character stream (which is probably a common source of problems)
据我所知,U+001A是Unicode的一个遗留字符。它存在的惟一理由是,它已经以ASCII作为替代字符(“……”在一个被确认为无效或错误或不能在给定设备上表示的字符的地方使用。它有时也用于结束字符流(这可能是问题的常见来源)
In Unicode that function is taken over by the U+FFFD REPLACEMENT CHARACTER.
在Unicode中,函数由U+FFFD替换字符接管。
#1
20
U+001A is defined in the Unicode Standard as a control character with the name SUBSTITUTE, and it belongs to a group characterized as follows, in chapter 16 of the standard: “There are 65 code points set aside in the Unicode Standard for compatibility with the C0 and C1 control codes defined in the ISO/IEC 2022 framework [...] The Unicode Standard provides for the intact interchange of these code points, neither adding to nor subtracting from their semantics. The semantics of the control codes are generally determined by the application with which they are used. However, in the absence of specific application uses, they may be interpreted according to the control function semantics specified in ISO/IEC 6429:1992.”
U + 001中定义的一个是Unicode标准作为一个控制字符的名称替代,它属于一组特征如下,在16章的标准:“有65在Unicode代码点留出标准兼容C0和C1控制代码中定义的ISO / IEC 2022框架[…[参考译文]Unicode标准规定了这些代码点的完整交换,既不增加也不减少它们的语义。控制代码的语义通常由使用它们的应用程序决定。但是,在没有特定应用程序的情况下,它们可以根据ISO/IEC 6429:1992中指定的控制函数语义进行解释。
ISO 6429 is effectively equivalent to ECMA 48, which mentions this code as having the short name SUB, too, and defines it as follows: “SUB is used in the place of a character that has been found to be invalid or in error. SUB is intended to be introduced by automatic means.” This reflects the definition of this control code in Ascii.
ISO 6429实际上相当于ECMA 48,它将此代码提到有短名称SUB,并将其定义为:“SUB被用来代替已被发现无效或错误的字符。”SUB是通过自动方式引入的。这反映了该控制码在Ascii中的定义。
Thus, in general, U+001A may be used to indicate a character-level data error, such as the presence of bytes, in purported character data, that have no interpretation in the character encoding being applied. Loosely speaking, it would thus mean “bad character data”, but more appropriately “malformed data, when trying to interpret data as characters”. However, in Unicode, U+FFFD REPLACEMENT CHARACTER is more appropriate, as it has specific Unicode semantics.
因此,一般来说,U+001A可以用来表示字符级的数据错误,比如字节的存在,在所谓的字符数据中,在字符编码中没有解释。粗略地说,它将意味着“糟糕的字符数据”,但更合适的是“在试图将数据解释为字符时”的“畸形数据”。但是,在Unicode中,U+FFFD替换字符更合适,因为它具有特定的Unicode语义。
Since the question has been tagged with “xml”, it needs to be noted that in XML 1.0, U+001A is forbidden, by clause 2.2 Characters. Note that the comment “any Unicode character, excluding the surrogate blocks, FFFE, and FFFF” is misleading (but comments are non-normative); U+001A is a Unicode character, though it is not a graphic character and its effect is not defined in the Unicode Standard.
由于这个问题已经被标记为“xml”,所以需要注意的是,在xml 1.0中,根据第2.2个字符,U+001A是被禁止的。请注意,注释“任何Unicode字符,不包括代理块、FFFE和FFFF”都具有误导性(但注释是非规范的);U+001A是一个Unicode字符,虽然它不是一个图形字符,它的效果不是在Unicode标准中定义的。
#2
11
That's the Ctrl+Z control code. It's kinda special in Windows, which inherited it from DOS which inherited it from CP/M. Its legacy use was as an end-of-text marker, similar to how Ctrl+D is used in Unix.
这是Ctrl+Z的控制代码。它在Windows中有点特别,它继承了从CP/M继承来的DOS。它的遗留应用是作为文本标记,类似于在Unix中使用Ctrl+D。
Seeing it in an error message or used as the fall-back character for a failed encoding conversion is however quite unusual. I'd double-check the code and make sure it isn't U+003F or U+FFFD, the more typical encoding fallback characters. Or just a plain quirk of the specific code you are dealing with.
但是,在错误消息中看到它或者作为失败的编码转换的回退字符是非常不寻常的。我会仔细检查代码,确保它不是U+003F或U+FFFD,更典型的编码回退字符。或者只是你正在处理的特定代码的一个简单的巧合。
#3
6
As far as I can tell U+001A is a legacy character in Unicode. Its only reason for existence is that it was already defined in ASCII as the substitute character ("... used in the place of a character that is recognized to be invalid or in error or that cannot be represented on a given device."). It was also sometimes used to end a character stream (which is probably a common source of problems)
据我所知,U+001A是Unicode的一个遗留字符。它存在的惟一理由是,它已经以ASCII作为替代字符(“……”在一个被确认为无效或错误或不能在给定设备上表示的字符的地方使用。它有时也用于结束字符流(这可能是问题的常见来源)
In Unicode that function is taken over by the U+FFFD REPLACEMENT CHARACTER.
在Unicode中,函数由U+FFFD替换字符接管。