在XML名称中编码空间字符

时间:2021-03-17 20:14:08

I am given an XML file which contains names like below:

我收到一个包含如下名称的XML文件:

<Benchↂ0020Codeↂ0020>something</Benchↂ0020Codeↂ0020>

The ↂ symbol is represented with three bytes: 0xE2, 0x86, 0x82.

ↂ符号是用三个字节表示:0 xe2 0 x86,0 x82。

It looks like ↂ0020 is supposed to be treated as space character. But when I read the XML using System.Xml.XmlReader the characters ↂ0020 are not converted to space.

它看起来像ↂ0020应该被视为空格字符。但是当我使用System.Xml读取XML时。XmlReader人物ↂ0020不转化为空间。

Is there is a way to have them converted (besides replacing, of course)? Or I just got broken XML?

有没有一种方法可以让他们转换(当然除了替换)?还是我刚刚弄砸了XML?

2 个解决方案

#1


1  

The XML isn't broken, but it's representing names using a private convention for escaping disallowed characters. The XML parser won't understand this convention, it's up to the receiving application to interpret it.

XML并没有被破坏,但是它使用一个私有约定来表示名称,以避免不允许的字符。XML解析器无法理解这种约定,只能由接收应用程序来解释它。

#2


4  

Space characters are not permitted in XML names

There are 86 codepoints whose name contain the word space. Ignoring the codepoints where SPACE hits due to MONOSPACE and any other that have a visual representation, leaves the following:

有86个代码点的名称包含单词空间。忽略因MONOSPACE和任何其他具有视觉表示的代码点,会留下以下内容:

  • #x0020 SPACE
  • # x0020空间
  • #x00A0 NO-BREAK SPACE
  • # x00A0不中断的空间
  • [#x2002-#x200A] EN SPACE through HAIR SPACE
  • [#x2002-#x200A]通过头发空间
  • #x205F MEDIUM MATHEMATICAL SPACE
  • # x205F中等数学空间
  • #x3000 IDEOGRAPHIC SPACE
  • # x3000表意空间

None of the space-related code points (empty visual representation) are permitted in XML names by the W3C XML BNF for component names:

W3C XML BNF对于组件名不允许在XML名称中使用任何与空间相关的代码点(空的可视化表示):

NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] |
                  [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] |
                  [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] |
                  [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] |
                  [#x10000-#xEFFFF]
NameChar      ::= NameStartChar | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] |
                  [#x203F-#x2040]
Name          ::= NameStartChar (NameChar)*

Alternatives to spaces in XML names

替代XML名称中的空格

  • CamelCase
  • CamelCase
  • underscore_char
  • underscore_char
  • hyphen-char
  • hyphen-char
  • period.char
  • period.char

Colon should not be used as a word separator in XML names to avoid confusion with its use in XML Namespaces.

冒号不应该用作XML名称中的单词分隔符,以避免与在XML名称空间中的使用混淆。


ↂ is permitted in XML names

The character, ↂ, (0xE2, 0x86, 0x82, which is #x2182), has nothing to do with spaces – it is ROMAN NUMERAL TEN THOUSAND. ↂ is explicitly permitted: #x2182 is in the [#x2070-#x218F] code range.

这个角色,ↂ(0 xe2 0 x86 0 x82,# x2182),与空间无关——这是一万年罗马数字。ↂ是显式地允许:# x2182在[# x2070 - # x218F]代码范围。

The 0020 appearing after ↂ are just digits. Together with the rest of the characters in Benchↂ0020Codeↂ0020, these form an allowed (albeit unconventional) XML name. They do not constitute spaces in the XML name as spaces are not allowed in XML names.

0020年后出现ↂ只是数字。连同其他人物0020年板凳ↂ代码ↂ0020,这些形成一个允许(尽管非传统的)XML名称。它们不构成XML名称中的空格,因为XML名称中不允许使用空格。

#1


1  

The XML isn't broken, but it's representing names using a private convention for escaping disallowed characters. The XML parser won't understand this convention, it's up to the receiving application to interpret it.

XML并没有被破坏,但是它使用一个私有约定来表示名称,以避免不允许的字符。XML解析器无法理解这种约定,只能由接收应用程序来解释它。

#2


4  

Space characters are not permitted in XML names

There are 86 codepoints whose name contain the word space. Ignoring the codepoints where SPACE hits due to MONOSPACE and any other that have a visual representation, leaves the following:

有86个代码点的名称包含单词空间。忽略因MONOSPACE和任何其他具有视觉表示的代码点,会留下以下内容:

  • #x0020 SPACE
  • # x0020空间
  • #x00A0 NO-BREAK SPACE
  • # x00A0不中断的空间
  • [#x2002-#x200A] EN SPACE through HAIR SPACE
  • [#x2002-#x200A]通过头发空间
  • #x205F MEDIUM MATHEMATICAL SPACE
  • # x205F中等数学空间
  • #x3000 IDEOGRAPHIC SPACE
  • # x3000表意空间

None of the space-related code points (empty visual representation) are permitted in XML names by the W3C XML BNF for component names:

W3C XML BNF对于组件名不允许在XML名称中使用任何与空间相关的代码点(空的可视化表示):

NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] |
                  [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] |
                  [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] |
                  [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] |
                  [#x10000-#xEFFFF]
NameChar      ::= NameStartChar | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] |
                  [#x203F-#x2040]
Name          ::= NameStartChar (NameChar)*

Alternatives to spaces in XML names

替代XML名称中的空格

  • CamelCase
  • CamelCase
  • underscore_char
  • underscore_char
  • hyphen-char
  • hyphen-char
  • period.char
  • period.char

Colon should not be used as a word separator in XML names to avoid confusion with its use in XML Namespaces.

冒号不应该用作XML名称中的单词分隔符,以避免与在XML名称空间中的使用混淆。


ↂ is permitted in XML names

The character, ↂ, (0xE2, 0x86, 0x82, which is #x2182), has nothing to do with spaces – it is ROMAN NUMERAL TEN THOUSAND. ↂ is explicitly permitted: #x2182 is in the [#x2070-#x218F] code range.

这个角色,ↂ(0 xe2 0 x86 0 x82,# x2182),与空间无关——这是一万年罗马数字。ↂ是显式地允许:# x2182在[# x2070 - # x218F]代码范围。

The 0020 appearing after ↂ are just digits. Together with the rest of the characters in Benchↂ0020Codeↂ0020, these form an allowed (albeit unconventional) XML name. They do not constitute spaces in the XML name as spaces are not allowed in XML names.

0020年后出现ↂ只是数字。连同其他人物0020年板凳ↂ代码ↂ0020,这些形成一个允许(尽管非传统的)XML名称。它们不构成XML名称中的空格,因为XML名称中不允许使用空格。