XML的默认编码是UTF-8还是UTF-16?

时间:2023-01-11 12:00:38

OpenTag FAQ states:

OpenTag常见问题解答:

If no encoding declaration is present in the XML document (and no external encoding declaration mechanism such as the HTTP header is available), the assumed encoding of an XML document depends on the presence of the Byte-Order-Mark (BOM).

如果XML文档中不存在编码声明(并且没有可用的外部编码声明机制,如HTTP头),则假定XML文档的编码取决于字节顺序标记(BOM)的存在。

The BOM is a Unicode special marker placed at the top of the file that indicate its encoding. The BOM is optional for UTF-8.

BOM是放置在文件顶部的Unicode特殊标记,表示其编码。对于UTF-8, BOM是可选的。

First bytes        Encoding assumed
-----------------------------------------
EF BB BF           UTF-8
FE FF              UTF-16 (big-endian)
FF FE              UTF-16 (little-endian)
00 00 FE FF        UTF-32 (big-endian)
FF FE 00 00        UTF-32 (little-endian)
None of the above  UTF-8

Is there a dumbed-down explanation of the above paragraph?

上面这段有什么晦涩的解释吗?

1 个解决方案

#1


28  

Either you have to use a line like

或者你必须使用像这样的线

<?xml version="1.0" encoding="iso-8859-1" ?>

to specify which encoding is used. If the encoding is not specified, a Byte order mark (BOM) can be present. If a BOM for either UTF-16 or UTF-32 is present, that encoding is used. Otherwise UTF-8 is the encoding. (The BOM for UTF-8 is optional)

指定使用哪个编码。如果没有指定编码,可以显示字节顺序标记(BOM)。如果存在针对UTF-16或UTF-32的BOM,则使用该编码。否则,UTF-8就是编码。(UTF-8的BOM是可选的)

Edit

编辑

The BOM is an invisible character. But there is no need to see it. Applications take care of it automatically. When you use windows notepad, you can select the encoding when you save the file. Notepad will automatically insert the BOM at the start of the file. When you later reopen the file, notepad will recognise the BOM and use the proper encoding to read the file. There is no need for you to ever modify the BOM, if you would do so, characters can get a different meaning, so the text will not be the same.

BOM是一个无形的角色。但是没有必要去看它。应用程序自动处理它。当您使用windows记事本时,您可以在保存文件时选择编码。记事本将在文件开始时自动插入BOM。当您稍后重新打开文件时,记事本将识别BOM并使用适当的编码来读取文件。你不需要修改BOM,如果你这样做的话,字符可以得到不同的含义,所以文本不会是相同的。

I will try to explain with an example. Consider a text file, with just the characters "test". Default notepad will use ANSI encoding, the text file will look like this when you view it in hex mode:

我将尝试用一个例子来解释。考虑一个文本文件,只有字符“test”。默认的记事本将使用ANSI编码,当您以十六进制模式查看文本文件时,文本文件将如下所示:

C:\>C:\gnuwin32\bin\hexdump -C test-ansi.txt
00000000  74 65 73 74                                       |test|
00000004

(as you see, I am using hexdump from gnuwin32, but you can also use an hex editor like Frhed to see this.

(如您所见,我正在使用gnuwin32中的hexdump,但是您也可以使用像Frhed这样的十六进制编辑器来查看这一点。

There is no BOM in front of this file. It would not be possible, because the character which is used for the BOM does not exist in ANSI encoding. (Because there is not BOM, editors which don't support ANSI encoding, would treat this file as UTF-8).

这个文件前面没有BOM。这是不可能的,因为用于BOM的字符在ANSI编码中不存在。(因为没有BOM,不支持ANSI编码的编辑器会将这个文件视为UTF-8)。

when I now save the file like utf8, you will see 3 extra bytes (the BOM) in front of "test":

当我现在将文件保存为utf8时,您将在“test”前面看到3个额外的字节(BOM):

C:\>C:\gnuwin32\bin\hexdump -C test-utf8.txt
00000000  ef bb bf 74 65 73 74                              |test|
00000007

(if you would open this file with a text editor which does not support utf-8, you would actually see those characters "")

(如果你用不支持utf-8的文本编辑器打开这个文件,你会看到这些字符“i -¿”)

Notepad can also save the file as unicode, this means UTF-16 little-endian (UTF-16LE):

Notepad还可以将文件保存为unicode,这意味着UTF-16 little-endian (UTF-16LE):

C:\>C:\gnuwin32\bin\hexdump -C test-unicode.txt
00000000  ff fe 74 00 65 00 73 00  74 00                    |ÿþt.e.s.t.|
0000000a

And here is the version saved as unicode (big endian) (UTF-16BE):

这里是保存为unicode (big endian) (UTF-16BE)的版本:

C:\>C:\gnuwin32\bin\hexdump -C test-unicode-big-endian.txt
00000000  fe ff 00 74 00 65 00 73  00 74                    |þÿ.t.e.s.t|
0000000a

Now consider a text file with the 4 chinese characters "琀攀猀琀". When I save that as unicode (big endian), the result looks like this:

现在考虑一个文本文件与4汉字“琀攀猀琀”。当我将其保存为unicode (big endian)时,结果是这样的:

C:\>C:\gnuwin32\bin\hexdump -C test2-unicode-big-endian.txt
00000000  fe ff 74 00 65 00 73 00  74 00                    |þÿt.e.s.t.|
0000000a

As you see, the word "test" in UTF-16LE is stored the same way as the word "琀攀猀琀" in UTF-16BE. But because the BOM if stored different, you can see whether the file contains "test" or "琀攀猀琀". Without a BOM you would have to guess.

如你所见,“测试”这个词在UTF-16LE存储一样这个词“琀攀猀琀”UTF-16BE。但由于BOM如果存储不同,您可以看到文件是否包含“测试”或“琀攀猀琀”。没有BOM,你就得猜了。

#1


28  

Either you have to use a line like

或者你必须使用像这样的线

<?xml version="1.0" encoding="iso-8859-1" ?>

to specify which encoding is used. If the encoding is not specified, a Byte order mark (BOM) can be present. If a BOM for either UTF-16 or UTF-32 is present, that encoding is used. Otherwise UTF-8 is the encoding. (The BOM for UTF-8 is optional)

指定使用哪个编码。如果没有指定编码,可以显示字节顺序标记(BOM)。如果存在针对UTF-16或UTF-32的BOM,则使用该编码。否则,UTF-8就是编码。(UTF-8的BOM是可选的)

Edit

编辑

The BOM is an invisible character. But there is no need to see it. Applications take care of it automatically. When you use windows notepad, you can select the encoding when you save the file. Notepad will automatically insert the BOM at the start of the file. When you later reopen the file, notepad will recognise the BOM and use the proper encoding to read the file. There is no need for you to ever modify the BOM, if you would do so, characters can get a different meaning, so the text will not be the same.

BOM是一个无形的角色。但是没有必要去看它。应用程序自动处理它。当您使用windows记事本时,您可以在保存文件时选择编码。记事本将在文件开始时自动插入BOM。当您稍后重新打开文件时,记事本将识别BOM并使用适当的编码来读取文件。你不需要修改BOM,如果你这样做的话,字符可以得到不同的含义,所以文本不会是相同的。

I will try to explain with an example. Consider a text file, with just the characters "test". Default notepad will use ANSI encoding, the text file will look like this when you view it in hex mode:

我将尝试用一个例子来解释。考虑一个文本文件,只有字符“test”。默认的记事本将使用ANSI编码,当您以十六进制模式查看文本文件时,文本文件将如下所示:

C:\>C:\gnuwin32\bin\hexdump -C test-ansi.txt
00000000  74 65 73 74                                       |test|
00000004

(as you see, I am using hexdump from gnuwin32, but you can also use an hex editor like Frhed to see this.

(如您所见,我正在使用gnuwin32中的hexdump,但是您也可以使用像Frhed这样的十六进制编辑器来查看这一点。

There is no BOM in front of this file. It would not be possible, because the character which is used for the BOM does not exist in ANSI encoding. (Because there is not BOM, editors which don't support ANSI encoding, would treat this file as UTF-8).

这个文件前面没有BOM。这是不可能的,因为用于BOM的字符在ANSI编码中不存在。(因为没有BOM,不支持ANSI编码的编辑器会将这个文件视为UTF-8)。

when I now save the file like utf8, you will see 3 extra bytes (the BOM) in front of "test":

当我现在将文件保存为utf8时,您将在“test”前面看到3个额外的字节(BOM):

C:\>C:\gnuwin32\bin\hexdump -C test-utf8.txt
00000000  ef bb bf 74 65 73 74                              |test|
00000007

(if you would open this file with a text editor which does not support utf-8, you would actually see those characters "")

(如果你用不支持utf-8的文本编辑器打开这个文件,你会看到这些字符“i -¿”)

Notepad can also save the file as unicode, this means UTF-16 little-endian (UTF-16LE):

Notepad还可以将文件保存为unicode,这意味着UTF-16 little-endian (UTF-16LE):

C:\>C:\gnuwin32\bin\hexdump -C test-unicode.txt
00000000  ff fe 74 00 65 00 73 00  74 00                    |ÿþt.e.s.t.|
0000000a

And here is the version saved as unicode (big endian) (UTF-16BE):

这里是保存为unicode (big endian) (UTF-16BE)的版本:

C:\>C:\gnuwin32\bin\hexdump -C test-unicode-big-endian.txt
00000000  fe ff 00 74 00 65 00 73  00 74                    |þÿ.t.e.s.t|
0000000a

Now consider a text file with the 4 chinese characters "琀攀猀琀". When I save that as unicode (big endian), the result looks like this:

现在考虑一个文本文件与4汉字“琀攀猀琀”。当我将其保存为unicode (big endian)时,结果是这样的:

C:\>C:\gnuwin32\bin\hexdump -C test2-unicode-big-endian.txt
00000000  fe ff 74 00 65 00 73 00  74 00                    |þÿt.e.s.t.|
0000000a

As you see, the word "test" in UTF-16LE is stored the same way as the word "琀攀猀琀" in UTF-16BE. But because the BOM if stored different, you can see whether the file contains "test" or "琀攀猀琀". Without a BOM you would have to guess.

如你所见,“测试”这个词在UTF-16LE存储一样这个词“琀攀猀琀”UTF-16BE。但由于BOM如果存储不同,您可以看到文件是否包含“测试”或“琀攀猀琀”。没有BOM,你就得猜了。