的意义——

时间:2023-01-05 22:03:30

I am new to XML and I am trying to understand the basics. I read the line below in "Learning XML", but I am still not clear. Can someone point me to a book or website which explains these basics clearly?

我对XML很陌生,我正在尝试理解基本的东西。我在“学习XML”中读了下面这一行,但我还是不清楚。有人能告诉我一本书或网站能清楚地解释这些基本知识吗?

From Learning XML:

从学习XML:

The XML declaration describes some of the most general properties of the document, telling the XML processor that it needs an XML parser to interpret this document.

XML声明描述了文档的一些最一般的属性,告诉XML处理器需要一个XML解析器来解释这个文档。

What does this mean?

这是什么意思?

I understand the "xml version part" - both doc and user of doc should "talk" in the same version of XML. But what about the encoding part? Why is that necessary?

我理解“xml版本部分”——doc和doc用户都应该使用相同版本的xml进行“对话”。但是编码部分呢?为什么有必要吗?

6 个解决方案

#1


90  

To understand the "encoding" attribute, you have to understand the difference between bytes and characters.

要理解“编码”属性,您必须了解字节和字符之间的区别。

Think of bytes as numbers between 0 and 255, whereas characters are things like "a", "1" and "Ä". The set of all characters that are available is called a character set.

把字节看作0到255之间的数字,而字符则是“a”、“1”和“a”之类的东西。所有可用字符的集合称为字符集。

Each character has a sequence of one or more bytes that are used to represent it; however, the exact number and value of the bytes depends on the encoding used and there are many different encodings.

每个字符都有一个或多个字节的序列,用来表示字符;但是,字节的确切数目和值取决于所使用的编码,并且有许多不同的编码。

Most encodings are based on an old character set and encoding called ASCII which is a single byte per character (actually, only 7 bits) and contains 128 characters including a lot of the common characters used in US English.

大多数编码都基于一种叫做ASCII的旧字符集和编码,它是每个字符的一个字节(实际上,只有7位),包含128个字符,其中包括许多在美式英语中使用的常用字符。

For example, here are 6 characters in the ASCII character set that are represented by the values 60 to 65.

例如,这里是ASCII字符集中的6个字符,它们由值60到65表示。

Extract of ASCII Table 60-65
╔══════╦══════════════╗
║ Byte ║  Character   ║
╠══════╬══════════════║
║  60  ║      <       ║
║  61  ║      =       ║
║  62  ║      >       ║
║  63  ║      ?       ║
║  64  ║      @       ║
║  65  ║      A       ║
╚══════╩══════════════╝

In the full ASCII set, the lowest value used is zero and the highest is 127 (both of these are hidden control characters).

在完整的ASCII集合中,使用的最小值是0,最高值是127(这两个都是隐藏的控制字符)。

However, once you start needing more characters than the basic ASCII provides (for example, letters with accents, currency symbols, graphic symbols, etc.), ASCII is not suitable and you need something more extensive. You need more characters (a different character set) and you need a different encoding as 128 characters is not enough to fit all the characters in. Some encodings offer one byte (256 characters) or up to six bytes.

但是,一旦您开始需要比基本ASCII提供的字符更多的字符(例如,带有重音、货币符号、图形符号等的字母),那么ASCII就不合适,您需要更广泛的内容。您需要更多的字符(不同的字符集),并且需要不同的编码,因为128个字符不足以容纳所有字符。一些编码提供一个字节(256个字符)或最多6个字节。

Over time a lot of encodings have been created. In the Windows world, there is CP1252, or ISO-8859-1, whereas Linux users tend to favour UTF-8. Java uses UTF-16 natively.

随着时间的推移,已经产生了许多编码。在Windows世界中,有CP1252,或ISO-8859-1,而Linux用户倾向于使用UTF-8。Java本机使用utf - 16。

One sequence of byte values for a character in one encoding might stand for a completely different character in another encoding, or might even be invalid.

一个编码中的字符的字节值序列可能代表另一个编码中的完全不同的字符,甚至可能是无效的。

For example, in ISO 8859-1, â is represented by one byte of value 226, whereas in UTF-8 it is two bytes: 195, 162. However, in ISO 8859-1, 195, 162 would be two characters, Ã, ¢.

例如,在ISO 8859-1中,a表示一个值为226的字节,而在UTF-8中则是两个字节:195,162。然而,在ISO 8859 - 1中,195年,162年将是两个字符,A¢。

Think of XML as not a sequence of characters but a sequence of bytes.

XML不是字符序列,而是字节序列。

Imagine the system receiving the XML sees the bytes 195, 162. How does it know what characters these are?

假设接收XML的系统看到的字节为195、162。它怎么知道这些字符是什么呢?

In order for the system to interpret those bytes as actual characters (and so display them or convert them to another encoding), it needs to know the encoding used in the XML.

为了让系统将这些字节解释为实际的字符(并显示它们或将它们转换为其他编码),它需要知道XML中使用的编码。

Since most common encodings are compatible with ASCII, as far as basic alphabetic characters and symbols go, in these cases, the declaration itself can get away with using only the ASCII characters to say what the encoding is. In other cases, the parser must try and figure out the encoding of the declaration. Since it knows the declaration begins with <?xml it is a lot easier to do this.

由于大多数常见的编码都与ASCII兼容,就基本的字母字符和符号而言,在这些情况下,声明本身可以只使用ASCII字符来表示编码是什么。在其他情况下,解析器必须尝试找出声明的编码。因为它知道声明以

Finally, the version attribute specifies the XML version, of which there are two at the moment (see Wikipedia XML versions. There are slight differences between the versions, so an XML parser needs to know what it is dealing with. In most cases (for English speakers anyway), version 1.0 is sufficient.

最后,version属性指定了XML版本,目前有两个版本(参见Wikipedia XML版本)。这些版本之间有细微的差异,因此XML解析器需要知道它在处理什么。在大多数情况下(对于讲英语的人来说),1.0版本就足够了。

#2


17  

An XML declaration is not required in all XML documents; however XHTML document authors are strongly encouraged to use XML declarations in all their documents. Such a declaration is required when the character encoding of the document is other than the default UTF-8 or UTF-16 and no encoding was determined by a higher-level protocol. Here is an example of an XHTML document. In this example, the XML declaration is included.

所有XML文档都不需要XML声明;但是强烈建议XHTML文档作者在所有文档中使用XML声明。当文档的字符编码不是默认的UTF-8或UTF-16,并且没有编码是由更高级别的协议决定时,就需要这样的声明。这里是XHTML文档的一个示例。在本例中,包含了XML声明。

<?xml version="1.0" encoding="UTF-8"?>
 <!DOCTYPE html 
 PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
 <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
  <head>
    <title>Virtual Library</title>
  </head>
  <body>
    <p>Moved to <a href="http://example.org/">example.org</a>.</p>
 </body>
</html>

Please refer to the W3 standards for XML.

请参阅W3 XML标准。

#3


3  

The encoding declaration identifies which encoding is used to represent the characters in the document.

编码声明标识使用哪个编码来表示文档中的字符。

More on the XML Declaration here: http://msdn.microsoft.com/en-us/library/ms256048.aspx

更多关于XML声明的信息:http://msdn.microsoft.com/en-us/library/ms256048.aspx

#4


2  

This is the XML optional preamble.

这是XML可选的序言。

  • version="1.0" means that this is the XML standard this file conforms to
  • version="1.0"意味着这是该文件所遵循的XML标准。
  • encoding="utf-8" means that the file is encoded using the UTF-8 Unicode encoding
  • 编码=“utf-8”表示使用utf-8 Unicode编码文件

#5


2  

Can someone point me to a book or website which explains these basics clearly ?

有人能给我指出一本书或网站,清楚地解释这些基础吗?

You can check this XML Tutorial with examples.

您可以使用示例检查这个XML教程。

But what about the encoding part ? Why is that necessary ?

那么编码部分呢?为什么这是必须的?

W3C provides explanation about encoding :

W3C提供了关于编码的解释:

"The document character set for XML and HTML 4.0 is Unicode (aka ISO 10646). This means that HTML browsers and XML processors should behave as if they used Unicode internally. But it doesn't mean that documents have to be transmitted in Unicode. As long as client and server agree on the encoding, they can use any encoding that can be converted to Unicode..."

XML和HTML 4.0的文档字符集是Unicode(即ISO 10646)。这意味着HTML浏览器和XML处理器的行为应该像在内部使用Unicode一样。但这并不意味着文档必须用Unicode传输。只要客户机和服务器同意编码,它们就可以使用任何可以转换为Unicode的编码……

#6


-1  

The XML declaration in the document map consists of the following:

文档映射中的XML声明包括以下内容:

The version number, ?xml version="1.0"?. 

This is mandatory. Although the number might change for future versions of XML, 1.0 is the current version.

这是强制性的。虽然将来的XML版本的数字可能会改变,但1.0是当前的版本。

The encoding declaration,

编码声明,

encoding="UTF-8"?

This is optional. If used, the encoding declaration must appear immediately after the version information in the XML declaration, and must contain a value representing an existing character encoding.

这是可选的。如果使用,编码声明必须在XML声明中的版本信息之后立即出现,并且必须包含表示现有字符编码的值。

#1


90  

To understand the "encoding" attribute, you have to understand the difference between bytes and characters.

要理解“编码”属性,您必须了解字节和字符之间的区别。

Think of bytes as numbers between 0 and 255, whereas characters are things like "a", "1" and "Ä". The set of all characters that are available is called a character set.

把字节看作0到255之间的数字,而字符则是“a”、“1”和“a”之类的东西。所有可用字符的集合称为字符集。

Each character has a sequence of one or more bytes that are used to represent it; however, the exact number and value of the bytes depends on the encoding used and there are many different encodings.

每个字符都有一个或多个字节的序列,用来表示字符;但是,字节的确切数目和值取决于所使用的编码,并且有许多不同的编码。

Most encodings are based on an old character set and encoding called ASCII which is a single byte per character (actually, only 7 bits) and contains 128 characters including a lot of the common characters used in US English.

大多数编码都基于一种叫做ASCII的旧字符集和编码,它是每个字符的一个字节(实际上,只有7位),包含128个字符,其中包括许多在美式英语中使用的常用字符。

For example, here are 6 characters in the ASCII character set that are represented by the values 60 to 65.

例如,这里是ASCII字符集中的6个字符,它们由值60到65表示。

Extract of ASCII Table 60-65
╔══════╦══════════════╗
║ Byte ║  Character   ║
╠══════╬══════════════║
║  60  ║      <       ║
║  61  ║      =       ║
║  62  ║      >       ║
║  63  ║      ?       ║
║  64  ║      @       ║
║  65  ║      A       ║
╚══════╩══════════════╝

In the full ASCII set, the lowest value used is zero and the highest is 127 (both of these are hidden control characters).

在完整的ASCII集合中,使用的最小值是0,最高值是127(这两个都是隐藏的控制字符)。

However, once you start needing more characters than the basic ASCII provides (for example, letters with accents, currency symbols, graphic symbols, etc.), ASCII is not suitable and you need something more extensive. You need more characters (a different character set) and you need a different encoding as 128 characters is not enough to fit all the characters in. Some encodings offer one byte (256 characters) or up to six bytes.

但是,一旦您开始需要比基本ASCII提供的字符更多的字符(例如,带有重音、货币符号、图形符号等的字母),那么ASCII就不合适,您需要更广泛的内容。您需要更多的字符(不同的字符集),并且需要不同的编码,因为128个字符不足以容纳所有字符。一些编码提供一个字节(256个字符)或最多6个字节。

Over time a lot of encodings have been created. In the Windows world, there is CP1252, or ISO-8859-1, whereas Linux users tend to favour UTF-8. Java uses UTF-16 natively.

随着时间的推移,已经产生了许多编码。在Windows世界中,有CP1252,或ISO-8859-1,而Linux用户倾向于使用UTF-8。Java本机使用utf - 16。

One sequence of byte values for a character in one encoding might stand for a completely different character in another encoding, or might even be invalid.

一个编码中的字符的字节值序列可能代表另一个编码中的完全不同的字符,甚至可能是无效的。

For example, in ISO 8859-1, â is represented by one byte of value 226, whereas in UTF-8 it is two bytes: 195, 162. However, in ISO 8859-1, 195, 162 would be two characters, Ã, ¢.

例如,在ISO 8859-1中,a表示一个值为226的字节,而在UTF-8中则是两个字节:195,162。然而,在ISO 8859 - 1中,195年,162年将是两个字符,A¢。

Think of XML as not a sequence of characters but a sequence of bytes.

XML不是字符序列,而是字节序列。

Imagine the system receiving the XML sees the bytes 195, 162. How does it know what characters these are?

假设接收XML的系统看到的字节为195、162。它怎么知道这些字符是什么呢?

In order for the system to interpret those bytes as actual characters (and so display them or convert them to another encoding), it needs to know the encoding used in the XML.

为了让系统将这些字节解释为实际的字符(并显示它们或将它们转换为其他编码),它需要知道XML中使用的编码。

Since most common encodings are compatible with ASCII, as far as basic alphabetic characters and symbols go, in these cases, the declaration itself can get away with using only the ASCII characters to say what the encoding is. In other cases, the parser must try and figure out the encoding of the declaration. Since it knows the declaration begins with <?xml it is a lot easier to do this.

由于大多数常见的编码都与ASCII兼容,就基本的字母字符和符号而言,在这些情况下,声明本身可以只使用ASCII字符来表示编码是什么。在其他情况下,解析器必须尝试找出声明的编码。因为它知道声明以

Finally, the version attribute specifies the XML version, of which there are two at the moment (see Wikipedia XML versions. There are slight differences between the versions, so an XML parser needs to know what it is dealing with. In most cases (for English speakers anyway), version 1.0 is sufficient.

最后,version属性指定了XML版本,目前有两个版本(参见Wikipedia XML版本)。这些版本之间有细微的差异,因此XML解析器需要知道它在处理什么。在大多数情况下(对于讲英语的人来说),1.0版本就足够了。

#2


17  

An XML declaration is not required in all XML documents; however XHTML document authors are strongly encouraged to use XML declarations in all their documents. Such a declaration is required when the character encoding of the document is other than the default UTF-8 or UTF-16 and no encoding was determined by a higher-level protocol. Here is an example of an XHTML document. In this example, the XML declaration is included.

所有XML文档都不需要XML声明;但是强烈建议XHTML文档作者在所有文档中使用XML声明。当文档的字符编码不是默认的UTF-8或UTF-16,并且没有编码是由更高级别的协议决定时,就需要这样的声明。这里是XHTML文档的一个示例。在本例中,包含了XML声明。

<?xml version="1.0" encoding="UTF-8"?>
 <!DOCTYPE html 
 PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
 <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
  <head>
    <title>Virtual Library</title>
  </head>
  <body>
    <p>Moved to <a href="http://example.org/">example.org</a>.</p>
 </body>
</html>

Please refer to the W3 standards for XML.

请参阅W3 XML标准。

#3


3  

The encoding declaration identifies which encoding is used to represent the characters in the document.

编码声明标识使用哪个编码来表示文档中的字符。

More on the XML Declaration here: http://msdn.microsoft.com/en-us/library/ms256048.aspx

更多关于XML声明的信息:http://msdn.microsoft.com/en-us/library/ms256048.aspx

#4


2  

This is the XML optional preamble.

这是XML可选的序言。

  • version="1.0" means that this is the XML standard this file conforms to
  • version="1.0"意味着这是该文件所遵循的XML标准。
  • encoding="utf-8" means that the file is encoded using the UTF-8 Unicode encoding
  • 编码=“utf-8”表示使用utf-8 Unicode编码文件

#5


2  

Can someone point me to a book or website which explains these basics clearly ?

有人能给我指出一本书或网站,清楚地解释这些基础吗?

You can check this XML Tutorial with examples.

您可以使用示例检查这个XML教程。

But what about the encoding part ? Why is that necessary ?

那么编码部分呢?为什么这是必须的?

W3C provides explanation about encoding :

W3C提供了关于编码的解释:

"The document character set for XML and HTML 4.0 is Unicode (aka ISO 10646). This means that HTML browsers and XML processors should behave as if they used Unicode internally. But it doesn't mean that documents have to be transmitted in Unicode. As long as client and server agree on the encoding, they can use any encoding that can be converted to Unicode..."

XML和HTML 4.0的文档字符集是Unicode(即ISO 10646)。这意味着HTML浏览器和XML处理器的行为应该像在内部使用Unicode一样。但这并不意味着文档必须用Unicode传输。只要客户机和服务器同意编码,它们就可以使用任何可以转换为Unicode的编码……

#6


-1  

The XML declaration in the document map consists of the following:

文档映射中的XML声明包括以下内容:

The version number, ?xml version="1.0"?. 

This is mandatory. Although the number might change for future versions of XML, 1.0 is the current version.

这是强制性的。虽然将来的XML版本的数字可能会改变,但1.0是当前的版本。

The encoding declaration,

编码声明,

encoding="UTF-8"?

This is optional. If used, the encoding declaration must appear immediately after the version information in the XML declaration, and must contain a value representing an existing character encoding.

这是可选的。如果使用,编码声明必须在XML声明中的版本信息之后立即出现,并且必须包含表示现有字符编码的值。