使用“&”将XML读入c# XMLDocument对象

时间:2022-05-21 15:47:41

I have inherited a poorly written web application that seems to have errors when it tries to read in an xml document stored in the database that has an "&" in it. For example there will be a tag with the contents: "Prepaid & Charge". Is there some secret simple thing to do to have it not get an error parsing that character, or am I missing something obvious?

我继承了一个写得很糟糕的web应用程序,当它试图读取存储在数据库中带有“&”的xml文档时,它似乎有错误。例如,将会有一个标签的内容:“预付和收费”。是否有一些秘密的简单的事情可以让它在解析那个字符时不会出错,还是我漏掉了一些明显的东西?

EDIT: Are there any other characters that will cause this same type of parser error for not being well formed?

编辑:是否有其他字符会导致这种类型的解析器错误,因为格式不正确?

6 个解决方案

#1


40  

The problem is that the xml is not well-formed. Properly generated xml would list that data like this:

问题是xml格式不佳。正确生成的xml会列出这样的数据:

Prepaid & Charge

预付,负责

I've had to fix the same problem before, and I did it with this regex:

我之前也解决过同样的问题,我用这个regex:

Regex badAmpersand = new Regex("&(?![a-zA-Z]{2,6};|#[0-9]{2,4};)");

Combine that with a string constant defined like this:

将它与如下定义的字符串常量结合起来:

const string goodAmpersand = "&";

Now you can just say badAmpersand.Replace(<your input>, goodAmpersand);

现在你可以说badAmpersand。替换( <输入> ,goodAmpersand);

Note that a simple String.Replace("&", "&amp;") isn't good enough, since you can't know in advance for a given document whether any & characters will be coded correctly, incorrectly, or even both in the same document.

注意一个简单的字符串。替换(“&”、“&;”)不够好,因为您不能预先知道给定文档中的任何&字符是否将被正确、错误地编码,甚至在同一文档中两者都被编码。

The catches here are that you have to do this to your xml document before loading it into your parser, which likely means an extra pass through it. Also, it does not account for ampersands inside of a CDATA section. Finally, it only catches ampersands, not other illegal characters like <. Update: based on the comment, I need to update the expression for hex-coded (&#x...;) entities as well.

这里遇到的问题是,在将xml文档加载到解析器之前,必须对xml文档进行这种处理,这可能意味着要对其进行额外的传递。而且,它不包含CDATA部分中的与号。最后,它只捕获与字符,而不捕获其他非法字符,如<。更新:基于注释,我还需要更新十六进制(&#x…;;)实体的表达式。

Regarding which characters can cause problems, the actual rules are a little complex. For example, certain characters are allowed in data, but not as the first letter of an element name. And there's no simple list of illegal characters. Instead, a large (non-contiguous) swath of UNICODE is defined as legal, and anything outside of that is illegal.

对于哪些字符会导致问题,实际的规则有点复杂。例如,数据中允许使用某些字符,但不作为元素名称的第一个字母。而且不存在简单的非法字符列表。相反,一个大的(非连续的)UNICODE系列被定义为合法的,除此之外的任何东西都是非法的。

So when it comes down to it, you have to trust your document source to have at least a certain amount of compliance and consistency. For example, I've found that people are often smart enough to make sure the tags work properly and escape <, even if they don't know that & isn't allowed, hence your problem today. However, the best thing would be to get this fixed at the source.

因此,当涉及到它时,您必须相信您的文档源至少具有一定的遵从性和一致性。例如,我发现人们往往足够聪明,能够确保标签正常工作并转义为<,即使他们不知道&不允许这样做,这就是今天的问题。然而,最好的办法是在源头上解决这个问题。

Oh, and a note about the CDATA suggestion: I'd use that to make sure xml that I'm creating is well-formed, but when dealing with existing xml from outside, I find the regex method easier.

噢,还有关于CDATA的建议:我将使用它来确保正在创建的xml格式良好,但是当从外部处理现有xml时,我发现regex方法更容易。

#2


4  

The web application isn't at fault, the XML document is. Ampersands in XML should be encoded as &amp;. Failure to do so is a syntax error.

web应用程序没有错误,XML文档有错误。XML中的符号应该被编码为& & &;失败是语法错误。

Edit: in answer to the followup question, yes there are all kinds of similar errors. For example, unbalanced tags, unencoded less-than signs, unquoted attribute values, octets outside of the character encoding and various Unicode oddities, unrecognised entity references, and so on. In order to get any decent XML parser to consume a document, that document must be well-formed. The XML specification requires that a parser encountering a malformed document throw a fatal error.

编辑:在回答后续问题时,是的,有各种类似的错误。例如,不平衡的标记、未编码的小于号、未引用的属性值、字符编码之外的八进制数、各种Unicode古怪、未识别的实体引用等等。为了获得任何像样的XML解析器来使用文档,文档必须是格式良好的。XML规范要求遇到错误文档的解析器抛出一个致命错误。

#3


4  

The other answers are all correct, and I concur with their advice, but let me just add one thing:

其他的答案都是正确的,我同意他们的建议,但是让我补充一点:

PLEASE do not make applications that work with non well-formed XML, it just makes the rest of our lives more difficult :).

请不要让使用非格式良好的XML的应用程序工作,它只会让我们的生活变得更加困难:)。

Granted, there are times when you really just don't have a choice if you have no control over the other end, but you should really have it throwing a fatal error and complaining very loudly and explicitly about what is broken when such an event occurs.

当然,有时如果你无法控制另一端,你真的没有选择的余地,但你真的应该让它抛出一个致命的错误,并大声地、明确地抱怨当这样的事件发生时什么东西坏了。

You could probably take it one step further and say "Ack! This XML is broken in these places and for these reasons, here's how I tried to fix it to make it well-formed: ...".

你可以更进一步说“啊!”这个XML在这些地方被破坏了,由于这些原因,下面是我试图修复它的方法,使它变得很好:……”。

I'm not overly familiar with the MSXML APIs, but most good XML parsers will allow you to install error handlers so that you can trap the exact line/column number where errors are appearing along with getting the error code and message.

我不太熟悉MSXML api,但是大多数优秀的XML解析器都允许您安装错误处理程序,这样您就可以捕获错误出现的确切行/列号,同时获取错误代码和消息。

#4


3  

Your database doesn't contain XML documents. It contains some well-formed XML documents and some strings that look like XML to a human.

您的数据库不包含XML文档。它包含一些格式良好的XML文档和一些在人看来像XML的字符串。

If it's at all possible, you should fix this - in particular, you should fix whatever process is generating the malformed XML documents. Fixing the program that reads data out of this database is just putting wallpaper over a crack in the wall.

如果可能的话,您应该修复这个问题——特别是,您应该修复生成格式错误的XML文档的任何进程。修复从这个数据库中读取数据的程序只是在墙上贴了一张壁纸。

#5


2  

You can replace & with &amp;

你可以用& & & & &;

Or you might also be able to use CDATA sections.

或者您也可以使用CDATA部分。

#6


2  

There are several characters which will cause XML data to be reported as badly-formed.

有几个字符将导致XML数据被报告为格式不佳。

From w3schools:

从w3schools:

Characters like "<" and "&" are illegal in XML elements.

像“<”和“&”这样的字符在XML元素中是非法的。

The best solution for input you can't trust to be XML-compliant is to wrap it in CDATA tags, e.g.

对于输入您不能信任的输入的最佳解决方案是将其封装在CDATA标记中,例如。

<![CDATA[This is my wonderful & great user text]]>

Everything within the <![CDATA[ and ]]> tags is ignored by the parser.

一切都在< !>标记被解析器忽略。

#1


40  

The problem is that the xml is not well-formed. Properly generated xml would list that data like this:

问题是xml格式不佳。正确生成的xml会列出这样的数据:

Prepaid &amp; Charge

预付,负责

I've had to fix the same problem before, and I did it with this regex:

我之前也解决过同样的问题,我用这个regex:

Regex badAmpersand = new Regex("&(?![a-zA-Z]{2,6};|#[0-9]{2,4};)");

Combine that with a string constant defined like this:

将它与如下定义的字符串常量结合起来:

const string goodAmpersand = "&amp;";

Now you can just say badAmpersand.Replace(<your input>, goodAmpersand);

现在你可以说badAmpersand。替换( <输入> ,goodAmpersand);

Note that a simple String.Replace("&", "&amp;") isn't good enough, since you can't know in advance for a given document whether any & characters will be coded correctly, incorrectly, or even both in the same document.

注意一个简单的字符串。替换(“&”、“&;”)不够好,因为您不能预先知道给定文档中的任何&字符是否将被正确、错误地编码,甚至在同一文档中两者都被编码。

The catches here are that you have to do this to your xml document before loading it into your parser, which likely means an extra pass through it. Also, it does not account for ampersands inside of a CDATA section. Finally, it only catches ampersands, not other illegal characters like <. Update: based on the comment, I need to update the expression for hex-coded (&#x...;) entities as well.

这里遇到的问题是,在将xml文档加载到解析器之前,必须对xml文档进行这种处理,这可能意味着要对其进行额外的传递。而且,它不包含CDATA部分中的与号。最后,它只捕获与字符,而不捕获其他非法字符,如<。更新:基于注释,我还需要更新十六进制(&#x…;;)实体的表达式。

Regarding which characters can cause problems, the actual rules are a little complex. For example, certain characters are allowed in data, but not as the first letter of an element name. And there's no simple list of illegal characters. Instead, a large (non-contiguous) swath of UNICODE is defined as legal, and anything outside of that is illegal.

对于哪些字符会导致问题,实际的规则有点复杂。例如,数据中允许使用某些字符,但不作为元素名称的第一个字母。而且不存在简单的非法字符列表。相反,一个大的(非连续的)UNICODE系列被定义为合法的,除此之外的任何东西都是非法的。

So when it comes down to it, you have to trust your document source to have at least a certain amount of compliance and consistency. For example, I've found that people are often smart enough to make sure the tags work properly and escape <, even if they don't know that & isn't allowed, hence your problem today. However, the best thing would be to get this fixed at the source.

因此,当涉及到它时,您必须相信您的文档源至少具有一定的遵从性和一致性。例如,我发现人们往往足够聪明,能够确保标签正常工作并转义为<,即使他们不知道&不允许这样做,这就是今天的问题。然而,最好的办法是在源头上解决这个问题。

Oh, and a note about the CDATA suggestion: I'd use that to make sure xml that I'm creating is well-formed, but when dealing with existing xml from outside, I find the regex method easier.

噢,还有关于CDATA的建议:我将使用它来确保正在创建的xml格式良好,但是当从外部处理现有xml时,我发现regex方法更容易。

#2


4  

The web application isn't at fault, the XML document is. Ampersands in XML should be encoded as &amp;. Failure to do so is a syntax error.

web应用程序没有错误,XML文档有错误。XML中的符号应该被编码为& & &;失败是语法错误。

Edit: in answer to the followup question, yes there are all kinds of similar errors. For example, unbalanced tags, unencoded less-than signs, unquoted attribute values, octets outside of the character encoding and various Unicode oddities, unrecognised entity references, and so on. In order to get any decent XML parser to consume a document, that document must be well-formed. The XML specification requires that a parser encountering a malformed document throw a fatal error.

编辑:在回答后续问题时,是的,有各种类似的错误。例如,不平衡的标记、未编码的小于号、未引用的属性值、字符编码之外的八进制数、各种Unicode古怪、未识别的实体引用等等。为了获得任何像样的XML解析器来使用文档,文档必须是格式良好的。XML规范要求遇到错误文档的解析器抛出一个致命错误。

#3


4  

The other answers are all correct, and I concur with their advice, but let me just add one thing:

其他的答案都是正确的,我同意他们的建议,但是让我补充一点:

PLEASE do not make applications that work with non well-formed XML, it just makes the rest of our lives more difficult :).

请不要让使用非格式良好的XML的应用程序工作,它只会让我们的生活变得更加困难:)。

Granted, there are times when you really just don't have a choice if you have no control over the other end, but you should really have it throwing a fatal error and complaining very loudly and explicitly about what is broken when such an event occurs.

当然,有时如果你无法控制另一端,你真的没有选择的余地,但你真的应该让它抛出一个致命的错误,并大声地、明确地抱怨当这样的事件发生时什么东西坏了。

You could probably take it one step further and say "Ack! This XML is broken in these places and for these reasons, here's how I tried to fix it to make it well-formed: ...".

你可以更进一步说“啊!”这个XML在这些地方被破坏了,由于这些原因,下面是我试图修复它的方法,使它变得很好:……”。

I'm not overly familiar with the MSXML APIs, but most good XML parsers will allow you to install error handlers so that you can trap the exact line/column number where errors are appearing along with getting the error code and message.

我不太熟悉MSXML api,但是大多数优秀的XML解析器都允许您安装错误处理程序,这样您就可以捕获错误出现的确切行/列号,同时获取错误代码和消息。

#4


3  

Your database doesn't contain XML documents. It contains some well-formed XML documents and some strings that look like XML to a human.

您的数据库不包含XML文档。它包含一些格式良好的XML文档和一些在人看来像XML的字符串。

If it's at all possible, you should fix this - in particular, you should fix whatever process is generating the malformed XML documents. Fixing the program that reads data out of this database is just putting wallpaper over a crack in the wall.

如果可能的话,您应该修复这个问题——特别是,您应该修复生成格式错误的XML文档的任何进程。修复从这个数据库中读取数据的程序只是在墙上贴了一张壁纸。

#5


2  

You can replace & with &amp;

你可以用& & & & &;

Or you might also be able to use CDATA sections.

或者您也可以使用CDATA部分。

#6


2  

There are several characters which will cause XML data to be reported as badly-formed.

有几个字符将导致XML数据被报告为格式不佳。

From w3schools:

从w3schools:

Characters like "<" and "&" are illegal in XML elements.

像“<”和“&”这样的字符在XML元素中是非法的。

The best solution for input you can't trust to be XML-compliant is to wrap it in CDATA tags, e.g.

对于输入您不能信任的输入的最佳解决方案是将其封装在CDATA标记中,例如。

<![CDATA[This is my wonderful & great user text]]>

Everything within the <![CDATA[ and ]]> tags is ignored by the parser.

一切都在< !>标记被解析器忽略。