使用“&”将XML读入c# XMLDocument对象

时间:2022-05-21 15:47:41

I have inherited a poorly written web application that seems to have errors when it tries to read in an xml document stored in the database that has an "&" in it. For example there will be a tag with the contents: "Prepaid & Charge". Is there some secret simple thing to do to have it not get an error parsing that character, or am I missing something obvious?


EDIT: Are there any other characters that will cause this same type of parser error for not being well formed?


6 个解决方案



The problem is that the xml is not well-formed. Properly generated xml would list that data like this:


Prepaid & Charge


I've had to fix the same problem before, and I did it with this regex:


Regex badAmpersand = new Regex("&(?![a-zA-Z]{2,6};|#[0-9]{2,4};)");

Combine that with a string constant defined like this:


const string goodAmpersand = "&";

Now you can just say badAmpersand.Replace(<your input>, goodAmpersand);

现在你可以说badAmpersand。替换( <输入> ,goodAmpersand);

Note that a simple String.Replace("&", "&amp;") isn't good enough, since you can't know in advance for a given document whether any & characters will be coded correctly, incorrectly, or even both in the same document.


The catches here are that you have to do this to your xml document before loading it into your parser, which likely means an extra pass through it. Also, it does not account for ampersands inside of a CDATA section. Finally, it only catches ampersands, not other illegal characters like <. Update: based on the comment, I need to update the expression for hex-coded (&#x...;) entities as well.


Regarding which characters can cause problems, the actual rules are a little complex. For example, certain characters are allowed in data, but not as the first letter of an element name. And there's no simple list of illegal characters. Instead, a large (non-contiguous) swath of UNICODE is defined as legal, and anything outside of that is illegal.


So when it comes down to it, you have to trust your document source to have at least a certain amount of compliance and consistency. For example, I've found that people are often smart enough to make sure the tags work properly and escape <, even if they don't know that & isn't allowed, hence your problem today. However, the best thing would be to get this fixed at the source.


Oh, and a note about the CDATA suggestion: I'd use that to make sure xml that I'm creating is well-formed, but when dealing with existing xml from outside, I find the regex method easier.




The web application isn't at fault, the XML document is. Ampersands in XML should be encoded as &amp;. Failure to do so is a syntax error.

web应用程序没有错误,XML文档有错误。XML中的符号应该被编码为& & &;失败是语法错误。

Edit: in answer to the followup question, yes there are all kinds of similar errors. For example, unbalanced tags, unencoded less-than signs, unquoted attribute values, octets outside of the character encoding and various Unicode oddities, unrecognised entity references, and so on. In order to get any decent XML parser to consume a document, that document must be well-formed. The XML specification requires that a parser encountering a malformed document throw a fatal error.




The other answers are all correct, and I concur with their advice, but let me just add one thing:


PLEASE do not make applications that work with non well-formed XML, it just makes the rest of our lives more difficult :).


Granted, there are times when you really just don't have a choice if you have no control over the other end, but you should really have it throwing a fatal error and complaining very loudly and explicitly about what is broken when such an event occurs.


You could probably take it one step further and say "Ack! This XML is broken in these places and for these reasons, here's how I tried to fix it to make it well-formed: ...".


I'm not overly familiar with the MSXML APIs, but most good XML parsers will allow you to install error handlers so that you can trap the exact line/column number where errors are appearing along with getting the error code and message.

我不太熟悉MSXML api,但是大多数优秀的XML解析器都允许您安装错误处理程序,这样您就可以捕获错误出现的确切行/列号,同时获取错误代码和消息。



Your database doesn't contain XML documents. It contains some well-formed XML documents and some strings that look like XML to a human.


If it's at all possible, you should fix this - in particular, you should fix whatever process is generating the malformed XML documents. Fixing the program that reads data out of this database is just putting wallpaper over a crack in the wall.




You can replace & with &amp;

你可以用& & & & &;

Or you might also be able to use CDATA sections.




There are several characters which will cause XML data to be reported as badly-formed.


From w3schools:


Characters like "<" and "&" are illegal in XML elements.


The best solution for input you can't trust to be XML-compliant is to wrap it in CDATA tags, e.g.


<![CDATA[This is my wonderful & great user text]]>

Everything within the <![CDATA[ and ]]> tags is ignored by the parser.

一切都在< !>标记被解析器忽略。



The problem is that the xml is not well-formed. Properly generated xml would list that data like this:


Prepaid &amp; Charge


I've had to fix the same problem before, and I did it with this regex:


Regex badAmpersand = new Regex("&(?![a-zA-Z]{2,6};|#[0-9]{2,4};)");

Combine that with a string constant defined like this:


const string goodAmpersand = "&amp;";

Now you can just say badAmpersand.Replace(<your input>, goodAmpersand);

现在你可以说badAmpersand。替换( <输入> ,goodAmpersand);

Note that a simple String.Replace("&", "&amp;") isn't good enough, since you can't know in advance for a given document whether any & characters will be coded correctly, incorrectly, or even both in the same document.


The catches here are that you have to do this to your xml document before loading it into your parser, which likely means an extra pass through it. Also, it does not account for ampersands inside of a CDATA section. Finally, it only catches ampersands, not other illegal characters like <. Update: based on the comment, I need to update the expression for hex-coded (&#x...;) entities as well.


Regarding which characters can cause problems, the actual rules are a little complex. For example, certain characters are allowed in data, but not as the first letter of an element name. And there's no simple list of illegal characters. Instead, a large (non-contiguous) swath of UNICODE is defined as legal, and anything outside of that is illegal.


So when it comes down to it, you have to trust your document source to have at least a certain amount of compliance and consistency. For example, I've found that people are often smart enough to make sure the tags work properly and escape <, even if they don't know that & isn't allowed, hence your problem today. However, the best thing would be to get this fixed at the source.


Oh, and a note about the CDATA suggestion: I'd use that to make sure xml that I'm creating is well-formed, but when dealing with existing xml from outside, I find the regex method easier.




The web application isn't at fault, the XML document is. Ampersands in XML should be encoded as &amp;. Failure to do so is a syntax error.

web应用程序没有错误,XML文档有错误。XML中的符号应该被编码为& & &;失败是语法错误。

Edit: in answer to the followup question, yes there are all kinds of similar errors. For example, unbalanced tags, unencoded less-than signs, unquoted attribute values, octets outside of the character encoding and various Unicode oddities, unrecognised entity references, and so on. In order to get any decent XML parser to consume a document, that document must be well-formed. The XML specification requires that a parser encountering a malformed document throw a fatal error.




The other answers are all correct, and I concur with their advice, but let me just add one thing:


PLEASE do not make applications that work with non well-formed XML, it just makes the rest of our lives more difficult :).


Granted, there are times when you really just don't have a choice if you have no control over the other end, but you should really have it throwing a fatal error and complaining very loudly and explicitly about what is broken when such an event occurs.


You could probably take it one step further and say "Ack! This XML is broken in these places and for these reasons, here's how I tried to fix it to make it well-formed: ...".


I'm not overly familiar with the MSXML APIs, but most good XML parsers will allow you to install error handlers so that you can trap the exact line/column number where errors are appearing along with getting the error code and message.

我不太熟悉MSXML api,但是大多数优秀的XML解析器都允许您安装错误处理程序,这样您就可以捕获错误出现的确切行/列号,同时获取错误代码和消息。



Your database doesn't contain XML documents. It contains some well-formed XML documents and some strings that look like XML to a human.


If it's at all possible, you should fix this - in particular, you should fix whatever process is generating the malformed XML documents. Fixing the program that reads data out of this database is just putting wallpaper over a crack in the wall.




You can replace & with &amp;

你可以用& & & & &;

Or you might also be able to use CDATA sections.




There are several characters which will cause XML data to be reported as badly-formed.


From w3schools:


Characters like "<" and "&" are illegal in XML elements.


The best solution for input you can't trust to be XML-compliant is to wrap it in CDATA tags, e.g.


<![CDATA[This is my wonderful & great user text]]>

Everything within the <![CDATA[ and ]]> tags is ignored by the parser.

一切都在< !>标记被解析器忽略。