PHP DOMDocument->loadXML与XML包含&号/less/greater?

时间:2022-10-20 15:43:36

I'm trying to parse an XML string containing characters & < and > in the TEXTDATA. Normally, those characters should be htmlencoded, but in my case they aren't so I get the following messages:

我正在解析一个XML字符串,该字符串包含TEXTDATA中的字符& <和> 。通常,这些字符应该用htmlencoding编码,但我的情况不是这样,所以我得到了以下信息:

Warning: DOMDocument::loadXML() [function.loadXML]: error parsing attribute name in Entity ... Warning: DOMDocument::loadXML() [function.loadXML]: Couldn't find end of Start Tag ...

警告:DOMDocument::loadXML()函数。:解析实体中的属性名错误……警告:DOMDocument::loadXML()函数。无法找到开始标记的结尾……

I can use the str_replace to encode all the &, but if I do that with < or > I'm doing it for valid XML tags too.

我可以使用str_replace对所有&进行编码,但是如果我使用 <或> 进行编码,我也会对有效的XML标记进行编码。

Does anyone know a workaround for this problem??

有人知道解决这个问题的办法吗?

Thank you!

谢谢你!

4 个解决方案

#1


5  

If you have a < inside text in an XML... it's not a valid XML. Try to encode it or to enclose them into <![CDATA[.

如果在XML中有< inside text…它不是有效的XML。尝试对它进行编码或将它们封装到]中。</p>

If it's not possible (because you're not outputting this "XML") I'd suggest to try with some Html parsing library (I didn't used them, but they exists) beacuse they're less strict than XML ones.

如果不可能(因为您没有输出这个“XML”),我建议尝试使用一些Html解析库(我没有使用它们,但它们确实存在),因为它们没有XML那么严格。

But I'd really try to get valid XML before trying any other thing!!

但是在尝试任何其他东西之前,我真的要尝试获得有效的XML !

#2


3  

I often use @ in front of calls to load() for DomDocument mainly because you can never be absolutely sure what you load, is what you expected.

我经常在调用load()前使用@,主要是因为你永远不能完全确定你所装载的是什么,这是你所期望的。

Using @ will suppress errors.

使用@将抑制错误。

@$dom->loadXml($myXml);

#3


1  

I can use the str_replace to encode all the &, but if I do that with < or > I'm doing it for valid XML tags too.

我可以使用str_replace对所有&进行编码,但是如果我使用 <或> 进行编码,我也会对有效的XML标记进行编码。

As a strictly temporary fixup measure you can replace the ones that aren't part of what looks like a tag or entity reference, eg.:

作为一个严格的临时修正措施,你可以替换掉那些看起来不像标签或实体引用的部分。

$str= preg_replace('<(?![a-zA-Z_!?])', '&lt;', $str);
$str= preg_replace('&(?!([a-zA-Z]+|#[0-9]+|#x[0-9a-fA-F]+);)', '&amp;', $str);

However this isn't watertight and in the longer term you need to fix whatever is generating this bogus markup, or shout at the person who needs to fix it until they get a clue. Grossly-non-well-formed XML like this is simply not XML by definition.

然而,这并不是水到渠成的,从长期来看,您需要修复产生这种虚假标记的任何东西,或者向需要修复标记的人大喊大叫,直到他们得到线索。像这样格式不完善的XML从定义上讲根本不是XML。

#4


0  

Put all your text inside CDATA elements?

将所有文本放入CDATA元素中?

<!-- Old -->
<blah>
    x & y < 3
</blah>

<!-- New -->
<blah><![CDATA[
    x & y < 3
]]></blah>

#1


5  

If you have a < inside text in an XML... it's not a valid XML. Try to encode it or to enclose them into <![CDATA[.

如果在XML中有< inside text…它不是有效的XML。尝试对它进行编码或将它们封装到]中。</p>

If it's not possible (because you're not outputting this "XML") I'd suggest to try with some Html parsing library (I didn't used them, but they exists) beacuse they're less strict than XML ones.

如果不可能(因为您没有输出这个“XML”),我建议尝试使用一些Html解析库(我没有使用它们,但它们确实存在),因为它们没有XML那么严格。

But I'd really try to get valid XML before trying any other thing!!

但是在尝试任何其他东西之前,我真的要尝试获得有效的XML !

#2


3  

I often use @ in front of calls to load() for DomDocument mainly because you can never be absolutely sure what you load, is what you expected.

我经常在调用load()前使用@,主要是因为你永远不能完全确定你所装载的是什么,这是你所期望的。

Using @ will suppress errors.

使用@将抑制错误。

@$dom->loadXml($myXml);

#3


1  

I can use the str_replace to encode all the &, but if I do that with < or > I'm doing it for valid XML tags too.

我可以使用str_replace对所有&进行编码,但是如果我使用 <或> 进行编码,我也会对有效的XML标记进行编码。

As a strictly temporary fixup measure you can replace the ones that aren't part of what looks like a tag or entity reference, eg.:

作为一个严格的临时修正措施,你可以替换掉那些看起来不像标签或实体引用的部分。

$str= preg_replace('<(?![a-zA-Z_!?])', '&lt;', $str);
$str= preg_replace('&(?!([a-zA-Z]+|#[0-9]+|#x[0-9a-fA-F]+);)', '&amp;', $str);

However this isn't watertight and in the longer term you need to fix whatever is generating this bogus markup, or shout at the person who needs to fix it until they get a clue. Grossly-non-well-formed XML like this is simply not XML by definition.

然而,这并不是水到渠成的,从长期来看,您需要修复产生这种虚假标记的任何东西,或者向需要修复标记的人大喊大叫,直到他们得到线索。像这样格式不完善的XML从定义上讲根本不是XML。

#4


0  

Put all your text inside CDATA elements?

将所有文本放入CDATA元素中?

<!-- Old -->
<blah>
    x & y < 3
</blah>

<!-- New -->
<blah><![CDATA[
    x & y < 3
]]></blah>