在使用DOMDocument函数处理之前,修复PHP中的格式错误的XML

时间:2022-10-20 19:27:38

I'm needing to load an XML document into PHP that comes from an external source. The XML does not declare it's encoding and contains illegal characters like &. If I try to load the XML document directly in the browser I get errors like "An invalid character was found in text content" also when loading the file in PHP I get lots of warnings like: xmlParseEntityRef: no name in Entity and Input is not proper UTF-8, indicate encoding ! Bytes: 0x9C 0x31 0x21 0x3C.

我需要将来自外部源的XML文档加载到PHP中。XML没有声明它的编码,并且包含&之类的非法字符。如果我试图加载XML文档直接在浏览器中我得到错误如“无效的字符在文本内容被发现”也当加载文件在PHP中我得到了许多这样的警告:xmlParseEntityRef:没有名字的实体和输入不适当的utf - 8,表明编码!字节:0x9C 0x31 0x21 0x3C。

It's clear that the XML is not well formed and contains illegal characters that should be converted to XML entities.

很明显,XML格式不佳,并且包含应该转换为XML实体的非法字符。

This is because the XML feed is made up of data supplied by lots of other users and clearly it's not being validated or reformatted before I get it.

这是因为XML提要是由许多其他用户提供的数据组成的,显然在我得到它之前,它没有经过验证或重新格式化。

I've spoken to the supplier of the XML feed and they say they are trying to get the content providers to sort it out, but this seems silly as they should be validating the input first.

我和XML提要的提供者谈过,他们说他们试图让内容提供者对其进行分类,但这似乎很愚蠢,因为他们应该首先验证输入。

I basically need to fix the XML correcting any encoding errors and converting any illegal chars to XML entities so that the XML loads problem when using PHP's DOMDocument functions.

我基本上需要修复XML纠正任何编码错误,并将任何非法字符转换为XML实体,这样在使用PHP的DOMDocument函数时XML就会出现问题。

My code currently looks like:

我的代码现在看起来是:

  $feedURL = '3704017_14022010_050004.xml';
  $dom = new DOMDocument();
  $dom->load($feedURL);

Example XML file showing encoding issue (click to download): feed.xml

示例XML文件显示编码问题(单击下载):feed.xml

Example XML that contains chars that have not been converted to XML entities:

包含尚未转换为XML实体的字符的示例XML:

<?xml version="1.0"?>
<feed>
<RECORD>
<ID>117387</ID>
<ADVERTISERNAME>Test</ADVERTISERNAME>
<AID>10544740</AID>
<NAME>This & This</NAME>
<DESCRIPTION>For one day only this is > than this.</DESCRIPTION>
</RECORD>
</feed>

3 个解决方案

#1


8  

Try using the Tidy library which can be used to clean up bad HTML and XML http://php.net/manual/en/book.tidy.php

尝试使用Tidy库来清理糟糕的HTML和XML http://php.net/manual/en/book.tidy.php

A pure PHP solution to fix some XML like this:

一个纯PHP解决方案来修复这样的XML:

<?xml version="1.0"?>
<feed>
<RECORD>
<ID>117387</ID>
<ADVERTISERNAME>Test < texter</ADVERTISERNAME>
<AID>10544740</AID>
<NAME>This & This</NAME>
<DESCRIPTION>For one day only this is > than this.</DESCRIPTION>
</RECORD>
</feed>

Would be something like this:

会是这样的:

  function cleanupXML($xml) {
    $xmlOut = '';
    $inTag = false;
    $xmlLen = strlen($xml);
    for($i=0; $i < $xmlLen; ++$i) {
        $char = $xml[$i];
        // $nextChar = $xml[$i+1];
        switch ($char) {
        case '<':
          if (!$inTag) {
              // Seek forward for the next tag boundry
              for($j = $i+1; $j < $xmlLen; ++$j) {
                 $nextChar = $xml[$j];
                 switch($nextChar) {
                 case '<':  // Means a < in text
                   $char = htmlentities($char);
                   break 2;
                 case '>':  // Means we are in a tag
                   $inTag = true;
                   break 2;
                 }
              }
          } else {
             $char = htmlentities($char);
          }
          break;
        case '>':
          if (!$inTag) {  // No need to seek ahead here
             $char = htmlentities($char);
          } else {
             $inTag = false;
          }
          break;
        default:
          if (!$inTag) {
             $char = htmlentities($char);
          }
          break;
        }
        $xmlOut .= $char;
    }
    return $xmlOut;
  }

Which is a simple state machine noting whether we are in a tag or not and if not then encoding the text using htmlentities.

这是一个简单的状态机,它记录我们是否在标记中,如果不是,则使用htmlentities对文本进行编码。

It's worth noting that this will be memory hungry on large files so you may want to rewrite it as a stream plugin or a pre-processor.

值得注意的是,在大型文件中,这将占用大量内存,因此您可能希望将其重写为流插件或预处理器。

#2


10  

To solve this issue, set the DomDocument recover property to TRUE before loading XML Document

为了解决这个问题,在加载XML文档之前,将DomDocument恢复为TRUE。

$dom->recover = TRUE;

$ dom - >恢复= TRUE;

Try this code:

试试这段代码:

$feedURL = '3704017_14022010_050004.xml';
$dom = new DOMDocument();
$dom->recover = TRUE;
$dom->load($feedURL);

#3


0  

If tidy extension is not an option, you may consider htmlpurifier.

如果不选择tidy扩展,您可以考虑htmlpurifier。

#1


8  

Try using the Tidy library which can be used to clean up bad HTML and XML http://php.net/manual/en/book.tidy.php

尝试使用Tidy库来清理糟糕的HTML和XML http://php.net/manual/en/book.tidy.php

A pure PHP solution to fix some XML like this:

一个纯PHP解决方案来修复这样的XML:

<?xml version="1.0"?>
<feed>
<RECORD>
<ID>117387</ID>
<ADVERTISERNAME>Test < texter</ADVERTISERNAME>
<AID>10544740</AID>
<NAME>This & This</NAME>
<DESCRIPTION>For one day only this is > than this.</DESCRIPTION>
</RECORD>
</feed>

Would be something like this:

会是这样的:

  function cleanupXML($xml) {
    $xmlOut = '';
    $inTag = false;
    $xmlLen = strlen($xml);
    for($i=0; $i < $xmlLen; ++$i) {
        $char = $xml[$i];
        // $nextChar = $xml[$i+1];
        switch ($char) {
        case '<':
          if (!$inTag) {
              // Seek forward for the next tag boundry
              for($j = $i+1; $j < $xmlLen; ++$j) {
                 $nextChar = $xml[$j];
                 switch($nextChar) {
                 case '<':  // Means a < in text
                   $char = htmlentities($char);
                   break 2;
                 case '>':  // Means we are in a tag
                   $inTag = true;
                   break 2;
                 }
              }
          } else {
             $char = htmlentities($char);
          }
          break;
        case '>':
          if (!$inTag) {  // No need to seek ahead here
             $char = htmlentities($char);
          } else {
             $inTag = false;
          }
          break;
        default:
          if (!$inTag) {
             $char = htmlentities($char);
          }
          break;
        }
        $xmlOut .= $char;
    }
    return $xmlOut;
  }

Which is a simple state machine noting whether we are in a tag or not and if not then encoding the text using htmlentities.

这是一个简单的状态机,它记录我们是否在标记中,如果不是,则使用htmlentities对文本进行编码。

It's worth noting that this will be memory hungry on large files so you may want to rewrite it as a stream plugin or a pre-processor.

值得注意的是,在大型文件中,这将占用大量内存,因此您可能希望将其重写为流插件或预处理器。

#2


10  

To solve this issue, set the DomDocument recover property to TRUE before loading XML Document

为了解决这个问题,在加载XML文档之前,将DomDocument恢复为TRUE。

$dom->recover = TRUE;

$ dom - >恢复= TRUE;

Try this code:

试试这段代码:

$feedURL = '3704017_14022010_050004.xml';
$dom = new DOMDocument();
$dom->recover = TRUE;
$dom->load($feedURL);

#3


0  

If tidy extension is not an option, you may consider htmlpurifier.

如果不选择tidy扩展,您可以考虑htmlpurifier。