I have to read some quite heavy XML files (between 200 MB and 1 GB) that are, for some of them, invalid. Let me give you a small example :
我必须读取一些非常重的XML文件(在200 MB到1gb之间),对于其中一些文件,它们是无效的。我举个小例子:
<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:g="http://base.google.com/ns/1.0" version="2.0">
<item>
<title>Some article</title>
<g:material><ul><li>50 % Coton</li><li>50% Lyocell</li></g:material>
</item>
</rss>
Obviously, there is a missing </ul>
closing tag in the g:material
tag. Moreover, people that have developed this feed should have enclosed g:material
content into CDATA
, which they did not... Basically, that's what I want to do : add this missing CDATA
section.
显然,g:material标签中缺少一个关闭标签。此外,开发这个提要的人应该将g:material content包含到CDATA中,而他们没有……基本上,这就是我想做的:添加这个丢失的CDATA部分。
I've tried to use a SAX parser to read this file but it fails when reading the </g:material>
tag since the </ul>
tag is missing. I've tried with XMLReader but got basically the same issue. I could probably do something with DomDocument::loadHtml but the size of this file is not really compatible with a DOM approach. Do you have any idea how I could simply repair this feed without having to buy lots of RAM for DomDocument to work ? Thanks.
我曾尝试使用SAX解析器来读取这个文件,但在读取标记时失败,因为标记缺失。我尝试过XMLReader,但基本上遇到了同样的问题。我可以用DomDocument::loadHtml做一些事情,但是这个文件的大小与DOM方法并不兼容。你知道我如何在不需要为DomDocument购买大量内存的情况下修复这个提要吗?谢谢。
2 个解决方案
#1
3
If the files are too large to use the Tidy extension, you can use the tidy CLI tool to make the files parseable.
如果文件太大而不能使用Tidy扩展,可以使用Tidy CLI工具将文件解析。
$ tidy -output my.clean.xml my.xml
After that, the XML files are well-formed, so you can parse them using the XMLReader. Since tidy adds the 'missing' (X)HTML parts, your original document's code is inside the element.
之后,XML文件格式良好,因此可以使用XMLReader解析它们。因为tidy添加了“缺少”(X)HTML部分,所以原始文档的代码在元素中。
#2
0
(copy from https://*.com/a/17903058/287948)
从https://*.com/a/17903058/287948(复制)
Summarizing as two steps:
总结为两个步骤:
- Use Tidy to transform "free HTML" into "good XHTML".
- 使用Tidy将“free HTML”转换为“good XHTML”。
- Use XML Parser to parse XHTML as XML by SAX API.
- 使用XML解析器通过SAX API将XHTML解析为XML。
Use first Tidy (!), to transform "free HTML" into XHTML (or when you can not trust your "supposed XHTML"). See cleanRepair method. It needs more time, but runs with big files (!)... Set some minutes as maximum execution time if too big.
使用first Tidy(!)将“free HTML”转换为XHTML(或者当您不能信任“假定的XHTML”时)。看到cleanRepair方法。它需要更多的时间,但是运行大文件(!)如果执行时间太长,设置一些分钟为最大执行时间。
Another option (for work with big files) is to cache your XHTML files after checked or transformed into XHTML. See Tidy's repairfile method.
另一个选项(用于处理大文件)是在检查或转换为XHTML之后缓存XHTML文件。看到整洁的repairfile方法。
With a "trusted XHTML", use SAX... How to use SAX with PHP?
使用“可信的XHTML”,使用SAX…如何在PHP中使用SAX ?
Parse XML with a SAX standard API, that in PHP is implemented by LibXML (see LibXML2 at xmlsoft.org), and its interface is the PHP's XML Parser, that is near to the SAX standard API.
使用SAX标准API解析XML,在PHP中是由LibXML实现的(参见xmlsoft.org上的LibXML2),它的接口是PHP的XML解析器,它接近SAX标准API。
Another way to use the "SAX of LibXML2", with another interface (a PHP iterator instead the traditional SAX interface), is to use XMLReader. See this explanation about "XMLReader use SAX".
使用“LibXML2的SAX”以及另一个接口(PHP迭代器而不是传统的SAX接口)的另一种方法是使用XMLReader。请参阅“XMLReader使用SAX”的解释。
Yes, the terms "SAX" or "SAX API" not expressed in the PHP manual (!). See this old but good introduction.
是的,术语“SAX”或“SAX API”在PHP手册中没有表达(!)看看这个古老但很好的介绍。
#1
3
If the files are too large to use the Tidy extension, you can use the tidy CLI tool to make the files parseable.
如果文件太大而不能使用Tidy扩展,可以使用Tidy CLI工具将文件解析。
$ tidy -output my.clean.xml my.xml
After that, the XML files are well-formed, so you can parse them using the XMLReader. Since tidy adds the 'missing' (X)HTML parts, your original document's code is inside the element.
之后,XML文件格式良好,因此可以使用XMLReader解析它们。因为tidy添加了“缺少”(X)HTML部分,所以原始文档的代码在元素中。
#2
0
(copy from https://*.com/a/17903058/287948)
从https://*.com/a/17903058/287948(复制)
Summarizing as two steps:
总结为两个步骤:
- Use Tidy to transform "free HTML" into "good XHTML".
- 使用Tidy将“free HTML”转换为“good XHTML”。
- Use XML Parser to parse XHTML as XML by SAX API.
- 使用XML解析器通过SAX API将XHTML解析为XML。
Use first Tidy (!), to transform "free HTML" into XHTML (or when you can not trust your "supposed XHTML"). See cleanRepair method. It needs more time, but runs with big files (!)... Set some minutes as maximum execution time if too big.
使用first Tidy(!)将“free HTML”转换为XHTML(或者当您不能信任“假定的XHTML”时)。看到cleanRepair方法。它需要更多的时间,但是运行大文件(!)如果执行时间太长,设置一些分钟为最大执行时间。
Another option (for work with big files) is to cache your XHTML files after checked or transformed into XHTML. See Tidy's repairfile method.
另一个选项(用于处理大文件)是在检查或转换为XHTML之后缓存XHTML文件。看到整洁的repairfile方法。
With a "trusted XHTML", use SAX... How to use SAX with PHP?
使用“可信的XHTML”,使用SAX…如何在PHP中使用SAX ?
Parse XML with a SAX standard API, that in PHP is implemented by LibXML (see LibXML2 at xmlsoft.org), and its interface is the PHP's XML Parser, that is near to the SAX standard API.
使用SAX标准API解析XML,在PHP中是由LibXML实现的(参见xmlsoft.org上的LibXML2),它的接口是PHP的XML解析器,它接近SAX标准API。
Another way to use the "SAX of LibXML2", with another interface (a PHP iterator instead the traditional SAX interface), is to use XMLReader. See this explanation about "XMLReader use SAX".
使用“LibXML2的SAX”以及另一个接口(PHP迭代器而不是传统的SAX接口)的另一种方法是使用XMLReader。请参阅“XMLReader使用SAX”的解释。
Yes, the terms "SAX" or "SAX API" not expressed in the PHP manual (!). See this old but good introduction.
是的,术语“SAX”或“SAX API”在PHP手册中没有表达(!)看看这个古老但很好的介绍。