Summary: I am looking a fast XML parser (most likely a wrapper around some standard SAX parser) which will produce per-record data structure 100% identical to those produced by XML::Simple.
简介:我正在寻找一个快速的XML解析器(很可能是一些标准SAX解析器的包装器),它将生成与XML :: Simple生成的每个记录数据结构100%相同的每个记录数据结构。
Details:
We have a large code infrastructure which depends on processing records one-by-one and expects the record to be a data structure in a format produced by XML::Simple since it always used XML::Simple since early Jurassic era.
我们有一个庞大的代码基础结构,它依赖于逐个处理记录,并期望记录是XML :: Simple生成的格式的数据结构,因为它从早期的侏罗纪时代就一直使用XML :: Simple。
An example simple XML is:
一个简单的XML示例是:
<root>
<rec><f1>v1</f1><f2>v2</f2></rec>
<rec><f1>v1b</f1><f2>v2b</f2></rec>
<rec><f1>v1c</f1><f2>v2c</f2></rec>
</root>
And example rough code is:
例如粗略的代码是:
sub process_record { my ($obj, $record_hash) = @_; # do_stuff }
my $records = XML::Simple->XMLin(@args)->{root};
foreach my $record (@$records) { $obj->process_record($record) };
As everyone knows XML::Simple is, well, simple. And more importantly, it is very slow and a memory hog—due to being a DOM parser and needing to build/store 100% of data in memory. So, it's not the best tool for parsing an XML file consisting of large amount of small records record-by-record.
众所周知,XML :: Simple很简单。更重要的是,由于是一个DOM解析器并且需要在内存中构建/存储100%的数据,因此它非常慢且内存耗尽。因此,它不是解析包含大量小记录的XML文件的最佳工具。
However, re-writing the entire code (which consist of large amount of "process_record"-like methods) to work with standard SAX parser seems like an big task not worth the resources, even at the cost of living with XML::Simple.
但是,重写整个代码(包含大量“process_record”类方法)来使用标准SAX解析器似乎是一项不值得资源的大任务,即使以使用XML :: Simple为代价也是如此。
I'm looking for an existing module which will probably be based on a SAX parser (or anything fast with small memory footprint) which can be used to produce $record
hashrefs one by one based on the XML pictured above that can be passed to $obj->process_record($record)
and be 100% identical to what XML::Simple's hashrefs would have been.
我正在寻找一个现有的模块,它可能基于一个SAX解析器(或任何快速,内存占用很少),可以用来根据上面的图片逐个生成$ record hashrefs,可以传递给$ obj-> process_record($ record)并且与XML :: Simple的hashrefs 100%完全相同。
I don't care much what the interface of the new module is; e.g whether I need to call next_record()
or give it a callback coderef accepting a record.
我不在乎新模块的界面是什么;例如,我是否需要调用next_record()或给它一个接受记录的回调coderef。
3 个解决方案
#1
7
XML::Twig
has a simplify method which you can call on a XML element which according to docs says:
XML :: Twig有一个简化方法,你可以调用XML元素,根据文档说:
Return a data structure suspiciously similar to XML::Simple's
返回一个与XML :: Simple相似的数据结构
Here is an example:
这是一个例子:
use XML::Twig;
use Data::Dumper;
my $twig = XML::Twig->new(
twig_handlers => {
rec => \&rec,
}
)->parsefile( 'data.xml' );
sub rec {
my ($twig, $rec) = @_;
my $data = $rec->simplify;
say Dumper $data;
$rec->purge;
}
NB. The $rec->purge cleans out the record immediately from memory.
NB。 $ rec-> purge立即从内存中清除记录。
Running this against your XML example produces this:
针对XML示例运行此操作会产生以下结果:
$VAR1 = {
'f1' => 'v1',
'f2' => 'v2'
};
$VAR1 = {
'f1' => 'v1b',
'f2' => 'v2b'
};
$VAR1 = {
'f1' => 'v1c',
'f2' => 'v2c'
};
Which I hope is suspiciously like what comes out of XML::Simple :)
我希望可疑的是XML :: Simple :)
/I3az/
#2
6
As the author of XML::Simple, I'd just like to correct some misconceptions in your question.
作为XML :: Simple的作者,我想纠正你的问题中的一些误解。
XML::Simple isn't a DOM parser, in fact it isn't a parser at all. It delegates all parsing duties to either a SAX parser or XML::Parser. The speed of parsing will depend on which parser module is the default on your system. When you run 'make test' for the XML::Simple distribution, the output will list the default parser.
XML :: Simple不是DOM解析器,实际上它根本不是解析器。它将所有解析职责委托给SAX解析器或XML :: Parser。解析的速度取决于系统上默认的解析器模块。当您为XML :: Simple分发运行'make test'时,输出将列出默认解析器。
If the default parser on your system is XML::SAX::PurePerl then it will be slow and more importantly buggy too. If that's the case then I'd recommend installing either XML::Expat or XML::ExpatXS for an immediate speed up. (Whichever SAX parser is installed last will be the default from that point).
如果你的系统上的默认解析器是XML :: SAX :: PurePerl,那么它将是缓慢的,更重要的是也是错误的。如果是这种情况,那么我建议安装XML :: Expat或XML :: ExpatXS以立即加速。 (最后安装的SAX解析器将是该点的默认值)。
Having said that, your requirements are a bit contradictory, you want something that returns your whole document as a hash and yet you don't want a parser that slurps the whole document into memory.
话虽如此,你的要求有点矛盾,你想要的东西可以将整个文档作为哈希返回,但你不希望解析器将整个文档放入内存中。
I understand your short-term goals, but as a longer term solution, I'd recommend migrating your code to XML::LibXML. It is a DOM parser but it's very fast because all the grunt work is done in C. Best of all the built-in XPath support makes it even simpler to use than XML::Simple - see this article.
我了解您的短期目标,但作为一个长期解决方案,我建议您将代码迁移到XML :: LibXML。它是一个DOM解析器,但速度非常快,因为所有繁琐的工作都是用C语言完成的。最重要的是内置的XPath支持使它比XML :: Simple更简单易用 - 请参阅本文。
#1
7
XML::Twig
has a simplify method which you can call on a XML element which according to docs says:
XML :: Twig有一个简化方法,你可以调用XML元素,根据文档说:
Return a data structure suspiciously similar to XML::Simple's
返回一个与XML :: Simple相似的数据结构
Here is an example:
这是一个例子:
use XML::Twig;
use Data::Dumper;
my $twig = XML::Twig->new(
twig_handlers => {
rec => \&rec,
}
)->parsefile( 'data.xml' );
sub rec {
my ($twig, $rec) = @_;
my $data = $rec->simplify;
say Dumper $data;
$rec->purge;
}
NB. The $rec->purge cleans out the record immediately from memory.
NB。 $ rec-> purge立即从内存中清除记录。
Running this against your XML example produces this:
针对XML示例运行此操作会产生以下结果:
$VAR1 = {
'f1' => 'v1',
'f2' => 'v2'
};
$VAR1 = {
'f1' => 'v1b',
'f2' => 'v2b'
};
$VAR1 = {
'f1' => 'v1c',
'f2' => 'v2c'
};
Which I hope is suspiciously like what comes out of XML::Simple :)
我希望可疑的是XML :: Simple :)
/I3az/
#2
6
As the author of XML::Simple, I'd just like to correct some misconceptions in your question.
作为XML :: Simple的作者,我想纠正你的问题中的一些误解。
XML::Simple isn't a DOM parser, in fact it isn't a parser at all. It delegates all parsing duties to either a SAX parser or XML::Parser. The speed of parsing will depend on which parser module is the default on your system. When you run 'make test' for the XML::Simple distribution, the output will list the default parser.
XML :: Simple不是DOM解析器,实际上它根本不是解析器。它将所有解析职责委托给SAX解析器或XML :: Parser。解析的速度取决于系统上默认的解析器模块。当您为XML :: Simple分发运行'make test'时,输出将列出默认解析器。
If the default parser on your system is XML::SAX::PurePerl then it will be slow and more importantly buggy too. If that's the case then I'd recommend installing either XML::Expat or XML::ExpatXS for an immediate speed up. (Whichever SAX parser is installed last will be the default from that point).
如果你的系统上的默认解析器是XML :: SAX :: PurePerl,那么它将是缓慢的,更重要的是也是错误的。如果是这种情况,那么我建议安装XML :: Expat或XML :: ExpatXS以立即加速。 (最后安装的SAX解析器将是该点的默认值)。
Having said that, your requirements are a bit contradictory, you want something that returns your whole document as a hash and yet you don't want a parser that slurps the whole document into memory.
话虽如此,你的要求有点矛盾,你想要的东西可以将整个文档作为哈希返回,但你不希望解析器将整个文档放入内存中。
I understand your short-term goals, but as a longer term solution, I'd recommend migrating your code to XML::LibXML. It is a DOM parser but it's very fast because all the grunt work is done in C. Best of all the built-in XPath support makes it even simpler to use than XML::Simple - see this article.
我了解您的短期目标,但作为一个长期解决方案,我建议您将代码迁移到XML :: LibXML。它是一个DOM解析器,但速度非常快,因为所有繁琐的工作都是用C语言完成的。最重要的是内置的XPath支持使它比XML :: Simple更简单易用 - 请参阅本文。