I am using XML::Twig
to parse through a very large XML document. I want to split it into chunks based on the <change></change>
tags.
我正在使用XML :: Twig来解析一个非常大的XML文档。我想根据
Right now I have:
现在我有:
my $xml = XML::Twig->new(twig_handlers => { 'change' => \&parseChange, });
$xml->parsefile($LOGFILE);
sub parseChange {
my ($xml, $change) = @_;
my $message = $change->first_child('message');
my @lines = $message->children_text('line');
foreach (@lines) {
if ($_ =~ /[^a-zA-Z0-9](?i)bug(?-i)[^a-zA-Z0-9]/) {
print outputData "$_\n";
}
}
outputData->flush();
$change->purge;
}
Right now this is running the parseChange
method when it pulls that block from the XML. It is going extremely slow. I tested it against reading the XML from a file with $/=</change>
and writing a function to return the contents of an XML tag and it went much faster.
现在,当它从XML中提取该块时,它正在运行parseChange方法。它变得非常缓慢。我测试它反对从带有$ / = 的文件中读取XML并编写一个函数来返回XML标记的内容,它的速度要快得多。
Is there something I'm missing or am I using XML::Twig
incorrectly? I'm new to Perl.
是否有我遗漏的东西或我使用XML :: Twig错误?我是Perl的新手。
EDIT: Here is an example change from the changes file. The file consists of a lot of these one right after the other and there should not be anything in between them:
编辑:以下是更改文件的示例更改。该文件由很多这些文件一个接一个地组成,它们之间不应该有任何东西:
<change>
<project>device_common</project>
<commit_hash>523e077fb8fe899680c33539155d935e0624e40a</commit_hash>
<tree_hash>598e7a1bd070f33b1f1f8c926047edde055094cf</tree_hash>
<parent_hashes>71b1f9be815b72f925e66e866cb7afe9c5cd3239</parent_hashes>
<author_name>Jean-Baptiste Queru</author_name>
<author_e-mail>jbq@google.com</author_e-mail>
<author_date>Fri Apr 22 08:32:04 2011 -0700</author_date>
<commiter_name>Jean-Baptiste Queru</commiter_name>
<commiter_email>jbq@google.com</commiter_email>
<committer_date>Fri Apr 22 08:32:04 2011 -0700</committer_date>
<subject>chmod the output scripts</subject>
<message>
<line>Change-Id: Iae22c67066ba4160071aa2b30a5a1052b00a9d7f</line>
</message>
<target>
<line>generate-blob-scripts.sh</line>
</target>
</change>
5 个解决方案
#1
3
As it stands, your program is processing all of the XML document, including the data outside the change
elements that you aren't interested in.
目前,您的程序正在处理所有XML文档,包括您不感兴趣的更改元素之外的数据。
If you change the twig_handlers
parameter in your constructor to twig_roots
, then the tree structures will be built for only the elements of interest and the rest will be ignored.
如果将构造函数中的twig_handlers参数更改为twig_roots,则将仅为感兴趣的元素构建树结构,其余的将被忽略。
my $xml = XML::Twig->new(twig_roots => { change => \&parseChange });
#2
1
XML::Twig
includes a mechanism by which you can handle tags as they appear, then discard what you no longer need to free memory.
XML :: Twig包含一种机制,您可以通过该机制处理标记,然后丢弃不再需要释放内存的标记。
Here is an example taken from the documentation (which also has a lot more helpful information):
以下是从文档中获取的示例(其中还有更多有用的信息):
my $t= XML::Twig->new( twig_handlers =>
{ section => \§ion,
para => sub { $_->set_tag( 'p'); }
},
);
$t->parsefile( 'doc.xml');
# the handler is called once a section is completely parsed, ie when
# the end tag for section is found, it receives the twig itself and
# the element (including all its sub-elements) as arguments
sub section
{ my( $t, $section)= @_; # arguments for all twig_handlers
$section->set_tag( 'div'); # change the tag name.4, my favourite method...
# let's use the attribute nb as a prefix to the title
my $title= $section->first_child( 'title'); # find the title
my $nb= $title->att( 'nb'); # get the attribute
$title->prefix( "$nb - "); # easy isn't it?
$section->flush; # outputs the section and frees memory
}
This will probably be essential when working with a multi-gigabyte file, because (again, according to the documentation) storing the entire thing in memory can take as much as 10 times the size of the file.
在使用多GB文件时,这可能是必不可少的,因为(再次,根据文档)将整个内容存储在内存中可能需要多达文件大小的10倍。
Edit: A couple of comments based on your edited question. It is not clear exactly what is slowing you down without knowing more about your file structure, but here are a few things to try:
编辑:基于您编辑的问题的几条评论。在不了解您的文件结构的情况下,目前尚不清楚究竟是什么让您失望,但这里有几件事要尝试:
- Flushing the output filehandle will slow you down if you are writing a lot of lines. Perl caches file writing specifically for performance reasons, and you are bypassing that.
- 如果你写了很多行,刷新输出文件句柄会减慢你的速度。 Perl专门出于性能原因缓存文件写入,你绕过了它。
- Instead of using the
(?i)
mechanism, a rather advanced feature that probably has a performance penalty, why not make the whole match case insensitive?/[^a-z0-9]bug[^a-z0-9]/i
is equivalent. You also might be able to simplify it with/\bbug\b/i
, which is nearly equivalent, the only difference being that underscores are included in the non-matching class. - 而不是使用(?i)机制,一个可能具有性能损失的相当高级的功能,为什么不使整个匹配大小写不敏感? / [^ a-z0-9] bug [^ a-z0-9] / i是等价的。您也可以使用/ \ bbug \ b / i来简化它,这几乎是等价的,唯一的区别是下划线包含在非匹配类中。
- There are a couple of other simplifications that can be made as well to remove intermediate steps.
- 除了中间步骤之外,还可以进行其他一些简化。
How does this handler code compare to yours speed-wise?
这个处理程序代码如何与您的速度相比?
sub parseChange
{
my ($xml, $change) = @_;
foreach(grep /[^a-z0-9]bug[^a-z0-9]/i, $change->first_child_text('message'))
{
print outputData "$_\n";
}
$change->purge;
}
#3
0
If your XML is really big, use XML::SAX. It doesn't have to load entire data set to the memory; instead, it sequentially loads the file and generates callback events for every tag. I successfully used XML::SAX to parse XML with size of more than 1GB. Here is an example of a XML::SAX handler for your data:
如果您的XML非常大,请使用XML :: SAX。它不必将整个数据集加载到内存中;相反,它会按顺序加载文件并为每个标记生成回调事件。我成功地使用XML :: SAX来解析大小超过1GB的XML。以下是数据的XML :: SAX处理程序示例:
#!/usr/bin/env perl
package Change::Extractor;
use 5.010;
use strict;
use warnings qw(all);
use base qw(XML::SAX::Base);
sub new {
bless { data => '', path => [] }, shift;
}
sub start_element {
my ($self, $el) = @_;
$self->{data} = '';
push @{$self->{path}} => $el->{Name};
}
sub end_element {
my ($self, $el) = @_;
if ($self->{path} ~~ [qw[change message line]]) {
say $self->{data};
}
pop @{$self->{path}};
}
sub characters {
my ($self, $data) = @_;
$self->{data} .= $data->{Data};
}
1;
package main;
use strict;
use warnings qw(all);
use XML::SAX::PurePerl;
my $handler = Change::Extractor->new;
my $parser = XML::SAX::PurePerl->new(Handler => $handler);
$parser->parse_file(\*DATA);
__DATA__
<?xml version="1.0"?>
<change>
<project>device_common</project>
<commit_hash>523e077fb8fe899680c33539155d935e0624e40a</commit_hash>
<tree_hash>598e7a1bd070f33b1f1f8c926047edde055094cf</tree_hash>
<parent_hashes>71b1f9be815b72f925e66e866cb7afe9c5cd3239</parent_hashes>
<author_name>Jean-Baptiste Queru</author_name>
<author_e-mail>jbq@google.com</author_e-mail>
<author_date>Fri Apr 22 08:32:04 2011 -0700</author_date>
<commiter_name>Jean-Baptiste Queru</commiter_name>
<commiter_email>jbq@google.com</commiter_email>
<committer_date>Fri Apr 22 08:32:04 2011 -0700</committer_date>
<subject>chmod the output scripts</subject>
<message>
<line>Change-Id: Iae22c67066ba4160071aa2b30a5a1052b00a9d7f</line>
</message>
<target>
<line>generate-blob-scripts.sh</line>
</target>
</change>
Outputs
输出
Change-Id: Iae22c67066ba4160071aa2b30a5a1052b00a9d7f
#4
0
Not an XML::Twig answer, but ...
不是XML :: Twig的答案,但......
If you're going to extract stuff from xml files, you might want to consider XSLT. Using xsltproc and the following XSL stylesheet, I got the bug-containing change lines out of 1Gb of <change>
s in about a minute. Lots of improvements possible, I'm sure.
如果您要从xml文件中提取内容,您可能需要考虑XSLT。使用xsltproc和以下XSL样式表,我在大约一分钟内从1Gb的
<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0" >
<xsl:output method="text"/>
<xsl:variable name="lowercase" select="'abcdefghijklmnopqrstuvwxyz'" />
<xsl:variable name="uppercase" select="'ABCDEFGHIJKLMNOPQRSTUVWXYZ'" />
<xsl:template match="/">
<xsl:apply-templates select="changes/change/message/line"/>
</xsl:template>
<xsl:template match="line">
<xsl:variable name="lower" select="translate(.,$uppercase,$lowercase)" />
<xsl:if test="contains($lower,'bug')">
<xsl:value-of select="."/>
<xsl:text>
</xsl:text>
</xsl:if>
</xsl:template>
</xsl:stylesheet>
If your XML processing can be done as
如果您的XML处理可以完成
- extract to plain text
- 提取到纯文本
- wrangle flattened text
- 争吵扁平的文字
- profit
- 利润
then XSLT may be the tool for the first step in that process.
那么XSLT可能是该过程第一步的工具。
#5
0
Mine's taking an horrifically long time.
我的恐怖时间很长。
my $twig=XML::Twig->new
(
twig_handlers =>
{
SchoolInfo => \&schoolinfo,
},
pretty_print => 'indented',
);
$twig->parsefile( 'data/SchoolInfos.2018-04-17.xml');
sub schoolinfo {
my( $twig, $l)= @_;
my $rec = {
name => $l->field('SchoolName'),
refid => $l->{'att'}->{RefId},
phone => $l->field('SchoolPhoneNumber'),
};
for my $node ( $l->findnodes( '//Street' ) ) { $rec->{street} = $node->text; }
for my $node ( $l->findnodes( '//Town' ) ) { $rec->{city} = $node->text; }
for my $node ( $l->findnodes( '//PostCode' ) ) { $rec->{postcode} = $node->text; }
for my $node ( $l->findnodes( '//Latitude' ) ) { $rec->{lat} = $node->text; }
for my $node ( $l->findnodes( '//Longitude' ) ) { $rec->{lng} = $node->text; }
}
Is it the pretty_print perchance? Otherwise it's pretty straightforward.
这是漂亮的印记吗?否则它非常简单。
#1
3
As it stands, your program is processing all of the XML document, including the data outside the change
elements that you aren't interested in.
目前,您的程序正在处理所有XML文档,包括您不感兴趣的更改元素之外的数据。
If you change the twig_handlers
parameter in your constructor to twig_roots
, then the tree structures will be built for only the elements of interest and the rest will be ignored.
如果将构造函数中的twig_handlers参数更改为twig_roots,则将仅为感兴趣的元素构建树结构,其余的将被忽略。
my $xml = XML::Twig->new(twig_roots => { change => \&parseChange });
#2
1
XML::Twig
includes a mechanism by which you can handle tags as they appear, then discard what you no longer need to free memory.
XML :: Twig包含一种机制,您可以通过该机制处理标记,然后丢弃不再需要释放内存的标记。
Here is an example taken from the documentation (which also has a lot more helpful information):
以下是从文档中获取的示例(其中还有更多有用的信息):
my $t= XML::Twig->new( twig_handlers =>
{ section => \§ion,
para => sub { $_->set_tag( 'p'); }
},
);
$t->parsefile( 'doc.xml');
# the handler is called once a section is completely parsed, ie when
# the end tag for section is found, it receives the twig itself and
# the element (including all its sub-elements) as arguments
sub section
{ my( $t, $section)= @_; # arguments for all twig_handlers
$section->set_tag( 'div'); # change the tag name.4, my favourite method...
# let's use the attribute nb as a prefix to the title
my $title= $section->first_child( 'title'); # find the title
my $nb= $title->att( 'nb'); # get the attribute
$title->prefix( "$nb - "); # easy isn't it?
$section->flush; # outputs the section and frees memory
}
This will probably be essential when working with a multi-gigabyte file, because (again, according to the documentation) storing the entire thing in memory can take as much as 10 times the size of the file.
在使用多GB文件时,这可能是必不可少的,因为(再次,根据文档)将整个内容存储在内存中可能需要多达文件大小的10倍。
Edit: A couple of comments based on your edited question. It is not clear exactly what is slowing you down without knowing more about your file structure, but here are a few things to try:
编辑:基于您编辑的问题的几条评论。在不了解您的文件结构的情况下,目前尚不清楚究竟是什么让您失望,但这里有几件事要尝试:
- Flushing the output filehandle will slow you down if you are writing a lot of lines. Perl caches file writing specifically for performance reasons, and you are bypassing that.
- 如果你写了很多行,刷新输出文件句柄会减慢你的速度。 Perl专门出于性能原因缓存文件写入,你绕过了它。
- Instead of using the
(?i)
mechanism, a rather advanced feature that probably has a performance penalty, why not make the whole match case insensitive?/[^a-z0-9]bug[^a-z0-9]/i
is equivalent. You also might be able to simplify it with/\bbug\b/i
, which is nearly equivalent, the only difference being that underscores are included in the non-matching class. - 而不是使用(?i)机制,一个可能具有性能损失的相当高级的功能,为什么不使整个匹配大小写不敏感? / [^ a-z0-9] bug [^ a-z0-9] / i是等价的。您也可以使用/ \ bbug \ b / i来简化它,这几乎是等价的,唯一的区别是下划线包含在非匹配类中。
- There are a couple of other simplifications that can be made as well to remove intermediate steps.
- 除了中间步骤之外,还可以进行其他一些简化。
How does this handler code compare to yours speed-wise?
这个处理程序代码如何与您的速度相比?
sub parseChange
{
my ($xml, $change) = @_;
foreach(grep /[^a-z0-9]bug[^a-z0-9]/i, $change->first_child_text('message'))
{
print outputData "$_\n";
}
$change->purge;
}
#3
0
If your XML is really big, use XML::SAX. It doesn't have to load entire data set to the memory; instead, it sequentially loads the file and generates callback events for every tag. I successfully used XML::SAX to parse XML with size of more than 1GB. Here is an example of a XML::SAX handler for your data:
如果您的XML非常大,请使用XML :: SAX。它不必将整个数据集加载到内存中;相反,它会按顺序加载文件并为每个标记生成回调事件。我成功地使用XML :: SAX来解析大小超过1GB的XML。以下是数据的XML :: SAX处理程序示例:
#!/usr/bin/env perl
package Change::Extractor;
use 5.010;
use strict;
use warnings qw(all);
use base qw(XML::SAX::Base);
sub new {
bless { data => '', path => [] }, shift;
}
sub start_element {
my ($self, $el) = @_;
$self->{data} = '';
push @{$self->{path}} => $el->{Name};
}
sub end_element {
my ($self, $el) = @_;
if ($self->{path} ~~ [qw[change message line]]) {
say $self->{data};
}
pop @{$self->{path}};
}
sub characters {
my ($self, $data) = @_;
$self->{data} .= $data->{Data};
}
1;
package main;
use strict;
use warnings qw(all);
use XML::SAX::PurePerl;
my $handler = Change::Extractor->new;
my $parser = XML::SAX::PurePerl->new(Handler => $handler);
$parser->parse_file(\*DATA);
__DATA__
<?xml version="1.0"?>
<change>
<project>device_common</project>
<commit_hash>523e077fb8fe899680c33539155d935e0624e40a</commit_hash>
<tree_hash>598e7a1bd070f33b1f1f8c926047edde055094cf</tree_hash>
<parent_hashes>71b1f9be815b72f925e66e866cb7afe9c5cd3239</parent_hashes>
<author_name>Jean-Baptiste Queru</author_name>
<author_e-mail>jbq@google.com</author_e-mail>
<author_date>Fri Apr 22 08:32:04 2011 -0700</author_date>
<commiter_name>Jean-Baptiste Queru</commiter_name>
<commiter_email>jbq@google.com</commiter_email>
<committer_date>Fri Apr 22 08:32:04 2011 -0700</committer_date>
<subject>chmod the output scripts</subject>
<message>
<line>Change-Id: Iae22c67066ba4160071aa2b30a5a1052b00a9d7f</line>
</message>
<target>
<line>generate-blob-scripts.sh</line>
</target>
</change>
Outputs
输出
Change-Id: Iae22c67066ba4160071aa2b30a5a1052b00a9d7f
#4
0
Not an XML::Twig answer, but ...
不是XML :: Twig的答案,但......
If you're going to extract stuff from xml files, you might want to consider XSLT. Using xsltproc and the following XSL stylesheet, I got the bug-containing change lines out of 1Gb of <change>
s in about a minute. Lots of improvements possible, I'm sure.
如果您要从xml文件中提取内容,您可能需要考虑XSLT。使用xsltproc和以下XSL样式表,我在大约一分钟内从1Gb的
<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0" >
<xsl:output method="text"/>
<xsl:variable name="lowercase" select="'abcdefghijklmnopqrstuvwxyz'" />
<xsl:variable name="uppercase" select="'ABCDEFGHIJKLMNOPQRSTUVWXYZ'" />
<xsl:template match="/">
<xsl:apply-templates select="changes/change/message/line"/>
</xsl:template>
<xsl:template match="line">
<xsl:variable name="lower" select="translate(.,$uppercase,$lowercase)" />
<xsl:if test="contains($lower,'bug')">
<xsl:value-of select="."/>
<xsl:text>
</xsl:text>
</xsl:if>
</xsl:template>
</xsl:stylesheet>
If your XML processing can be done as
如果您的XML处理可以完成
- extract to plain text
- 提取到纯文本
- wrangle flattened text
- 争吵扁平的文字
- profit
- 利润
then XSLT may be the tool for the first step in that process.
那么XSLT可能是该过程第一步的工具。
#5
0
Mine's taking an horrifically long time.
我的恐怖时间很长。
my $twig=XML::Twig->new
(
twig_handlers =>
{
SchoolInfo => \&schoolinfo,
},
pretty_print => 'indented',
);
$twig->parsefile( 'data/SchoolInfos.2018-04-17.xml');
sub schoolinfo {
my( $twig, $l)= @_;
my $rec = {
name => $l->field('SchoolName'),
refid => $l->{'att'}->{RefId},
phone => $l->field('SchoolPhoneNumber'),
};
for my $node ( $l->findnodes( '//Street' ) ) { $rec->{street} = $node->text; }
for my $node ( $l->findnodes( '//Town' ) ) { $rec->{city} = $node->text; }
for my $node ( $l->findnodes( '//PostCode' ) ) { $rec->{postcode} = $node->text; }
for my $node ( $l->findnodes( '//Latitude' ) ) { $rec->{lat} = $node->text; }
for my $node ( $l->findnodes( '//Longitude' ) ) { $rec->{lng} = $node->text; }
}
Is it the pretty_print perchance? Otherwise it's pretty straightforward.
这是漂亮的印记吗?否则它非常简单。