如何使用PERL脚本过滤XML中的节点

This question is all over the internet but all the examples I see do not take into account my apparently unique circumstances. Here is an excerpt from my XML:

这个问题在网上随处可见，但我看到的所有例子都没有考虑到我明显独特的环境。以下是我的XML摘录:

<message type="error" from="Realtime" timestamp="Mon Nov 24 19:28:55 2014"> Could not receive from Loader </message>
<message type="warning" from="Dcd_Mux" timestamp="Mon Dec  1 02:31:18 2014"> Could not connect to Dcd </message>

Instead of having several levels of nodes, I just have several attributes on a message node. I want to be able to filter out nodes based on an argument to my Perl script. For example: If I wanted to filter out all messages with type="error", and I was using an XML that only had the 2 lines from above, my output would only be the warning message from above. Output shown here:

不是有几个级别的节点，而是在消息节点上有几个属性。我希望能够根据我的Perl脚本的参数来过滤掉节点。例如:如果我想用type="error"过滤掉所有的消息，并且我使用的XML只有上面的两行，那么我的输出将仅仅是上面的警告消息。输出所示:

<message type="warning" from="Dcd_Mux" timestamp="Mon Dec  1 02:31:18 2014"> Could not connect to Dcd </message>

I need some direction on how to begin opening the XML, looping through the entire thing, and removing any nodes that have attributes that match my filter. I'm interested in using LibXML to get this done.

我需要一些关于如何开始打开XML的指导，如何在整个过程中循环，以及如何删除任何具有与我的过滤器匹配的属性的节点。我有兴趣使用LibXML来完成这项工作。

4 个解决方案

#1

I use XML::LibXML as my XML parser.

我使用XML::LibXML作为我的XML解析器。

use XML::LibXML qw( );

die "usage\n" if @ARGV != 2;

my ($type, $qfn) = @ARGV;
my $doc = XML::LibXML->new->parse_file($qfn);
for my $node ($doc->findnodes('//message') {
   my $type_addr = $node->getAttribute('type');
   next if !$type_addr || $type_addr ne $type;

   $node->parentNode->removeChild($node);
}

$doc->toFile($qfn);

#2

It could look something like this using XML::LibXML:

使用XML::LibXML:

use strict;
use warnings; 

use XML::LibXML;

my $filename = $ARGV[0] 
   or die "Missing XML filename to parse";
my $type = $ARGV[1] 
   or die "Missing type of node to exclude";

open(my $xml_file, '<', $filename) 
   or die "Cannot open XML file '$filename' for reading: $!";

my $dom = XML::LibXML->load_xml(IO => $xml_file);
NODE:
foreach my $message_node ( $dom->findnodes('/root/message') ) {
   next NODE 
      unless $message_node->hasAttribute('type');

   $message_node->unbindNode() 
      if $message_node->getAttribute('type') eq $type;
}
$dom->toFile($filename);

#3

There's two elements to your problem - first building a filter criteria, and the selecting or deleting elements based on it.

问题中有两个元素——首先构建筛选条件，然后基于筛选条件选择或删除元素。

In particular - mixing 'add' and 'remove' can be quite difficult, because deciding what to do if they don't apply or contradict can be rather annoying.

特别是，混合“添加”和“删除”可能会非常困难，因为如果它们不适用或相互矛盾，那么决定该怎么做可能会非常烦人。

Anyway, I'm offering XML::Twig despite that not being precisely what you've asked for - because I've used it a fair bit, and haven't really touched LibXML.

无论如何，我提供的是XML::Twig，尽管它并不是您所要求的那样——因为我已经使用了它很多次，而且还没有真正接触到LibXML。

#!/usr/bin/perl
use strict;
use warnings;

use XML::Twig;

#read these from ARGV, just here as example.
my @sample_filters = qw ( -type=error
                          -from=Not_Dcd_Mux );

my %exclude;
for (@sample_filters) {
    if (m/^-/) {
        my ( $att, $criteria ) = (
            m/^-     #starts with -
              (\w+)  #word
              =     
              (\w+)
              $      #end of string
              /x
        );
        next unless $att;
        $exclude{$att} = $criteria;
    }
}

#process_message is called for each 'message' element, and tests filters for exclusion.
sub process_message {
    my ( $twig, $message ) = @_;
    foreach my $att ( keys %exclude ) {
        if ( $message->att($att) eq $exclude{$att} ) {
            $message->delete();
            last;
        }
    }
}

my $twig = XML::Twig->new(
    pretty_print  => 'indented',
    twig_handlers => { 'message' => \&process_message }
);
$twig->parse( \*DATA ); #might use 'parsefile ( $filename )' or 'STDIN' instead
$twig->print;


__DATA__
<XML>
<message type="error" from="Realtime" timestamp="Mon Nov 24 19:28:55 2014"> Could not receive from Loader </message>
<message type="warning" from="Not_Dcd_Mux" timestamp="Mon Dec  1 02:31:18 2014"> Could not connect to Dcd </message>
<message type="warning" from="Dcd_Mux" timestamp="Mon Dec  1 02:31:18 2014"> Could not connect to Dcd </message>
</XML>

#4

This solution is a variation on the one from Hunter McMillen, and is here largely to illustrate what I meant by "looks like a Java program written in Perl".

这个解决方案是Hunter McMillen的一个变体，在这里主要是为了说明我所说的“看起来像用Perl编写的Java程序”。

The parameter validation is part of it and, while I have reduced it to a simple count check, I would not normally write anything at all. It is of doubtful worth as the question is about how to process the data, and any such trimmings depend on who will be using the program and how often.

参数验证是其中的一部分，虽然我将它简化为一个简单的计数检查，但我通常不会编写任何内容。值得怀疑的是，这个问题是关于如何处理数据的，而任何这样的修饰都取决于谁将使用这个程序以及频率。

I have chosen to serialize the output and print it to STDOUT, as it is often more useful to be able to redirect output as required on the command line.

我选择序列化输出并将其打印到STDOUT，因为能够根据命令行上的要求重定向输出通常更有用。

I recognized what I thought was a Java-style approach by the attention to verification and general "protecting me from myself". I don't believe that adding a label and using it in next is at all helpful, especially with such a short loop.

通过对验证的关注和对“保护我不受自己伤害”的概括，我认识到了我认为的java风格的方法。我不相信在next中添加标签并使用它会有任何帮助，特别是对于这样一个短的循环。

use strict;
use warnings; 

use XML::LibXML::PrettyPrint;

@ARGV == 2 or die <<END_USAGE;
Usage:
  $0 <XML file> <node type>
END_USAGE

my ($xml_file, $exclude_type) = @ARGV;

my $dom = XML::LibXML->load_xml(location => $xml_file);

for my $node ( $dom->findnodes('/root/message[@type]') ) {
  my $type = $node->getAttribute('type');
  $node->unbindNode if $type eq $exclude_type;
}

local $XML::LibXML::skipXMLDeclaration = 1;
my $pp = XML::LibXML::PrettyPrint->new;
print $pp->pretty_print($dom)->toString;

output

输出

<root>
  <message type="warning" from="Dcd_Mux" timestamp="Mon Dec  1 02:31:18 2014">
    Could not connect to Dcd
  </message>
</root>

#1