如何从提取的行中读取数字(删除重复的数字)

时间:2021-05-04 08:59:18

I need to convert parts of a .txt file in this format (first by matching "SchDay")

我需要以这种格式转换.txt文件的一部分(首先匹配“SchDay”)

<SchDay>
  <Name>School Occup WD</Name>
  <Type>Fraction</Type>
  <Hr index="0">0</Hr>
  <Hr index="1">0</Hr>
  <Hr index="2">0</Hr>
  <Hr index="3">0</Hr>
  <Hr index="4">0</Hr>
  <Hr index="5">0</Hr>
  <Hr index="6">0</Hr>
  <Hr index="7">0.05</Hr>
  <Hr index="8">0.75</Hr>
  ....

to look like this (values come first, and “steps” need just the 2 ends defined):

看起来像这样(值是第一个,“步骤”只需要定义2个结束):

0.00, 0.00,

0.00, 6.00,    <- end of step

0.05, 7.00,

0.75, 8.00,

...

Etc

This is what I have so far:

这是我到目前为止:

open (OUTFILE, ">C:/begperl/parts/all1.txt")|| die "Can't open it";

my @files = glob ("*.txt");

for (@files) {

    open (INFILE, $_) || die "can't open infile";
    @lines = <INFILE>;
    my %answer;
    $regex = '<SchDay';
    for my $idx (0..$#lines) {
    if ($lines[$idx] =~ /$regex/) {
        for $ii (($idx + 3)..($idx + 26)){
        {$answer{$ii} = ($lines[$ii]);}
        }
    }
    foreach $key (sort keys %answer) { print OUTFILE "$answer{$key}\n" }
    }
close (INFILE);}

So I have the lines I want. Now I need to extract just the numbers, including decimal points, and then delete consecutive hours with the same values.

所以我有我想要的台词。现在我需要提取数字,包括小数点,然后删除具有相同值的连续小时。

1 个解决方案

#1


1  

You document has an XML structure. You are much better off exploiting that by using a proper XML parser. XML::Twig allows you to easily isolate the parts of an XML document in which you are interested. In this case, all we want are <Hr> elements that occur within <SchDay> elements:

您的文档具有XML结构。通过使用适当的XML解析器,您可以更好地利用它。 XML :: Twig允许您轻松隔离您感兴趣的XML文档的各个部分。在这种情况下,我们想要的只是在 元素中出现的


元素:

my $parser = XML::Twig->new(
    twig_roots => { 'SchDay/Hr' => \&do_print },
);

This just tells the parser to invoke the do_print sub for each <Hr> within a <SchDay>. do_print will be called with two arguments: The parser instance we just created and the element. Use $element->att('index') to access the value of the index attribute, and $attr->text to get the text of the attribute, and format and print them. Here is the script:

这只是告诉解析器为 中的每个


调用do_print子。将使用两个参数调用do_print:我们刚刚创建的解析器实例和元素。使用$ element-> att('index')来访问index属性的值,使用$ attr-> text来获取属性的文本,并格式化并打印它们。这是脚本:

#!/usr/bin/env perl

use strict;
use warnings;

use XML::Twig;

my $parser = XML::Twig->new(
    twig_roots => { 'SchDay/Hr' => \&do_print },
);

$parser->parse(\*DATA);

sub do_print {
    my $parser = shift;
    my $element = shift;

    printf "%.02f,%.02f,\n",
        $element->text,
        $element->att('index'),
    ;
    $parser->purge;
    return;
}

__DATA__
<SchDay>
  <Name>School Occup WD</Name>
  <Type>Fraction</Type>
  <Hr index="0">0</Hr>
  <Hr index="1">0</Hr>
  <Hr index="2">0</Hr>
  <Hr index="3">0</Hr>
  <Hr index="4">0</Hr>
  <Hr index="5">0</Hr>
  <Hr index="6">0</Hr>
  <Hr index="7">0.05</Hr>
  <Hr index="8">0.75</Hr>
</SchDay>

Output:

0.00, 0.00,
0.00, 1.00,
0.00, 2.00,
0.00, 3.00,
0.00, 4.00,
0.00, 5.00,
0.00, 6.00,
0.05, 7.00,
0.75, 8.00,

As for what needs to be fixed with your code … Here are some points I hope will help you write better Perl:

至于需要用你的代码修复什么......以下是一些我希望能帮助你写出更好的Perl的观点:

open (OUTFILE, ">C:/begperl/parts/all1.txt")|| die "Can't open it";
  • Don't use bareword filehandles such as OUTFILE. They are package variables which means they are subject to action at a distance. Instead, declare a lexical variable in the smallest applicable scope as in:

    不要使用诸如OUTFILE之类的裸字文件句柄。它们是包变量,这意味着它们可以在远处进行操作。相反,在最小的适用范围内声明一个词法变量,如:

     my $filename = 'C:/begperl/parts/all1.txt';
    
     open my $outfile, '>', $filename
          or die "Failed to open '$filename': $!";
    
  • Do name the loop variable in for loops:

    在for循环中命名循环变量:

     for my $input_file (@files) {
          open my $input, '<', $input_file
              or die "Failed to open '$input_file': $!";
    
  • Don't slurp when line-by-line processing will do. That is, don't use @lines = <INFILE>; to read all of the lines of the file in one go.

    当逐行处理时,不要啜饮。也就是说,不要使用@lines = ;一次读取文件的所有行。

  • Don't use magical constants such as the 3 and the 26 below. Instead, give them names. For example:

    不要使用神奇的常量,如下面的3和26。相反,给他们起名字。例如:

           use Const::Fast;
           const my $HR_BEGIN => 3;
           const my $HR_END   => 26;
    

But, that is still too fragile. What if the number of lines of <Hr> elements changes? After all, this is an XML document, and you could just as easily have the next batch with

但是,这仍然太脆弱了。如果


元素的行数改变怎么办?毕竟,这是一个XML文档,您可以轻松地使用下一批

<Hr index="5">
   0.00
</Hr>

What do you do then?

那你怎么办呢?

#1


1  

You document has an XML structure. You are much better off exploiting that by using a proper XML parser. XML::Twig allows you to easily isolate the parts of an XML document in which you are interested. In this case, all we want are <Hr> elements that occur within <SchDay> elements:

您的文档具有XML结构。通过使用适当的XML解析器,您可以更好地利用它。 XML :: Twig允许您轻松隔离您感兴趣的XML文档的各个部分。在这种情况下,我们想要的只是在 元素中出现的


元素:

my $parser = XML::Twig->new(
    twig_roots => { 'SchDay/Hr' => \&do_print },
);

This just tells the parser to invoke the do_print sub for each <Hr> within a <SchDay>. do_print will be called with two arguments: The parser instance we just created and the element. Use $element->att('index') to access the value of the index attribute, and $attr->text to get the text of the attribute, and format and print them. Here is the script:

这只是告诉解析器为 中的每个


调用do_print子。将使用两个参数调用do_print:我们刚刚创建的解析器实例和元素。使用$ element-> att('index')来访问index属性的值,使用$ attr-> text来获取属性的文本,并格式化并打印它们。这是脚本:

#!/usr/bin/env perl

use strict;
use warnings;

use XML::Twig;

my $parser = XML::Twig->new(
    twig_roots => { 'SchDay/Hr' => \&do_print },
);

$parser->parse(\*DATA);

sub do_print {
    my $parser = shift;
    my $element = shift;

    printf "%.02f,%.02f,\n",
        $element->text,
        $element->att('index'),
    ;
    $parser->purge;
    return;
}

__DATA__
<SchDay>
  <Name>School Occup WD</Name>
  <Type>Fraction</Type>
  <Hr index="0">0</Hr>
  <Hr index="1">0</Hr>
  <Hr index="2">0</Hr>
  <Hr index="3">0</Hr>
  <Hr index="4">0</Hr>
  <Hr index="5">0</Hr>
  <Hr index="6">0</Hr>
  <Hr index="7">0.05</Hr>
  <Hr index="8">0.75</Hr>
</SchDay>

Output:

0.00, 0.00,
0.00, 1.00,
0.00, 2.00,
0.00, 3.00,
0.00, 4.00,
0.00, 5.00,
0.00, 6.00,
0.05, 7.00,
0.75, 8.00,

As for what needs to be fixed with your code … Here are some points I hope will help you write better Perl:

至于需要用你的代码修复什么......以下是一些我希望能帮助你写出更好的Perl的观点:

open (OUTFILE, ">C:/begperl/parts/all1.txt")|| die "Can't open it";
  • Don't use bareword filehandles such as OUTFILE. They are package variables which means they are subject to action at a distance. Instead, declare a lexical variable in the smallest applicable scope as in:

    不要使用诸如OUTFILE之类的裸字文件句柄。它们是包变量,这意味着它们可以在远处进行操作。相反,在最小的适用范围内声明一个词法变量,如:

     my $filename = 'C:/begperl/parts/all1.txt';
    
     open my $outfile, '>', $filename
          or die "Failed to open '$filename': $!";
    
  • Do name the loop variable in for loops:

    在for循环中命名循环变量:

     for my $input_file (@files) {
          open my $input, '<', $input_file
              or die "Failed to open '$input_file': $!";
    
  • Don't slurp when line-by-line processing will do. That is, don't use @lines = <INFILE>; to read all of the lines of the file in one go.

    当逐行处理时,不要啜饮。也就是说,不要使用@lines = ;一次读取文件的所有行。

  • Don't use magical constants such as the 3 and the 26 below. Instead, give them names. For example:

    不要使用神奇的常量,如下面的3和26。相反,给他们起名字。例如:

           use Const::Fast;
           const my $HR_BEGIN => 3;
           const my $HR_END   => 26;
    

But, that is still too fragile. What if the number of lines of <Hr> elements changes? After all, this is an XML document, and you could just as easily have the next batch with

但是,这仍然太脆弱了。如果


元素的行数改变怎么办?毕竟,这是一个XML文档,您可以轻松地使用下一批

<Hr index="5">
   0.00
</Hr>

What do you do then?

那你怎么办呢?