I need to convert parts of a .txt file in this format (first by matching "SchDay")
我需要以这种格式转换.txt文件的一部分(首先匹配“SchDay”)
<SchDay>
<Name>School Occup WD</Name>
<Type>Fraction</Type>
<Hr index="0">0</Hr>
<Hr index="1">0</Hr>
<Hr index="2">0</Hr>
<Hr index="3">0</Hr>
<Hr index="4">0</Hr>
<Hr index="5">0</Hr>
<Hr index="6">0</Hr>
<Hr index="7">0.05</Hr>
<Hr index="8">0.75</Hr>
....
to look like this (values come first, and “steps” need just the 2 ends defined):
看起来像这样(值是第一个,“步骤”只需要定义2个结束):
0.00, 0.00,
0.00, 6.00, <- end of step
0.05, 7.00,
0.75, 8.00,
...
Etc
This is what I have so far:
这是我到目前为止:
open (OUTFILE, ">C:/begperl/parts/all1.txt")|| die "Can't open it";
my @files = glob ("*.txt");
for (@files) {
open (INFILE, $_) || die "can't open infile";
@lines = <INFILE>;
my %answer;
$regex = '<SchDay';
for my $idx (0..$#lines) {
if ($lines[$idx] =~ /$regex/) {
for $ii (($idx + 3)..($idx + 26)){
{$answer{$ii} = ($lines[$ii]);}
}
}
foreach $key (sort keys %answer) { print OUTFILE "$answer{$key}\n" }
}
close (INFILE);}
So I have the lines I want. Now I need to extract just the numbers, including decimal points, and then delete consecutive hours with the same values.
所以我有我想要的台词。现在我需要提取数字,包括小数点,然后删除具有相同值的连续小时。
1 个解决方案
#1
1
You document has an XML structure. You are much better off exploiting that by using a proper XML parser. XML::Twig allows you to easily isolate the parts of an XML document in which you are interested. In this case, all we want are <Hr>
elements that occur within <SchDay>
elements:
您的文档具有XML结构。通过使用适当的XML解析器,您可以更好地利用它。 XML :: Twig允许您轻松隔离您感兴趣的XML文档的各个部分。在这种情况下,我们想要的只是在
元素:
my $parser = XML::Twig->new(
twig_roots => { 'SchDay/Hr' => \&do_print },
);
This just tells the parser to invoke the do_print
sub for each <Hr>
within a <SchDay>
. do_print
will be called with two arguments: The parser instance we just created and the element. Use $element->att('index')
to access the value of the index attribute, and $attr->text
to get the text of the attribute, and format and print them. Here is the script:
这只是告诉解析器为
调用do_print子。将使用两个参数调用do_print:我们刚刚创建的解析器实例和元素。使用$ element-> att('index')来访问index属性的值,使用$ attr-> text来获取属性的文本,并格式化并打印它们。这是脚本:
#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;
my $parser = XML::Twig->new(
twig_roots => { 'SchDay/Hr' => \&do_print },
);
$parser->parse(\*DATA);
sub do_print {
my $parser = shift;
my $element = shift;
printf "%.02f,%.02f,\n",
$element->text,
$element->att('index'),
;
$parser->purge;
return;
}
__DATA__
<SchDay>
<Name>School Occup WD</Name>
<Type>Fraction</Type>
<Hr index="0">0</Hr>
<Hr index="1">0</Hr>
<Hr index="2">0</Hr>
<Hr index="3">0</Hr>
<Hr index="4">0</Hr>
<Hr index="5">0</Hr>
<Hr index="6">0</Hr>
<Hr index="7">0.05</Hr>
<Hr index="8">0.75</Hr>
</SchDay>
Output:
0.00, 0.00, 0.00, 1.00, 0.00, 2.00, 0.00, 3.00, 0.00, 4.00, 0.00, 5.00, 0.00, 6.00, 0.05, 7.00, 0.75, 8.00,
As for what needs to be fixed with your code … Here are some points I hope will help you write better Perl:
至于需要用你的代码修复什么......以下是一些我希望能帮助你写出更好的Perl的观点:
open (OUTFILE, ">C:/begperl/parts/all1.txt")|| die "Can't open it";
-
Don't use bareword filehandles such as
OUTFILE
. They are package variables which means they are subject to action at a distance. Instead, declare a lexical variable in the smallest applicable scope as in:不要使用诸如OUTFILE之类的裸字文件句柄。它们是包变量,这意味着它们可以在远处进行操作。相反,在最小的适用范围内声明一个词法变量,如:
my $filename = 'C:/begperl/parts/all1.txt'; open my $outfile, '>', $filename or die "Failed to open '$filename': $!";
-
Do name the loop variable in
for
loops:在for循环中命名循环变量:
for my $input_file (@files) { open my $input, '<', $input_file or die "Failed to open '$input_file': $!";
-
Don't slurp when line-by-line processing will do. That is, don't use
@lines = <INFILE>;
to read all of the lines of the file in one go.当逐行处理时,不要啜饮。也就是说,不要使用@lines =
;一次读取文件的所有行。 -
Don't use magical constants such as the
3
and the26
below. Instead, give them names. For example:不要使用神奇的常量,如下面的3和26。相反,给他们起名字。例如:
use Const::Fast; const my $HR_BEGIN => 3; const my $HR_END => 26;
But, that is still too fragile. What if the number of lines of <Hr>
elements changes? After all, this is an XML document, and you could just as easily have the next batch with
但是,这仍然太脆弱了。如果
元素的行数改变怎么办?毕竟,这是一个XML文档,您可以轻松地使用下一批
<Hr index="5">
0.00
</Hr>
What do you do then?
那你怎么办呢?
#1
1
You document has an XML structure. You are much better off exploiting that by using a proper XML parser. XML::Twig allows you to easily isolate the parts of an XML document in which you are interested. In this case, all we want are <Hr>
elements that occur within <SchDay>
elements:
您的文档具有XML结构。通过使用适当的XML解析器,您可以更好地利用它。 XML :: Twig允许您轻松隔离您感兴趣的XML文档的各个部分。在这种情况下,我们想要的只是在
元素:
my $parser = XML::Twig->new(
twig_roots => { 'SchDay/Hr' => \&do_print },
);
This just tells the parser to invoke the do_print
sub for each <Hr>
within a <SchDay>
. do_print
will be called with two arguments: The parser instance we just created and the element. Use $element->att('index')
to access the value of the index attribute, and $attr->text
to get the text of the attribute, and format and print them. Here is the script:
这只是告诉解析器为
调用do_print子。将使用两个参数调用do_print:我们刚刚创建的解析器实例和元素。使用$ element-> att('index')来访问index属性的值,使用$ attr-> text来获取属性的文本,并格式化并打印它们。这是脚本:
#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;
my $parser = XML::Twig->new(
twig_roots => { 'SchDay/Hr' => \&do_print },
);
$parser->parse(\*DATA);
sub do_print {
my $parser = shift;
my $element = shift;
printf "%.02f,%.02f,\n",
$element->text,
$element->att('index'),
;
$parser->purge;
return;
}
__DATA__
<SchDay>
<Name>School Occup WD</Name>
<Type>Fraction</Type>
<Hr index="0">0</Hr>
<Hr index="1">0</Hr>
<Hr index="2">0</Hr>
<Hr index="3">0</Hr>
<Hr index="4">0</Hr>
<Hr index="5">0</Hr>
<Hr index="6">0</Hr>
<Hr index="7">0.05</Hr>
<Hr index="8">0.75</Hr>
</SchDay>
Output:
0.00, 0.00, 0.00, 1.00, 0.00, 2.00, 0.00, 3.00, 0.00, 4.00, 0.00, 5.00, 0.00, 6.00, 0.05, 7.00, 0.75, 8.00,
As for what needs to be fixed with your code … Here are some points I hope will help you write better Perl:
至于需要用你的代码修复什么......以下是一些我希望能帮助你写出更好的Perl的观点:
open (OUTFILE, ">C:/begperl/parts/all1.txt")|| die "Can't open it";
-
Don't use bareword filehandles such as
OUTFILE
. They are package variables which means they are subject to action at a distance. Instead, declare a lexical variable in the smallest applicable scope as in:不要使用诸如OUTFILE之类的裸字文件句柄。它们是包变量,这意味着它们可以在远处进行操作。相反,在最小的适用范围内声明一个词法变量,如:
my $filename = 'C:/begperl/parts/all1.txt'; open my $outfile, '>', $filename or die "Failed to open '$filename': $!";
-
Do name the loop variable in
for
loops:在for循环中命名循环变量:
for my $input_file (@files) { open my $input, '<', $input_file or die "Failed to open '$input_file': $!";
-
Don't slurp when line-by-line processing will do. That is, don't use
@lines = <INFILE>;
to read all of the lines of the file in one go.当逐行处理时,不要啜饮。也就是说,不要使用@lines =
;一次读取文件的所有行。 -
Don't use magical constants such as the
3
and the26
below. Instead, give them names. For example:不要使用神奇的常量,如下面的3和26。相反,给他们起名字。例如:
use Const::Fast; const my $HR_BEGIN => 3; const my $HR_END => 26;
But, that is still too fragile. What if the number of lines of <Hr>
elements changes? After all, this is an XML document, and you could just as easily have the next batch with
但是,这仍然太脆弱了。如果
元素的行数改变怎么办?毕竟,这是一个XML文档,您可以轻松地使用下一批
<Hr index="5">
0.00
</Hr>
What do you do then?
那你怎么办呢?