使用和regex搜索和替换文本文件中的元素

时间:2022-03-15 16:51:20

I'm working my way through Learning Perl, Chapter 9, "Processing Text with Regular Expressions."

我正在通过Learning Perl,第9章“使用正则表达式处理文本”。

Here's two of the end-of-chapter exercises:

这是章节结束练习中的两个:

  1. Write a program to add a copyright line to all of your exercise answers so far, placing a line like ## Copyright (c) 20XX by Yours Truly in the file immediately after the 'shebang' line. Presume that the program will be invoked with the filenames to edit already on the command line.

    编写一个程序,为目前为止的所有练习答案添加一个版权行,在'shebang'行之后立即在Yours Truly中添加## Copyright(c)20XX。假定将使用文件名调用程序,以便在命令行上进行编辑。

  2. Modify the previous program so that it doesn't edit the files that already contain the copyright line. As a hint on that, you might need to know that the name of the file being read by the diamond operator is in $ARGV.

    修改以前的程序,使其不编辑已包含版权行的文件。作为提示,您可能需要知道菱形运算符读取的文件名是$ ARGV。

This was my attempted solution:

这是我尝试的解决方案:

#!/usr/bin/env perl

use 5.014;
use warnings;

my $shebang     = '(#!/usr/bin/env perl|#!/usr/bin/perl)'; 
my $copyright   = '# Copyright (c) 20XX Yours Truly'; 

$^I = ".bak";

while (<>) {
    unless (/$copyright/mi) {
        s/($shebang)/$1\n$copyright/mig;
    }
    print;
}

Run on the command line with perl ch9.pl sample_perl_script.pl.

使用perl ch9.pl sample_perl_script.pl在命令行上运行。

My goals were:

我的目标是:

  • Keep the original shebang intact, regardless of path.
  • 无论路径如何,保持原始的shebang完好无损。

  • Loop through <> just once.
  • 只需一次循环<>。

  • Check to see if the copyright notice existed.
  • 检查是否存在版权声明。

  • If it didn't, add it (hence the attempt with unless { ... }).
  • 如果没有,请添加它(因此尝试除非{...})。

This works for the first part of the problem (adding a copyright line) but not the second (check to make sure the copyright doesn't already exist).

这适用于问题的第一部分(添加版权线)但不适用于第二部分(检查以确保版权尚不存在)。

My questions are: Why? And why is the unless totally ignored when I run the program?

我的问题是:为什么?为什么在我运行程序时完全忽略了除非?

I peeked at the appendix, and the book's proposed solution was to create a hash to track filenames from $ARGV, and pass over the files twice. First to eliminate files that already had the copyright notice, then to perform the search/replace. Like so:

我偷看了附录,本书的建议解决方案是创建一个哈希来跟踪$ ARGV中的文件名,并将文件传递两次。首先要删除已经有版权声明的文件,然后执行搜索/替换。像这样:

my %do_these;
foreach (@ARGV) {
    $do_these{$_} = 1;
}

while (<>) { 
    if (/\A## Copyright/) {
        delete $do_these{$ARGV};
    }
}

@ARGV = sort keys %do_these; 
$^I = ".bak";
while (<>) {
    if (/\A#!/) {
        $_ .= "## Copyright (c) 20XX by Yours Truly\n";
    }
    print;
}

This works, of course, but it seems like twice the work. I'm trying to see if there's a way to do this within a single while (<>) { ... } loop, with my approach, and come away with a better understanding of how the diamond operator works.

当然,这有效,但似乎是工作的两倍。我试图通过我的方法在单个while(<>){...}循环中找到一种方法来实现这一点,并且更好地理解钻石运算符的工作原理。

If my approach is totally off-base, please explain why and don't spare my feelings. I'm more interested in a full understanding than my ego.

如果我的方法完全偏离基础,请解释原因并且不要忘记我的感受。我对自我的理解比对自我更感兴趣。

2 个解决方案

#1


3  

Your unless does not work because the copyright is not on the same line as the shebang. The diamond operator reads a line up until the first value of $/, which by default is newline. Your program will perform the substitution on all the lines that do not contain the copyright.

你的除非不起作用,因为版权与shebang不在同一条线上。钻石运算符读取一行直到$ /的第一个值,默认情况下是换行符。您的程序将在不包含版权的所有行上执行替换。

Since this is perl, there are many ways to fix it. The most straightforward way is perhaps to unset $/ and slurp the file (read it all into one line). That way you can check right away if there is a copyright notice on the second line of the file. E.g.:

由于这是perl,因此有很多方法可以解决它。最简单的方法可能是取消设置$ /并啜饮文件(将其全部读入一行)。这样,如果文件的第二行有版权声明,您可以立即查看。例如。:

local $/;                                     # slurp the file
while (<>) {
    s/^.*\n\K(?!\Q$copyright\E)/$copyright/;  # negative lookahead assertion
    print;
}

You can also check line number 2 in your files directly, without slurping the file:

您还可以直接在文件中检查第2行,而不会诋毁文件:

while (<>) {
    if ($. == 2) {
         unless (/\Q$copyright/) {
               print "$copyright\n";
         }
    }
    print;
    close ARGV if eof;                # this will reset the line counter $.
}

Note that Nick ODell is correct that your copyright string contains meta characters (namely parentheses) which must be escaped. I used \Q ... \E escape sequences above.

请注意,Nick ODell是正确的,您的版权字符串包含必须转义的元字符(即括号)。我在上面使用\ Q ... \ E转义序列。

Note also that you do not need to be very specific in checking for the shebang, that is more likely to trip you up on slightly varied lines.

另请注意,您不需要非常具体地检查shebang,这更有可能在稍微变化的线路上绊倒您。

#2


4  

Your book's approach is stupid. Actually, I think perl is barfing because your copyright notice has special characters like (.

你的书的方法是愚蠢的。实际上,我认为perl正在bar,因为你的版权声明有像(。

What you want is the quotemeta function. Link

你想要的是quotemeta函数。链接

I'd change your program like so:

我会改变你的程序:

while (<>) {
    my $copyright2 = quotemeta $copyright;
    unless (/$copyright2/mi) {
        s/($shebang)/$1\n$copyright/mig;
    }
    print;
}

Apologies if that doesn't work. It's been a while since I wrote perl.

如果这不起作用,请道歉。我写perl已经有一段时间了。

#1


3  

Your unless does not work because the copyright is not on the same line as the shebang. The diamond operator reads a line up until the first value of $/, which by default is newline. Your program will perform the substitution on all the lines that do not contain the copyright.

你的除非不起作用,因为版权与shebang不在同一条线上。钻石运算符读取一行直到$ /的第一个值,默认情况下是换行符。您的程序将在不包含版权的所有行上执行替换。

Since this is perl, there are many ways to fix it. The most straightforward way is perhaps to unset $/ and slurp the file (read it all into one line). That way you can check right away if there is a copyright notice on the second line of the file. E.g.:

由于这是perl,因此有很多方法可以解决它。最简单的方法可能是取消设置$ /并啜饮文件(将其全部读入一行)。这样,如果文件的第二行有版权声明,您可以立即查看。例如。:

local $/;                                     # slurp the file
while (<>) {
    s/^.*\n\K(?!\Q$copyright\E)/$copyright/;  # negative lookahead assertion
    print;
}

You can also check line number 2 in your files directly, without slurping the file:

您还可以直接在文件中检查第2行,而不会诋毁文件:

while (<>) {
    if ($. == 2) {
         unless (/\Q$copyright/) {
               print "$copyright\n";
         }
    }
    print;
    close ARGV if eof;                # this will reset the line counter $.
}

Note that Nick ODell is correct that your copyright string contains meta characters (namely parentheses) which must be escaped. I used \Q ... \E escape sequences above.

请注意,Nick ODell是正确的,您的版权字符串包含必须转义的元字符(即括号)。我在上面使用\ Q ... \ E转义序列。

Note also that you do not need to be very specific in checking for the shebang, that is more likely to trip you up on slightly varied lines.

另请注意,您不需要非常具体地检查shebang,这更有可能在稍微变化的线路上绊倒您。

#2


4  

Your book's approach is stupid. Actually, I think perl is barfing because your copyright notice has special characters like (.

你的书的方法是愚蠢的。实际上,我认为perl正在bar,因为你的版权声明有像(。

What you want is the quotemeta function. Link

你想要的是quotemeta函数。链接

I'd change your program like so:

我会改变你的程序:

while (<>) {
    my $copyright2 = quotemeta $copyright;
    unless (/$copyright2/mi) {
        s/($shebang)/$1\n$copyright/mig;
    }
    print;
}

Apologies if that doesn't work. It's been a while since I wrote perl.

如果这不起作用,请道歉。我写perl已经有一段时间了。