如何在Perl中跳过非空格或数字的行?

时间:2022-08-25 23:49:45

I am reading data from a file like this

我正在从这样的文件中读取数据

while (<$fh>)
{
        @tmp = split; # <-- ?
        push @AoA, [@tmp];
}

I have a couple of questions regarding this. What does the marked line do? Does it split the file by lines and store elements of each line into an array?? If so, is it possible to convert @tmp into a string or do a regex on @tmp?

我有几个问题。标记线有什么作用?它是否按行分割文件并将每行的元素存储到数组中?如果是这样,是否可以将@tmp转换为字符串或在@tmp上执行正则表达式?

Basically I want to stop pushing data onto the AoA if I find anything other than a space or an integer in the file. I have the regex for it already: \^[\s\d]*$\

基本上我想停止将数据推送到AoA,如果我找到文件中的空格或整数以外的任何内容。我已经有了正则表达式:\ ^ [\ s \ d] * $ \

10 个解决方案

#1


[@tmp = split;] is shorthand for:

[@tmp = split;]是以下的简写:

@tmp = split " ", $_, 0;

which is similar to

这类似于

@tmp = split /\s+/, $_, 0;

but ignores any leading whitespace, so " foo bar baz" becomes ("foo", "bar", "baz") instead of ("", "foo", "bar", "baz").

但忽略任何前导空格,所以“foo bar baz”变为(“foo”,“bar”,“baz”)而不是(“”,“foo”,“bar”,“baz”)。

It takes each line in the filehandler $fh and splits it, using spaces as a delimiter.

它使用filehandler $ fh中的每一行并将其拆分,使用空格作为分隔符。

Regarding what you want to do, why don't you just run the regex on $_ to begin with? That's a string.

关于你想做什么,你为什么不在$ _开始运行正则表达式?那是一个字符串。

You could do:

你可以这样做:

while (<$fh>) {
    last unless  /^[\s\d]*$/; # break if a line containing something 
                              # other than whitespace or a number is found
    @tmp = split;
    push @AoA, [@tmp];
}

#2


When you wonder what a Perl built-in does, read its documentation. Most of the answers you are getting are merely restating the documentation. The key to using any language is the learning how to use its documentation. If you've read the docs and don't understand that, mention that in your question :)

当您想知道Perl内置的功能时,请阅读其文档。您获得的大部分答案仅仅是重述文档。使用任何语言的关键是学习如何使用其文档。如果您已经阅读了文档并且不明白,请在您的问题中提及:)

  • You can look in the perlfunc page to see all the built-ins.

    您可以在perlfunc页面中查看所有内置函数。

  • At the command line, you can use the -f switch to perldoc to pull out just the documentation for a built-in: perldoc -f split

    在命令行中,您可以使用-f开关来perldoc只提取内置文档:perldoc -f split

Good luck, :)

祝好运, :)

#3


while(<$fh>) {

This reads the file in line-by-line. The current line of the file is stored in $_. It's basically the same as while($_ = <$fh>) {. Technically it expands to while(defined($_ = <$fh>)) {, but they're very close to the same thing (and either way, it's automatic, so you don't need to worry about it).

这会逐行读取文件。该文件的当前行存储在$ _中。它与while($ _ = <$ fh>){基本相同。从技术上讲,它扩展到while(定义($ _ = <$ fh>)){,但它们非常接近同一个东西(无论哪种方式,它都是自动的,所以你不必担心它)。

  @tmp = split; 

"split" with no arguments is (mostly) equivalent to "split /\s+/, $_". It splits the current line into a list of items between whitespace. So it splits the current line into a list of words (more or less) and stores this list in an array. However, this line is bad. @tmp should be qualified with my. Perl would catch this if you have use strict; and use warnings; at the top.

没有参数的“拆分”(大部分)等同于“split / \ s + /,$ _”。它将当前行拆分为空格之间的项列表。因此,它将当前行拆分为单词列表(或多或少)并将此列表存储在数组中。但是,这条线很糟糕。 @tmp应该是我的资格。如果你使用严格,Perl会抓住这个;并使用警告;在顶部。

  push @AoA, [@tmp];
}

This pushes a reference to an anonymous array containing the elements that were in @tmp into @AoA, which is an array of arrays (as you probably already knew).

这会将对包含@tmp中元素的匿名数组的引用推送到@AoA,这是一个数组数组(您可能已经知道)。

So in the end, you have a list @AoA where each element in the list corresponds to a line of the file, and each element of the list is another list of the words on that line.

所以最后,你有一个列表@AoA,其中列表中的每个元素对应于文件的一行,列表的每个元素是该行上的另一个单词列表。

In short, @tmp should really be declared using my, and you should use strict; and use warnings;. In fact, as has been said, you could do away with @tmp altogether:

简而言之,@ tmp应该真正使用my声明,你应该使用strict;并使用警告;事实上,正如已经说过的那样,你可以完全取消@tmp:

while(<$fh>) { push @AoA, [split] }

But using a temporary array may be nicer on anyone who has to add to this code later.

但是对于之后必须添加到此代码的任何人来说,使用临时数组可能会更好。

EDIT: I missed the regex you wanted to add:

编辑:我错过了你想要添加的正则表达式:

while(<$fh>) {
  last unless /^[\d\s]*$/;
  push @AoA, [split];
}

However, /^[\d\s]*$/ won't catch all integers - specifically, it won't match -1. If you want it to match negative numbers, use /^[\d\s-]*$/. Also, if you want to match non-integers (floating-point numbers), you could use /^[\d\s\.-]*$/, but I don't know if you want to match those. However, these regexes will match invalid entries like 1-3 and 5.5.5, which are NOT integers or numbers. If you want to be more strict about that, try this:

但是,/ ^ [\ d \ s] * $ /将不会捕获所有整数 - 具体而言,它将不匹配-1。如果您希望它匹配负数,请使用/ ^ [\ d \ s - ] * $ /。此外,如果要匹配非整数(浮点数),可以使用/^ [\\\\\\.-] * $,但我不知道您是否要匹配它们。但是,这些正则表达式将匹配1-3和5.5.5之类的无效条目,这些条目不是整数或数字。如果你想更加严格,试试这个:

LOOP: while(<$fh>) {
  my @tmp = split;
  for(@tmp) {
    # this line for floating points:
    last LOOP unless /^-?\d+(?:\.\d+|)$/;
    # this line for just integers:
    last LOOP unless /^-?\d+$/;
  }
  push @AoA, [@tmp];
}

#4


[@tmp = split;] splits each incoming line of the file on whitespace and stores the words, as an array, in @tmp. (The while() loop is iterating across each line in the file.) An array reference containing @tmp is then pushed onto @AoA.

[@tmp = split;]在空白处拆分文件的每个传入行,并将这些单词作为数组存储在@tmp中。 (while()循环遍历文件中的每一行。)然后将包含@tmp的数组引用推送到@AoA。

The best way to accomplish 'converting @tmp into a string', if you want to do something with it right there, is to never converted it out of being a string; the split is operating on $_, which is a string (the while loop is implicitly setting this). If you do regex operations like s/foo/bar/ within that loop, they'll automatically operate on $_.

完成'将@tmp转换为字符串'的最佳方法,如果你想在那里做一些事情,就是永远不要把它转换成字符串;拆分是在$ _上运行的,这是一个字符串(while循环隐式设置它)。如果你在那个循环中执行像s / foo / bar /这样的正则表达式操作,它们将自动在$ _上运行。

So one way to accomplish what you say you want (with the code simplified somewhat) is:

所以,实现你想要的东西的一种方法(稍微简化了代码)是:

while(<$fh>) {
    last
        if /[^\s\d]/;
    push @AoA, [split];
}

If you truly desired to reconvert @tmp to a string, you could do:

如果您真的希望将@tmp重新转换为字符串,则可以执行以下操作:

my $tmp = join ' ', @tmp;

#5


Actually, the while (<$fh>) line splits the file by lines; each iteration of the loop will have a new line stored in $_.

实际上,while(<$ fh>)行按行分割文件;循环的每次迭代都会在$ _中存储一个新行。

The marked line splits the line stored in $_ by whitespace. So, @tmp will be an array containing all of the words on the line: if the line contains foo bar baz, @tmp will be ('foo', 'bar', 'baz').

标记的行用空格分割存储在$ _中的行。所以,@ tmp将是一个包含该行所有单词的数组:如果该行包含foo bar baz,@ tmp将是('foo','bar','baz')。

If you want to do a regexp match on the line in question, then you should do that before you split the line. A regular expression in perl matches against $_ by default, so the line is pretty simple:

如果你想在相关的行上进行正则表达式匹配,那么你应该在分割线之前这样做。 perl中的正则表达式默认与$ _匹配,因此该行非常简单:

while (<$fh>)
{
    last unless /^[\s\d]*$/;
    @tmp = split;
    push @AoA, [@tmp];
}

#6


Warning, \d doesn't mean [0-9] in Perl 5.8 and 5.10 (unless you use the bytes pragma). It means any UNICODE character that has the digit property, such as *N DIGIT FIVE U+1815 (᠕), if you want to restrict it to only whitespace and numbers you can do math with, you need to say /^[\s0-9]$/.

警告,\ d并不意味着Perl 5.8和5.10中的[0-9](除非你使用bytes pragma)。它表示具有digit属性的任何UNICODE字符,例如*N DIGIT FIVE U + 1815(᠕),如果你想将它限制为只有空格和数字你可以用数学,你需要说/ ^ [\ s0- 9] $ /。

#7


split takes the string it is given and turns it into an array by splitting on whitespace - since no parameter is given, it will split the $_ variable (this is given each line from the file in $fh in turn.

split获取它给出的字符串并通过拆分空格将其转换为数组 - 因为没有给出参数,它将拆分$ _变量(这将依次给$ fh中的文件中的每一行。

It is not necessary to convert @tmp into a string, since that string is already in the $_ variable.

没有必要将@tmp转换为字符串,因为该字符串已经在$ _变量中。

In order to stop the loop if you match any single character that is not whitespace or numeric:

如果匹配任何不是空格或数字的单个字符,则为了停止循环:

last if /[\s\d]/;

This is slightly different from your version, which would match any complete line that consisted of only non-whitespace and/or non-numeric.

这与您的版本略有不同,该版本将匹配仅由非空格和/或非数字组成的任何完整行。

#8


The first line is a while loop like any other, but its "condition" reads a line of input from the filehandle $fh into the default variable $_. If the read succeeds (i.e. we're not at the end of the file), the body executes. It's essentially a "for each line in the file $fh".

第一行是与任何其他行一样的while循环,但它的“条件”从文件句柄$ fh读取一行输入到默认变量$ _。如果读取成功(即我们不在文件的末尾),则执行正文。它基本上是“文件$ fh中的每一行”。

The next line is splitting the items in $_ (the default variable, remember, so it's left out of the call to split) by whitespace (the default separator), and storing the result in @tmp. The last line adds a REFERENCE to @tmp to @AoA, an array of array references.

下一行是将$ _(默认变量,请记住,因此它不在split分区调用之外)中的项目拆分为空格(默认分隔符),并将结果存储在@tmp中。最后一行将@tmp的REFERENCE添加到@AoA,这是一个数组引用数组。

So, what you want to do is say (at the top of the loop)

那么,你想要做的就是说(在循环的顶部)

last if $_ =~ <apropriate regex here>;

#9


The core questions have been pretty well covered already, but there's one aspect of the "turning @tmp back into a string" subquestion that hasn't been explicitly mentioned:

核心问题已经很好地涵盖了,但是有一个方面是“将@tmp转换回字符串”子问题,但没有明确提到:

$_ and join ' ', @tmp are not equivalent. $_ will contain the line as originally read. join ' ', @tmp will contain the words found on the line, joined by single spaces. If the line contains non-space whitespace (e.g., tabs), words separated by multiple spaces, or leading whitespace, then the two versions of the "complete" line will be different.

$ _和join'',@ tmp不等价。 $ _将包含最初读取的行。 join'',@ tmp将包含在行上找到的单词,由单个空格连接。如果该行包含非空格空格(例如制表符),由多个空格分隔的单词或前导空格,则“完整”行的两个版本将不同。

#10


ok cool!

shorthand explains a lot.

简写解释了很多。

So I can do this..

所以我可以这样做..

while (<$fh>)
{
        if( /^[/s/d]*$/ ){
          //do something
        }else{
          //do something else;
        }

        @tmp = split;
        push @AoA, [@tmp];
}

#1


[@tmp = split;] is shorthand for:

[@tmp = split;]是以下的简写:

@tmp = split " ", $_, 0;

which is similar to

这类似于

@tmp = split /\s+/, $_, 0;

but ignores any leading whitespace, so " foo bar baz" becomes ("foo", "bar", "baz") instead of ("", "foo", "bar", "baz").

但忽略任何前导空格,所以“foo bar baz”变为(“foo”,“bar”,“baz”)而不是(“”,“foo”,“bar”,“baz”)。

It takes each line in the filehandler $fh and splits it, using spaces as a delimiter.

它使用filehandler $ fh中的每一行并将其拆分,使用空格作为分隔符。

Regarding what you want to do, why don't you just run the regex on $_ to begin with? That's a string.

关于你想做什么,你为什么不在$ _开始运行正则表达式?那是一个字符串。

You could do:

你可以这样做:

while (<$fh>) {
    last unless  /^[\s\d]*$/; # break if a line containing something 
                              # other than whitespace or a number is found
    @tmp = split;
    push @AoA, [@tmp];
}

#2


When you wonder what a Perl built-in does, read its documentation. Most of the answers you are getting are merely restating the documentation. The key to using any language is the learning how to use its documentation. If you've read the docs and don't understand that, mention that in your question :)

当您想知道Perl内置的功能时,请阅读其文档。您获得的大部分答案仅仅是重述文档。使用任何语言的关键是学习如何使用其文档。如果您已经阅读了文档并且不明白,请在您的问题中提及:)

  • You can look in the perlfunc page to see all the built-ins.

    您可以在perlfunc页面中查看所有内置函数。

  • At the command line, you can use the -f switch to perldoc to pull out just the documentation for a built-in: perldoc -f split

    在命令行中,您可以使用-f开关来perldoc只提取内置文档:perldoc -f split

Good luck, :)

祝好运, :)

#3


while(<$fh>) {

This reads the file in line-by-line. The current line of the file is stored in $_. It's basically the same as while($_ = <$fh>) {. Technically it expands to while(defined($_ = <$fh>)) {, but they're very close to the same thing (and either way, it's automatic, so you don't need to worry about it).

这会逐行读取文件。该文件的当前行存储在$ _中。它与while($ _ = <$ fh>){基本相同。从技术上讲,它扩展到while(定义($ _ = <$ fh>)){,但它们非常接近同一个东西(无论哪种方式,它都是自动的,所以你不必担心它)。

  @tmp = split; 

"split" with no arguments is (mostly) equivalent to "split /\s+/, $_". It splits the current line into a list of items between whitespace. So it splits the current line into a list of words (more or less) and stores this list in an array. However, this line is bad. @tmp should be qualified with my. Perl would catch this if you have use strict; and use warnings; at the top.

没有参数的“拆分”(大部分)等同于“split / \ s + /,$ _”。它将当前行拆分为空格之间的项列表。因此,它将当前行拆分为单词列表(或多或少)并将此列表存储在数组中。但是,这条线很糟糕。 @tmp应该是我的资格。如果你使用严格,Perl会抓住这个;并使用警告;在顶部。

  push @AoA, [@tmp];
}

This pushes a reference to an anonymous array containing the elements that were in @tmp into @AoA, which is an array of arrays (as you probably already knew).

这会将对包含@tmp中元素的匿名数组的引用推送到@AoA,这是一个数组数组(您可能已经知道)。

So in the end, you have a list @AoA where each element in the list corresponds to a line of the file, and each element of the list is another list of the words on that line.

所以最后,你有一个列表@AoA,其中列表中的每个元素对应于文件的一行,列表的每个元素是该行上的另一个单词列表。

In short, @tmp should really be declared using my, and you should use strict; and use warnings;. In fact, as has been said, you could do away with @tmp altogether:

简而言之,@ tmp应该真正使用my声明,你应该使用strict;并使用警告;事实上,正如已经说过的那样,你可以完全取消@tmp:

while(<$fh>) { push @AoA, [split] }

But using a temporary array may be nicer on anyone who has to add to this code later.

但是对于之后必须添加到此代码的任何人来说,使用临时数组可能会更好。

EDIT: I missed the regex you wanted to add:

编辑:我错过了你想要添加的正则表达式:

while(<$fh>) {
  last unless /^[\d\s]*$/;
  push @AoA, [split];
}

However, /^[\d\s]*$/ won't catch all integers - specifically, it won't match -1. If you want it to match negative numbers, use /^[\d\s-]*$/. Also, if you want to match non-integers (floating-point numbers), you could use /^[\d\s\.-]*$/, but I don't know if you want to match those. However, these regexes will match invalid entries like 1-3 and 5.5.5, which are NOT integers or numbers. If you want to be more strict about that, try this:

但是,/ ^ [\ d \ s] * $ /将不会捕获所有整数 - 具体而言,它将不匹配-1。如果您希望它匹配负数,请使用/ ^ [\ d \ s - ] * $ /。此外,如果要匹配非整数(浮点数),可以使用/^ [\\\\\\.-] * $,但我不知道您是否要匹配它们。但是,这些正则表达式将匹配1-3和5.5.5之类的无效条目,这些条目不是整数或数字。如果你想更加严格,试试这个:

LOOP: while(<$fh>) {
  my @tmp = split;
  for(@tmp) {
    # this line for floating points:
    last LOOP unless /^-?\d+(?:\.\d+|)$/;
    # this line for just integers:
    last LOOP unless /^-?\d+$/;
  }
  push @AoA, [@tmp];
}

#4


[@tmp = split;] splits each incoming line of the file on whitespace and stores the words, as an array, in @tmp. (The while() loop is iterating across each line in the file.) An array reference containing @tmp is then pushed onto @AoA.

[@tmp = split;]在空白处拆分文件的每个传入行,并将这些单词作为数组存储在@tmp中。 (while()循环遍历文件中的每一行。)然后将包含@tmp的数组引用推送到@AoA。

The best way to accomplish 'converting @tmp into a string', if you want to do something with it right there, is to never converted it out of being a string; the split is operating on $_, which is a string (the while loop is implicitly setting this). If you do regex operations like s/foo/bar/ within that loop, they'll automatically operate on $_.

完成'将@tmp转换为字符串'的最佳方法,如果你想在那里做一些事情,就是永远不要把它转换成字符串;拆分是在$ _上运行的,这是一个字符串(while循环隐式设置它)。如果你在那个循环中执行像s / foo / bar /这样的正则表达式操作,它们将自动在$ _上运行。

So one way to accomplish what you say you want (with the code simplified somewhat) is:

所以,实现你想要的东西的一种方法(稍微简化了代码)是:

while(<$fh>) {
    last
        if /[^\s\d]/;
    push @AoA, [split];
}

If you truly desired to reconvert @tmp to a string, you could do:

如果您真的希望将@tmp重新转换为字符串,则可以执行以下操作:

my $tmp = join ' ', @tmp;

#5


Actually, the while (<$fh>) line splits the file by lines; each iteration of the loop will have a new line stored in $_.

实际上,while(<$ fh>)行按行分割文件;循环的每次迭代都会在$ _中存储一个新行。

The marked line splits the line stored in $_ by whitespace. So, @tmp will be an array containing all of the words on the line: if the line contains foo bar baz, @tmp will be ('foo', 'bar', 'baz').

标记的行用空格分割存储在$ _中的行。所以,@ tmp将是一个包含该行所有单词的数组:如果该行包含foo bar baz,@ tmp将是('foo','bar','baz')。

If you want to do a regexp match on the line in question, then you should do that before you split the line. A regular expression in perl matches against $_ by default, so the line is pretty simple:

如果你想在相关的行上进行正则表达式匹配,那么你应该在分割线之前这样做。 perl中的正则表达式默认与$ _匹配,因此该行非常简单:

while (<$fh>)
{
    last unless /^[\s\d]*$/;
    @tmp = split;
    push @AoA, [@tmp];
}

#6


Warning, \d doesn't mean [0-9] in Perl 5.8 and 5.10 (unless you use the bytes pragma). It means any UNICODE character that has the digit property, such as *N DIGIT FIVE U+1815 (᠕), if you want to restrict it to only whitespace and numbers you can do math with, you need to say /^[\s0-9]$/.

警告,\ d并不意味着Perl 5.8和5.10中的[0-9](除非你使用bytes pragma)。它表示具有digit属性的任何UNICODE字符,例如*N DIGIT FIVE U + 1815(᠕),如果你想将它限制为只有空格和数字你可以用数学,你需要说/ ^ [\ s0- 9] $ /。

#7


split takes the string it is given and turns it into an array by splitting on whitespace - since no parameter is given, it will split the $_ variable (this is given each line from the file in $fh in turn.

split获取它给出的字符串并通过拆分空格将其转换为数组 - 因为没有给出参数,它将拆分$ _变量(这将依次给$ fh中的文件中的每一行。

It is not necessary to convert @tmp into a string, since that string is already in the $_ variable.

没有必要将@tmp转换为字符串,因为该字符串已经在$ _变量中。

In order to stop the loop if you match any single character that is not whitespace or numeric:

如果匹配任何不是空格或数字的单个字符,则为了停止循环:

last if /[\s\d]/;

This is slightly different from your version, which would match any complete line that consisted of only non-whitespace and/or non-numeric.

这与您的版本略有不同,该版本将匹配仅由非空格和/或非数字组成的任何完整行。

#8


The first line is a while loop like any other, but its "condition" reads a line of input from the filehandle $fh into the default variable $_. If the read succeeds (i.e. we're not at the end of the file), the body executes. It's essentially a "for each line in the file $fh".

第一行是与任何其他行一样的while循环,但它的“条件”从文件句柄$ fh读取一行输入到默认变量$ _。如果读取成功(即我们不在文件的末尾),则执行正文。它基本上是“文件$ fh中的每一行”。

The next line is splitting the items in $_ (the default variable, remember, so it's left out of the call to split) by whitespace (the default separator), and storing the result in @tmp. The last line adds a REFERENCE to @tmp to @AoA, an array of array references.

下一行是将$ _(默认变量,请记住,因此它不在split分区调用之外)中的项目拆分为空格(默认分隔符),并将结果存储在@tmp中。最后一行将@tmp的REFERENCE添加到@AoA,这是一个数组引用数组。

So, what you want to do is say (at the top of the loop)

那么,你想要做的就是说(在循环的顶部)

last if $_ =~ <apropriate regex here>;

#9


The core questions have been pretty well covered already, but there's one aspect of the "turning @tmp back into a string" subquestion that hasn't been explicitly mentioned:

核心问题已经很好地涵盖了,但是有一个方面是“将@tmp转换回字符串”子问题,但没有明确提到:

$_ and join ' ', @tmp are not equivalent. $_ will contain the line as originally read. join ' ', @tmp will contain the words found on the line, joined by single spaces. If the line contains non-space whitespace (e.g., tabs), words separated by multiple spaces, or leading whitespace, then the two versions of the "complete" line will be different.

$ _和join'',@ tmp不等价。 $ _将包含最初读取的行。 join'',@ tmp将包含在行上找到的单词,由单个空格连接。如果该行包含非空格空格(例如制表符),由多个空格分隔的单词或前导空格,则“完整”行的两个版本将不同。

#10


ok cool!

shorthand explains a lot.

简写解释了很多。

So I can do this..

所以我可以这样做..

while (<$fh>)
{
        if( /^[/s/d]*$/ ){
          //do something
        }else{
          //do something else;
        }

        @tmp = split;
        push @AoA, [@tmp];
}