使用Perl在方括号“[]”之间提取数据

时间:2021-07-09 21:43:52

I was using a regex for extracting data from curved brackets (or "parentheses") like extracting a,b from (a,b) as shown below. I have a file in which every line will be like

我正在使用正则表达式从弯曲括号(或“括号”)中提取数据,如从(a,b)中提取a,b,如下所示。我有一个文件,其中每一行都会像

this is the range of values (a1,b1) and [b1|a1]
this is the range of values (a2,b2) and [b2|a2]
this is the range of values (a3,b3) and [b3|a3]

I'm using the following string to extract a1,b1, a2,b2, etc...

我正在使用以下字符串来提取a1,b1,a2,b2等...

@numbers = $_ =~ /\((.*),(.*)\)/

However, if I want to extract the data from square brackets [], how can I do it? For example

但是,如果我想从方括号[]中提取数据,我该怎么办呢?例如

this is the range of values (a1,b1) and [b1|a1]
this is the range of values (a1,b1) and [b2|a2]

I need to extract/match only the data in square brackets and not the curved brackets.

我需要提取/匹配方括号中的数据而不是曲线括号。

3 个解决方案

#1


23  

[Update] In the meantime, I've written a blog post about the specific issue with .* I describe below: Why Using .* in Regular Expressions Is Almost Never What You Actually Want

[更新]与此同时,我写了一篇关于具体问题的博客文章。*我在下面描述:为什么在正则表达式中使用。*几乎不是你实际想要的


If your identifiers a1, b1 etc. never contain commas or square brackets themselves, you should use a pattern along the lines of the following to avoid backtracking hell:

如果您的标识符a1,b1等本身不包含逗号或方括号,则应使用以下行的模式以避免回溯地狱:

/\[([^,\]]+),([^,\]]+)\]/

Here's a working example on Regex101.

这是Regex101的一个工作示例。

The issue with greedy quantifiers like .* is that you'll very likely consume too much in the beginning so that the regex engine has to do extensive backtracking. Even if you use non-greedy quantifiers, the engine will do more attempts to match than necessary because it'll only consume one character at a time and then try to advance the position in the pattern.

像。*这样的贪婪量词的问题是,你很可能在开始时消耗太多,以便正则表达式引擎必须进行大量的回溯。即使你使用非贪婪的量词,引擎也会做更多的匹配尝试,因为它一次只消耗一个字符,然后尝试提升模式中的位置。

(You could even use atomic groups to make the matching even more performant.)

(你甚至可以使用原子组来使匹配更加高效。)

#2


2  

#!/usr/bin/perl
# your code goes here
my @numbers;
while(chomp(my $line=<DATA>)){
    if($line =~ m|\[(.*),(.*)\]|){
    push @numbers, ($1,$2);
    }
}
print @numbers; 
__DATA__
this is the range of values [a1,b1]
this is the range of values [a2,b2]
this is the range of values [a3,b3]

Demo

演示

#3


1  

You can match it using non-greedy quantifier *?

你可以使用非贪心量词*来匹配它吗?

my @numbers = $_ =~ /\[(.*?),(.*?)\]/g;

or

要么

my @numbers = /\[(.*?),(.*?)\]/g;

for short.

简而言之。

UPDATE

UPDATE

my @numbers = /\[(.*?)\|(.*?)\]/g;

#1


23  

[Update] In the meantime, I've written a blog post about the specific issue with .* I describe below: Why Using .* in Regular Expressions Is Almost Never What You Actually Want

[更新]与此同时,我写了一篇关于具体问题的博客文章。*我在下面描述:为什么在正则表达式中使用。*几乎不是你实际想要的


If your identifiers a1, b1 etc. never contain commas or square brackets themselves, you should use a pattern along the lines of the following to avoid backtracking hell:

如果您的标识符a1,b1等本身不包含逗号或方括号,则应使用以下行的模式以避免回溯地狱:

/\[([^,\]]+),([^,\]]+)\]/

Here's a working example on Regex101.

这是Regex101的一个工作示例。

The issue with greedy quantifiers like .* is that you'll very likely consume too much in the beginning so that the regex engine has to do extensive backtracking. Even if you use non-greedy quantifiers, the engine will do more attempts to match than necessary because it'll only consume one character at a time and then try to advance the position in the pattern.

像。*这样的贪婪量词的问题是,你很可能在开始时消耗太多,以便正则表达式引擎必须进行大量的回溯。即使你使用非贪婪的量词,引擎也会做更多的匹配尝试,因为它一次只消耗一个字符,然后尝试提升模式中的位置。

(You could even use atomic groups to make the matching even more performant.)

(你甚至可以使用原子组来使匹配更加高效。)

#2


2  

#!/usr/bin/perl
# your code goes here
my @numbers;
while(chomp(my $line=<DATA>)){
    if($line =~ m|\[(.*),(.*)\]|){
    push @numbers, ($1,$2);
    }
}
print @numbers; 
__DATA__
this is the range of values [a1,b1]
this is the range of values [a2,b2]
this is the range of values [a3,b3]

Demo

演示

#3


1  

You can match it using non-greedy quantifier *?

你可以使用非贪心量词*来匹配它吗?

my @numbers = $_ =~ /\[(.*?),(.*?)\]/g;

or

要么

my @numbers = /\[(.*?),(.*?)\]/g;

for short.

简而言之。

UPDATE

UPDATE

my @numbers = /\[(.*?)\|(.*?)\]/g;