如何从Perl中的字符串中提取子字符串?

时间:2022-05-30 07:33:00

Consider the following strings:

请考虑以下字符串:

1) Scheme ID: abc-456-hu5t10 (High priority) *****

1)方案ID:abc-456-hu5t10(高优先级)*****

2) Scheme ID: frt-78f-hj542w (Balanced)

2)方案编号:frt-78f-hj542w(平衡)

3) Scheme ID: 23f-f974-nm54w (super formula run) *****

3)方案ID:23f-f974-nm54w(超级配方运行)*****

and so on in the above format - the parts in bold are changes across the strings.

以上述格式等等 - 粗体部分是字符串的变化。

==> Imagine I've many strings of format Shown above. I want to pick 3 substrings (As shown in BOLD below) from the each of the above strings.

==>想象一下,我上面有许多格式的字符串。我想从上面的每个字符串中选择3个子串(如下面的BOLD所示)。

  • 1st substring containing the alphanumeric value (in eg above it's "abc-456-hu5t10")
  • 包含字母数字值的第一个子字符串(例如在其上面的“abc-456-hu5t10”)
  • 2nd substring containing the word (in eg above it's "High priority")
  • 包含该单词的第二个子字符串(例如在其上方的“高优先级”)
  • 3rd substring containing * (IF * is present at the end of the string ELSE leave it )
  • 包含*的第3个子字符串(IF *出现在字符串ELSE的末尾)

How do I pick these 3 substrings from each string shown above? I know it can be done using regular expressions in Perl... Can you help with this?

如何从上面显示的每个字符串中选择这3个子字符串?我知道可以使用Perl中的正则表达式来完成...你能帮忙解决这个问题吗?

7 个解决方案

#1


29  

You could do something like this:

你可以这样做:

my $data = <<END;
1) Scheme ID: abc-456-hu5t10 (High priority) *
2) Scheme ID: frt-78f-hj542w (Balanced)
3) Scheme ID: 23f-f974-nm54w (super formula run) *
END

foreach (split(/\n/,$data)) {
  $_ =~ /Scheme ID: ([a-z0-9-]+)\s+\(([^)]+)\)\s*(\*)?/ || next;
  my ($id,$word,$star) = ($1,$2,$3);
  print "$id $word $star\n";
}

The key thing is the Regular expression:

关键是正则表达式:

Scheme ID: ([a-z0-9-]+)\s+\(([^)]+)\)\s*(\*)?

Which breaks up as follows.

其中分解如下。

The fixed String "Scheme ID: ":

固定字符串“Scheme ID:”:

Scheme ID: 

Followed by one or more of the characters a-z, 0-9 or -. We use the brackets to capture it as $1:

随后是一个或多个字符a-z,0-9或 - 。我们使用括号将其捕获为$ 1:

([a-z0-9-]+)

Followed by one or more whitespace characters:

后面跟着一个或多个空格字符:

\s+

Followed by an opening bracket (which we escape) followed by any number of characters which aren't a close bracket, and then a closing bracket (escaped). We use unescaped brackets to capture the words as $2:

接下来是一个左括号(我们将其转义),后跟任意数量的非紧密括号的字符,然后是一个右括号(转义)。我们使用未转义的括号将单词捕获为$ 2:

\(([^)]+)\)

Followed by some spaces any maybe a *, captured as $3:

随后是一些空格,可能是*,被捕获为3美元:

\s*(\*)?

#2


3  

You could use a regular expression such as the following:

您可以使用正则表达式,如下所示:

/([-a-z0-9]+)\s*\((.*?)\)\s*(\*)?/

So for example:

例如:

$s = "abc-456-hu5t10 (High priority) *";
$s =~ /([-a-z0-9]+)\s*\((.*?)\)\s*(\*)?/;
print "$1\n$2\n$3\n";

prints

版画

abc-456-hu5t10
High priority
*

#3


2  

Well, a one liner here:

好吧,这里有一个班轮:

perl -lne 'm|Scheme ID:\s+(.*?)\s+\((.*?)\)\s?(\*)?|g&&print "$1:$2:$3"' file.txt

Expanded to a simple script to explain things a bit better:

扩展到一个简单的脚本来解释事情好一点:

#!/usr/bin/perl -ln              

#-w : warnings                   
#-l : print newline after every print                               
#-n : apply script body to stdin or files listed at commandline, dont print $_           

use strict; #always do this.     

my $regex = qr{  # precompile regex                                 
  Scheme\ ID:      # to match beginning of line.                      
  \s+              # 1 or more whitespace                             
  (.*?)            # Non greedy match of all characters up to         
  \s+              # 1 or more whitespace                             
  \(               # parenthesis literal                              
    (.*?)            # non-greedy match to the next                     
  \)               # closing literal parenthesis                      
  \s*              # 0 or more whitespace (trailing * is optional)    
  (\*)?            # 0 or 1 literal *s                                
}x;  #x switch allows whitespace in regex to allow documentation.   

#values trapped in $1 $2 $3, so do whatever you need to:            
#Perl lets you use any characters as delimiters, i like pipes because                    
#they reduce the amount of escaping when using file paths           
m|$regex| && print "$1 : $2 : $3";

#alternatively if(m|$regex|) {doOne($1); doTwo($2) ... }     

Though if it were anything other than formatting, I would implement a main loop to handle files and flesh out the body of the script rather than rely ing on the commandline switches for the looping.

虽然如果它不是格式化,我会实现一个主循环来处理文件并充实脚本的主体,而不是依赖命令行开关进行循环。

#4


2  

(\S*)\s*\((.*?)\)\s*(\*?)


(\S*)    picks up anything which is NOT whitespace
\s*      0 or more whitespace characters
\(       a literal open parenthesis
(.*?)    anything, non-greedy so stops on first occurrence of...
\)       a literal close parenthesis
\s*      0 or more whitespace characters
(\*?)    0 or 1 occurances of literal *

#5


1  

Long time no Perl

很久没有Perl

while(<STDIN>) {
    next unless /:\s*(\S+)\s+\(([^\)]+)\)\s*(\*?)/;
    print "|$1|$2|$3|\n";
}

#6


1  

This just requires a small change to my last answer:

这只需要对我的上一个答案做一点改动:

my ($guid, $scheme, $star) = $line =~ m{
    The [ ] Scheme [ ] GUID: [ ]
    ([a-zA-Z0-9-]+)          #capture the guid
    [ ]
    \(  (.+)  \)             #capture the scheme 
    (?:
        [ ]
        ([*])                #capture the star 
    )?                       #if it exists
}x;

#7


0  

String 1:

字符串1:

$input =~ /'^\S+'/;
$s1 = $&;

String 2:

字符串2:

$input =~ /\(.*\)/;
$s2 = $&;

String 3:

字符串3:

$input =~ /\*?$/;
$s3 = $&;

#1


29  

You could do something like this:

你可以这样做:

my $data = <<END;
1) Scheme ID: abc-456-hu5t10 (High priority) *
2) Scheme ID: frt-78f-hj542w (Balanced)
3) Scheme ID: 23f-f974-nm54w (super formula run) *
END

foreach (split(/\n/,$data)) {
  $_ =~ /Scheme ID: ([a-z0-9-]+)\s+\(([^)]+)\)\s*(\*)?/ || next;
  my ($id,$word,$star) = ($1,$2,$3);
  print "$id $word $star\n";
}

The key thing is the Regular expression:

关键是正则表达式:

Scheme ID: ([a-z0-9-]+)\s+\(([^)]+)\)\s*(\*)?

Which breaks up as follows.

其中分解如下。

The fixed String "Scheme ID: ":

固定字符串“Scheme ID:”:

Scheme ID: 

Followed by one or more of the characters a-z, 0-9 or -. We use the brackets to capture it as $1:

随后是一个或多个字符a-z,0-9或 - 。我们使用括号将其捕获为$ 1:

([a-z0-9-]+)

Followed by one or more whitespace characters:

后面跟着一个或多个空格字符:

\s+

Followed by an opening bracket (which we escape) followed by any number of characters which aren't a close bracket, and then a closing bracket (escaped). We use unescaped brackets to capture the words as $2:

接下来是一个左括号(我们将其转义),后跟任意数量的非紧密括号的字符,然后是一个右括号(转义)。我们使用未转义的括号将单词捕获为$ 2:

\(([^)]+)\)

Followed by some spaces any maybe a *, captured as $3:

随后是一些空格,可能是*,被捕获为3美元:

\s*(\*)?

#2


3  

You could use a regular expression such as the following:

您可以使用正则表达式,如下所示:

/([-a-z0-9]+)\s*\((.*?)\)\s*(\*)?/

So for example:

例如:

$s = "abc-456-hu5t10 (High priority) *";
$s =~ /([-a-z0-9]+)\s*\((.*?)\)\s*(\*)?/;
print "$1\n$2\n$3\n";

prints

版画

abc-456-hu5t10
High priority
*

#3


2  

Well, a one liner here:

好吧,这里有一个班轮:

perl -lne 'm|Scheme ID:\s+(.*?)\s+\((.*?)\)\s?(\*)?|g&&print "$1:$2:$3"' file.txt

Expanded to a simple script to explain things a bit better:

扩展到一个简单的脚本来解释事情好一点:

#!/usr/bin/perl -ln              

#-w : warnings                   
#-l : print newline after every print                               
#-n : apply script body to stdin or files listed at commandline, dont print $_           

use strict; #always do this.     

my $regex = qr{  # precompile regex                                 
  Scheme\ ID:      # to match beginning of line.                      
  \s+              # 1 or more whitespace                             
  (.*?)            # Non greedy match of all characters up to         
  \s+              # 1 or more whitespace                             
  \(               # parenthesis literal                              
    (.*?)            # non-greedy match to the next                     
  \)               # closing literal parenthesis                      
  \s*              # 0 or more whitespace (trailing * is optional)    
  (\*)?            # 0 or 1 literal *s                                
}x;  #x switch allows whitespace in regex to allow documentation.   

#values trapped in $1 $2 $3, so do whatever you need to:            
#Perl lets you use any characters as delimiters, i like pipes because                    
#they reduce the amount of escaping when using file paths           
m|$regex| && print "$1 : $2 : $3";

#alternatively if(m|$regex|) {doOne($1); doTwo($2) ... }     

Though if it were anything other than formatting, I would implement a main loop to handle files and flesh out the body of the script rather than rely ing on the commandline switches for the looping.

虽然如果它不是格式化,我会实现一个主循环来处理文件并充实脚本的主体,而不是依赖命令行开关进行循环。

#4


2  

(\S*)\s*\((.*?)\)\s*(\*?)


(\S*)    picks up anything which is NOT whitespace
\s*      0 or more whitespace characters
\(       a literal open parenthesis
(.*?)    anything, non-greedy so stops on first occurrence of...
\)       a literal close parenthesis
\s*      0 or more whitespace characters
(\*?)    0 or 1 occurances of literal *

#5


1  

Long time no Perl

很久没有Perl

while(<STDIN>) {
    next unless /:\s*(\S+)\s+\(([^\)]+)\)\s*(\*?)/;
    print "|$1|$2|$3|\n";
}

#6


1  

This just requires a small change to my last answer:

这只需要对我的上一个答案做一点改动:

my ($guid, $scheme, $star) = $line =~ m{
    The [ ] Scheme [ ] GUID: [ ]
    ([a-zA-Z0-9-]+)          #capture the guid
    [ ]
    \(  (.+)  \)             #capture the scheme 
    (?:
        [ ]
        ([*])                #capture the star 
    )?                       #if it exists
}x;

#7


0  

String 1:

字符串1:

$input =~ /'^\S+'/;
$s1 = $&;

String 2:

字符串2:

$input =~ /\(.*\)/;
$s2 = $&;

String 3:

字符串3:

$input =~ /\*?$/;
$s3 = $&;