I have a text like this:
我有这样的文字:
hello world /* select a from table_b
*/ some other text with new line cha
racter and there are some blocks of
/* any string */ select this part on
ly
////RESULT rest string
The text is multilined and I need to extract from last occurrence of "*/" until "////RESULT". In this case, the result should be:
文本是多行的,我需要从最后一次出现的“* /”中提取,直到“//// RESULT”。在这种情况下,结果应该是:
select this part on
ly
How to achieve this in perl?
如何在perl中实现这一目标?
I have attempted \\\*/(.|\n)*////RESULT
but that will start from first "*/"
我试过\\\ * /(。| \ n)* ////结果但是从第一个“* /”开始
3 个解决方案
#1
17
A useful trick in cases like this is to prefix the regexp with the greedy pattern .*
, which will try to match as many characters as possible before the rest of the pattern matches. So:
在这种情况下,一个有用的技巧是在regexp前加上贪婪模式。*,它会在模式的其余部分匹配之前尝试匹配尽可能多的字符。所以:
my ($match) = ($string =~ m!^.*\*/(.*?)////RESULT!s);
Let's break this pattern into its components:
让我们将这种模式分解为其组成部分:
-
^.*
starts at the beginning of the string and matches as many characters as it can. (Thes
modifier allows.
to match even newlines.) The beginning-of-string anchor^
is not strictly necessary, but it ensures that the regexp engine won't waste too much time backtracking if the match fails.^。*从字符串的开头开始,尽可能多地匹配字符。 (s修饰符允许。甚至匹配换行符。)字符串开头的锚点^并不是绝对必要的,但它确保正则表达式引擎在匹配失败时不会浪费太多时间回溯。
-
\*/
just matches the literal string*/
.\ * /只匹配文字字符串* /。
-
(.*?)
matches and captures any number of characters; the?
makes it ungreedy, so it prefers to match as few characters as possible in case there's more than one position where the rest of the regexp can match.(。*?)匹配并捕获任意数量的字符;的?使它不合适,所以它更喜欢匹配尽可能少的字符,以防正则表达式的其余部分可以匹配多个位置。
-
Finally,
////RESULT
just matches itself.最后,////结果只是匹配自己。
Since the pattern contains a lot of slashes, and since I wanted to avoid leaning toothpick syndrome, I decided to use alternative regexp delimiters. Exclamation points (!
) are a popular choice, since they don't collide with any normal regexp syntax.
由于该模式包含很多斜线,并且因为我想避免倾斜牙签综合症,所以我决定使用替代的regexp分隔符。感叹号(!)是一种流行的选择,因为它们不会与任何正常的正则表达式语法冲突。
Edit: Per discussion with ikegami below, I guess I should note that, if you want to use this regexp as a sub-pattern in a longer regexp, and if you want to guarantee that the string matched by (.*?)
will never contain ////RESULT
, then you should wrap those parts of the regexp in an independent (?>)
subexpression, like this:
编辑:每次与ikegami的讨论,我想我应该注意,如果你想在更长的正则表达式中使用这个正则表达式作为子模式,并且如果你想保证匹配的字符串(。*?)永远不会包含//// RESULT,那么你应该将regexp的那些部分包装在一个独立的(?>)子表达式中,如下所示:
my $regexp = qr!\*/(?>(.*?)////RESULT)!s;
...
my $match = ($string =~ /^.*$regexp$some_other_regexp/s);
The (?>)
causes the pattern inside it to fail rather than accepting a suboptimal match (i.e. one that extends beyond the first substring matching ////RESULT
) even if that means that the rest of the regexp will fail to match.
(?>)导致其中的模式失败而不是接受次优匹配(即超出匹配//// RESULT的第一个子串的匹配),即使这意味着正则表达式的其余部分将无法匹配。
#2
4
(?:(?!STRING).)*
matches any number of characters that don't contain STRING
. It's like [^a]
, but for strings instead of characters.
匹配任意数量的不包含STRING的字符。它就像[^ a],但是对于字符串而不是字符。
You can take shortcuts if you know certain inputs won't be encountered (like Kenosis and Ilmari Karonen did), but this is what what matches what you specified:
如果您知道不会遇到某些输入(如Kenosis和Ilmari Karonen所做的那样),您可以使用快捷方式,但这与您指定的匹配:
my ($segment) = $string =~ m{
\*/
( (?: (?! \*/ ). )* )
////RESULT
(?: (?! \*/ ). )*
\z
}xs;
If you don't care if */
appears after ////RESULT
, the following is the safest:
如果您不关心// // RESULT之后是否出现* /,则以下是最安全的:
my ($segment) = $string =~ m{
\*/
( (?: (?! \*/ ). )* )
////RESULT
}xs;
You didn't specify what should happen if there are two ////RESULT
that follow the last */
. The above matches until the last one. If you wanted to match until the first one, you'd use
如果有两个//// RESULT跟随最后一个* /,则没有指定会发生什么。以上匹配到最后一个。如果你想匹配到第一个,你可以使用
my ($segment) = $string =~ m{
\*/
( (?: (?! \*/ | ////RESULT ). )* )
////RESULT
}xs;
#3
2
Here's one option:
这是一个选项:
use strict;
use warnings;
my $string = <<'END';
hello world /* select a from table_b
*/ some other text with new line cha
racter and there are some blocks of
/* any string */ select this part on
ly
////RESULT
END
my ($segment) = $string =~ m!\*/([^/]+)////RESULT$!s;
print $segment;
Output:
输出:
select this part on
ly
#1
17
A useful trick in cases like this is to prefix the regexp with the greedy pattern .*
, which will try to match as many characters as possible before the rest of the pattern matches. So:
在这种情况下,一个有用的技巧是在regexp前加上贪婪模式。*,它会在模式的其余部分匹配之前尝试匹配尽可能多的字符。所以:
my ($match) = ($string =~ m!^.*\*/(.*?)////RESULT!s);
Let's break this pattern into its components:
让我们将这种模式分解为其组成部分:
-
^.*
starts at the beginning of the string and matches as many characters as it can. (Thes
modifier allows.
to match even newlines.) The beginning-of-string anchor^
is not strictly necessary, but it ensures that the regexp engine won't waste too much time backtracking if the match fails.^。*从字符串的开头开始,尽可能多地匹配字符。 (s修饰符允许。甚至匹配换行符。)字符串开头的锚点^并不是绝对必要的,但它确保正则表达式引擎在匹配失败时不会浪费太多时间回溯。
-
\*/
just matches the literal string*/
.\ * /只匹配文字字符串* /。
-
(.*?)
matches and captures any number of characters; the?
makes it ungreedy, so it prefers to match as few characters as possible in case there's more than one position where the rest of the regexp can match.(。*?)匹配并捕获任意数量的字符;的?使它不合适,所以它更喜欢匹配尽可能少的字符,以防正则表达式的其余部分可以匹配多个位置。
-
Finally,
////RESULT
just matches itself.最后,////结果只是匹配自己。
Since the pattern contains a lot of slashes, and since I wanted to avoid leaning toothpick syndrome, I decided to use alternative regexp delimiters. Exclamation points (!
) are a popular choice, since they don't collide with any normal regexp syntax.
由于该模式包含很多斜线,并且因为我想避免倾斜牙签综合症,所以我决定使用替代的regexp分隔符。感叹号(!)是一种流行的选择,因为它们不会与任何正常的正则表达式语法冲突。
Edit: Per discussion with ikegami below, I guess I should note that, if you want to use this regexp as a sub-pattern in a longer regexp, and if you want to guarantee that the string matched by (.*?)
will never contain ////RESULT
, then you should wrap those parts of the regexp in an independent (?>)
subexpression, like this:
编辑:每次与ikegami的讨论,我想我应该注意,如果你想在更长的正则表达式中使用这个正则表达式作为子模式,并且如果你想保证匹配的字符串(。*?)永远不会包含//// RESULT,那么你应该将regexp的那些部分包装在一个独立的(?>)子表达式中,如下所示:
my $regexp = qr!\*/(?>(.*?)////RESULT)!s;
...
my $match = ($string =~ /^.*$regexp$some_other_regexp/s);
The (?>)
causes the pattern inside it to fail rather than accepting a suboptimal match (i.e. one that extends beyond the first substring matching ////RESULT
) even if that means that the rest of the regexp will fail to match.
(?>)导致其中的模式失败而不是接受次优匹配(即超出匹配//// RESULT的第一个子串的匹配),即使这意味着正则表达式的其余部分将无法匹配。
#2
4
(?:(?!STRING).)*
matches any number of characters that don't contain STRING
. It's like [^a]
, but for strings instead of characters.
匹配任意数量的不包含STRING的字符。它就像[^ a],但是对于字符串而不是字符。
You can take shortcuts if you know certain inputs won't be encountered (like Kenosis and Ilmari Karonen did), but this is what what matches what you specified:
如果您知道不会遇到某些输入(如Kenosis和Ilmari Karonen所做的那样),您可以使用快捷方式,但这与您指定的匹配:
my ($segment) = $string =~ m{
\*/
( (?: (?! \*/ ). )* )
////RESULT
(?: (?! \*/ ). )*
\z
}xs;
If you don't care if */
appears after ////RESULT
, the following is the safest:
如果您不关心// // RESULT之后是否出现* /,则以下是最安全的:
my ($segment) = $string =~ m{
\*/
( (?: (?! \*/ ). )* )
////RESULT
}xs;
You didn't specify what should happen if there are two ////RESULT
that follow the last */
. The above matches until the last one. If you wanted to match until the first one, you'd use
如果有两个//// RESULT跟随最后一个* /,则没有指定会发生什么。以上匹配到最后一个。如果你想匹配到第一个,你可以使用
my ($segment) = $string =~ m{
\*/
( (?: (?! \*/ | ////RESULT ). )* )
////RESULT
}xs;
#3
2
Here's one option:
这是一个选项:
use strict;
use warnings;
my $string = <<'END';
hello world /* select a from table_b
*/ some other text with new line cha
racter and there are some blocks of
/* any string */ select this part on
ly
////RESULT
END
my ($segment) = $string =~ m!\*/([^/]+)////RESULT$!s;
print $segment;
Output:
输出:
select this part on
ly