I wish to read in a text, use regex to find all instances of a pattern, then print the matching strings. If I use the re.search() method, I can successfully grab and print the first instance of the desired pattern:
我希望在文本中读取,使用regex查找模式的所有实例,然后打印匹配的字符串。如果使用re.search()方法,我可以成功地获取和打印所需模式的第一个实例:
import re
text = "Cello is a yellow parakeet who sings with Lillian. Toby is a clown who doesn't sing. Willy is a Wonka. Cello is a yellow Lillian."
match = re.search(r'(cello|Cello)(\W{1,80}\w{1,60}){0,9}\W{0,20}(lillian|Lillian)', text)
print match.group()
Unfortunately, the re.search() method only finds the first instance of the desired pattern, so I substituted re.findall():
不幸的是,re.search()方法只找到所需模式的第一个实例,所以我替换了re.findall():
import re
text = "Cello is a yellow parakeet who sings with Lillian. Toby is a clown who doesn't sing. Willy is a Wonka. Cello is a yellow Lillian."
match = re.findall(r'(cello|Cello)(\W{1,80}\w{1,60}){0,9}\W{0,20}(lillian|Lillian)', text)
print match
This routine finds both instances of the target pattern in the sample text, but I can't find a way to print the sentences in which the patterns occur. The print function of this latter bit of code yields: ('Cello', ' with', 'Lillian'), ('Cello', ' yellow', 'Lillian'), instead of the output I desire: "Cello is a yellow parakeet who sings with Lillian. Cello is a yellow Lillian."
这个例程在示例文本中查找目标模式的两个实例,但是我找不到一种方法来打印模式发生的句子。这段代码的打印功能是:(“大提琴”,“with”,“Lillian”),(“大提琴”,“黄色”,“Lillian”),而不是我想要的输出:“大提琴是一个和Lillian一起唱歌的黄色长尾鹦鹉。大提琴是一种黄色的Lillian。
Is there a way to modify the second bit of code so as to obtain this desired output? I would be most grateful for any advice any can lend on this question.
是否有一种方法可以修改第二段代码,从而获得所需的输出?我将非常感谢任何关于这个问题的建议。
2 个解决方案
#1
1
I would just make a big capturing group around the two endpoints:
我只需要在两个端点上做一个大的捕捉组:
import re
text = "Cello is a yellow parakeet who sings with Lillian. Toby is a clown who doesn't sing. Willy is a Wonka. Cello is a yellow Lillian."
for match in re.findall(r'(Cello(?:\W{1,80}\w{1,60}){0,9}\W{0,20}Lillian)', text, flags=re.I):
print match
Now, you get the two sentences:
现在,你得到两个句子:
Cello is a yellow parakeet who sings with Lillian
Cello is a yellow Lillian
Some tips:
一些建议:
-
flags=re.I
makes the regex case-insensitive, soCello
matches bothcello
andCello
. - 旗帜=再保险。我让regex大小写不敏感,所以大提琴与大提琴和大提琴相匹配。
-
(?:foo)
is just like(foo)
, except that the captured text won't appear as a match. It's useful for grouping things without making them match. - (?:foo)就像(foo),除了被捕获的文本不会显示为匹配。它对于分组而不使它们匹配是很有用的。
#2
3
Description
Use a forward lookahead like in this regex which will capture complete sentences which contain both Cello and Lillian.
在这个regex中使用前向预览,它将捕获包含大提琴和Lillian的完整句子。
(?:(?<=\.)\s+|^)((?=(?:(?!\.(?:\s|$)).)*?\b[Cc]ello(?=\s|\.|$))(?=(?:(?!\.(?:\s|$)).)*?\b[Ll]illian(?=\s|\.|$)).*?\.(?=\s|$))
(?:(? < = \)\ s + | ^)((? =(?(? ! \(?:\ s | $)))* ? \ b(Cc)嗨(? = | \ \ s。| $))(? =(?(? ! \(?:\ s | $)))* ? b \[我]illian(? = | \ \ s。| $))。* ? \(? = \ s | $))。
The expression is broken down like to these functional components:
将表达式分解为这些功能组件:
-
(?:(?<=\.)\s+|^)
start matching this sentence at after a.
followed by any number of spaces or at the start to of the string - (?:(? < = \)\ s + | ^)后开始匹配这个句子。后面是任意数量的空格或字符串的开头。
-
(
start capture group 1 which will capture the this entire sentence - (开始捕捉第1组,它将捕获整个句子。
-
(?=
start the look ahead-
(?:(?!\.(?:\s|$)).)*?
ensure the regex engine doesn't leave this sentence by forcing it acknowledge a.
followed by either white space or an end of string - (?:(? ! \。(?:\ s | $)))* ?确保regex引擎不离开这个句子,强制它承认a。其次是空格或字符串的结束。
-
\b
matcht the word break - 让我们来看看break这个词吧。
-
[Cc]ello
match the desired text either all lower case or with a capital initial - [Cc]ello匹配所需的文本,要么全部小写,要么以大写字母开头。
-
(?=\s|\.|$)
look ahead to ensure the string has a trailing space,.
, or the end of the string - (=\ |\.|$)向前看,以确保字符串有一个尾随空格,或字符串的结尾。
-
)
end of the look ahead - )展望未来。
-
- (?开始向前看(?)? (? ? ? ? ?确保regex引擎不离开这个句子,强制它承认a。后跟空格或结束的字符串\ b matcht打破这个词(Cc)嗨匹配所需的初始文本所有小写或大写(? = | \ \ s。| $)展望未来,确保字符串末尾有空间,或结束的字符串)展望未来
-
(?=(?:(?!\.(?:\s|$)).)*?\b[Ll]illian(?=\s|\.|$))
this essentially does the same but for Lillian - (? =(?(? ! \(?:\ s | $)))* ? b \[我]illian(? = | \ \ s。| $))这是相同的但对莉莲
-
.*?\.(?=\s|$)
capture the rest of the sentence upto and including the period, and make sure the period is followed by either white space or the end of the string - .* \. \.(?=\s|$)记录下句的其余部分,包括句点,并确保句号后面是空格或字符串的结尾。
-
)
end of the sentence capture group 1 - 1 .句子的结束。
Code example
I don't know python well enough so I offer a PHP example. Note in match statement I'm using the s
option which allows the .
expression to match new line characters
我不太了解python,所以我提供了一个PHP示例。注意,在match语句中,我使用的s选项允许。表示匹配新行字符。
Input text
输入文本
Cello is a yellow parakeet who sings with Lillian. Toby is a clown who doesn't sing. Willy is a Wonka. Cello is a yellow Lillian.
Cello likes Lillian and kittens.
Lillian likes Cello and dogs. Cello has no friends. And Lillian also hasn't met anyone.
Code
代码
<?php
$sourcestring="your source string";
preg_match_all('/(?:(?<=\.)\s+|^)((?=(?:(?!\.(?:\s|$)).)*?\b[Cc]ello(?=\s|\.|$))(?=(?:(?!\.(?:\s|$)).)*?\b[Ll]illian(?=\s|\.|$)).*?\.(?=\s|$))/s',$sourcestring,$matches);
echo "<pre>".print_r($matches,true);
?>
Matches
匹配
$matches Array:
(
[0] => Array
(
[0] => Cello is a yellow parakeet who sings with Lillian.
[1] => Cello is a yellow Lillian.
[2] =>
Cello likes Lillian and kittens.
[3] =>
Lillian likes Cello and dogs.
)
[1] => Array
(
[0] => Cello is a yellow parakeet who sings with Lillian.
[1] => Cello is a yellow Lillian.
[2] => Cello likes Lillian and kittens.
[3] => Lillian likes Cello and dogs.
)
)
If you absolutly need to match sentences where the string Cello appears before Lillian, then you use an expression like this. Here I've simply moved a single close parentheses.
如果你绝对需要匹配字符串大提琴出现在Lillian之前的句子,那么你可以使用这样的表达式。这里我只移动了一个小括号。
(?:(?<=\.)\s+|^)((?=(?:(?!\.(?:\s|$)).)*?\b[Cc]ello(?=\s|\.|$)(?=(?:(?!\.(?:\s|$)).)*?\b[Ll]illian(?=\s|\.|$))).*?\.(?=\s|$))
(?:(? < = \)\ s + | ^)((? =(?(? ! \(?:\ s | $)))* ? \ b(Cc)嗨(? = | \ \ s。| $)(? =(?(? ! \(?:\ s | $)))* ? b \[我]illian(? = | \ \ s。| $)))。* ? \(? = \ s | $))。
Input text
输入文本
Cello is a yellow parakeet who sings with Lillian. Toby is a clown who doesn't sing. Willy is a Wonka. Cello is a yellow Lillian.
Cello likes Lillian and kittens.
Lillian likes Cello and dogs. Cello has no friends. And Lillian also hasn't met anyone.
Output for capture group 1
捕获组1的输出。
[1] => Array
(
[0] => Cello is a yellow parakeet who sings with Lillian.
[1] => Cello is a yellow Lillian.
[2] => Cello likes Lillian and kittens.
)
#1
1
I would just make a big capturing group around the two endpoints:
我只需要在两个端点上做一个大的捕捉组:
import re
text = "Cello is a yellow parakeet who sings with Lillian. Toby is a clown who doesn't sing. Willy is a Wonka. Cello is a yellow Lillian."
for match in re.findall(r'(Cello(?:\W{1,80}\w{1,60}){0,9}\W{0,20}Lillian)', text, flags=re.I):
print match
Now, you get the two sentences:
现在,你得到两个句子:
Cello is a yellow parakeet who sings with Lillian
Cello is a yellow Lillian
Some tips:
一些建议:
-
flags=re.I
makes the regex case-insensitive, soCello
matches bothcello
andCello
. - 旗帜=再保险。我让regex大小写不敏感,所以大提琴与大提琴和大提琴相匹配。
-
(?:foo)
is just like(foo)
, except that the captured text won't appear as a match. It's useful for grouping things without making them match. - (?:foo)就像(foo),除了被捕获的文本不会显示为匹配。它对于分组而不使它们匹配是很有用的。
#2
3
Description
Use a forward lookahead like in this regex which will capture complete sentences which contain both Cello and Lillian.
在这个regex中使用前向预览,它将捕获包含大提琴和Lillian的完整句子。
(?:(?<=\.)\s+|^)((?=(?:(?!\.(?:\s|$)).)*?\b[Cc]ello(?=\s|\.|$))(?=(?:(?!\.(?:\s|$)).)*?\b[Ll]illian(?=\s|\.|$)).*?\.(?=\s|$))
(?:(? < = \)\ s + | ^)((? =(?(? ! \(?:\ s | $)))* ? \ b(Cc)嗨(? = | \ \ s。| $))(? =(?(? ! \(?:\ s | $)))* ? b \[我]illian(? = | \ \ s。| $))。* ? \(? = \ s | $))。
The expression is broken down like to these functional components:
将表达式分解为这些功能组件:
-
(?:(?<=\.)\s+|^)
start matching this sentence at after a.
followed by any number of spaces or at the start to of the string - (?:(? < = \)\ s + | ^)后开始匹配这个句子。后面是任意数量的空格或字符串的开头。
-
(
start capture group 1 which will capture the this entire sentence - (开始捕捉第1组,它将捕获整个句子。
-
(?=
start the look ahead-
(?:(?!\.(?:\s|$)).)*?
ensure the regex engine doesn't leave this sentence by forcing it acknowledge a.
followed by either white space or an end of string - (?:(? ! \。(?:\ s | $)))* ?确保regex引擎不离开这个句子,强制它承认a。其次是空格或字符串的结束。
-
\b
matcht the word break - 让我们来看看break这个词吧。
-
[Cc]ello
match the desired text either all lower case or with a capital initial - [Cc]ello匹配所需的文本,要么全部小写,要么以大写字母开头。
-
(?=\s|\.|$)
look ahead to ensure the string has a trailing space,.
, or the end of the string - (=\ |\.|$)向前看,以确保字符串有一个尾随空格,或字符串的结尾。
-
)
end of the look ahead - )展望未来。
-
- (?开始向前看(?)? (? ? ? ? ?确保regex引擎不离开这个句子,强制它承认a。后跟空格或结束的字符串\ b matcht打破这个词(Cc)嗨匹配所需的初始文本所有小写或大写(? = | \ \ s。| $)展望未来,确保字符串末尾有空间,或结束的字符串)展望未来
-
(?=(?:(?!\.(?:\s|$)).)*?\b[Ll]illian(?=\s|\.|$))
this essentially does the same but for Lillian - (? =(?(? ! \(?:\ s | $)))* ? b \[我]illian(? = | \ \ s。| $))这是相同的但对莉莲
-
.*?\.(?=\s|$)
capture the rest of the sentence upto and including the period, and make sure the period is followed by either white space or the end of the string - .* \. \.(?=\s|$)记录下句的其余部分,包括句点,并确保句号后面是空格或字符串的结尾。
-
)
end of the sentence capture group 1 - 1 .句子的结束。
Code example
I don't know python well enough so I offer a PHP example. Note in match statement I'm using the s
option which allows the .
expression to match new line characters
我不太了解python,所以我提供了一个PHP示例。注意,在match语句中,我使用的s选项允许。表示匹配新行字符。
Input text
输入文本
Cello is a yellow parakeet who sings with Lillian. Toby is a clown who doesn't sing. Willy is a Wonka. Cello is a yellow Lillian.
Cello likes Lillian and kittens.
Lillian likes Cello and dogs. Cello has no friends. And Lillian also hasn't met anyone.
Code
代码
<?php
$sourcestring="your source string";
preg_match_all('/(?:(?<=\.)\s+|^)((?=(?:(?!\.(?:\s|$)).)*?\b[Cc]ello(?=\s|\.|$))(?=(?:(?!\.(?:\s|$)).)*?\b[Ll]illian(?=\s|\.|$)).*?\.(?=\s|$))/s',$sourcestring,$matches);
echo "<pre>".print_r($matches,true);
?>
Matches
匹配
$matches Array:
(
[0] => Array
(
[0] => Cello is a yellow parakeet who sings with Lillian.
[1] => Cello is a yellow Lillian.
[2] =>
Cello likes Lillian and kittens.
[3] =>
Lillian likes Cello and dogs.
)
[1] => Array
(
[0] => Cello is a yellow parakeet who sings with Lillian.
[1] => Cello is a yellow Lillian.
[2] => Cello likes Lillian and kittens.
[3] => Lillian likes Cello and dogs.
)
)
If you absolutly need to match sentences where the string Cello appears before Lillian, then you use an expression like this. Here I've simply moved a single close parentheses.
如果你绝对需要匹配字符串大提琴出现在Lillian之前的句子,那么你可以使用这样的表达式。这里我只移动了一个小括号。
(?:(?<=\.)\s+|^)((?=(?:(?!\.(?:\s|$)).)*?\b[Cc]ello(?=\s|\.|$)(?=(?:(?!\.(?:\s|$)).)*?\b[Ll]illian(?=\s|\.|$))).*?\.(?=\s|$))
(?:(? < = \)\ s + | ^)((? =(?(? ! \(?:\ s | $)))* ? \ b(Cc)嗨(? = | \ \ s。| $)(? =(?(? ! \(?:\ s | $)))* ? b \[我]illian(? = | \ \ s。| $)))。* ? \(? = \ s | $))。
Input text
输入文本
Cello is a yellow parakeet who sings with Lillian. Toby is a clown who doesn't sing. Willy is a Wonka. Cello is a yellow Lillian.
Cello likes Lillian and kittens.
Lillian likes Cello and dogs. Cello has no friends. And Lillian also hasn't met anyone.
Output for capture group 1
捕获组1的输出。
[1] => Array
(
[0] => Cello is a yellow parakeet who sings with Lillian.
[1] => Cello is a yellow Lillian.
[2] => Cello likes Lillian and kittens.
)