将tuple从re.findall转换为字符串?

时间:2021-06-18 22:32:49

I wish to read in a text, use regex to find all instances of a pattern, then print the matching strings. If I use the re.search() method, I can successfully grab and print the first instance of the desired pattern:

我希望在文本中读取,使用regex查找模式的所有实例,然后打印匹配的字符串。如果使用re.search()方法,我可以成功地获取和打印所需模式的第一个实例:

import re

text = "Cello is a yellow parakeet who sings with Lillian. Toby is a clown who doesn't sing. Willy is a Wonka. Cello is a yellow Lillian."

match = re.search(r'(cello|Cello)(\W{1,80}\w{1,60}){0,9}\W{0,20}(lillian|Lillian)', text)
print match.group()

Unfortunately, the re.search() method only finds the first instance of the desired pattern, so I substituted re.findall():

不幸的是,re.search()方法只找到所需模式的第一个实例,所以我替换了re.findall():

import re

text = "Cello is a yellow parakeet who sings with Lillian. Toby is a clown who doesn't sing. Willy is a Wonka. Cello is a yellow Lillian."

match = re.findall(r'(cello|Cello)(\W{1,80}\w{1,60}){0,9}\W{0,20}(lillian|Lillian)', text)
print match

This routine finds both instances of the target pattern in the sample text, but I can't find a way to print the sentences in which the patterns occur. The print function of this latter bit of code yields: ('Cello', ' with', 'Lillian'), ('Cello', ' yellow', 'Lillian'), instead of the output I desire: "Cello is a yellow parakeet who sings with Lillian. Cello is a yellow Lillian."

这个例程在示例文本中查找目标模式的两个实例,但是我找不到一种方法来打印模式发生的句子。这段代码的打印功能是:(“大提琴”,“with”,“Lillian”),(“大提琴”,“黄色”,“Lillian”),而不是我想要的输出:“大提琴是一个和Lillian一起唱歌的黄色长尾鹦鹉。大提琴是一种黄色的Lillian。

Is there a way to modify the second bit of code so as to obtain this desired output? I would be most grateful for any advice any can lend on this question.

是否有一种方法可以修改第二段代码,从而获得所需的输出?我将非常感谢任何关于这个问题的建议。

2 个解决方案

#1


1  

I would just make a big capturing group around the two endpoints:

我只需要在两个端点上做一个大的捕捉组:

import re

text = "Cello is a yellow parakeet who sings with Lillian. Toby is a clown who doesn't sing. Willy is a Wonka. Cello is a yellow Lillian."

for match in re.findall(r'(Cello(?:\W{1,80}\w{1,60}){0,9}\W{0,20}Lillian)', text, flags=re.I):
    print match

Now, you get the two sentences:

现在,你得到两个句子:

Cello is a yellow parakeet who sings with Lillian
Cello is a yellow Lillian

Some tips:

一些建议:

  • flags=re.I makes the regex case-insensitive, so Cello matches both cello and Cello.
  • 旗帜=再保险。我让regex大小写不敏感,所以大提琴与大提琴和大提琴相匹配。
  • (?:foo) is just like (foo), except that the captured text won't appear as a match. It's useful for grouping things without making them match.
  • (?:foo)就像(foo),除了被捕获的文本不会显示为匹配。它对于分组而不使它们匹配是很有用的。

#2


3  

Description

Use a forward lookahead like in this regex which will capture complete sentences which contain both Cello and Lillian.

在这个regex中使用前向预览,它将捕获包含大提琴和Lillian的完整句子。

(?:(?<=\.)\s+|^)((?=(?:(?!\.(?:\s|$)).)*?\b[Cc]ello(?=\s|\.|$))(?=(?:(?!\.(?:\s|$)).)*?\b[Ll]illian(?=\s|\.|$)).*?\.(?=\s|$))

(?:(? < = \)\ s + | ^)((? =(?(? ! \(?:\ s | $)))* ? \ b(Cc)嗨(? = | \ \ s。| $))(? =(?(? ! \(?:\ s | $)))* ? b \[我]illian(? = | \ \ s。| $))。* ? \(? = \ s | $))。

将tuple从re.findall转换为字符串?

The expression is broken down like to these functional components:

将表达式分解为这些功能组件:

  • (?:(?<=\.)\s+|^) start matching this sentence at after a . followed by any number of spaces or at the start to of the string
  • (?:(? < = \)\ s + | ^)后开始匹配这个句子。后面是任意数量的空格或字符串的开头。
  • ( start capture group 1 which will capture the this entire sentence
  • (开始捕捉第1组,它将捕获整个句子。
  • (?= start the look ahead
    • (?:(?!\.(?:\s|$)).)*? ensure the regex engine doesn't leave this sentence by forcing it acknowledge a . followed by either white space or an end of string
    • (?:(? ! \。(?:\ s | $)))* ?确保regex引擎不离开这个句子,强制它承认a。其次是空格或字符串的结束。
    • \b matcht the word break
    • 让我们来看看break这个词吧。
    • [Cc]ello match the desired text either all lower case or with a capital initial
    • [Cc]ello匹配所需的文本,要么全部小写,要么以大写字母开头。
    • (?=\s|\.|$) look ahead to ensure the string has a trailing space, ., or the end of the string
    • (=\ |\.|$)向前看,以确保字符串有一个尾随空格,或字符串的结尾。
    • ) end of the look ahead
    • )展望未来。
  • (?开始向前看(?)? (? ? ? ? ?确保regex引擎不离开这个句子,强制它承认a。后跟空格或结束的字符串\ b matcht打破这个词(Cc)嗨匹配所需的初始文本所有小写或大写(? = | \ \ s。| $)展望未来,确保字符串末尾有空间,或结束的字符串)展望未来
  • (?=(?:(?!\.(?:\s|$)).)*?\b[Ll]illian(?=\s|\.|$)) this essentially does the same but for Lillian
  • (? =(?(? ! \(?:\ s | $)))* ? b \[我]illian(? = | \ \ s。| $))这是相同的但对莉莲
  • .*?\.(?=\s|$) capture the rest of the sentence upto and including the period, and make sure the period is followed by either white space or the end of the string
  • .* \. \.(?=\s|$)记录下句的其余部分,包括句点,并确保句号后面是空格或字符串的结尾。
  • ) end of the sentence capture group 1
  • 1 .句子的结束。

Code example

I don't know python well enough so I offer a PHP example. Note in match statement I'm using the s option which allows the . expression to match new line characters

我不太了解python,所以我提供了一个PHP示例。注意,在match语句中,我使用的s选项允许。表示匹配新行字符。

Input text

输入文本

Cello is a yellow parakeet who sings with Lillian. Toby is a clown who doesn't sing. Willy is a Wonka. Cello is a yellow Lillian.
Cello likes Lillian and kittens.
Lillian likes Cello and dogs.  Cello has no friends. And Lillian also hasn't met anyone.

Code

代码

<?php
$sourcestring="your source string";
preg_match_all('/(?:(?<=\.)\s+|^)((?=(?:(?!\.(?:\s|$)).)*?\b[Cc]ello(?=\s|\.|$))(?=(?:(?!\.(?:\s|$)).)*?\b[Ll]illian(?=\s|\.|$)).*?\.(?=\s|$))/s',$sourcestring,$matches);
echo "<pre>".print_r($matches,true);
?>

Matches

匹配

$matches Array:
(
    [0] => Array
        (
            [0] => Cello is a yellow parakeet who sings with Lillian.
            [1] =>  Cello is a yellow Lillian.
            [2] => 
Cello likes Lillian and kittens.
            [3] => 
Lillian likes Cello and dogs.
        )

    [1] => Array
        (
            [0] => Cello is a yellow parakeet who sings with Lillian.
            [1] => Cello is a yellow Lillian.
            [2] => Cello likes Lillian and kittens.
            [3] => Lillian likes Cello and dogs.
        )

)

If you absolutly need to match sentences where the string Cello appears before Lillian, then you use an expression like this. Here I've simply moved a single close parentheses.

如果你绝对需要匹配字符串大提琴出现在Lillian之前的句子,那么你可以使用这样的表达式。这里我只移动了一个小括号。

(?:(?<=\.)\s+|^)((?=(?:(?!\.(?:\s|$)).)*?\b[Cc]ello(?=\s|\.|$)(?=(?:(?!\.(?:\s|$)).)*?\b[Ll]illian(?=\s|\.|$))).*?\.(?=\s|$))

(?:(? < = \)\ s + | ^)((? =(?(? ! \(?:\ s | $)))* ? \ b(Cc)嗨(? = | \ \ s。| $)(? =(?(? ! \(?:\ s | $)))* ? b \[我]illian(? = | \ \ s。| $)))。* ? \(? = \ s | $))。

将tuple从re.findall转换为字符串?

Input text

输入文本

Cello is a yellow parakeet who sings with Lillian. Toby is a clown who doesn't sing. Willy is a Wonka. Cello is a yellow Lillian.
Cello likes Lillian and kittens.
Lillian likes Cello and dogs.  Cello has no friends. And Lillian also hasn't met anyone.

Output for capture group 1

捕获组1的输出。

[1] => Array
    (
        [0] => Cello is a yellow parakeet who sings with Lillian.
        [1] => Cello is a yellow Lillian.
        [2] => Cello likes Lillian and kittens.
    )

#1


1  

I would just make a big capturing group around the two endpoints:

我只需要在两个端点上做一个大的捕捉组:

import re

text = "Cello is a yellow parakeet who sings with Lillian. Toby is a clown who doesn't sing. Willy is a Wonka. Cello is a yellow Lillian."

for match in re.findall(r'(Cello(?:\W{1,80}\w{1,60}){0,9}\W{0,20}Lillian)', text, flags=re.I):
    print match

Now, you get the two sentences:

现在,你得到两个句子:

Cello is a yellow parakeet who sings with Lillian
Cello is a yellow Lillian

Some tips:

一些建议:

  • flags=re.I makes the regex case-insensitive, so Cello matches both cello and Cello.
  • 旗帜=再保险。我让regex大小写不敏感,所以大提琴与大提琴和大提琴相匹配。
  • (?:foo) is just like (foo), except that the captured text won't appear as a match. It's useful for grouping things without making them match.
  • (?:foo)就像(foo),除了被捕获的文本不会显示为匹配。它对于分组而不使它们匹配是很有用的。

#2


3  

Description

Use a forward lookahead like in this regex which will capture complete sentences which contain both Cello and Lillian.

在这个regex中使用前向预览,它将捕获包含大提琴和Lillian的完整句子。

(?:(?<=\.)\s+|^)((?=(?:(?!\.(?:\s|$)).)*?\b[Cc]ello(?=\s|\.|$))(?=(?:(?!\.(?:\s|$)).)*?\b[Ll]illian(?=\s|\.|$)).*?\.(?=\s|$))

(?:(? < = \)\ s + | ^)((? =(?(? ! \(?:\ s | $)))* ? \ b(Cc)嗨(? = | \ \ s。| $))(? =(?(? ! \(?:\ s | $)))* ? b \[我]illian(? = | \ \ s。| $))。* ? \(? = \ s | $))。

将tuple从re.findall转换为字符串?

The expression is broken down like to these functional components:

将表达式分解为这些功能组件:

  • (?:(?<=\.)\s+|^) start matching this sentence at after a . followed by any number of spaces or at the start to of the string
  • (?:(? < = \)\ s + | ^)后开始匹配这个句子。后面是任意数量的空格或字符串的开头。
  • ( start capture group 1 which will capture the this entire sentence
  • (开始捕捉第1组,它将捕获整个句子。
  • (?= start the look ahead
    • (?:(?!\.(?:\s|$)).)*? ensure the regex engine doesn't leave this sentence by forcing it acknowledge a . followed by either white space or an end of string
    • (?:(? ! \。(?:\ s | $)))* ?确保regex引擎不离开这个句子,强制它承认a。其次是空格或字符串的结束。
    • \b matcht the word break
    • 让我们来看看break这个词吧。
    • [Cc]ello match the desired text either all lower case or with a capital initial
    • [Cc]ello匹配所需的文本,要么全部小写,要么以大写字母开头。
    • (?=\s|\.|$) look ahead to ensure the string has a trailing space, ., or the end of the string
    • (=\ |\.|$)向前看,以确保字符串有一个尾随空格,或字符串的结尾。
    • ) end of the look ahead
    • )展望未来。
  • (?开始向前看(?)? (? ? ? ? ?确保regex引擎不离开这个句子,强制它承认a。后跟空格或结束的字符串\ b matcht打破这个词(Cc)嗨匹配所需的初始文本所有小写或大写(? = | \ \ s。| $)展望未来,确保字符串末尾有空间,或结束的字符串)展望未来
  • (?=(?:(?!\.(?:\s|$)).)*?\b[Ll]illian(?=\s|\.|$)) this essentially does the same but for Lillian
  • (? =(?(? ! \(?:\ s | $)))* ? b \[我]illian(? = | \ \ s。| $))这是相同的但对莉莲
  • .*?\.(?=\s|$) capture the rest of the sentence upto and including the period, and make sure the period is followed by either white space or the end of the string
  • .* \. \.(?=\s|$)记录下句的其余部分,包括句点,并确保句号后面是空格或字符串的结尾。
  • ) end of the sentence capture group 1
  • 1 .句子的结束。

Code example

I don't know python well enough so I offer a PHP example. Note in match statement I'm using the s option which allows the . expression to match new line characters

我不太了解python,所以我提供了一个PHP示例。注意,在match语句中,我使用的s选项允许。表示匹配新行字符。

Input text

输入文本

Cello is a yellow parakeet who sings with Lillian. Toby is a clown who doesn't sing. Willy is a Wonka. Cello is a yellow Lillian.
Cello likes Lillian and kittens.
Lillian likes Cello and dogs.  Cello has no friends. And Lillian also hasn't met anyone.

Code

代码

<?php
$sourcestring="your source string";
preg_match_all('/(?:(?<=\.)\s+|^)((?=(?:(?!\.(?:\s|$)).)*?\b[Cc]ello(?=\s|\.|$))(?=(?:(?!\.(?:\s|$)).)*?\b[Ll]illian(?=\s|\.|$)).*?\.(?=\s|$))/s',$sourcestring,$matches);
echo "<pre>".print_r($matches,true);
?>

Matches

匹配

$matches Array:
(
    [0] => Array
        (
            [0] => Cello is a yellow parakeet who sings with Lillian.
            [1] =>  Cello is a yellow Lillian.
            [2] => 
Cello likes Lillian and kittens.
            [3] => 
Lillian likes Cello and dogs.
        )

    [1] => Array
        (
            [0] => Cello is a yellow parakeet who sings with Lillian.
            [1] => Cello is a yellow Lillian.
            [2] => Cello likes Lillian and kittens.
            [3] => Lillian likes Cello and dogs.
        )

)

If you absolutly need to match sentences where the string Cello appears before Lillian, then you use an expression like this. Here I've simply moved a single close parentheses.

如果你绝对需要匹配字符串大提琴出现在Lillian之前的句子,那么你可以使用这样的表达式。这里我只移动了一个小括号。

(?:(?<=\.)\s+|^)((?=(?:(?!\.(?:\s|$)).)*?\b[Cc]ello(?=\s|\.|$)(?=(?:(?!\.(?:\s|$)).)*?\b[Ll]illian(?=\s|\.|$))).*?\.(?=\s|$))

(?:(? < = \)\ s + | ^)((? =(?(? ! \(?:\ s | $)))* ? \ b(Cc)嗨(? = | \ \ s。| $)(? =(?(? ! \(?:\ s | $)))* ? b \[我]illian(? = | \ \ s。| $)))。* ? \(? = \ s | $))。

将tuple从re.findall转换为字符串?

Input text

输入文本

Cello is a yellow parakeet who sings with Lillian. Toby is a clown who doesn't sing. Willy is a Wonka. Cello is a yellow Lillian.
Cello likes Lillian and kittens.
Lillian likes Cello and dogs.  Cello has no friends. And Lillian also hasn't met anyone.

Output for capture group 1

捕获组1的输出。

[1] => Array
    (
        [0] => Cello is a yellow parakeet who sings with Lillian.
        [1] => Cello is a yellow Lillian.
        [2] => Cello likes Lillian and kittens.
    )