使用正则表达式获取连续的大写单词

I am having trouble with my regex for capturing consecutive capitalized words. Here is what I want the regex to capture:

我正在使用我的正则表达式来捕获连续的大写单词。这是我想要正则表达式捕获的内容：

"said Polly Pocket and the toys" -> Polly Pocket

Here is the regex I am using:

这是我正在使用的正则表达式：

re.findall('said ([A-Z][\w-]*(\s+[A-Z][\w-]*)+)', article)

It returns the following:

它返回以下内容：

[('Polly Pocket', ' Pocket')]

I want it to return:

我希望它返回：

['Polly Pocket']

3 个解决方案

#1

Use a positive look-ahead:

使用积极的预测：

([A-Z][a-z]+(?=\s[A-Z])(?:\s[A-Z][a-z]+)+)

Assert that the current word, to be accepted, needs to be followed by another word with a capital letter in it. Broken down:

断言要接受的当前单词需要后跟另一个带有大写字母的单词。细分：

(                # begin capture
  [A-Z]            # one uppercase letter  \ First Word
  [a-z]+           # 1+ lowercase letters  /
  (?=\s[A-Z])      # must have a space and uppercase letter following it
  (?:                # non-capturing group
    \s               # space
    [A-Z]            # uppercase letter   \ Additional Word(s)
    [a-z]+           # lowercase letter   /
  )+              # group can be repeated (more words)
)               #end capture

#2

It's because findall returns all the capturing groups in your regex, and you have two capturing groups (one that gets all the matching text, and the inner one for subsequent words).

这是因为findall返回正则表达式中的所有捕获组，并且您有两个捕获组（一个获取所有匹配的文本，内部一个用于后续单词）。

You can just make your second capturing group into a non-capturing one by using (?:regex) instead of (regex):

您可以使用（？：regex）而不是（regex）将第二个捕获组变为非捕获组：

re.findall('([A-Z][\w-]*(?:\s+[A-Z][\w-]*)+)', article)

#3

$mystring = "the United States of America has many big cities like New York and Los Angeles, and others like Atlanta";

@phrases = $mystring =~ /[A-Z][\w'-]\*(?:\s+[A-Z][\w'-]\*)\*/g;

print "\n" . join(", ", @phrases) . "\n\n# phrases = " . scalar(@phrases) . "\n\n";

OUTPUT:

OUTPUT：

$ ./try_me.pl

United States, America, New York, Los Angeles, Atlanta

\# phrases = 5

#1