Short question:
I have a string:
我有一个字符串:
title="Announcing Elasticsearch.js For Node.js And The Browser"
I want to find all pairs of words where each word is properly capitalized.
我想找到每个单词都正确大写的所有单词对。
So, expected output should be:
所以,预期的输出应该是:
['Announcing Elasticsearch.js', 'Elasticsearch.js For', 'For Node.js', 'Node.js And', 'And The', 'The Browser']
What I have right now is this:
我现在拥有的是:
'[A-Z][a-z]+[\s-][A-Z][a-z.]*'
This gives me the output:
这给了我输出:
['Announcing Elasticsearch.js', 'For Node.js', 'And The']
How can I change my regex to give desired output?
如何更改我的正则表达式以提供所需的输出?
3 个解决方案
#1
2
You can use this:
你可以用这个:
#!/usr/bin/python
import re
title="Announcing Elasticsearch.js For Node.js And The Browser TEst"
pattern = r'(?=((?<![A-Za-z.])[A-Z][a-z.]*[\s-][A-Z][a-z.]*))'
print re.findall(pattern, title)
A "normal" pattern can't match overlapping substrings, all characters are founded once for all. However, a lookahead (?=..)
(i.e. "followed by") is only a check and match nothing. It can parse the string several times. Thus if you put a capturing group inside the lookahead, you can obtain overlapping substrings.
“正常”模式不能匹配重叠的子串,所有字符都是一次性建立的。但是,前瞻(?= ..)(即“后跟”)只是一个检查并且不匹配。它可以多次解析字符串。因此,如果将捕获组放在前瞻中,则可以获得重叠的子串。
#2
0
There's probably a more efficient way to do this, but you could use a regex like this:
可能有一种更有效的方法,但你可以使用这样的正则表达式:
(\b[A-Z][a-z.-]+\b)
Then iterate through the capture groups like so testing with this regex: (^[A-Z][a-z.-]+$)
to ensure the matched group(current) matches the matched group(next).
然后迭代捕获组,如此测试使用此正则表达式:(^ [A-Z] [a-z .-] + $)以确保匹配的组(当前)匹配匹配的组(下一个)。
Working example:
import re
title = "Announcing Elasticsearch.js For Node.js And The Browser"
matchlist = []
m = re.findall(r"(\b[A-Z][a-z.-]+\b)", title)
i = 1
if m:
for i in range(len(m)):
if re.match(r"(^[A-Z][a-z.-]+$)", m[i - 1]) and re.match(r"(^[A-Z][a-z.-]+$)", m[i]):
matchlist.append([m[i - 1], m[i]])
print matchlist
Output:
[
['Browser', 'Announcing'],
['Announcing', 'Elasticsearch.js'],
['Elasticsearch.js', 'For'],
['For', 'Node.js'],
['Node.js', 'And'],
['And', 'The'],
['The', 'Browser']
]
#3
0
If your Python code at the moment is this
如果您目前的Python代码是这样的话
title="Announcing Elasticsearch.js For Node.js And The Browser"
results = re.findall("[A-Z][a-z]+[\s-][A-Z][a-z.]*", title)
then your program is skipping odd numbered pairs. An easy solution would be to research the pattern after skipping the first word like this:
然后你的程序正在跳过奇数对。一个简单的解决方案是在跳过第一个单词后研究模式,如下所示:
m = re.match("[A-Z][a-z]+[\s-]", title)
title_without_first_word = title[m.end():]
results2 = re.findall("[A-Z][a-z]+[\s-][A-Z][a-z.]*", title_without_first_word)
Now just combine results and result2 together.
现在只需将结果和result2结合起来。
#1
2
You can use this:
你可以用这个:
#!/usr/bin/python
import re
title="Announcing Elasticsearch.js For Node.js And The Browser TEst"
pattern = r'(?=((?<![A-Za-z.])[A-Z][a-z.]*[\s-][A-Z][a-z.]*))'
print re.findall(pattern, title)
A "normal" pattern can't match overlapping substrings, all characters are founded once for all. However, a lookahead (?=..)
(i.e. "followed by") is only a check and match nothing. It can parse the string several times. Thus if you put a capturing group inside the lookahead, you can obtain overlapping substrings.
“正常”模式不能匹配重叠的子串,所有字符都是一次性建立的。但是,前瞻(?= ..)(即“后跟”)只是一个检查并且不匹配。它可以多次解析字符串。因此,如果将捕获组放在前瞻中,则可以获得重叠的子串。
#2
0
There's probably a more efficient way to do this, but you could use a regex like this:
可能有一种更有效的方法,但你可以使用这样的正则表达式:
(\b[A-Z][a-z.-]+\b)
Then iterate through the capture groups like so testing with this regex: (^[A-Z][a-z.-]+$)
to ensure the matched group(current) matches the matched group(next).
然后迭代捕获组,如此测试使用此正则表达式:(^ [A-Z] [a-z .-] + $)以确保匹配的组(当前)匹配匹配的组(下一个)。
Working example:
import re
title = "Announcing Elasticsearch.js For Node.js And The Browser"
matchlist = []
m = re.findall(r"(\b[A-Z][a-z.-]+\b)", title)
i = 1
if m:
for i in range(len(m)):
if re.match(r"(^[A-Z][a-z.-]+$)", m[i - 1]) and re.match(r"(^[A-Z][a-z.-]+$)", m[i]):
matchlist.append([m[i - 1], m[i]])
print matchlist
Output:
[
['Browser', 'Announcing'],
['Announcing', 'Elasticsearch.js'],
['Elasticsearch.js', 'For'],
['For', 'Node.js'],
['Node.js', 'And'],
['And', 'The'],
['The', 'Browser']
]
#3
0
If your Python code at the moment is this
如果您目前的Python代码是这样的话
title="Announcing Elasticsearch.js For Node.js And The Browser"
results = re.findall("[A-Z][a-z]+[\s-][A-Z][a-z.]*", title)
then your program is skipping odd numbered pairs. An easy solution would be to research the pattern after skipping the first word like this:
然后你的程序正在跳过奇数对。一个简单的解决方案是在跳过第一个单词后研究模式,如下所示:
m = re.match("[A-Z][a-z]+[\s-]", title)
title_without_first_word = title[m.end():]
results2 = re.findall("[A-Z][a-z]+[\s-][A-Z][a-z.]*", title_without_first_word)
Now just combine results and result2 together.
现在只需将结果和result2结合起来。