I've just started on regex.
我刚开始讲regex。
I'm trying to search through a short list of 'phrases' to find UK mobile numbers (starting with +44 or 07, sometimes with the number broken up by one space). I'm having trouble getting it to return numbers starting +44.
我正在搜索一个简短的“短语”列表,以找到英国的手机号码(以+44或07开头,有时以一个空格分开)。我很难让它返回以+44开头的数字。
This is what I've written:
这是我写的:
for snippet in phrases:
match = re.search("\\b(\+44|07)\\d+\\s?\\d+\\b", snippet)
if match:
numbers.append(match)
print(match)
which prints
的打印
<_sre.SRE_Match object; span=(19, 31), match='07700 900432'>
<_sre.SRE_Match object; span=(20, 31), match='07700930710'>
and misses out the number +44770090999 which is in 'phrases.'
并漏掉了“短语”中的数字+44770090999。
I tried with and without the brackets. Without the brackets it would also print the +44 in sums like '10+44=54.' Is the backslash before the +44 necessary? Any ideas on what I'm missing?
我试过没有括号。如果没有括号,它也会像“10+44=54”一样打印+44。“有必要在+44之前加上反斜杠吗?”有什么想法吗?
Thanks to all!
感谢所有!
EDIT: Some of my input:
编辑:我的一些输入:
phrases = ["You can call me on 07700 900432.",
"My mobile number is 07700930710",
"My date of birth is 07.08.92",
"Why not phone me on 202-555-0136?"
"There are around 7600000000 people on Earth",
"If you're from overseas, call +44 7700 900190",
"Try calling +447700900999 now!",
"56+44=100."]
4 个解决方案
#1
1
In your regex the word boundary \b
does not match between a whitespace and a plus sign.
在regex中,单词边界\b在空格和加号之间不匹配。
What you could do is match either 07
or +44
and then match either a digit or a whitespace one or more times [\d ]+
followed by a digit \d
to not match a whitespace at the end and add a word boundary \b
at the end.
你所能做的是匹配07或+44,然后匹配一个数字或一个空格,一个或多个时间[\d]+后面跟着一个数字\d,以不匹配最后的空格,最后加上一个单词边界。
(?:07年| \ + 44)(\ d)+ \ d \ b
演示Python
#2
1
The problem with your regex is that the the first \b
matches the word boundary between the +
and the 4
. The boundary between a space and a +
is not a word boundary. This means that it can't find +44
after the \b
because the +
is on the left of the \b
. There is only 44
on the right of \b
.
regex的问题是,第一个\b匹配+和4之间的单词边界。空间和a +之间的边界不是一个词的边界。这意味着它在b \b后面找不到+44,因为+在b \b的左边。\b的右边只有44个。
To fix this, you can use a negative lookbehind to make sure there are no words before +44
. Remember to put it inside the capturing group because it should only be matched if the +44
option was chosen. You still want to match a word boundary if it were starting with 07
.
为了解决这个问题,你可以使用消极的向后看,以确保在+44之前没有单词。记住把它放在捕获组中,因为只有在选择+44选项时才应该匹配它。如果从07开始,你仍然需要匹配一个单词边界。
((?!\w)\+44|\b07)\d+\s?\d+\b
You can put the regex in a r""
string. This way you don't have to write that many slashes:
您可以将regex放在r“”字符串中。这样你就不用写那么多的斜杠了:
r"((?!\w)\+44|07)\d+\s?\d+\b"
Demo
#3
0
This should help.
这应该帮助。
import re
phrases = ["Hello +4407700 900432 World", "Hello +44770090999 World"]
for snippet in phrases:
match = re.search(r"(?P<num>(\+44|07)\d+\s?\d+)", snippet)
if match:
print(match.group('num'))
Output:
输出:
+4407700 900432
+44770090999
#4
0
You should be able to cover all cases by removing expected "noisy characters" from the string and simplify your regex to just "(07|\D44)\d{9}"
. Where:
您应该能够通过从字符串中删除预期的“噪声字符”来覆盖所有情况,并将regex简化为“(07|\D44)\d{9}”。地点:
(07|\D44)
searches for a starting number with 07 and 44 preceded by a non-numeric character. \d{9}
searches for the remaining 9 digits.
(07|\D44)搜索的起始号码是07和44,前面是一个非数字字符。\d{9}搜索剩下的9位数字。
Your code should look like this:
您的代码应该如下所示:
cleansnippet = snippet.replace("-","").replace(" ","").replace("(0)","")...
re.search("(07|\D44)\d{9}", cleansnippet)
Applying this to your input retrieves this:
将此应用到输入中可得到以下结果:
<_sre.SRE_Match object; span=(14, 25), match='07700900432'>
<_sre.SRE_Match object; span=(16, 27), match='07700930710'>
<_sre.SRE_Match object; span=(25, 37), match='+44770090019'>
<_sre.SRE_Match object; span=(10, 22), match='+44770090099'>
Hope that helps.
希望有帮助。
Pd.: The \
before the +
means that you are specifically looking for a +
sign instead of "1 or more" of the previous element.
Pd。:在+表示之前,你是在寻找一个+号,而不是前一个元素的“1或更多”。
The only reason why I propose \D44
instead of the \+44
is because it could be safer for you as people could miss typing + prior their number. :)
我提出用D44代替\+44的唯一原因是它可能对你更安全,因为人们可能会错过输入+之前的号码。:)
#1
1
In your regex the word boundary \b
does not match between a whitespace and a plus sign.
在regex中,单词边界\b在空格和加号之间不匹配。
What you could do is match either 07
or +44
and then match either a digit or a whitespace one or more times [\d ]+
followed by a digit \d
to not match a whitespace at the end and add a word boundary \b
at the end.
你所能做的是匹配07或+44,然后匹配一个数字或一个空格,一个或多个时间[\d]+后面跟着一个数字\d,以不匹配最后的空格,最后加上一个单词边界。
(?:07年| \ + 44)(\ d)+ \ d \ b
演示Python
#2
1
The problem with your regex is that the the first \b
matches the word boundary between the +
and the 4
. The boundary between a space and a +
is not a word boundary. This means that it can't find +44
after the \b
because the +
is on the left of the \b
. There is only 44
on the right of \b
.
regex的问题是,第一个\b匹配+和4之间的单词边界。空间和a +之间的边界不是一个词的边界。这意味着它在b \b后面找不到+44,因为+在b \b的左边。\b的右边只有44个。
To fix this, you can use a negative lookbehind to make sure there are no words before +44
. Remember to put it inside the capturing group because it should only be matched if the +44
option was chosen. You still want to match a word boundary if it were starting with 07
.
为了解决这个问题,你可以使用消极的向后看,以确保在+44之前没有单词。记住把它放在捕获组中,因为只有在选择+44选项时才应该匹配它。如果从07开始,你仍然需要匹配一个单词边界。
((?!\w)\+44|\b07)\d+\s?\d+\b
You can put the regex in a r""
string. This way you don't have to write that many slashes:
您可以将regex放在r“”字符串中。这样你就不用写那么多的斜杠了:
r"((?!\w)\+44|07)\d+\s?\d+\b"
Demo
#3
0
This should help.
这应该帮助。
import re
phrases = ["Hello +4407700 900432 World", "Hello +44770090999 World"]
for snippet in phrases:
match = re.search(r"(?P<num>(\+44|07)\d+\s?\d+)", snippet)
if match:
print(match.group('num'))
Output:
输出:
+4407700 900432
+44770090999
#4
0
You should be able to cover all cases by removing expected "noisy characters" from the string and simplify your regex to just "(07|\D44)\d{9}"
. Where:
您应该能够通过从字符串中删除预期的“噪声字符”来覆盖所有情况,并将regex简化为“(07|\D44)\d{9}”。地点:
(07|\D44)
searches for a starting number with 07 and 44 preceded by a non-numeric character. \d{9}
searches for the remaining 9 digits.
(07|\D44)搜索的起始号码是07和44,前面是一个非数字字符。\d{9}搜索剩下的9位数字。
Your code should look like this:
您的代码应该如下所示:
cleansnippet = snippet.replace("-","").replace(" ","").replace("(0)","")...
re.search("(07|\D44)\d{9}", cleansnippet)
Applying this to your input retrieves this:
将此应用到输入中可得到以下结果:
<_sre.SRE_Match object; span=(14, 25), match='07700900432'>
<_sre.SRE_Match object; span=(16, 27), match='07700930710'>
<_sre.SRE_Match object; span=(25, 37), match='+44770090019'>
<_sre.SRE_Match object; span=(10, 22), match='+44770090099'>
Hope that helps.
希望有帮助。
Pd.: The \
before the +
means that you are specifically looking for a +
sign instead of "1 or more" of the previous element.
Pd。:在+表示之前,你是在寻找一个+号,而不是前一个元素的“1或更多”。
The only reason why I propose \D44
instead of the \+44
is because it could be safer for you as people could miss typing + prior their number. :)
我提出用D44代替\+44的唯一原因是它可能对你更安全,因为人们可能会错过输入+之前的号码。:)