正则表达式:如何使我的代码匹配'+'字符或数字

时间:2021-03-11 14:58:51

I've just started on regex.

我刚开始讲regex。

I'm trying to search through a short list of 'phrases' to find UK mobile numbers (starting with +44 or 07, sometimes with the number broken up by one space). I'm having trouble getting it to return numbers starting +44.

我正在搜索一个简短的“短语”列表,以找到英国的手机号码(以+44或07开头,有时以一个空格分开)。我很难让它返回以+44开头的数字。

This is what I've written:

这是我写的:

for snippet in phrases:
    match = re.search("\\b(\+44|07)\\d+\\s?\\d+\\b", snippet)
    if match:
        numbers.append(match)
        print(match)

which prints

的打印

    <_sre.SRE_Match object; span=(19, 31), match='07700 900432'>
    <_sre.SRE_Match object; span=(20, 31), match='07700930710'>

and misses out the number +44770090999 which is in 'phrases.'

并漏掉了“短语”中的数字+44770090999。

I tried with and without the brackets. Without the brackets it would also print the +44 in sums like '10+44=54.' Is the backslash before the +44 necessary? Any ideas on what I'm missing?

我试过没有括号。如果没有括号,它也会像“10+44=54”一样打印+44。“有必要在+44之前加上反斜杠吗?”有什么想法吗?

Thanks to all!

感谢所有!

EDIT: Some of my input:

编辑:我的一些输入:

  phrases = ["You can call me on 07700 900432.",
      "My mobile number is 07700930710",
      "My date of birth is 07.08.92",
      "Why not phone me on 202-555-0136?"
      "There are around 7600000000 people on Earth",
      "If you're from overseas, call +44 7700 900190",
      "Try calling +447700900999 now!",
      "56+44=100."]  

4 个解决方案

#1


1  

In your regex the word boundary \b does not match between a whitespace and a plus sign.

在regex中,单词边界\b在空格和加号之间不匹配。

What you could do is match either 07 or +44 and then match either a digit or a whitespace one or more times [\d ]+ followed by a digit \d to not match a whitespace at the end and add a word boundary \b at the end.

你所能做的是匹配07或+44,然后匹配一个数字或一个空格,一个或多个时间[\d]+后面跟着一个数字\d,以不匹配最后的空格,最后加上一个单词边界。

(?:07|\+44)[\d ]+\d\b

(?:07年| \ + 44)(\ d)+ \ d \ b

Demo Python

演示Python

#2


1  

The problem with your regex is that the the first \b matches the word boundary between the + and the 4. The boundary between a space and a + is not a word boundary. This means that it can't find +44 after the \b because the + is on the left of the \b. There is only 44 on the right of \b.

regex的问题是,第一个\b匹配+和4之间的单词边界。空间和a +之间的边界不是一个词的边界。这意味着它在b \b后面找不到+44,因为+在b \b的左边。\b的右边只有44个。

To fix this, you can use a negative lookbehind to make sure there are no words before +44. Remember to put it inside the capturing group because it should only be matched if the +44 option was chosen. You still want to match a word boundary if it were starting with 07.

为了解决这个问题,你可以使用消极的向后看,以确保在+44之前没有单词。记住把它放在捕获组中,因为只有在选择+44选项时才应该匹配它。如果从07开始,你仍然需要匹配一个单词边界。

((?!\w)\+44|\b07)\d+\s?\d+\b

You can put the regex in a r"" string. This way you don't have to write that many slashes:

您可以将regex放在r“”字符串中。这样你就不用写那么多的斜杠了:

r"((?!\w)\+44|07)\d+\s?\d+\b"

Demo

#3


0  

This should help.

这应该帮助。

import re
phrases = ["Hello +4407700 900432 World", "Hello +44770090999 World"]
for snippet in phrases:
    match = re.search(r"(?P<num>(\+44|07)\d+\s?\d+)", snippet)
    if match:
        print(match.group('num'))

Output:

输出:

+4407700 900432
+44770090999

#4


0  

You should be able to cover all cases by removing expected "noisy characters" from the string and simplify your regex to just "(07|\D44)\d{9}". Where:

您应该能够通过从字符串中删除预期的“噪声字符”来覆盖所有情况,并将regex简化为“(07|\D44)\d{9}”。地点:

(07|\D44) searches for a starting number with 07 and 44 preceded by a non-numeric character. \d{9} searches for the remaining 9 digits.

(07|\D44)搜索的起始号码是07和44,前面是一个非数字字符。\d{9}搜索剩下的9位数字。

Your code should look like this:

您的代码应该如下所示:

cleansnippet = snippet.replace("-","").replace(" ","").replace("(0)","")...
re.search("(07|\D44)\d{9}", cleansnippet)

Applying this to your input retrieves this:

将此应用到输入中可得到以下结果:

<_sre.SRE_Match object; span=(14, 25), match='07700900432'>
<_sre.SRE_Match object; span=(16, 27), match='07700930710'>
<_sre.SRE_Match object; span=(25, 37), match='+44770090019'>  
<_sre.SRE_Match object; span=(10, 22), match='+44770090099'>

Hope that helps.

希望有帮助。

Pd.: The \ before the + means that you are specifically looking for a + sign instead of "1 or more" of the previous element.

Pd。:在+表示之前,你是在寻找一个+号,而不是前一个元素的“1或更多”。

The only reason why I propose \D44 instead of the \+44 is because it could be safer for you as people could miss typing + prior their number. :)

我提出用D44代替\+44的唯一原因是它可能对你更安全,因为人们可能会错过输入+之前的号码。:)

#1


1  

In your regex the word boundary \b does not match between a whitespace and a plus sign.

在regex中,单词边界\b在空格和加号之间不匹配。

What you could do is match either 07 or +44 and then match either a digit or a whitespace one or more times [\d ]+ followed by a digit \d to not match a whitespace at the end and add a word boundary \b at the end.

你所能做的是匹配07或+44,然后匹配一个数字或一个空格,一个或多个时间[\d]+后面跟着一个数字\d,以不匹配最后的空格,最后加上一个单词边界。

(?:07|\+44)[\d ]+\d\b

(?:07年| \ + 44)(\ d)+ \ d \ b

Demo Python

演示Python

#2


1  

The problem with your regex is that the the first \b matches the word boundary between the + and the 4. The boundary between a space and a + is not a word boundary. This means that it can't find +44 after the \b because the + is on the left of the \b. There is only 44 on the right of \b.

regex的问题是,第一个\b匹配+和4之间的单词边界。空间和a +之间的边界不是一个词的边界。这意味着它在b \b后面找不到+44,因为+在b \b的左边。\b的右边只有44个。

To fix this, you can use a negative lookbehind to make sure there are no words before +44. Remember to put it inside the capturing group because it should only be matched if the +44 option was chosen. You still want to match a word boundary if it were starting with 07.

为了解决这个问题,你可以使用消极的向后看,以确保在+44之前没有单词。记住把它放在捕获组中,因为只有在选择+44选项时才应该匹配它。如果从07开始,你仍然需要匹配一个单词边界。

((?!\w)\+44|\b07)\d+\s?\d+\b

You can put the regex in a r"" string. This way you don't have to write that many slashes:

您可以将regex放在r“”字符串中。这样你就不用写那么多的斜杠了:

r"((?!\w)\+44|07)\d+\s?\d+\b"

Demo

#3


0  

This should help.

这应该帮助。

import re
phrases = ["Hello +4407700 900432 World", "Hello +44770090999 World"]
for snippet in phrases:
    match = re.search(r"(?P<num>(\+44|07)\d+\s?\d+)", snippet)
    if match:
        print(match.group('num'))

Output:

输出:

+4407700 900432
+44770090999

#4


0  

You should be able to cover all cases by removing expected "noisy characters" from the string and simplify your regex to just "(07|\D44)\d{9}". Where:

您应该能够通过从字符串中删除预期的“噪声字符”来覆盖所有情况,并将regex简化为“(07|\D44)\d{9}”。地点:

(07|\D44) searches for a starting number with 07 and 44 preceded by a non-numeric character. \d{9} searches for the remaining 9 digits.

(07|\D44)搜索的起始号码是07和44,前面是一个非数字字符。\d{9}搜索剩下的9位数字。

Your code should look like this:

您的代码应该如下所示:

cleansnippet = snippet.replace("-","").replace(" ","").replace("(0)","")...
re.search("(07|\D44)\d{9}", cleansnippet)

Applying this to your input retrieves this:

将此应用到输入中可得到以下结果:

<_sre.SRE_Match object; span=(14, 25), match='07700900432'>
<_sre.SRE_Match object; span=(16, 27), match='07700930710'>
<_sre.SRE_Match object; span=(25, 37), match='+44770090019'>  
<_sre.SRE_Match object; span=(10, 22), match='+44770090099'>

Hope that helps.

希望有帮助。

Pd.: The \ before the + means that you are specifically looking for a + sign instead of "1 or more" of the previous element.

Pd。:在+表示之前,你是在寻找一个+号,而不是前一个元素的“1或更多”。

The only reason why I propose \D44 instead of the \+44 is because it could be safer for you as people could miss typing + prior their number. :)

我提出用D44代替\+44的唯一原因是它可能对你更安全,因为人们可能会错过输入+之前的号码。:)