I am reading through http://docs.python.org/2/library/re.html. According to this the "r" in pythons re.compile(r' pattern flags') refers the raw string notation :
我正在阅读http://docs.python.org/2/library/re.html。根据这一点,python的re.compile(r' pattern flags')中的“r”是指原始的字符串表示法:
The solution is to use Python’s raw string notation for regular expression patterns; backslashes are not handled in any special way in a string literal prefixed with 'r'. So r"\n" is a two-character string containing '\' and 'n', while "\n" is a one-character string containing a newline. Usually patterns will be expressed in Python code using this raw string notation.
解决方案是对正则表达式模式使用Python的原始字符串表示法;反斜杠在以“r”为前缀的字符串中不会以任何特殊方式处理。所以r"\n"是一个包含'\'和'n'的双字符字符串,而"\n"是一个包含换行的单字符字符串。通常,模式将使用这种原始字符串表示法在Python代码中表示。
Would it be fair to say then that:
这样说是否公平:
re.compile(r pattern) means that "pattern" is a regex while, re.compile(pattern) means that "pattern" is an exact match?
re.compile(r pattern)意味着“pattern”是一个regex,而re.compile(pattern)意味着“pattern”是一个精确的匹配。
3 个解决方案
#1
26
As @PauloBu
stated, the r
string prefix is not specifically related to regex's, but to strings generally in Python.
正如@PauloBu所指出的,r字符串前缀与regex没有特别的关系,而是与Python中的一般字符串有关。
Normal strings use the backslash character as an escape character for special characters (like newlines):
普通字符串使用反斜线字符作为特殊字符(如换行)的转义字符:
>>> print 'this is \n a test'
this is
a test
The r
prefix tells the interpreter not to do this:
r前缀告诉解释器不要这样做:
>>> print r'this is \n a test'
this is \n a test
>>>
This is important in regular expressions, as you need the backslash to make it to the re
module intact - in particular, \b
matches empty string specifically at the start and end of a word. re
expects the string \b
, however normal string interpretation '\b'
is converted to the ASCII backspace character, so you need to either explicitly escape the backslash ('\\b'
), or tell python it is a raw string (r'\b'
).
这在正则表达式中是很重要的,因为您需要反斜杠使它完整地到达re模块—特别是,\b在单词的开头和结尾匹配空字符串。re期望字符串\b,然而普通的字符串解释'\b'被转换为ASCII回空字符,所以您需要显式转义反斜杠('\ b'),或者告诉python它是一个原始字符串(r'\b')。
>>> import re
>>> re.findall('\b', 'test') # the backslash gets consumed by the python string interpreter
[]
>>> re.findall('\\b', 'test') # backslash is explicitly escaped and is passed through to re module
['', '']
>>> re.findall(r'\b', 'test') # often this syntax is easier
['', '']
#2
6
No, as the documentation pasted in explains the r
prefix to a string indicates that the string is a raw string
.
不,正如粘贴进来的文档解释的那样,字符串的r前缀表明该字符串是一个原始字符串。
Because of the collisions between Python escaping of characters and regex escaping, both of which use the back-slash \
character, raw strings provide a way to indicate to python that you want an unescaped string.
由于字符的Python转义和regex转义之间的冲突(这两个转义都使用反斜杠\字符),原始字符串提供了一种方法来向Python表明您想要一个未转义字符串。
Examine the following:
检查以下几点:
>>> "\n"
'\n'
>>> r"\n"
'\\n'
>>> print "\n"
>>> print r"\n"
\n
Prefixing with an r
merely indicates to the string that backslashes \
should be treated literally and not as escape characters for python.
以r开头的前缀只是向字符串表示,反斜杠\应该按照字面意思处理,而不是作为python的转义字符。
This is helpful, when for example you are searching on a word boundry. The regex for this is \b
, however to capture this in a Python string, I'd need to use "\\b"
as the pattern. Instead, I can use the raw string: r"\b"
to pattern match on.
这很有帮助,例如当你搜索一个单词boundry。它的regex是\b,但是要在Python字符串中捕获它,我需要使用“\b”作为模式。相反,我可以使用原始字符串:r“\b”来进行模式匹配。
This becomes especially handy when trying to find a literal backslash in regex. To match a backslash in regex I need to use the pattern \\
, to escape this in python means I need to escape each slash and the pattern becomes "\\\\"
, or the much simpler r"\\"
.
当试图在regex中找到一个字面反斜杠时,这将变得特别方便。为了匹配regex中的反斜杠,我需要使用模式\\,以python的方式来逃避这个,这意味着我需要避免每个斜杠,而模式变成“\\\”,或者更简单的r“\\”。
As you can guess in longer and more complex regexes, the extra slashes can get confusing, so raw strings are generally considered the way to go.
正如您可以在更长的、更复杂的regex中猜测的那样,额外的斜杠可能会让人感到困惑,因此原始字符串通常被认为是正确的。
#3
2
No. Not everything in regex syntax needs to be preceded by \
, so .
, *
, +
, etc still have special meaning in a pattern
不。并不是regex语法中的所有内容都需要在前面加上\,所以.、*、+等等在模式中仍然具有特殊的意义
The r''
is often used as a convenience for regex that do need a lot of \
as it prevents the clutter of doubling up the \
r通常被用作regex的一种便利,它确实需要很多的\,因为它可以防止将\加倍的混乱
#1
26
As @PauloBu
stated, the r
string prefix is not specifically related to regex's, but to strings generally in Python.
正如@PauloBu所指出的,r字符串前缀与regex没有特别的关系,而是与Python中的一般字符串有关。
Normal strings use the backslash character as an escape character for special characters (like newlines):
普通字符串使用反斜线字符作为特殊字符(如换行)的转义字符:
>>> print 'this is \n a test'
this is
a test
The r
prefix tells the interpreter not to do this:
r前缀告诉解释器不要这样做:
>>> print r'this is \n a test'
this is \n a test
>>>
This is important in regular expressions, as you need the backslash to make it to the re
module intact - in particular, \b
matches empty string specifically at the start and end of a word. re
expects the string \b
, however normal string interpretation '\b'
is converted to the ASCII backspace character, so you need to either explicitly escape the backslash ('\\b'
), or tell python it is a raw string (r'\b'
).
这在正则表达式中是很重要的,因为您需要反斜杠使它完整地到达re模块—特别是,\b在单词的开头和结尾匹配空字符串。re期望字符串\b,然而普通的字符串解释'\b'被转换为ASCII回空字符,所以您需要显式转义反斜杠('\ b'),或者告诉python它是一个原始字符串(r'\b')。
>>> import re
>>> re.findall('\b', 'test') # the backslash gets consumed by the python string interpreter
[]
>>> re.findall('\\b', 'test') # backslash is explicitly escaped and is passed through to re module
['', '']
>>> re.findall(r'\b', 'test') # often this syntax is easier
['', '']
#2
6
No, as the documentation pasted in explains the r
prefix to a string indicates that the string is a raw string
.
不,正如粘贴进来的文档解释的那样,字符串的r前缀表明该字符串是一个原始字符串。
Because of the collisions between Python escaping of characters and regex escaping, both of which use the back-slash \
character, raw strings provide a way to indicate to python that you want an unescaped string.
由于字符的Python转义和regex转义之间的冲突(这两个转义都使用反斜杠\字符),原始字符串提供了一种方法来向Python表明您想要一个未转义字符串。
Examine the following:
检查以下几点:
>>> "\n"
'\n'
>>> r"\n"
'\\n'
>>> print "\n"
>>> print r"\n"
\n
Prefixing with an r
merely indicates to the string that backslashes \
should be treated literally and not as escape characters for python.
以r开头的前缀只是向字符串表示,反斜杠\应该按照字面意思处理,而不是作为python的转义字符。
This is helpful, when for example you are searching on a word boundry. The regex for this is \b
, however to capture this in a Python string, I'd need to use "\\b"
as the pattern. Instead, I can use the raw string: r"\b"
to pattern match on.
这很有帮助,例如当你搜索一个单词boundry。它的regex是\b,但是要在Python字符串中捕获它,我需要使用“\b”作为模式。相反,我可以使用原始字符串:r“\b”来进行模式匹配。
This becomes especially handy when trying to find a literal backslash in regex. To match a backslash in regex I need to use the pattern \\
, to escape this in python means I need to escape each slash and the pattern becomes "\\\\"
, or the much simpler r"\\"
.
当试图在regex中找到一个字面反斜杠时,这将变得特别方便。为了匹配regex中的反斜杠,我需要使用模式\\,以python的方式来逃避这个,这意味着我需要避免每个斜杠,而模式变成“\\\”,或者更简单的r“\\”。
As you can guess in longer and more complex regexes, the extra slashes can get confusing, so raw strings are generally considered the way to go.
正如您可以在更长的、更复杂的regex中猜测的那样,额外的斜杠可能会让人感到困惑,因此原始字符串通常被认为是正确的。
#3
2
No. Not everything in regex syntax needs to be preceded by \
, so .
, *
, +
, etc still have special meaning in a pattern
不。并不是regex语法中的所有内容都需要在前面加上\,所以.、*、+等等在模式中仍然具有特殊的意义
The r''
is often used as a convenience for regex that do need a lot of \
as it prevents the clutter of doubling up the \
r通常被用作regex的一种便利,它确实需要很多的\,因为它可以防止将\加倍的混乱