I tried to compare re.match
and re.search
using timeit
module and I found that match was better than search when the string I want to found was at the beginning of the string.
我尝试使用timeit模块比较re.match和re.search,我发现当我想要找到的字符串位于字符串的开头时匹配比搜索更好。
>>> s1 = '''
... import re
... re.search(r'hello','helloab'*100000)
... '''
>>> timeit.timeit(stmt=s1,number=10000)
32.12064480781555
>>> s = '''
... import re
... re.match(r'hello','helloab'*100000)
... '''
>>> timeit.timeit(stmt=s,number=10000)
30.9136700630188
Now, I am aware that match looks for the pattern in the beginning of the string and return an object if found but what I am wondering is how does search operate.
现在,我知道匹配在字符串的开头查找模式并返回一个对象(如果找到),但我想知道搜索是如何操作的。
Does search performs any extra matching after the string is found in the beginning which slows it down?
在开头找到字符串后,搜索是否会执行任何额外的匹配,从而减慢它的速度?
Update
更新
After using @David Robinsons code I got results simlar to him.
在使用@David Robinsons代码后,我得到了他的结果。
>>> print timeit.timeit(stmt="r.match('hello')",
... setup="import re; s = 'helloab'*100000; r = re.compile('hello')",
... number = 10000000)
49.9567620754
>>> print timeit.timeit(stmt="r.search('hello')",
... setup="import re; s = 'helloab'*100000; r = re.compile('hello')",
... number = 10000000)
35.6694438457
So, the updated question is now why search
is out-performing match
?
那么,更新的问题现在是为什么搜索表现优异的原因?
2 个解决方案
#1
11
"So, the updated question is now why search is out-performing match?"
“那么,更新后的问题是为什么搜索表现不佳?”
In this particular instance where a literal string is used rather than a regex pattern, indeed re.search
is slightly faster than re.match
for the default CPython implementation (I have not tested this in other incarnations of Python).
在这个使用文字字符串而不是正则表达式模式的特定实例中,实际上re.search比默认CPython实现的re.match稍快(我没有在Python的其他版本中测试过这个)。
>>> print timeit.timeit(stmt="r.match(s)",
... setup="import re; s = 'helloab'*100000; r = re.compile('hello')",
... number = 10000000)
3.29107403755
>>> print timeit.timeit(stmt="r.search(s)",
... setup="import re; s = 'helloab'*100000; r = re.compile('hello')",
... number = 10000000)
2.39184308052
Looking into the C code behind those modules, it appears the search code has a built in optimisation to quickly match patterns prefixed with a string lateral. In the example above, the whole pattern is a literal string with no regex patterns and so this optimised routined is used to match the whole pattern.
查看这些模块背后的C代码,看起来搜索代码具有内置优化,可快速匹配前缀为字符串横向的模式。在上面的示例中,整个模式是一个没有正则表达式模式的文字字符串,因此这个优化的例程用于匹配整个模式。
Notice how the performance degrades once we introduce regex symbols and, as the literal string prefix gets shorter:
注意一旦引入正则表达式符号,性能如何降低,并且随着文字字符串前缀变短:
>>> print timeit.timeit(stmt="r.search(s)",
... setup="import re; s = 'helloab'*100000; r = re.compile('hell.')",
... number = 10000000)
3.20765399933
>>>
>>> print timeit.timeit(stmt="r.search(s)",
... setup="import re; s = 'helloab'*100000; r = re.compile('hel.o')",
... number = 10000000)
3.31512498856
>>> print timeit.timeit(stmt="r.search(s)",
... setup="import re; s = 'helloab'*100000; r = re.compile('he.lo')",
... number = 10000000)
3.31983995438
>>> print timeit.timeit(stmt="r.search(s)",
... setup="import re; s = 'helloab'*100000; r = re.compile('h.llo')",
... number = 10000000)
3.39261603355
For portion of the pattern that contain regex patterns, SRE_MATCH is used to determine matches. That is essentially the same code behind re.match
.
对于包含正则表达式模式的模式部分,SRE_MATCH用于确定匹配。这与re.match背后的代码基本相同。
Note how the results are close (with re.match
marginally faster) if the pattern starts with a regex pattern instead of a literal string.
注意如果模式以正则表达式模式而不是文字字符串开头,结果如何接近(re.match略微加快)。
>>> print timeit.timeit(stmt="r.match(s)",
... setup="import re; s = 'helloab'*100000; r = re.compile('.ello')",
... number = 10000000)
3.22782492638
>>> print timeit.timeit(stmt="r.search(s)",
... setup="import re; s = 'helloab'*100000; r = re.compile('.ello')",
... number = 10000000)
3.31773591042
In other words, ignoring the fact that search
and match
have different purposes, re.search
is faster than re.match
only when the pattern is a literal string.
换句话说,忽略搜索和匹配具有不同目的的事实,只有当模式是文字字符串时,re.search比re.match更快。
Of course, if you're working with literal strings, you're likely to be better off using string operations instead.
当然,如果你正在使用文字字符串,那么你可能会更好地使用字符串操作。
>>> # Detecting exact matches
>>> print timeit.timeit(stmt="s == r",
... setup="s = 'helloab'*100000; r = 'hello'",
... number = 10000000)
0.339027881622
>>> # Determine if string contains another string
>>> print timeit.timeit(stmt="s in r",
... setup="s = 'helloab'*100000; r = 'hello'",
... number = 10000000)
0.479326963425
>>> # detecting prefix
>>> print timeit.timeit(stmt="s.startswith(r)",
... setup="s = 'helloab'*100000; r = 'hello'",
... number = 10000000)
1.49393510818
>>> print timeit.timeit(stmt="s[:len(r)] == r",
... setup="s = 'helloab'*100000; r = 'hello'",
... number = 10000000)
1.21005606651
#2
6
On my machine (Python 2.7.3 on Mac OS 10.7.3, 1.7 GHz Intel Core i5), when done putting the string construction, importing re, and the regex compiling in setup, and performing 10000000 iterations instead of 10, I find the opposite:
在我的机器上(Mac OS 10.7.3上的Python 2.7.3,1.7 GHz Intel Core i5),完成字符串构造,导入re和设置中的正则表达式编译,并执行10000000次迭代而不是10次时,我发现对面:
import timeit
print timeit.timeit(stmt="r.match(s)",
setup="import re; s = 'helloab'*100000; r = re.compile('hello')",
number = 10000000)
# 6.43165612221
print timeit.timeit(stmt="r.search(s)",
setup="import re; s = 'helloab'*100000; r = re.compile('hello')",
number = 10000000)
# 3.85176897049
#1
11
"So, the updated question is now why search is out-performing match?"
“那么,更新后的问题是为什么搜索表现不佳?”
In this particular instance where a literal string is used rather than a regex pattern, indeed re.search
is slightly faster than re.match
for the default CPython implementation (I have not tested this in other incarnations of Python).
在这个使用文字字符串而不是正则表达式模式的特定实例中,实际上re.search比默认CPython实现的re.match稍快(我没有在Python的其他版本中测试过这个)。
>>> print timeit.timeit(stmt="r.match(s)",
... setup="import re; s = 'helloab'*100000; r = re.compile('hello')",
... number = 10000000)
3.29107403755
>>> print timeit.timeit(stmt="r.search(s)",
... setup="import re; s = 'helloab'*100000; r = re.compile('hello')",
... number = 10000000)
2.39184308052
Looking into the C code behind those modules, it appears the search code has a built in optimisation to quickly match patterns prefixed with a string lateral. In the example above, the whole pattern is a literal string with no regex patterns and so this optimised routined is used to match the whole pattern.
查看这些模块背后的C代码,看起来搜索代码具有内置优化,可快速匹配前缀为字符串横向的模式。在上面的示例中,整个模式是一个没有正则表达式模式的文字字符串,因此这个优化的例程用于匹配整个模式。
Notice how the performance degrades once we introduce regex symbols and, as the literal string prefix gets shorter:
注意一旦引入正则表达式符号,性能如何降低,并且随着文字字符串前缀变短:
>>> print timeit.timeit(stmt="r.search(s)",
... setup="import re; s = 'helloab'*100000; r = re.compile('hell.')",
... number = 10000000)
3.20765399933
>>>
>>> print timeit.timeit(stmt="r.search(s)",
... setup="import re; s = 'helloab'*100000; r = re.compile('hel.o')",
... number = 10000000)
3.31512498856
>>> print timeit.timeit(stmt="r.search(s)",
... setup="import re; s = 'helloab'*100000; r = re.compile('he.lo')",
... number = 10000000)
3.31983995438
>>> print timeit.timeit(stmt="r.search(s)",
... setup="import re; s = 'helloab'*100000; r = re.compile('h.llo')",
... number = 10000000)
3.39261603355
For portion of the pattern that contain regex patterns, SRE_MATCH is used to determine matches. That is essentially the same code behind re.match
.
对于包含正则表达式模式的模式部分,SRE_MATCH用于确定匹配。这与re.match背后的代码基本相同。
Note how the results are close (with re.match
marginally faster) if the pattern starts with a regex pattern instead of a literal string.
注意如果模式以正则表达式模式而不是文字字符串开头,结果如何接近(re.match略微加快)。
>>> print timeit.timeit(stmt="r.match(s)",
... setup="import re; s = 'helloab'*100000; r = re.compile('.ello')",
... number = 10000000)
3.22782492638
>>> print timeit.timeit(stmt="r.search(s)",
... setup="import re; s = 'helloab'*100000; r = re.compile('.ello')",
... number = 10000000)
3.31773591042
In other words, ignoring the fact that search
and match
have different purposes, re.search
is faster than re.match
only when the pattern is a literal string.
换句话说,忽略搜索和匹配具有不同目的的事实,只有当模式是文字字符串时,re.search比re.match更快。
Of course, if you're working with literal strings, you're likely to be better off using string operations instead.
当然,如果你正在使用文字字符串,那么你可能会更好地使用字符串操作。
>>> # Detecting exact matches
>>> print timeit.timeit(stmt="s == r",
... setup="s = 'helloab'*100000; r = 'hello'",
... number = 10000000)
0.339027881622
>>> # Determine if string contains another string
>>> print timeit.timeit(stmt="s in r",
... setup="s = 'helloab'*100000; r = 'hello'",
... number = 10000000)
0.479326963425
>>> # detecting prefix
>>> print timeit.timeit(stmt="s.startswith(r)",
... setup="s = 'helloab'*100000; r = 'hello'",
... number = 10000000)
1.49393510818
>>> print timeit.timeit(stmt="s[:len(r)] == r",
... setup="s = 'helloab'*100000; r = 'hello'",
... number = 10000000)
1.21005606651
#2
6
On my machine (Python 2.7.3 on Mac OS 10.7.3, 1.7 GHz Intel Core i5), when done putting the string construction, importing re, and the regex compiling in setup, and performing 10000000 iterations instead of 10, I find the opposite:
在我的机器上(Mac OS 10.7.3上的Python 2.7.3,1.7 GHz Intel Core i5),完成字符串构造,导入re和设置中的正则表达式编译,并执行10000000次迭代而不是10次时,我发现对面:
import timeit
print timeit.timeit(stmt="r.match(s)",
setup="import re; s = 'helloab'*100000; r = re.compile('hello')",
number = 10000000)
# 6.43165612221
print timeit.timeit(stmt="r.search(s)",
setup="import re; s = 'helloab'*100000; r = re.compile('hello')",
number = 10000000)
# 3.85176897049