Python的正则表达式源字符串长度

时间:2020-12-23 21:37:36

In Python Regular Expressions,

在Python中正则表达式,

re.compile("x"*50000)

gives me OverflowError: regular expression code size limit exceeded

给我溢出错误:正则表达式代码大小限制超出。

but following one does not get any error, but it hits 100% CPU, and took 1 minute in my PC

但是接下来的一个没有任何错误,但是它达到了100% CPU,并且在我的PC上花费了1分钟

>>> re.compile(".*?.*?.*?.*?.*?.*?.*?.*?.*?.*?"*50000)
<_sre.SRE_Pattern object at 0x03FB0020>

Is that normal?

是正常的吗?

Should I assume, ".*?.*?.*?.*?.*?.*?.*?.*?.*?.*?"*50000 is shorter than "x"*50000?

我应该假设,”。* ?。* ?。* ?。* ?。* ?。* ?。* ?。* ?。* ?。* ?”*50000小于x *50000吗?

Tested on Python 2.6, Win32

在Python 2.6、Win32上测试

UPDATE 1:

更新1:

It Looks like ".*?.*?.*?.*?.*?.*?.*?.*?.*?.*?"*50000 could be reduce to .*?

它看起来像”。* ?。* ?。* ?。* ?。* ?。* ?。* ?。* ?。* ?。* ?”*50000可以减到*?

So, how about this one?

那么,这个呢?

re.compile(".*?x"*50000)

It does compile, and if that one also can reduce to ".*?x", it should match to string "abcx" or "x" alone, but it does not match.

它是编译的,如果那个也可以简化为“。*?”x,它应该只匹配字符串“abcx”或“x”,但不匹配。

So, Am I missing something?

我是不是漏掉了什么?

UPDATE 2:

更新2:

My Point is not to know max limit of regex source strings, I like to know some reasons/concepts of "x"*50000 caught by overflow handler, but not on ".*?x"*50000.

我的观点是不知道regex源字符串的最大限制,我想知道一些被溢出处理程序捕获的“x”*50000的原因/概念,但不是on“.*?x”*50000。

It does not make sense for me, thats why.

这对我来说毫无意义,这就是为什么。

It is something missing on overflow checking or Its just fine or its really overflowing something?

它是在溢出检查中缺失的东西还是它只是好的或者它真的是溢出的东西?

Any Hints/Opinions will be appreciated.

如有任何提示或意见,我们将不胜感激。

2 个解决方案

#1


6  

The difference is that ".*?.*?.*?.*?.*?.*?.*?.*?.*?.*?"*50000 can be reduced to ".*?", while "x"*50000 has to generate 50000 nodes in the FSM (or a similar structure used by the regex engine).

所不同的是,“。* ?。* ?。* ?。* ?。* ?。* ?。* ?。* ?。* ?。* ?”*50000可以降为“。*?”,而“x”*50000必须在FSM中生成50000个节点(或regex引擎使用的类似结构)。

EDIT: Ok, I was wrong. It's not that smart. The reason why "x"*50000 fails, but ".*?x"*50000 doesn't is that there is a limit on size of one "code item". "x"*50000 will generate one long item and ".*?x"*50000 will generate many small items. If you could split the string literal somehow without changing the meaning of the regex, it would work, but I can't think of a way to do that.

编辑:好的,我错了。这并不是说聪明。为什么"x"*50000失败,但是"。?x"*50000不是指一个“代码项”的大小有限制。"x"*50000将产生一个长项目和"。*?x"*50000会产生很多小项目。如果你可以在不改变regex的意思的情况下将字符串字面量分割,它就可以工作,但是我想不出一个方法来实现它。

#2


1  

you want to match 50000 "x"s , correct??? if so, an alternative without regex

你想匹配50000 "x",对吗??? ??如果是的话,可以选择没有regex的选项

if "x"*50000 in mystring:
    print "found"

if you want to match 50000 "x"s using regex, you can use range

如果您想使用regex匹配50000“x”,可以使用range

>>> pat=re.compile("x{50000}")
>>> pat.search(s)
<_sre.SRE_Match object at 0xb8057a30>

on my system it will take in length of 65535 max

在我的系统里,它的长度是65535

>>> pat=re.compile("x{65536}")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.6/re.py", line 188, in compile
    return _compile(pattern, flags)
  File "/usr/lib/python2.6/re.py", line 241, in _compile
    p = sre_compile.compile(pattern, flags)
  File "/usr/lib/python2.6/sre_compile.py", line 529, in compile
    groupindex, indexgroup
RuntimeError: invalid SRE code
>>> pat=re.compile("x{65535}")
>>>

I don't know if there are tweaks in Python we can use to increase that limit though.

我不知道Python中是否有微调我们可以用来增加这个限制。

#1


6  

The difference is that ".*?.*?.*?.*?.*?.*?.*?.*?.*?.*?"*50000 can be reduced to ".*?", while "x"*50000 has to generate 50000 nodes in the FSM (or a similar structure used by the regex engine).

所不同的是,“。* ?。* ?。* ?。* ?。* ?。* ?。* ?。* ?。* ?。* ?”*50000可以降为“。*?”,而“x”*50000必须在FSM中生成50000个节点(或regex引擎使用的类似结构)。

EDIT: Ok, I was wrong. It's not that smart. The reason why "x"*50000 fails, but ".*?x"*50000 doesn't is that there is a limit on size of one "code item". "x"*50000 will generate one long item and ".*?x"*50000 will generate many small items. If you could split the string literal somehow without changing the meaning of the regex, it would work, but I can't think of a way to do that.

编辑:好的,我错了。这并不是说聪明。为什么"x"*50000失败,但是"。?x"*50000不是指一个“代码项”的大小有限制。"x"*50000将产生一个长项目和"。*?x"*50000会产生很多小项目。如果你可以在不改变regex的意思的情况下将字符串字面量分割,它就可以工作,但是我想不出一个方法来实现它。

#2


1  

you want to match 50000 "x"s , correct??? if so, an alternative without regex

你想匹配50000 "x",对吗??? ??如果是的话,可以选择没有regex的选项

if "x"*50000 in mystring:
    print "found"

if you want to match 50000 "x"s using regex, you can use range

如果您想使用regex匹配50000“x”,可以使用range

>>> pat=re.compile("x{50000}")
>>> pat.search(s)
<_sre.SRE_Match object at 0xb8057a30>

on my system it will take in length of 65535 max

在我的系统里,它的长度是65535

>>> pat=re.compile("x{65536}")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.6/re.py", line 188, in compile
    return _compile(pattern, flags)
  File "/usr/lib/python2.6/re.py", line 241, in _compile
    p = sre_compile.compile(pattern, flags)
  File "/usr/lib/python2.6/sre_compile.py", line 529, in compile
    groupindex, indexgroup
RuntimeError: invalid SRE code
>>> pat=re.compile("x{65535}")
>>>

I don't know if there are tweaks in Python we can use to increase that limit though.

我不知道Python中是否有微调我们可以用来增加这个限制。