I'm not new to using regular expressions, and I understand the basic theory they're based on--finite state machines.
我对使用正则表达式并不陌生,我理解它们基于有限状态机的基本理论。
I'm not so good at algorithmic analysis though and don't understand how a regex compares to say, a basic linear search. I'm asking because on the surface it seems like a linear array search. (If the regex is simple.)
我不太擅长算法分析,也不明白regex是如何与基本的线性搜索进行比较的。我问这个问题是因为表面上看起来像是线性数组搜索。(如果regex很简单。)
Where could I go to learn more about implementing a regex engine?
在哪里可以学到更多关于实现regex引擎的知识?
3 个解决方案
#1
41
This is one of the most popular outlines: Regular Expression Matching Can Be Simple And Fast . Running a DFA-compiled regular expression against a string is indeed O(n), but can require up to O(2^m) construction time/space (where m = regular expression size).
这是最流行的概述之一:正则表达式匹配可以简单而快速。运行一个DFA-compiled正则表达式对字符串确实是O(n),但需要O(2 ^ m)施工时间/空间大小(m =正则表达式)。
#2
6
Are you familiar with the term Deterministic/Non-Deterministic Finite Automata?
你熟悉确定性/非确定性有限自动机这个词吗?
Real regular expressions (when I say real I'm refering to those regex that recognize Regular Languages, and not the regex that almost every programming language include with backreferences, etc) can be converted into a DFA/NFA and both can be implemented in a mechanical way in a programming language (a NFA can be converted into a DFA)
真正的正则表达式(当我说真正的我参照这些正则表达式识别常规的语言,而不是正则表达式,几乎所有编程语言包括反向引用,等等)可以被转换成一个DFA / NFA,可以实现机械的方式在编程语言(NFA可以转化为一个DFA)
What you have to do is:
你要做的是:
- Find a way to convert a regex into an automaton
- 找到一种将正则表达式转换为自动机的方法
- Implement the recognition of the automaton in the programming language of your preference
- 用您喜欢的编程语言实现对自动机的识别
That way, given a regex you can convert it to a DFA and run it to see if it matches or not a specified text.
这样,给定一个regex,您可以将它转换为DFA并运行它,以查看它是否匹配指定的文本。
This can be implemented in O(n)
, because DFA don't go backward (like a Turing Machine), so it matches the string or not. That is supposing you won't take in count overlapped matches, otherwise you will have to go back and start matching again...
这可以在O(n)中实现,因为DFA不向后(像图灵机),所以它与字符串是否匹配。这是假设你不接受计数重叠匹配,否则你将不得不返回并重新开始匹配……
#3
4
The classic regular expression can be implemented in a way which is fast in practice but has really bad worst case behaviour (the standard DFA) or in a way which has guaranteed reasonable worst case behaviour (keeping it as an NFA). The standard DFA can be extended to support lots of extra matching characters and flags, which make use of the fact that it is basically back-tracking search.
经典的正则表达式可以以一种方式实现,这种方式在实践中是快速的,但是有非常糟糕的最坏情况行为(标准DFA),或者以某种方式保证了合理的最坏情况行为(将其保留为NFA)。标准DFA可以扩展为支持大量额外的匹配字符和标志,这利用了它基本上是反向跟踪搜索的事实。
Examples of the standard approach are everywhere (e.g. built into Perl). There is an example that claims good worst case behaviour at http://code.google.com/p/re2/ - in fact it is even better than I expected in the worst case, so they may have found an extra trick or two.
标准方法的例子随处可见(例如,内置到Perl中)。在http://code.google.com/p/re2/,有一个例子可以说明最坏情况下的良好行为——事实上,在最坏情况下,它甚至比我预期的更好,所以他们可能发现了一两个额外的技巧。
If you are at all interested in this, or care about writing programs that can be made to lock up solid given pathological inputs, read http://swtch.com/~rsc/regexp/regexp1.html.
如果你对此感兴趣,或者想写一些程序来锁定特定的病理输入,请阅读http://swtch.com/~rsc/regexp/regexp1.html。
#1
41
This is one of the most popular outlines: Regular Expression Matching Can Be Simple And Fast . Running a DFA-compiled regular expression against a string is indeed O(n), but can require up to O(2^m) construction time/space (where m = regular expression size).
这是最流行的概述之一:正则表达式匹配可以简单而快速。运行一个DFA-compiled正则表达式对字符串确实是O(n),但需要O(2 ^ m)施工时间/空间大小(m =正则表达式)。
#2
6
Are you familiar with the term Deterministic/Non-Deterministic Finite Automata?
你熟悉确定性/非确定性有限自动机这个词吗?
Real regular expressions (when I say real I'm refering to those regex that recognize Regular Languages, and not the regex that almost every programming language include with backreferences, etc) can be converted into a DFA/NFA and both can be implemented in a mechanical way in a programming language (a NFA can be converted into a DFA)
真正的正则表达式(当我说真正的我参照这些正则表达式识别常规的语言,而不是正则表达式,几乎所有编程语言包括反向引用,等等)可以被转换成一个DFA / NFA,可以实现机械的方式在编程语言(NFA可以转化为一个DFA)
What you have to do is:
你要做的是:
- Find a way to convert a regex into an automaton
- 找到一种将正则表达式转换为自动机的方法
- Implement the recognition of the automaton in the programming language of your preference
- 用您喜欢的编程语言实现对自动机的识别
That way, given a regex you can convert it to a DFA and run it to see if it matches or not a specified text.
这样,给定一个regex,您可以将它转换为DFA并运行它,以查看它是否匹配指定的文本。
This can be implemented in O(n)
, because DFA don't go backward (like a Turing Machine), so it matches the string or not. That is supposing you won't take in count overlapped matches, otherwise you will have to go back and start matching again...
这可以在O(n)中实现,因为DFA不向后(像图灵机),所以它与字符串是否匹配。这是假设你不接受计数重叠匹配,否则你将不得不返回并重新开始匹配……
#3
4
The classic regular expression can be implemented in a way which is fast in practice but has really bad worst case behaviour (the standard DFA) or in a way which has guaranteed reasonable worst case behaviour (keeping it as an NFA). The standard DFA can be extended to support lots of extra matching characters and flags, which make use of the fact that it is basically back-tracking search.
经典的正则表达式可以以一种方式实现,这种方式在实践中是快速的,但是有非常糟糕的最坏情况行为(标准DFA),或者以某种方式保证了合理的最坏情况行为(将其保留为NFA)。标准DFA可以扩展为支持大量额外的匹配字符和标志,这利用了它基本上是反向跟踪搜索的事实。
Examples of the standard approach are everywhere (e.g. built into Perl). There is an example that claims good worst case behaviour at http://code.google.com/p/re2/ - in fact it is even better than I expected in the worst case, so they may have found an extra trick or two.
标准方法的例子随处可见(例如,内置到Perl中)。在http://code.google.com/p/re2/,有一个例子可以说明最坏情况下的良好行为——事实上,在最坏情况下,它甚至比我预期的更好,所以他们可能发现了一两个额外的技巧。
If you are at all interested in this, or care about writing programs that can be made to lock up solid given pathological inputs, read http://swtch.com/~rsc/regexp/regexp1.html.
如果你对此感兴趣,或者想写一些程序来锁定特定的病理输入,请阅读http://swtch.com/~rsc/regexp/regexp1.html。