I was reading this article today on two different regular expression algorithms.
我今天在阅读这篇文章的两个不同的正则表达式算法。
According to the article old Unix tools like ed, sed, grep, egrep, awk, and lex, all use what's called the Thompson NFA algorithm in their regular expresssions...
根据文章旧的Unix工具,如ed,sed,grep,egrep,awk和lex,所有在他们的常规表达中都使用所谓的Thompson NFA算法......
However newer tools like Java, Perl, PHP, and Python all use a different algorithm for their regular expressions that are much, much slower.
然而,像Java,Perl,PHP和Python这样的新工具都使用不同的算法来处理速度慢得多的正则表达式。
This article makes no mention at all of Javascript's regex algorthim, (and yes I know there are various JS engines out there) but I was wondering if anybody knew which of those algorithms they use, and if maybe those algorithms should be swapped out for Thompson NFA.
这篇文章完全没有提到Javascript的正则表达式algorthim,(并且我知道那里有各种各样的JS引擎)但是我想知道是否有人知道他们使用了哪些算法,以及是否应该将这些算法换成Thompson NFA。
3 个解决方案
#1
6
The Javascript ECMA language description doesn't impose a requirement for the particular implementation of regular expressions, so that part of the question isn't well-formed. You're really wondering about the particular implementation in a particular browser.
Javascript ECMA语言描述并未强制要求正则表达式的特定实现,因此部分问题的格式不正确。您真的很想知道特定浏览器中的特定实现。
The reason Perl/Python etc use a slower algorithm, though, is that the regex language defined isn't really regular expressions. A real regular expression can be expressed as a finite state machine, but the language of regex is context free. That's why the fashion is to just call it "regex" instead of talking about regular expressions.
Perl / Python等使用较慢算法的原因是,定义的正则表达式语言不是真正的正则表达式。真正的正则表达式可以表示为有限状态机,但正则表达式的语言是无上下文的。这就是为什么时尚只是称它为“正则表达式”而不是谈论正则表达式。
Update
Yes, in fact javascript regex isn't content free regular. Consider the syntax using `{n,m}', that is, matches from n to m accepted regexs. Let d the difference d=|n-m|. The syntax means there exists a string uxdw that is acceptable, but a string uxk>dw that is not. It follows via the pumping lemma for regular languages that this is not a regular language.
是的,实际上javascript正则表达式不是免费的常规内容。考虑使用`{n,m}'的语法,即从n到m接受的正则表达式的匹配。设d差d = | n-m |。语法意味着存在一个可接受的字符串uxdw,但字符串uxk> dw不是。通过常规语言的泵浦引理,这不是常规语言。
(augh. Thinko corrected.)
(听说Thinko纠正了。)
#2
6
Though the ECMA standard does not specify the algorithm an ECMAScript implementation should use, the fact that the standard mandates that ECMAScript regular expressions must support backreferences (\1, \2, etc.) rules out the DFA and "Thompson NFA" implementations.
虽然ECMA标准没有指定ECMAScript实现应该使用的算法,但标准要求ECMAScript正则表达式必须支持反向引用(\ 1,\ 2等)的事实排除了DFA和“Thompson NFA”实现。
#3
3
Perl uses a memoized recursive backtracking search and, as of some improvements in 5.10, no longer blows up on perl -e '("a" x 100000) =~ /^(ab?)*$/;'
. In recent tests I performed on an OS X box, Perl 5.10 outperformed awk
, even in the cases where awk
's algorithm was supposed to be better.
Perl使用memoized递归回溯搜索,并且,对于5.10中的一些改进,不再在perl -e'(“a”x 100000)=〜/ ^(ab?)* $ /;'上爆炸。在我最近在OS X盒子上进行的测试中,Perl 5.10的性能优于awk,即使在awk的算法应该更好的情况下也是如此。
#1
6
The Javascript ECMA language description doesn't impose a requirement for the particular implementation of regular expressions, so that part of the question isn't well-formed. You're really wondering about the particular implementation in a particular browser.
Javascript ECMA语言描述并未强制要求正则表达式的特定实现,因此部分问题的格式不正确。您真的很想知道特定浏览器中的特定实现。
The reason Perl/Python etc use a slower algorithm, though, is that the regex language defined isn't really regular expressions. A real regular expression can be expressed as a finite state machine, but the language of regex is context free. That's why the fashion is to just call it "regex" instead of talking about regular expressions.
Perl / Python等使用较慢算法的原因是,定义的正则表达式语言不是真正的正则表达式。真正的正则表达式可以表示为有限状态机,但正则表达式的语言是无上下文的。这就是为什么时尚只是称它为“正则表达式”而不是谈论正则表达式。
Update
Yes, in fact javascript regex isn't content free regular. Consider the syntax using `{n,m}', that is, matches from n to m accepted regexs. Let d the difference d=|n-m|. The syntax means there exists a string uxdw that is acceptable, but a string uxk>dw that is not. It follows via the pumping lemma for regular languages that this is not a regular language.
是的,实际上javascript正则表达式不是免费的常规内容。考虑使用`{n,m}'的语法,即从n到m接受的正则表达式的匹配。设d差d = | n-m |。语法意味着存在一个可接受的字符串uxdw,但字符串uxk> dw不是。通过常规语言的泵浦引理,这不是常规语言。
(augh. Thinko corrected.)
(听说Thinko纠正了。)
#2
6
Though the ECMA standard does not specify the algorithm an ECMAScript implementation should use, the fact that the standard mandates that ECMAScript regular expressions must support backreferences (\1, \2, etc.) rules out the DFA and "Thompson NFA" implementations.
虽然ECMA标准没有指定ECMAScript实现应该使用的算法,但标准要求ECMAScript正则表达式必须支持反向引用(\ 1,\ 2等)的事实排除了DFA和“Thompson NFA”实现。
#3
3
Perl uses a memoized recursive backtracking search and, as of some improvements in 5.10, no longer blows up on perl -e '("a" x 100000) =~ /^(ab?)*$/;'
. In recent tests I performed on an OS X box, Perl 5.10 outperformed awk
, even in the cases where awk
's algorithm was supposed to be better.
Perl使用memoized递归回溯搜索,并且,对于5.10中的一些改进,不再在perl -e'(“a”x 100000)=〜/ ^(ab?)* $ /;'上爆炸。在我最近在OS X盒子上进行的测试中,Perl 5.10的性能优于awk,即使在awk的算法应该更好的情况下也是如此。