与JavaScript相比，正则表达式在Ruby中的表现较差

Recently, I started using the email validation regular expression from the JQuery validation plugin in my Rails models.

最近，我开始在我的Rails模型中使用JQuery验证插件中的电子邮件验证正则表达式。

EMAIL_REGEXP=/^((([a-z]|\d|[!#\$%&'\*\+\-\/=\?\^_`{\|}~]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])+(\.([a-z]|\d|[!#\$%&'\*\+\-\/=\?\^_`{\|}~]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])+)*)|((\x22)((((\x20|\x09)*(\x0d\x0a))?(\x20|\x09)+)?(([\x01-\x08\x0b\x0c\x0e-\x1f\x7f]|\x21|[\x23-\x5b]|[\x5d-\x7e]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(\\([\x01-\x09\x0b\x0c\x0d-\x7f]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF]))))*(((\x20|\x09)*(\x0d\x0a))?(\x20|\x09)+)?(\x22)))@((([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.)+(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))$/i

"aaaaa.bbbbbb.ccccc.ddddd@gmail.com".match EMAIL_REGEXP  # returns immidiately
"aaaaa.bbbbbb.ccccc.ddddd@gmail".match EMAIL_REGEXP  # takes a long time

The regular expression takes a long time when an invalid email has many dot separated tokens (E.g: first.middle.last@gmail). Same expression works without any noticable delay in JavaScript.

当一个无效的电子邮件有许多点分隔的令牌(E)时，正则表达式需要很长时间。g:first.middle.last@gmail)。同样的表达式在JavaScript中没有任何明显的延迟。

Why is there such a difference in performance between Ruby and JavaScript regular expression parsers? Is there anything I can do to improve the response time?

为什么Ruby和JavaScript正则表达式解析器的性能有如此大的差异?我能做些什么来改进响应时间?

I am on Ruby 1.8.7. I don't see the same issue on Ruby 1.9.2.

我使用的是Ruby 1.8.7。我在Ruby 1.9.2中没有看到同样的问题。

Note

请注意

I know the reg-exp is long. Since it is used by jQuery, I thought of using it. I can always change it back to a simpler regexp as shown here. My question is mostly about finding out the reason why the same regular expression is much faster in JS.

我知道reg-exp很长。由于jQuery使用了它，所以我想到了使用它。我可以把它改成更简单的regexp，如下所示。我的问题主要是找出为什么相同的正则表达式在JS中更快的原因。

Reference:

参考:

JQuery Validation Plugin Source

JQuery验证插件源

Sample form with jQuery email validation

带有jQuery电子邮件验证的示例表单

2 个解决方案

#1

Don't know why regex parser from 1.8.7 is so much slower than the one from JS or Oniguruma from 1.9.2, but may be this particular regex can benefit from wrapping its prefix including @ symbol with atomic group like that:

不知道为什么从1.8.7开始的regex解析器要比从1.9.2开始的JS或Oniguruma解析器慢得多，但是这个特定的regex可能会受益于用这样的原子组包装它的前缀包括@符号:

EMAIL_REGEXP = /
  ^
  (?>(( # atomic group start
    ([a-z]|\d|[!#\$%&'\*\+\-\/=\?\^_`{\|}~]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])+
    (\.([a-z]|\d|[!#\$%&'\*\+\-\/=\?\^_`{\|}~]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])+)*
   )
   |
   (
     (\x22)
     (
       (((\x20|\x09)*(\x0d\x0a))?(\x20|\x09)+)?
       (
         ([\x01-\x08\x0b\x0c\x0e-\x1f\x7f]|\x21|[\x23-\x5b]|[\x5d-\x7e]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])
         |
         (\\([\x01-\x09\x0b\x0c\x0d-\x7f]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF]))
       )
     )*
     (((\x20|\x09)*(\x0d\x0a))?(\x20|\x09)+)?
     (\x22)
   )
  )
  @) # atomic group end
  (
    (
      ([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])
      |
      (
        ([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])
        ([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*
        ([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])
      )
    )
    \.
  )+
  (
    ([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])
    |
    (
      ([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])
      ([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*
      ([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])
    )
  )
  $
  /xi

puts "aaaaa.bbbbbb.ccccc.ddddd@gmail.com".match EMAIL_REGEXP  # returns immediately
puts "aaaaa.bbbbbb.ccccc.ddddd@gmail".match EMAIL_REGEXP  # takes a long time

Atomic group in that case should prevent parser from returning to the first part of the string when matching the part following @ symbol failed. And it gives significant speed-up. Though, I'm not 100% sure that it doesn't break regexp logic, so I'd appreciate any comments.

在这种情况下，当匹配@符号后面的部分失败时，原子组应该防止解析器返回到字符串的第一部分。它提供了显著的加速。尽管如此，我不能100%确定它是否不会破坏regexp逻辑，因此我希望得到任何评论。

Another thing is using non-capturing groups that should be faster in general when you don't need to backreference for groups, but they don't give any noticeable improvement in this case.

另一件事是使用非捕获组，当您不需要对组进行反向引用时，它们通常应该更快，但是在这种情况下，它们不会带来任何明显的改进。

#2

The problem might be in that your Regexp contains a greedy quantifiers, so Ruby as those quantifiers require trying to check all combinations. The solution might be to use Possessive Quantifiers, so the look-up whould be mush faster but it will change regexp so some strings will no longer match. Short Example (from wikipedia):

问题可能在于Regexp包含一个贪婪的量词，因此Ruby作为这些量词需要尝试检查所有的组合。解决方案可能是使用所有格量词，这样查找速度会更快，但它会改变regexp，因此一些字符串将不再匹配。短的示例(来自*):

'aaaaaaaaaaaaaaaaaaaaaaaaa' =~ /(a+a+)/ => match
'aaaaaaaaaaaaaaaaaaaaaaaaa' =~ /(a++a+)/ => not match

The differense is in lookup process and in greedy quantifiers engine trying to look-back if no match, in the case of possessive quantifiers engine never looks back.

不同之处在于查找过程和贪婪量词引擎在没有匹配的情况下试图返回，在所有量词引擎中则永远不会返回。

#1

EMAIL_REGEXP = /
  ^
  (?>(( # atomic group start
    ([a-z]|\d|[!#\$%&'\*\+\-\/=\?\^_`{\|}~]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])+
    (\.([a-z]|\d|[!#\$%&'\*\+\-\/=\?\^_`{\|}~]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])+)*
   )
   |
   (
     (\x22)
     (
       (((\x20|\x09)*(\x0d\x0a))?(\x20|\x09)+)?
       (
         ([\x01-\x08\x0b\x0c\x0e-\x1f\x7f]|\x21|[\x23-\x5b]|[\x5d-\x7e]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])
         |
         (\\([\x01-\x09\x0b\x0c\x0d-\x7f]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF]))
       )
     )*
     (((\x20|\x09)*(\x0d\x0a))?(\x20|\x09)+)?
     (\x22)
   )
  )
  @) # atomic group end
  (
    (
      ([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])
      |
      (
        ([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])
        ([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*
        ([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])
      )
    )
    \.
  )+
  (
    ([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])
    |
    (
      ([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])
      ([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*
      ([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])
    )
  )
  $
  /xi

puts "aaaaa.bbbbbb.ccccc.ddddd@gmail.com".match EMAIL_REGEXP  # returns immediately
puts "aaaaa.bbbbbb.ccccc.ddddd@gmail".match EMAIL_REGEXP  # takes a long time