用ruby检查字符串是否匹配regexp的最快方法?

What is the fastest way to check if a string matches a regular expression in Ruby?

检查字符串是否与Ruby中的正则表达式匹配的最快方法是什么?

My problem is that I have to "egrep" through a huge list of strings to find which are the ones that match a regexp that is given at runtime. I only care about whether the string matches the regexp, not where it matches, nor what the content of the matching groups is. I hope this assumption can be used to reduce the amount of time my code spend matching regexps.

我的问题是，我必须在大量的字符串列表中“隔离”，以找到那些匹配运行时给定的regexp的字符串。我只关心字符串是否匹配regexp，而不关心它匹配的地方，也不关心匹配组的内容是什么。我希望这个假设可以用来减少我的代码花在匹配regexp上的时间。

I load the regexp with

我加载regexp

pattern = Regexp.new(ptx).freeze

I have found that string =~ pattern is slightly faster than string.match(pattern).

我发现string =~ pattern比string.match(pattern)稍微快一点。

Are there other tricks or shortcuts that can used to make this test even faster?

还有其他的技巧或捷径可以让这个测试更快吗?

7 个解决方案

#1

Starting with Ruby 2.4.0, you may use RegExp#match?:

从Ruby 2.4.0开始，您可以使用RegExp#match吗?

pattern.match?(string)

Regexp#match? is explicitly listed as a performance enhancement in the release notes for 2.4.0, as it avoids object allocations performed by other methods such as Regexp#match and =~:

Regexp #比赛吗?在2.4.0的发布说明中明确列出了性能增强，因为它避免了由Regexp#match和=~等其他方法执行的对象分配。

Regexp#match?
Added Regexp#match?, which executes a regexp match without creating a back reference object and changing $~ to reduce object allocation.

Regexp #比赛吗?添加Regexp #比赛吗?，执行regexp匹配，而不创建回引用对象并更改$~以减少对象分配。

#2

This is a simple benchmark:

这是一个简单的基准:

require 'benchmark'

"test123" =~ /1/
=> 4
Benchmark.measure{ 1000000.times { "test123" =~ /1/ } }
=>   0.610000   0.000000   0.610000 (  0.578133)

"test123"[/1/]
=> "1"
Benchmark.measure{ 1000000.times { "test123"[/1/] } }
=>   0.718000   0.000000   0.718000 (  0.750010)

irb(main):019:0> "test123".match(/1/)
=> #<MatchData "1">
Benchmark.measure{ 1000000.times { "test123".match(/1/) } }
=>   1.703000   0.000000   1.703000 (  1.578146)

So =~ is faster but it depends what you want to have as a returned value. If you just want to check if the text contains a regex or not use =~

所以=~更快，但它取决于你想要什么作为返回值。如果您只是想检查文本是否包含regex或不使用=~

#3

This is the benchmark I have run after finding some articles around the net.

这是我在网上找到一些文章后运行的基准。

With 2.4.0 the winner is re.match?(str) (as suggested by @wiktor-stribiżew), on previous versions, re =~ str seems to be fastest, although str =~ re is almost as fast.

测试盒框赢家是re.match ?(str)(如@wiktor-stribiżew)的建议,在以前的版本中,再保险= ~ str似乎是最快,尽管str = ~几乎一样快。

#!/usr/bin/env ruby
require 'benchmark'

str = "aacaabc"
re = Regexp.new('a+b').freeze

N = 4_000_000

Benchmark.bm do |b|
    b.report("str.match re\t") { N.times { str.match re } }
    b.report("str =~ re\t")    { N.times { str =~ re } }
    b.report("str[re]  \t")    { N.times { str[re] } }
    b.report("re =~ str\t")    { N.times { re =~ str } }
    b.report("re.match str\t") { N.times { re.match str } }
    if re.respond_to?(:match?)
        b.report("re.match? str\t") { N.times { re.match? str } }
    end
end

Results MRI 1.9.3-o551:

结果MRI 1.9.3-o551:

$ ./bench-re.rb  | sort -t $'\t' -k 2
       user     system      total        real
re =~ str         2.390000   0.000000   2.390000 (  2.397331)
str =~ re         2.450000   0.000000   2.450000 (  2.446893)
str[re]           2.940000   0.010000   2.950000 (  2.941666)
re.match str      3.620000   0.000000   3.620000 (  3.619922)
str.match re      4.180000   0.000000   4.180000 (  4.180083)

Results MRI 2.1.5:

结果MRI 2.1.5:

$ ./bench-re.rb  | sort -t $'\t' -k 2
       user     system      total        real
re =~ str         1.150000   0.000000   1.150000 (  1.144880)
str =~ re         1.160000   0.000000   1.160000 (  1.150691)
str[re]           1.330000   0.000000   1.330000 (  1.337064)
re.match str      2.250000   0.000000   2.250000 (  2.255142)
str.match re      2.270000   0.000000   2.270000 (  2.270948)

Results MRI 2.3.3 (there is a regression in regex matching, it seems):

结果MRI 2.3.3 (regex匹配似乎存在回归):

$ ./bench-re.rb  | sort -t $'\t' -k 2
       user     system      total        real
re =~ str         3.540000   0.000000   3.540000 (  3.535881)
str =~ re         3.560000   0.000000   3.560000 (  3.560657)
str[re]           4.300000   0.000000   4.300000 (  4.299403)
re.match str      5.210000   0.010000   5.220000 (  5.213041)
str.match re      6.000000   0.000000   6.000000 (  6.000465)

Results MRI 2.4.0:

结果MRI测试盒框:

$ ./bench-re.rb  | sort -t $'\t' -k 2
       user     system      total        real
re.match? str     0.690000   0.010000   0.700000 (  0.682934)
re =~ str         1.040000   0.000000   1.040000 (  1.035863)
str =~ re         1.040000   0.000000   1.040000 (  1.042963)
str[re]           1.340000   0.000000   1.340000 (  1.339704)
re.match str      2.040000   0.000000   2.040000 (  2.046464)
str.match re      2.180000   0.000000   2.180000 (  2.174691)

#4

What about re === str (case compare)?

那么re === str (case compare)呢?

Since it evaluates to true or false and has no need for storing matches, returning match index and that stuff, I wonder if it would be an even faster way of matching than =~.

由于它的计算结果为true或false，并且不需要存储匹配项、返回匹配索引和类似的内容，我想知道它是否会是比=~更快的匹配方法。

Ok, I tested this. =~ is still faster, even if you have multiple capture groups, however it is faster than the other options.

好的,我测试了这个。=~仍然更快，即使您有多个捕获组，但是它比其他选项更快。

BTW, what good is freeze? I couldn't measure any performance boost from it.

顺便说一句，冷冻有什么好处?我无法测量它对性能的提升。

#5

Depending on how complicated your regular expression is, you could possibly just use simple string slicing. I'm not sure about the practicality of this for your application or whether or not it would actually offer any speed improvements.

根据正则表达式的复杂程度，您可以使用简单的字符串切片。我不确定这是否适用于您的应用程序，或者它是否会提供任何速度改进。

'testsentence'['stsen']
=> 'stsen' # evaluates to true
'testsentence'['koala']
=> nil # evaluates to false

#6

What I am wondering is if there is any strange way to make this check even faster, maybe exploiting some strange method in Regexp or some weird construct.

我想知道的是，是否有任何奇怪的方法可以使这个检查更快，也许可以利用Regexp中的一些奇怪方法或一些奇怪的构造。

Regexp engines vary in how they implement searches, but, in general, anchor your patterns for speed, and avoid greedy matches, especially when searching long strings.

Regexp引擎在实现搜索的方式上各不相同，但是，通常情况下，要锚定模式以获得速度，并避免贪婪匹配，特别是在搜索长字符串时。

The best thing to do, until you're familiar with how a particular engine works, is to do benchmarks and add/remove anchors, try limiting searches, use wildcards vs. explicit matches, etc.

在您熟悉特定引擎的工作方式之前，最好的做法是执行基准测试并添加/删除锚点，尝试限制搜索，使用通配符和显式匹配，等等。

The Fruity gem is very useful for quickly benchmarking things, because it's smart. Ruby's built-in Benchmark code is also useful, though you can write tests that fool you by not being careful.

果味宝石对于快速基准测试非常有用，因为它很聪明。Ruby的内置基准代码也很有用，不过您可以编写一些测试，如果不小心，就会让您感到迷惑。

I've used both in many answers here on Stack Overflow, so you can search through my answers and will see lots of little tricks and results to give you ideas of how to write faster code.

我在Stack Overflow上已经用了很多答案，所以你可以搜索我的答案，看到很多小技巧和结果，让你知道如何写更快的代码。

The biggest thing to remember is, it's bad to prematurely optimize your code before you know where the slowdowns occur.

需要记住的最重要的事情是，在您知道慢下来发生在哪里之前过早地优化代码是不好的。

#7

To complete Wiktor Stribiżew and Dougui answers I would say that /regex/.match?("string") is slightly faster than "string".match?(/regex/).

完成有意Stribiżew Dougui答案我想说/正则表达式/ .match ?(“字符串”)是略高于“字符串”.match ?(/正则表达式)。

2.4.0 > require 'benchmark'
 => true 
2.4.0 > Benchmark.measure{ 10000000.times { /^CVE-[0-9]{4}-[0-9]{4,}$/.match?("CVE-2018-1589") } }
 => #<Benchmark::Tms:0x005563da1b1c80 @label="", @real=2.2060338060000504, @cstime=0.0, @cutime=0.0, @stime=0.04000000000000001, @utime=2.17, @total=2.21> 
2.4.0 > Benchmark.measure{ 10000000.times { "CVE-2018-1589".match?(/^CVE-[0-9]{4}-[0-9]{4,}$/) } }
 => #<Benchmark::Tms:0x005563da139eb0 @label="", @real=2.260814556000696, @cstime=0.0, @cutime=0.0, @stime=0.010000000000000009, @utime=2.2500000000000004, @total=2.2600000000000007>

#1