为什么String＃split（“\ n”）和Array＃join（''）比String＃gsub（/ \ n /，''）更快？

I have to remove all newlines from a large number of strings. In benchmarking string.join("\n").split(' ') vs string.gsub(/\n/, ' '), I found that the split and join methods are much quicker, but have a hard time understanding why. I do not understand how splitting the string up into array elements each time it encounters a \n, then joining the array into a new string could possibly be quicker than just scan and replacing each \n with a ' '.

我必须从大量字符串中删除所有换行符。在对string.join(“\ n”)。split('')vs string.gsub(/ \ n /,'')进行基准测试时,我发现拆分和连接方法要快得多,但很难理解为什么。我不明白每次遇到\ n时如何将字符串拆分成数组元素,然后将数组连接到新字符串可能比仅扫描并用''替换每个\ n更快。

sentence = %q[
  Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium,
  totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae
  dicta sunt explicabo. Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit,
  sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam
  est, qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit, sed quia non numquam eius
  modi tempora incidunt ut labore et dolore magnam aliquam quaerat voluptatem. Ut enim ad minima
  veniam, quis nostrum exercitationem ullam corporis suscipit laboriosam, nisi ut aliquid ex ea]

Check to verify the output is indeed the same for both methods:

检查以验证两种方法的输出确实相同:

puts sentence.split("\n").join(' ') == sentence.gsub(/\n/, ' ')
#=> true

Script used to benchmark:

用于基准测试的脚本:

def split_join_method(string)
  start = Time.now;
  1000000.times {  string.split("\n").join(' ') }
  puts "split_join: #{Time.now - start} s"
end

def gsub_method(string)
  start = Time.now;
  1000000.times {  string.gsub(/\n/, ' ') }
  puts "gsub: #{Time.now - start} s"
end

5.times do
  split_join_method(sentence)
  gsub_method(sentence)
end

Results:

#=> split_join: 6.753057 s
#=> gsub: 14.938358 s
#=> split_join: 6.16101 s
#=> gsub: 14.166971 s
#=> split_join: 5.946168 s
#=> gsub: 13.490355 s
#=> split_join: 5.781062 s
#=> gsub: 13.436135 s
#=> split_join: 5.903052 s
#=> gsub: 15.670774 s

3 个解决方案

#1

I think gsub takes more time for two reasons:

我认为gsub需要更多时间,原因有两个:

The first is that using a regex engine has an initial cost, at least to parse the pattern.

首先,使用正则表达式引擎具有初始成本,至少要解析模式。

The second and probably the most important here is that the regex engine works with a dumb walk character by character and tests the pattern for each positions in the string when the split (with a literal string here) uses a fast string search algorithm (probably the Boyer-Moore algorithm).

第二个也可能是最重要的是,正则表达式引擎按字符运行一个愚蠢的行为,并且当分割(在这里使用文字字符串)使用快速字符串搜索算法时,可以测试字符串中每个位置的模式(可能是Boyer-Moore算法)。

Note that even if the split/join way is faster, it uses probably more memory since this way needs to generate new strings.

请注意,即使分割/连接方式更快,它也可能使用更多内存,因为这种方式需要生成新字符串。

Note2: some regex engines are able to use this fast string search algorithm before the walk to find positions, but I have no informations about this for the ruby regex engine.

注意2:一些正则表达式引擎能够在行走之前使用这种快速字符串搜索算法来查找位置,但是我没有关于ruby正则表达式引擎的信息。

Note3: It may be interesting to have a better idea of what happens to include tests with few repeatitions but with larger strings. [edit] After several tests with @spickermann code, it seems that it doesn't change anything (or nothing very significative) even with very few repetitions. So the initial cost may be not so important.

注3:更好地了解包含少量重复但具有较大字符串的测试所发生的情况可能会很有趣。 [编辑]经过@spickermann代码的几次测试后,即使重复次数很少,它似乎也没有改变任何东西(或者没有任何意义)。因此,初始成本可能并不那么重要。

#2

Your question is comparing apples and oranges, because you compare a regexp method with a string search operation.

您的问题是比较苹果和橙子,因为您将正则表达式方法与字符串搜索操作进行比较。

My benchmark cannot reproduce your observation that split combined with join are in general faster that a simple gsub, the gsub version is always faster. The only thing I can confirm is, that regexp matches are slower than string searches what is not very surprising.

我的基准测试无法重现您的观察结果,即与简单的gsub相比,分割与连接相结合的速度通常更快,gsub版本总是更快。我唯一可以确认的是,regexp匹配比字符串搜索慢,这并不奇怪。

Btw. tr is the fastest solution for this kind of problem:

顺便说一句。 tr是解决此类问题的最快解决方案:

Rehearsal ---------------------------------------------------
string_split:     5.390000   0.100000   5.490000 (  5.480459)
regexp_split:    14.220000   0.160000  14.380000 ( 14.413509)
string_gsub :     3.750000   0.090000   3.840000 (  3.832316)
regexp_gsub :    12.890000   0.130000  13.020000 ( 13.045899)
string_tr   :     2.480000   0.050000   2.530000 (  2.525891)
----------------------------------------- total: 39.260000sec

                      user     system      total        real
string_split:     5.450000   0.090000   5.540000 (  5.543735)
regexp_split:    14.340000   0.190000  14.530000 ( 14.552214)
string_gsub :     4.160000   0.120000   4.280000 (  4.543941)
regexp_gsub :    13.570000   0.200000  13.770000 ( 14.356955)
string_tr   :     2.390000   0.040000   2.430000 (  2.431676)

Code that I used for this benchmark:

我用于此基准的代码:

require 'benchmark'

@string = %q[
  Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium,
  totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae
  dicta sunt explicabo. Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit,
  sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam
  est, qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit, sed quia non numquam eius
  modi tempora incidunt ut labore et dolore magnam aliquam quaerat voluptatem. Ut enim ad minima
  veniam, quis nostrum exercitationem ullam corporis suscipit laboriosam, nisi ut aliquid ex ea]

def string_split
  @string.split("\n").join(' ')
end

def regexp_split
  @string.split(/\n/).join(' ')
end

def string_gsub
  @string.gsub("\n", ' ')
end

def regexp_gsub
  @string.gsub(/\n/, ' ')
end

def string_tr
  @string.tr("\n", ' ')
end

n = 1_000_000
Benchmark.bmbm(15) do |x|
  x.report("string_split:")   { n.times do; string_split; end }
  x.report("regexp_split:")   { n.times do; regexp_split; end }
  x.report("string_gsub :")   { n.times do; string_gsub ; end }
  x.report("regexp_gsub :")   { n.times do; regexp_gsub ; end }
  x.report("string_tr   :")   { n.times do; string_tr   ; end }
end

#3

It's because in your gsub code, you are using regular expressions, which are slow for the reasons pointed out in Casimir's answer. Here is proof: if you change

这是因为在你的gsub代码中,你正在使用正则表达式,由于Casimir的答案中指出的原因,这些表达式很慢。这是证据:如果你改变了

string.gsub(/\n/, ' ')

string.gsub("\n", ' ')

then the gsub code is actually faster than the split/join code.

那么gsub代码实际上比分割/连接代码更快。

#1