数据结构:迭代两个数组vs转换为集合并在ruby中执行交叉操作。

Lets say I have a1 and a2:

假设有a1和a2:

a1 = [1,2,3]
a2 = [4,2,5]

To see if a1 shares any elements with a2, I can loop over each and compare each element:

为了查看a1是否与a2共享任何元素，我可以对每个元素进行循环，并对每个元素进行比较:

def intersect?(x,y)
  a1.each do |x|
    a2.each do |y|
      if x == y return true
    end
  end
  false
end

But even easier, (a1.to_set & a2.to_set).present? gives me the same answer.

但更简单,(a1。to_set & a2.to_set .present吗?给我同样的答案。

I'm assuming that the set operation is quicker and more efficient? If this is true, is it still true taking into account to overhead (if any) of the .to_set operation on each array?

我假设集合运算更快更有效?如果这是正确的，那么在每个阵列上的.to_set操作的开销(如果有的话)是否仍然是正确的?

tia

蒂雅

5 个解决方案

#1

Surprisingly the & method of Array is faster than that of Set for quite large collections:

令人惊讶的是，数组的&方法比集合的方法要快得多:

require 'set'
require 'benchmark'
f = 10_000
ar1 = (1..(10*f)).to_a # 100_000 elements
ar2 = ((5*f)..(15*f)).to_a # also 100_000 elements
set1 = ar1.to_set
set2 = ar2.to_set
n = 10

Benchmark.bm(10) do |testcase|
  testcase.report('Array'){ n.times{ ar1 & ar2 } }
  testcase.report('Set'){ n.times{ set1 & set2 } }
end

Result:

结果:

                 user     system      total        real
Array        1.380000   0.030000   1.410000 (  1.414634)
Set          2.310000   0.020000   2.330000 (  2.359317)

#2

steenslag's answer had an interesting observation that array & array was faster that set & set. It looks like most of that penalty appears to be the expense of obtaining keys from the underlying hash of the first set to enumerate on. A hybrid approach using the array for the left side of the operation and set for the right hand is faster yet. If you only want to know if there's any intersection, the same approach with #any? is even quicker:

steenslag的回答有一个有趣的发现，数组和数组比set和set要快。看起来大部分的代价似乎是从第一个要枚举的集合的底层散列中获取键。将数组用于操作的左侧并设置为右侧的混合方法更快。如果你只想知道是否有交集，#any的方法也是一样的?甚至更快:

#!/usr/bin/env ruby

require 'set'
require 'benchmark'

f = 10_000
ar1 = (1..(10*f)).to_a # 100_000 elements
ar2 = ((5*f)..(15*f)).to_a # also 100_000 elements
set1 = ar1.to_set
set2 = ar2.to_set
n = 10

Benchmark.bm(10) do |testcase|
  testcase.report('Array'){ n.times{ ar1 & ar2 } }
  testcase.report('Set'){ n.times{ set1 & set2 } }
  testcase.report('Set2'){ n.times{ ar1.select{ |element| set2.include? element } } }
  testcase.report('Set2present'){ n.times{ ar1.any?{ |element| set2.include? element } } }
end


$ ruby -v => ruby 1.9.2p290 (2011-07-09 revision 32553) [x86_64-darwin10.8.0]

                user     system      total        real
Array       0.680000   0.030000   0.710000 (  0.720882)
Set         1.130000   0.020000   1.150000 (  1.150571)
Set2        0.430000   0.000000   0.430000 (  0.434957)
Set2present  0.210000   0.010000   0.220000 (  0.220990)

#3

It should be faster for large arrays. Your method runs in O(m*n) time because it has to loop over both arrays. For tables of 3 elements each this is basically negligible, but for larger tables it can be very expensive.

对于大型数组，它应该更快。方法在O(m*n)时间内运行，因为它必须对两个数组进行循环。对于包含3个元素的表，这基本上可以忽略不计，但对于较大的表，这可能非常昂贵。

The second method will use hash lookups which are much faster, but first the arrays have to be put in sets.

第二个方法将使用哈希查找，这要快得多，但是首先必须将数组放在集合中。

What you should do is try both methods using arrays of sizes you expect to see in your application and see which is faster. If they're about the same size you can just pick whichever one you think is clearer.

您应该做的是使用您希望在应用程序中看到的大小的数组来尝试这两种方法，看看哪一种更快。如果它们大小相同，你可以选择你认为更清楚的。

#4

I just want to elaborate upon the excellent answers by steenslag and dbenhur. Specifically, I wanted to know if SortedSet would perform any better. It actually surprised me initially that the Ruby Set type was not implemented as a sorted set, since I come from C++; the STL by default uses an ordered set, and you generally have to specify unordered_set if you don't want ordering.

我只是想详细阐述一下steenslag和dbenhur的优秀答案。具体来说，我想知道SortedSet是否会有更好的表现。实际上，我最初很惊讶，因为我来自c++，所以Ruby Set类型没有被实现为一个排序集。默认情况下，STL使用一个有序集，如果不想排序，通常需要指定unordered_set。

I also wanted to know if the size of the set made a difference, as suggested in some other answers.

我还想知道集合的大小是否会产生影响，就像其他一些答案所暗示的那样。

require 'set'
require 'benchmark'

f = 20 # 10_000
ar1 = (1..(10*f)).to_a # 100_000 elements
ar2 = ((5*f)..(15*f)).to_a # also 100_000 elements
set1 = ar1.to_set
set2 = ar2.to_set
sset1 = SortedSet.new(ar1)
sset2 = SortedSet.new(ar2)
n = 20000 # 10

Benchmark.bm(10) do |testcase|
  testcase.report('Array'){ n.times{ ar1 & ar2 } }
  testcase.report('Set'){ n.times{ set1 & set2 } }
  testcase.report('SortedSet') { n.times{ sset1 & sset2 } }
  testcase.report('Set2'){ n.times{ ar1.select{ |element| set2.include? element } } }
  testcase.report('Set2present'){ n.times{ ar1.any?{ |element| set2.include? element } } }

  testcase.report('SortedSet2'){ n.times{ ar1.select{ |element| sset2.include? element } } }
  testcase.report('SortedSet2present'){ n.times{ ar1.any?{ |element| sset2.include? element } } }
end

Here are the results for f=20; n=20000:

这是f=20的结果;n = 20000:

$ ruby set.rb
                 user     system      total        real
Array        1.950000   0.010000   1.960000 (  1.963030)
Set          3.330000   0.040000   3.370000 (  3.374105)
SortedSet    3.810000   0.040000   3.850000 (  3.860340)
Set2         1.410000   0.010000   1.420000 (  1.427221)
Set2present  0.760000   0.000000   0.760000 (  0.759447)
SortedSet2   1.420000   0.000000   1.420000 (  1.446559)
SortedSet2present  0.770000   0.010000   0.780000 (  0.770939)

And here are the results for f=10000; n=10:

这是f=10000的结果;n = 10:

$ ruby set.rb
                 user     system      total        real
Array        0.910000   0.020000   0.930000 (  0.939325)
Set          1.270000   0.010000   1.280000 (  1.293581)
SortedSet    1.220000   0.010000   1.230000 (  1.229650)
Set2         0.550000   0.000000   0.550000 (  0.552708)
Set2present  0.290000   0.010000   0.300000 (  0.291845)
SortedSet2   0.550000   0.000000   0.550000 (  0.561049)
SortedSet2present  0.330000   0.000000   0.330000 (  0.339950)

So, for large sets, looks like Set does better than SortedSet; and for smaller sets, SortedSet does better than Set. When using the & notation, Array is faster than either. Looks like SortedSet2present performs significantly more efficiently with large sets, whereas Set2present performs more efficiently with small sets.

因此，对于大集合，看起来像集合比SortedSet好;对于较小的集合，SortedSet比Set做得更好。当使用&符号时，数组比这两个都快。看起来SortedSet2present在大集上的执行效率明显更高，而Set2present在小集上的执行效率更高。

Whereas Set is implemented using Hash, SortedSet is an RBTree (implemented in C). In both cases, & is implemented in Ruby rather than C.

而Set是使用散列实现的，SortedSet是一个RBTree(用C实现)。

#5

-1

The truth is with arrays this small, they either will be essentially the same or the lists will be faster than the sets.

事实是，对于这么小的数组，它们要么本质上是相同的，要么列表比集合要快。

A decent set implementation will do set operations faster than you can do list operations BUT there's some overhead. If you want to know what your implementation will do, use big sets/lists and test.

一个好的集合实现会比列表操作更快地完成集合操作，但是会有一些开销。如果您想知道实现要做什么，请使用大集合/列表和测试。

#1