查找两组相交的算法

时间:2021-08-20 08:22:29

Let's say I have two arrays:

假设我有两个数组:

int ArrayA[] = {5, 17, 150, 230, 285};

int ArrayA [] = {5,17,150,230,285};

int ArrayB[] = {7, 11, 57, 110, 230, 250};

int ArrayB [] = {7,11,57,110,230,250};

Both arrays are sorted and can be any size. I am looking for an efficient algorithm to find if the arrays contain any duplicated elements between them. I just want a true/false answer, I don't care which element is shared or how many.

两个数组都是排序的,可以是任何大小。我正在寻找一种有效的算法来查找数组是否包含它们之间的任何重复元素。我只想要一个真/假答案,我不关心共享哪个元素或多少元素。

The naive solution is to loop through each item in ArrayA, and do a binary search for it in ArrayB. I believe this complexity is O(m * log n).

天真的解决方案是循环遍历ArrayA中的每个项目,并在ArrayB中对其进行二进制搜索。我相信这种复杂性是O(m * log n)。

Because both arrays are sorted, it seems like there should be a more efficient algorithm.

因为两个数组都是排序的,所以似乎应该有一个更有效的算法。

I would also like a generic solution that doesn't assume that the arrays hold numbers (i.e. the solution should also work for strings). However, the comparison operators are well defined and both arrays are sorted from least to greatest.

我还想要一个通用的解决方案,它不假设数组包含数字(即解决方案也适用于字符串)。但是,比较运算符定义良好,两个数组都从最小到最大排序。

7 个解决方案

#1


38  

Pretend that you are doing a mergesort, but don't send the results anywhere. If you get to the end of either source, there is no intersection. Each time you compare the next element of each, if they are equal, there is an intersection.

假装您正在进行合并,但不要将结果发送到任何地方。如果到达任一源的末尾,则没有交叉点。每次比较每个元素的下一个元素时,如果它们相等,则存在交集。

For example:

counterA = 0;
counterB = 0;
for(;;) {
    if(counterA == ArrayA.length || counterB == ArrayB.length)
        return false;
    else if(ArrayA[counterA] == ArrayB[counterB])
        return true;
    else if(ArrayA[counterA] < ArrayB[counterB])
        counterA++;
    else if(ArrayA[counterA] > ArrayB[counterB])
        counterB++;
    else
        halt_and_catch_fire();
}

#2


7  

Since someone wondered about stl. Out-of-the-box, the set_intersection algorithm would do more than you want: it would find all the common values.

因为有人想知道stl。开箱即用的set_intersection算法会比你想要的更多:它会找到所有常见的值。

    #include <vector>
    #include <algorithm>
    #include <iterator>
    using namespace std;
//    ...    
      int ArrayA[] = {5, 17, 150, 230, 285};
      int ArrayB[] = {7, 11, 57, 110, 230, 250};
      vector<int> intersection;
      ThrowWhenWritten output_iterator;
        set_intersection(ArrayA, ArrayA + sizeof(ArrayA)/sizeof(int),
                         ArrayB, ArrayB + sizeof(ArrayB)/sizeof(int),
                         back_insert_iterator<vector<int> >(intersection));

        return !intersection.empty();

this runs in O(m+n) time, but it requires storing all the duplicates and doesn't stop when it finds the first dup.

这在O(m + n)时间运行,但它需要存储所有重复项,并且在找到第一个dup时不会停止。

Now, modifying the code from the gnu implementation of the stl, we can get more precisely what you want.

现在,修改stl的gnu实现中的代码,我们可以更准确地得到你想要的。

 template<typename InputIterator1, typename InputIterator2>
 bool 
 has_intersection(InputIterator1 first1, InputIterator1 last1,
             InputIterator2 first2, InputIterator2 last2)
    {
       while (first1 != last1 && first2 != last2) 
       {
          if (*first1 < *first2)
             ++first1;
          else if (*first2 < *first1)
             ++first2;
          else
             return true;
       }
       return false;
}

#3


4  

If one list is much much shorter than the other, binary search is the way to go. If the lists are of similar length and you're happy with O(m+n), a standard "merge" would work. There are fancier algorithms that are more flexible. One paper I've come across in my own searches is:

如果一个列表比另一个列表短得多,则二进制搜索是可行的方法。如果列表具有相似的长度并且您对O(m + n)感到满意,则标准的“合并”将起作用。有更灵活的算法更灵活。我在自己的搜索中遇到的一篇论文是:

http://www.cs.uwaterloo.ca/~ajsaling/papers/paper-spire.pdf

#4


3  

If you don't care about memory consumption, you can achieve good performance by using hash, i.e. create hash with keys = values of one array, and test values of second array against this hash

如果您不关心内存消耗,可以通过使用哈希来实现良好的性能,即使用keys =一个数组的值创建哈希,并针对此哈希测试第二个数组的值

#5


1  

If you are using C# 3.0 then why not take advantage of LINQ here?

如果您使用的是C#3.0,那么为什么不在这里利用LINQ呢?

ArrayA.Intersect(ArrayB).Any()

Not only is this generic (works for any comparable type) the implementation under the hood is pretty efficient (uses a hashing algorithm).

这种通用(适用于任何类似的类型)不仅非常有效(使用散列算法)。

#6


0  

If the range of values is small, you could build a lookup table for one of them (time cost = O(N)) and then check if the bit is set from the other list (time cost = O(N)). If the range is large, you could do something similar with a hash table.

如果值的范围很小,您可以为其中一个构建查找表(时间成本= O(N)),然后检查该位是否从另一个列表中设置(时间成本= O(N))。如果范围很大,您可以使用哈希表执行类似操作。

The mergesort trick from Glomek is an even better idea.

来自Glomek的合并技巧是一个更好的主意。

#7


0  

Glomek is on the right track, but kinda glossed over the algorithm.

Glomek在正确的轨道上,但有点掩盖了算法。

Start by comparing ArrayA[0] to ArrayB[0]. if they are equal, you're done. If ArrayA[0] is less than ArrayB[0], then move to ArrayA[1]. If ArrayA[0] is more than ArrayB[0], then move to ArrayB[1].

首先将ArrayA [0]与ArrayB [0]进行比较。如果他们是平等的,你就完成了。如果ArrayA [0]小于ArrayB [0],则移至ArrayA [1]。如果ArrayA [0]大于ArrayB [0],则移至ArrayB [1]。

Keeping stepping through till you reach the end of one array or find a match.

保持踩踏直到你到达一个阵列的末尾或找到一个匹配。

#1


38  

Pretend that you are doing a mergesort, but don't send the results anywhere. If you get to the end of either source, there is no intersection. Each time you compare the next element of each, if they are equal, there is an intersection.

假装您正在进行合并,但不要将结果发送到任何地方。如果到达任一源的末尾,则没有交叉点。每次比较每个元素的下一个元素时,如果它们相等,则存在交集。

For example:

counterA = 0;
counterB = 0;
for(;;) {
    if(counterA == ArrayA.length || counterB == ArrayB.length)
        return false;
    else if(ArrayA[counterA] == ArrayB[counterB])
        return true;
    else if(ArrayA[counterA] < ArrayB[counterB])
        counterA++;
    else if(ArrayA[counterA] > ArrayB[counterB])
        counterB++;
    else
        halt_and_catch_fire();
}

#2


7  

Since someone wondered about stl. Out-of-the-box, the set_intersection algorithm would do more than you want: it would find all the common values.

因为有人想知道stl。开箱即用的set_intersection算法会比你想要的更多:它会找到所有常见的值。

    #include <vector>
    #include <algorithm>
    #include <iterator>
    using namespace std;
//    ...    
      int ArrayA[] = {5, 17, 150, 230, 285};
      int ArrayB[] = {7, 11, 57, 110, 230, 250};
      vector<int> intersection;
      ThrowWhenWritten output_iterator;
        set_intersection(ArrayA, ArrayA + sizeof(ArrayA)/sizeof(int),
                         ArrayB, ArrayB + sizeof(ArrayB)/sizeof(int),
                         back_insert_iterator<vector<int> >(intersection));

        return !intersection.empty();

this runs in O(m+n) time, but it requires storing all the duplicates and doesn't stop when it finds the first dup.

这在O(m + n)时间运行,但它需要存储所有重复项,并且在找到第一个dup时不会停止。

Now, modifying the code from the gnu implementation of the stl, we can get more precisely what you want.

现在,修改stl的gnu实现中的代码,我们可以更准确地得到你想要的。

 template<typename InputIterator1, typename InputIterator2>
 bool 
 has_intersection(InputIterator1 first1, InputIterator1 last1,
             InputIterator2 first2, InputIterator2 last2)
    {
       while (first1 != last1 && first2 != last2) 
       {
          if (*first1 < *first2)
             ++first1;
          else if (*first2 < *first1)
             ++first2;
          else
             return true;
       }
       return false;
}

#3


4  

If one list is much much shorter than the other, binary search is the way to go. If the lists are of similar length and you're happy with O(m+n), a standard "merge" would work. There are fancier algorithms that are more flexible. One paper I've come across in my own searches is:

如果一个列表比另一个列表短得多,则二进制搜索是可行的方法。如果列表具有相似的长度并且您对O(m + n)感到满意,则标准的“合并”将起作用。有更灵活的算法更灵活。我在自己的搜索中遇到的一篇论文是:

http://www.cs.uwaterloo.ca/~ajsaling/papers/paper-spire.pdf

#4


3  

If you don't care about memory consumption, you can achieve good performance by using hash, i.e. create hash with keys = values of one array, and test values of second array against this hash

如果您不关心内存消耗,可以通过使用哈希来实现良好的性能,即使用keys =一个数组的值创建哈希,并针对此哈希测试第二个数组的值

#5


1  

If you are using C# 3.0 then why not take advantage of LINQ here?

如果您使用的是C#3.0,那么为什么不在这里利用LINQ呢?

ArrayA.Intersect(ArrayB).Any()

Not only is this generic (works for any comparable type) the implementation under the hood is pretty efficient (uses a hashing algorithm).

这种通用(适用于任何类似的类型)不仅非常有效(使用散列算法)。

#6


0  

If the range of values is small, you could build a lookup table for one of them (time cost = O(N)) and then check if the bit is set from the other list (time cost = O(N)). If the range is large, you could do something similar with a hash table.

如果值的范围很小,您可以为其中一个构建查找表(时间成本= O(N)),然后检查该位是否从另一个列表中设置(时间成本= O(N))。如果范围很大,您可以使用哈希表执行类似操作。

The mergesort trick from Glomek is an even better idea.

来自Glomek的合并技巧是一个更好的主意。

#7


0  

Glomek is on the right track, but kinda glossed over the algorithm.

Glomek在正确的轨道上,但有点掩盖了算法。

Start by comparing ArrayA[0] to ArrayB[0]. if they are equal, you're done. If ArrayA[0] is less than ArrayB[0], then move to ArrayA[1]. If ArrayA[0] is more than ArrayB[0], then move to ArrayB[1].

首先将ArrayA [0]与ArrayB [0]进行比较。如果他们是平等的,你就完成了。如果ArrayA [0]小于ArrayB [0],则移至ArrayA [1]。如果ArrayA [0]大于ArrayB [0],则移至ArrayB [1]。

Keeping stepping through till you reach the end of one array or find a match.

保持踩踏直到你到达一个阵列的末尾或找到一个匹配。