找到两个排序数组的交集，在某些情况下需要小于O(m+n)的比较。

Here is one way of doing this in O(m+n) where m and n are lengths of two arrays:

这是在O(m+n)中做这个的一种方法m和n是两个数组的长度:

import random

def comm_seq(arr_1, arr_2):
    if len(arr_1) == 0 or len(arr_2) == 0:
        return []

    m = len(arr_1) - 1
    n = len(arr_2) - 1

    if arr_1[m] == arr_2[n]:
        return comm_seq(arr_1[:-1], arr_2[:-1]) + [arr_1[m]]

    elif arr_1[m] < arr_2[n]:
        return comm_seq(arr_1, arr_2[:-1])

    elif arr_1[m] > arr_2[n]:
        return comm_seq(arr_1[:-1], arr_2)


if __name__ == "__main__":
    arr_1 = [random.randrange(0,5) for _ in xrange(10)]
    arr_2 = [random.randrange(0,5) for _ in xrange(10)]
    arr_1.sort()
    arr_2.sort()
    print comm_seq(arr_1, arr_2)

Is there a technique that in some cases uses less than O(m+n) comparisons? For example: arr_1=[1,2,2,2,2,2,2,2,2,2,2,100] and arr_2=[1,3,100]

是否有一种技术在某些情况下使用小于O(m+n)的比较?例如:arr_1 =[1、2、2、2、2、2、2、2、2、2,2100]和arr_2 =[3100]

(Not looking for the hash table implementation)

(不查找哈希表实现)

4 个解决方案

#1

A binary search algorithm requires O(logm) time to find a number in an array with length m. Therefore, if we search each number of an array with length n from an array with length m, its overall time complexity is O(nlogm). If m is much greater than n, O(nlogm) is actually less than O(m+n). Therefore, we can implement a new and better solution based on binary search in such a situation. source

一个二进制搜索算法需要O(logm)时间在一个长度为m的数组中查找一个数字，因此，如果我们从一个长度为m的数组中搜索一个长度为n的数组的每个数，其总时间复杂度为O(nlogm)。如果m大于n, O(nlogm)实际上小于O(m+n)因此，在这种情况下，我们可以基于二分查找实现一个新的更好的解决方案。源

However, this does not necessarily means binary search is better in than O(m+n) case. In fact, binary search approach is only better when n << m (n is very small compared to m).

然而，这并不一定意味着二进制搜索比O(m+n)的情况更好。实际上，当n << m (n小于m)时，二分查找方法才更好。

#2

As far as I know, there are a few different ways to solve this problem, but none of them are better than O(m + n). I don't know how you can have an algorithm faster than that (barring weird quantum computing answers), because you have to compare all the elements in both arrays or you might miss a duplicate.

据我所知,有一些不同的方法来解决这个问题,但是他们中没有一个人比O(m + n)。我不知道如何有一个算法的速度比(除非奇怪的量子计算的答案),因为你需要比较两个数组中的所有元素,或者你可能会错过一个复制。

Brute Force Use two nested for loops. Take every element from the first array and linear search it in the second array. O(M*N) time, O(1) space

蛮力使用两个嵌套的for循环。从第一个数组中取出每个元素并在第二个数组中进行线性搜索。O(M * N),O(1)的空间

Map Lookup Use a lookup structure like a hashtable or a binary search tree. Put all of the first array into the map structure, then loop through all of the second array and look up each element in the map to see if it exists. This works whether the arrays are sorted or not. O(M*log(M) + N*log(M)) for Binary Search Tree time or O(M + N) time for Hashtable, both are O(M) space.

映射查找使用一个查找结构，如hashtable或二进制搜索树。将所有第一个数组放入映射结构中，然后遍历所有的第二个数组，并在映射中查找每个元素是否存在。无论数组是否被排序，都是有效的。O(M*log(M) + N*log(M))用于二叉搜索树时间或O(M + N)哈希表时间，两者都是O(M)空间。

Binary Search Like brute force, but take every element from the first array and binary search it in the second array. O(m*log(N)) time, O(1) space

像蛮力一样的二分搜索，但是从第一个数组中提取每个元素，然后在第二个数组中搜索它。O(m * log(N)),O(1)的空间

Parallel Walk Like the merge part of merge sort. Have two pointers start at the front of each of the arrays. Compare the two elements, if they're equal store the duplicate, otherwise advance the pointer to the smaller value by one spot and repeat until you hit the end of one of the arrays. O(M + N) time, O(1) space

平行行走，就像归并排序的合并部分。在每个数组的前面都有两个指针。比较这两个元素，如果它们是相同的存储，则将指针指向较小的值，然后重复，直到到达一个数组的末尾。O(M + N)时间，O(1)空间。

Regardless, you must examine every element in both arrays or you won't know if you've found all the duplicates. You could argue fringe cases where one array is a lot bigger or a lot smaller, but that won't hold for an alogrithm where you're considering all ranges of input.

无论如何，您必须检查两个数组中的每个元素，否则您将不知道是否已经找到了所有的副本。你可能会说，一个数组要大得多，或者小得多，但这并不适用于你正在考虑所有输入范围的对话框。

#3

You can use a hash_table to save the large array, and then scan the other small array to calculate the intersection of two array.

您可以使用hash_table来保存大数组，然后扫描另一个小数组来计算两个数组的交集。

import random

def comm_seq(arr_1, arr_2):
    if len(arr_1) < len(arr_2): arr_1, arr_2 = arr_2, arr_1
    cnt = {}
    for item in arr_1: 
        cnt.setdefault(item, 0)
        cnt[item] += 1
    # save the large array in a hash_table
    ret = []
    for item in arr_2:
        p = cnt.get(item, 0)
        if p: 
            ret.append(item):
            cnt[item] -= 1
    # scan the small array and get the answer
    return ret

if __name__ == "__main__":
    arr_1 = [random.randrange(0,5) for _ in xrange(10)]
    arr_2 = [random.randrange(0,5) for _ in xrange(10)]
    arr_1.sort()
    arr_2.sort()
    print comm_seq(arr_1, arr_2)

If we consider the complexity of the py-dictionary operating as O(1), the total complexity is O(min(n, m))

如果我们考虑py-dictionary操作的复杂度为O(1)，那么总复杂度是O(min(n, m))

#4

Algorithm with O(N*log(M/N)) comparisons is possible if you use a combination of one-sided and normal binary search. In the worst case (when both arrays are of equal size) this is equal to O(N) = O(M + N) comparisons. Here M is size of the largest array, N is the number of distinct elements in smaller array.

如果你使用单边和普通二分搜索的组合，则可以用O(N*log(M/N))进行比较。在最坏的情况下(当两个数组大小相等时)，这等于O(N) = O(M + N)比较。这里M是最大数组的大小，N是小数组中不同元素的个数。

Get the smallest of two arrays and search each of its elements in the second array. Start with one-sided binary search: try positions M/N, 2*M/N, 4*M/N, ... until an element, larger than necessary is found. Then use normal binary search to find an element between positions 0 and 2^k*M/N.

获取两个数组中最小的数组，并在第二个数组中搜索它的每个元素。从单边的二分查找开始:尝试位置M/N, 2*M/N, 4*M/N，…直到一个元素，大于必需的被发现。然后使用普通的二分查找在0和2k*M/N之间找到一个元素。

If matching element is found, use the same combination of one-sided and normal binary search to find where the run of duplicate matching elements ends and copy appropriate number of matching elements to output. You can use the same combination of binary searches to count the number of duplicate elements in smaller array, and get the minimum of these duplicate counts to determine how much elements should be in the result.

如果找到匹配的元素，则使用单向和普通的二进制搜索的相同组合来查找重复匹配元素的运行情况，并复制适当数量的匹配元素到输出。您可以使用相同的二进制搜索组合来计算较小数组中重复元素的数量，并获得这些重复计数的最小值，以确定结果中应该包含多少元素。

To continue with the next element from smaller array, use starting position in larger array, where the previous step ended.

要从更小的数组中继续下一个元素，在更大的数组中使用起始位置，在前面的步骤结束。

#1