如何排序K排序数组，用归并排序?

I know that this question has been asked, and there is a very nice elegant solution using a min heap.

我知道这个问题已经被问过了，而且有一个非常漂亮的、使用最小堆的解决方案。

MY question is how would one do this using the merge function of merge sort.

我的问题是如何使用归并排序的合并函数。

You already have an array of sorted arrays. So you should be able to merge all of them into one array in O(nlog K) time, correct?

你已经有了数组的有序数组。所以你应该可以把它们合并到一个数组中O(nlogk)时间，对吧?

I just can't figure out how to do this!

我就是想不出怎么做!

Say I have

说我有

[ [5,6], [3,4], [1,2], [0] ]

[[5,6]，[3,4]，[1,2]，[0]]

Step 1: [ [3,4,5,6], [0,1,2] ]

步骤1:[[3,4,5,6]，[0,1,2]]

Step2: [ [0,1,2,3,4,5,6] ]

步骤2:[[0,1,2,3,4,5,6]]

Is there a simple way to do this? Is O(nlog K) theoretically achievable with mergesort?

有没有一种简单的方法?O(nlog K)理论上可以与归并排序吗?

5 个解决方案

#1

As others have said, using the min heap to hold the next items is the optimal way. It's called an N-way merge. Its complexity is O(n log k).

正如其他人所说，使用最小堆来保存下一个项目是最理想的方法。它叫做n路归并。它的复杂度是O(n log k)

You can use a 2-way merge algorithm to sort k arrays. Perhaps the easiest way is to modify the standard merge sort so that it uses non-constant partition sizes. For example, imagine that you have 4 arrays with lengths 10, 8, 12, and 33. Each array is sorted. If you concatenated the arrays into one, you would have these partitions (the numbers are indexes into the array, not values):

您可以使用2路合并算法对k数组进行排序。也许最简单的方法是修改标准的合并排序，以便它使用非恒定的分区大小。例如，假设您有4个数组，长度为10、8、12和33。每个数组进行排序。如果将数组连接到一个数组中，就会有这些分区(数字是数组中的索引，而不是值):

[0-9][10-17][18-29][30-62]

The first pass of your merge sort would have starting indexes of 0 and 10. You would merge that into a new array, just as you would with the standard merge sort. The next pass would start at positions 18 and 30 in the second array. When you're done with the second pass, your output array contains:

您的合并排序的第一次传递将会有0和10的开始索引。您可以将其合并到一个新的数组中，就像使用标准的合并排序一样。下一个通道将从第18和第30个位置开始。当你完成第二遍时，你的输出数组包含:

[0-17][18-62]

Now your partitions start at 0 and 18. You merge those two into a single array and you're done.

现在分区从0和18开始。将它们合并到一个数组中，就完成了。

The only real difference is that rather than starting with a partition size of 2 and doubling, you have non-constant partition sizes. As you make each pass, the new partition size is the sum of the sizes of the two partitions you used in the previous pass. This really is just a slight modification of the standard merge sort.

唯一真正的区别是，不是以2和2的分区大小开始，而是有非恒定的分区大小。当您完成每个传递时，新的分区大小是您在前一个传递中使用的两个分区的大小的总和。这只是对标准归并排序的一个小小的修改。

It will take log(k) passes to do the sort, and at each pass you look at all n items. The algorithm is O(n log k), but with a much higher constant than the N-way merge.

它将使用log(k)传递来完成排序，并且在每个传递过程中查看所有n项。该算法是O(n log k)，但它的常数比n -way的合并要高得多。

For implementation, build an array of integers that contains the starting indexes of each of your sub arrays. So in the example above you would have:

对于实现，构建包含每个子数组的起始索引的整数数组。在上面的例子中

int[] partitions = [0, 10, 18, 30];
int numPartitions = 4;

Now you do your standard merge sort. But you select your partitions from the partitions array. So your merge would start with:

现在进行标准归并排序。但是从分区数组中选择分区。所以你的合并会从:

merge (inputArray, outputArray, part1Index, part2Index, outputStart)
{
    part1Start = partitions[part1Index];
    part2Start = partitions[part2Index];

    part1Length = part2Start - part1Start;
    part2Length = partitions[part2Index-1] - part2Start;

    // now merge part1 and part2 into the output array,
    // starting at outputStart
}

And your main loop would look something like:

主回路大概是这样的

while (numPartitions > 1)
{
    for (int p = 0; p < numPartitions; p += 2)
    {
        outputStart = partitions[p];
        merge(inputArray, outputArray, p, p+1, outputStart);
        // update partitions table
        partitions[p/2] = partitions[p] + partitions[p+1];
    }
    numPartitions /= 2;
}

That's the basic idea. You'll have to do some work to handle the dangling partition when the number is odd, but in general that's how it's done.

这是最基本的想法。当数字是奇数时，你需要做一些工作来处理悬挂的分区，但一般来说，这就是它的工作方式。

You can also do it by maintaining an array of arrays, and merging each two arrays into a new array, adding that to an output array of arrays. Lather, rinse, repeat.

您还可以通过维护数组数组来实现这一点，并将每两个数组合并到一个新的数组中，并将其添加到数组的输出数组中。泡沫,冲洗,重复。

#2

You should note that when we say complexity is O(n log k), we assume that n means TOTAL number of elements in ALL of k arrays, i.e. number of elements in a final merged array.

您应该注意到，当我们说复杂度是O(n log k)时，我们假设n表示所有k数组中元素的总数，即最终合并数组中的元素个数。

For example, if you want to merge k arrays that contain n elements each, total number of elements in final array will be nk. So complexity will be O(nk log k).

例如，如果您想要合并包含n个元素的k数组，那么最终数组中的元素总数将是nk。复杂度是O(nk log k)

#3

There different ways to merge arrays. To accoplish that task in N*Log(K) time you can use a structure called Heap (it is good structure to implement priority queue). I suppose that you already have it, if you don’t then pick up any available implementation: http://en.wikipedia.org/wiki/Heap_(data_structure) Then you can do that like this:

有不同的方法来合并数组。为了在N*Log(K)时间内完成这个任务，您可以使用一个名为堆的结构(它是实现优先队列的良好结构)。我想你已经有了它，如果你没有找到任何可用的实现:http://en.wikipedia.org/wiki/Heap_(data_structure)，那么你可以这样做:

1.  We have A[1..K] array of arrays to sort, Head[1..K] - current pointer for every array and Count[1..K] - number of items for every array.
2.  We have Heap of pairs (Value: int; NumberOfArray: int) - empty at start.
3.  We put to the heap first item of every array - initialization phase.
4.  Then we organize cycle:
5.  Get pair (Value, NumberOfArray) from the heap. 
6.  Value is next value to output.
7.  NumberOfArray – is number of array where we need to take next item (if any) and place to the heap.
8.  If heap is not empty, then repeat from step 5

So for every item we operate only with heap built from K items as maximum. It mean that we will have N*Log(K) complexity as you asked.

因此，对于每一个项目，我们只使用从K项构建的堆作为最大值。这意味着我们将会有N*Log(K)复杂度。

#4

I implemented it in python. The main idea is similar to mergesort. There are k arrays in lists. In function mainMerageK, just divide lists (k) into left (k/2) and right (k/2). Therefore, the total count of partition is log(k). Regarding function merge, it is easy to know the runtime is O(n). Finally, we get O(nlog k) By the way, it also can be implemented in min heap, and there is a link: Merging K- Sorted Lists using Priority Queue

我用python实现了它。其主要思想与归并排序相似。列表中有k个数组。在函数mainMerageK中，只将列表(k)分为左(k/2)和右(k/2)。因此，分区的总计数是log(k)。关于函数合并，很容易知道运行时是O(n)。最后，我们得到O(nlog k)，它也可以在最小堆中实现，还有一个链接:使用优先队列合并k排序的列表。

def mainMergeK(*lists):
    # implemented by k-way partition
    k = len(lists)
    if k > 1:
        mid = int(k / 2)
        B = mainMergeK(*lists[0: mid])
        C = mainMergeK(*lists[mid:])
        A = merge(B, C)
        print B, ' + ', C, ' = ', A
        return A
    return lists[0]

def merge(B, C):
    A = []
    p = len(B)
    q = len(C)
    i = 0
    j = 0
    while i < p and j < q:
        if B[i] <= C[j]:
            A.append(B[i])
            i += 1
        else:
            A.append(C[j])
            j += 1
    if i == p:
        for c in C[j:]:
            A.append(c)
    else:
        for b in B[i:]:
            A.append(b)
    return A

if __name__ == '__main__':
    x = mainMergeK([1, 3, 5], [2, 4, 6], [7, 8, 10], [9])
    print x

The output likes below:

输出喜欢下面:

[1, 3, 5]  +  [2, 4, 6]  =  [1, 2, 3, 4, 5, 6]
[7, 8, 10]  +  [9]  =  [7, 8, 9, 10]
[1, 2, 3, 4, 5, 6]  +  [7, 8, 9, 10]  =  [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

#5

Just do it like a 2-way merge except with K items. Will result in O(NK). If you want O(N logK) you will need to use a min-heap to keep track of the K pointers(with source array as a metadata) in the algorithm below:

就像双向合并一样，除了K项。将导致O(NK)。如果您想要O(N logK)，您需要使用一个mini -heap来跟踪以下算法中的K指针(源数组作为元数据):

Keep an array of K elements - i.e K pointers showing position in each array. Mark all K elements are valid.

保留K个元素的数组- i。e K指针显示每个数组中的位置。标记所有K个元素是有效的。

loop: Compare values in K pointers that are valid. if the value is minimum, select least index pointer and increment it into the next value in the array. If incremented value has crossed it's array, mark it invalid. Add the least value into the result. Repeat till all K elements are invalid.

循环:比较有效的K指针中的值。如果值是最小值，则选择最小索引指针，并将其添加到数组的下一个值中。如果递增的值越过了它的数组，那么标记为无效。在结果中添加最小值。重复，直到所有K元素无效。

For example,:

例如,:

Positions        Arrays
p1:0  Array 1:  0  5  10  
p2:3  Array 2:  3  6   9
p3:2  Array 3:  2  4  6

Output (min of 0,3,2)=> 0. So output is {0}

输出(0,3,2)=> 0。所以输出是{ 0 }

      Array
p1:5    0  5  10
p2:3    3  6   9
p3:2    2  4  6

Output (min of 5,3,2)=> 2. So {0,2}

输出(5,3,2)=> 2。{ 0,2 }

       Array
p1:5    0  5  10
p2:3    3  6  9
p3:4    2  4  6

Output (min of 5,3,4)=>3. So {0,2,3} ..and so on..until you come to a state where output is {0,2,3,4,5,6}

输出(5分钟、3、4)= > 3。所以{ 0 2 3 } . .等等. .直到您到达输出{0、2、3、4、5、6}的状态

   Array
p1:5    0  5  10
p2:9    3  6  9
p3:6    2  4  6

Output (min of 5,9,6)=>6. So {0,2,3,4,5,6}+{6} when you mark p3 as "invalid" as you have exhausted the array. (or if you are using a min-heap you will simply remove the min-item, get it's source array metadata: in this case array 3, see that it's done so you will not add anything new to the min-heap)

输出(5分钟,9日,6)= > 6。当你把p3标记为“无效”时，{0,2,3,4,5,6}+{6}，因为你已经耗尽了数组。(或者，如果您使用的是min-heap，您将简单地删除minitem，获取它的源数组元数据:在本例中，您可以看到它已经完成了，所以您不会添加任何新到min-heap的内容)

#1