固定大小的集合,保持*(N)值

时间:2022-02-18 21:43:00

My code processes a huge number of values and I'm looking for an efficient structure to keep track of the top (N) values, where N is less than 10, so collecting ALL numbers then sorting the list and taking the first (N) is probably not the most efficient way.

我的代码处理大量的值,我正在寻找一个有效的结构来跟踪顶部(N)值,其中N小于10,所以收集所有数字然后排序列表并取第一个(N)可能不是最有效的方式。

To do that, I'm building a collection of fixed size N, to keep the top (N) values sorted in descending order. The Add(T value) method of the sorted collection would add the value to the collection if value is higher than any of the existing values (in which case the last element is removed) or if the collection is not full.

为此,我正在构建一个固定大小为N的集合,以保持顶部(N)值按降序排序。如果value高于任何现有值(在这种情况下删除最后一个元素)或者集合未满,则排序集合的Add(T value)方法会将值添加到集合中。

I was able to implement what I wanted using a doubly LinkedList<T> since it has fast insertion and removal, but I was wondering if using SortedDictionary<TKey, TValue> or a priority queue would be better ?

我能够使用双重LinkedList 实现我想要的东西,因为它有快速插入和删除,但我想知道使用SortedDictionary 还是优先级队列会更好? ,tvalue>

Thank you.

8 个解决方案

#1


3  

The performance may really change.

表现可能真的改变了。

For N < 10 any overly complex data structure will likely drag performance significantly (though perhaps not catastrophically) so I'd use an array to store the items.

对于N <10,任何过于复杂的数据结构都可能显着拖动性能(尽管可能不是灾难性的),因此我使用数组来存储项目。

Then there are 3 main possibilities on how to arrange the items in the array:

那么如何安排数组中的项目有三种主要可能性:

  1. sorted is probably the best choice to keep things simple:
    • constant time to determine whether to insert a new item (compare with lowest)
    • 确定是否插入新项目的恒定时间(与最低项目比较)

    • O(N) time to insert - but this only happens for items that are in the N best-so-far. And if your input is sufficiently random, the average time will be even lower because most insertions will only move a few elements at the bottom of the top.
    • O(N)时间插入 - 但这只发生在N最好的项目中。如果您的输入足够随机,平均时间将更低,因为大多数插入只会在顶部的底部移动一些元素。

  2. 排序可能是保持简单的最佳选择:确定是否插入新项目(与最低比较)O(N)插入时间的恒定时间 - 但这仅适用于N中最好的项目。如果您的输入足够随机,平均时间将更低,因为大多数插入只会在顶部的底部移动一些元素。

  3. unsorted:
    • O(N) time for each input element, that's too much compared to "sorted"
    • 每个输入元素的O(N)时间,与“已排序”相比太多

  4. 未分类:每个输入元素的O(N)时间,与“已排序”相比太多

  5. binary heap that implements a priority queue: more complex to implement but maybe even faster than "sorted"
    • constant time to determine whether to insert a new item (compare with lowest)
    • 确定是否插入新项目的恒定时间(与最低项目比较)

    • O(log N) time to insert - and this only happens for items that are in the N best-so-far
    • O(log N)时间要插入 - 这只发生在N最好的项目中

  6. 实现优先级队列的二进制堆:实现起来比较复杂但可能比“已排序”的常量时间更快,以确定是否插入新项目(与最低值比较)O(log N)插入时间 - 这仅适用于项目这是迄今为止最好的N.

#2


6  

I would simply use a heap with a limited depth. I do not know whether there already exists a library for that, but it should be easy to implement.

我只想使用深度有限的堆。我不知道是否已经存在一个库,但它应该很容易实现。

#3


4  

The main advantage to use a SortedDictionary or SortedList it is that you can skip the sorting intelligence because they handle it for you( e.g. You just have to remove the (n + 1)th element every time you add a value). But on the other hands adopt that sort of complex structure for 10 elements resembles to use a nuke to kill a fly...

使用SortedDictionary或SortedList的主要优点是,您可以跳过排序智能,因为它们会为您处理它(例如,您每次添加值时只需删除第(n + 1)个元素)。但另一方面,对于10种元素采用那种复杂的结构类似于使用核武杀死苍蝇......

Maybe the linked list is a good way, and also a simple linear comparison for inserting values in order is not so slower than binary search (we still speak about max 10 comparisons against ~3, current CPUs not event feel the difference).

也许链表是一个好方法,而且按顺序插入值的简单线性比较也不比二进制搜索慢(我们仍然谈论最多10次比较~3,当前CPU没有事件感觉差异)。

EDIT:

fixed arrays can be used to build prioriry queues with binary heaps, that probably is the right way to implement this

固定数组可用于构建具有二进制堆的优先级队列,这可能是实现此目的的正确方法

#4


3  

For such a small number, just keep an array. Scan the array keeping track of the smallest value and its position. If your new number is larger than the smallest on in the set, replace it. You should of course scan for the lowest value once after you insert a number, then just compare new numbers to that and only take action if you have something larger (replace and rescan).

对于这么小的数字,只需保留一个数组。扫描阵列,跟踪最小值及其位置。如果您的新号码大于集合中的最小号码,请将其替换。当然,在插入数字后,您应该扫描一次最低值,然后只需将新数字与数字进行比较,只有在有更大的数据时才采取措施(替换和重新扫描)。

#5


2  

Unless you have a solid reason to do otherwise, I'd use a priority queue.

除非你有充分的理由不这样做,否则我会使用优先级队列。

There is one trick that can simplify the logic quite a bit. Most people's first idea is to look at each incoming item, and insert it into the collection iff the collection contains fewer items than desired, or the new item is larger than the smallest item currently in the collection.

有一个技巧可以简化逻辑。大多数人的第一个想法是查看每个传入的项目,并将其插入到集合中,如果集合包含的项目少于所需项目,或者新项目大于集合中当前的最小项目。

You can simplify things quite a bit if you leave room for one extra item in the collection. Always insert each incoming item into the collection, and then if the collection is too large, remove the smallest item.

如果你为集合中的一个额外项目留出空间,你可以简化一些事情。始终将每个传入的项目插入到集合中,然后如果集合太大,请删除最小的项目。

While a priority queue is arguably overkill for only 10 items, it keeps the logic simple, and is efficient both in terms of space and time, so if you ever need N=10000 (or whatever) it'll still work nicely.

虽然优先级队列可以说只有10个项目有点过分,但它保持逻辑简单,并且在空间和时间方面都很有效,所以如果你需要N = 10000(或其他),它仍然可以很好地工作。

#6


1  

Edit:

If only the first N values are needed and the others are not of any interest, a plain old array will get the work done cheaply.

如果只需要前N个值而其他N值没有任何意义,那么一个普通的旧数组将能够以低成本完成工作。

Keep it sorted and test against the biggest. And only if it needs to be stored, insert it correctly and shift the remaining elements. With small sizes this is a cheap operation, and my guess is it won't be done often.

保持它排序并测试最大的。并且只有在需要存储时,才能正确插入并移动其余元素。小尺寸这是一个便宜的操作,我的猜测是它不会经常做。

#7


1  

If you have a fix size of 10, why not simply use a sorted array of length 10 and binary search? But I am not sure if at this size, binary search is not a huge win over a dumb search along the array due to some overhead.

如果您的修复大小为10,为什么不简单地使用长度为10的二元搜索和二进制搜索?但我不确定在这个大小,二进制搜索是不是因为一些开销而在阵列上进行愚蠢搜索的巨大胜利。

#8


0  

Use binary insertion sort on a raw array, pushing the smallest value off the end. This is routinely the fastest method used to maintain small sorted arrays and, for example, is generally used as a special case for various sorting algorithms (e.g. MergeSort).

在原始数组上使用二进制插入排序,将最小值推到最后。这通常是用于维护小型排序数组的最快方法,例如,通常用作各种排序算法(例如MergeSort)的特殊情况。

#1


3  

The performance may really change.

表现可能真的改变了。

For N < 10 any overly complex data structure will likely drag performance significantly (though perhaps not catastrophically) so I'd use an array to store the items.

对于N <10,任何过于复杂的数据结构都可能显着拖动性能(尽管可能不是灾难性的),因此我使用数组来存储项目。

Then there are 3 main possibilities on how to arrange the items in the array:

那么如何安排数组中的项目有三种主要可能性:

  1. sorted is probably the best choice to keep things simple:
    • constant time to determine whether to insert a new item (compare with lowest)
    • 确定是否插入新项目的恒定时间(与最低项目比较)

    • O(N) time to insert - but this only happens for items that are in the N best-so-far. And if your input is sufficiently random, the average time will be even lower because most insertions will only move a few elements at the bottom of the top.
    • O(N)时间插入 - 但这只发生在N最好的项目中。如果您的输入足够随机,平均时间将更低,因为大多数插入只会在顶部的底部移动一些元素。

  2. 排序可能是保持简单的最佳选择:确定是否插入新项目(与最低比较)O(N)插入时间的恒定时间 - 但这仅适用于N中最好的项目。如果您的输入足够随机,平均时间将更低,因为大多数插入只会在顶部的底部移动一些元素。

  3. unsorted:
    • O(N) time for each input element, that's too much compared to "sorted"
    • 每个输入元素的O(N)时间,与“已排序”相比太多

  4. 未分类:每个输入元素的O(N)时间,与“已排序”相比太多

  5. binary heap that implements a priority queue: more complex to implement but maybe even faster than "sorted"
    • constant time to determine whether to insert a new item (compare with lowest)
    • 确定是否插入新项目的恒定时间(与最低项目比较)

    • O(log N) time to insert - and this only happens for items that are in the N best-so-far
    • O(log N)时间要插入 - 这只发生在N最好的项目中

  6. 实现优先级队列的二进制堆:实现起来比较复杂但可能比“已排序”的常量时间更快,以确定是否插入新项目(与最低值比较)O(log N)插入时间 - 这仅适用于项目这是迄今为止最好的N.

#2


6  

I would simply use a heap with a limited depth. I do not know whether there already exists a library for that, but it should be easy to implement.

我只想使用深度有限的堆。我不知道是否已经存在一个库,但它应该很容易实现。

#3


4  

The main advantage to use a SortedDictionary or SortedList it is that you can skip the sorting intelligence because they handle it for you( e.g. You just have to remove the (n + 1)th element every time you add a value). But on the other hands adopt that sort of complex structure for 10 elements resembles to use a nuke to kill a fly...

使用SortedDictionary或SortedList的主要优点是,您可以跳过排序智能,因为它们会为您处理它(例如,您每次添加值时只需删除第(n + 1)个元素)。但另一方面,对于10种元素采用那种复杂的结构类似于使用核武杀死苍蝇......

Maybe the linked list is a good way, and also a simple linear comparison for inserting values in order is not so slower than binary search (we still speak about max 10 comparisons against ~3, current CPUs not event feel the difference).

也许链表是一个好方法,而且按顺序插入值的简单线性比较也不比二进制搜索慢(我们仍然谈论最多10次比较~3,当前CPU没有事件感觉差异)。

EDIT:

fixed arrays can be used to build prioriry queues with binary heaps, that probably is the right way to implement this

固定数组可用于构建具有二进制堆的优先级队列,这可能是实现此目的的正确方法

#4


3  

For such a small number, just keep an array. Scan the array keeping track of the smallest value and its position. If your new number is larger than the smallest on in the set, replace it. You should of course scan for the lowest value once after you insert a number, then just compare new numbers to that and only take action if you have something larger (replace and rescan).

对于这么小的数字,只需保留一个数组。扫描阵列,跟踪最小值及其位置。如果您的新号码大于集合中的最小号码,请将其替换。当然,在插入数字后,您应该扫描一次最低值,然后只需将新数字与数字进行比较,只有在有更大的数据时才采取措施(替换和重新扫描)。

#5


2  

Unless you have a solid reason to do otherwise, I'd use a priority queue.

除非你有充分的理由不这样做,否则我会使用优先级队列。

There is one trick that can simplify the logic quite a bit. Most people's first idea is to look at each incoming item, and insert it into the collection iff the collection contains fewer items than desired, or the new item is larger than the smallest item currently in the collection.

有一个技巧可以简化逻辑。大多数人的第一个想法是查看每个传入的项目,并将其插入到集合中,如果集合包含的项目少于所需项目,或者新项目大于集合中当前的最小项目。

You can simplify things quite a bit if you leave room for one extra item in the collection. Always insert each incoming item into the collection, and then if the collection is too large, remove the smallest item.

如果你为集合中的一个额外项目留出空间,你可以简化一些事情。始终将每个传入的项目插入到集合中,然后如果集合太大,请删除最小的项目。

While a priority queue is arguably overkill for only 10 items, it keeps the logic simple, and is efficient both in terms of space and time, so if you ever need N=10000 (or whatever) it'll still work nicely.

虽然优先级队列可以说只有10个项目有点过分,但它保持逻辑简单,并且在空间和时间方面都很有效,所以如果你需要N = 10000(或其他),它仍然可以很好地工作。

#6


1  

Edit:

If only the first N values are needed and the others are not of any interest, a plain old array will get the work done cheaply.

如果只需要前N个值而其他N值没有任何意义,那么一个普通的旧数组将能够以低成本完成工作。

Keep it sorted and test against the biggest. And only if it needs to be stored, insert it correctly and shift the remaining elements. With small sizes this is a cheap operation, and my guess is it won't be done often.

保持它排序并测试最大的。并且只有在需要存储时,才能正确插入并移动其余元素。小尺寸这是一个便宜的操作,我的猜测是它不会经常做。

#7


1  

If you have a fix size of 10, why not simply use a sorted array of length 10 and binary search? But I am not sure if at this size, binary search is not a huge win over a dumb search along the array due to some overhead.

如果您的修复大小为10,为什么不简单地使用长度为10的二元搜索和二进制搜索?但我不确定在这个大小,二进制搜索是不是因为一些开销而在阵列上进行愚蠢搜索的巨大胜利。

#8


0  

Use binary insertion sort on a raw array, pushing the smallest value off the end. This is routinely the fastest method used to maintain small sorted arrays and, for example, is generally used as a special case for various sorting algorithms (e.g. MergeSort).

在原始数组上使用二进制插入排序,将最小值推到最后。这通常是用于维护小型排序数组的最快方法,例如,通常用作各种排序算法(例如MergeSort)的特殊情况。