高效插入和搜索字符串

In an application I will have between about 3000 and 30000 strings. After creation (read from files unordered) there will not be many strings that will be added often (but there WILL be sometimes!). Deletion of strings will also not happen often. Comparing a string with the ones stored will occur frequently.

在一个应用程序中,我将有大约3000到30000个字符串。在创建之后(从无序的文件中读取)将不会有很多字符串经常被添加(但有时会有!)。删除字符串也不会经常发生。将字符串与存储的字符串进行比较将经常发生。

What kind of structure can I use best, a hashtable, a tree (Red-Black, Splay,....) or just on ordered list (maybe a StringArray?) ?

我最好使用什么样的结构,哈希表,树(Red-Black,Splay,......)或者只是在有序列表中(可能是StringArray?)?

(Additional remark : a link to a good C# implementation would be appreciated as well)

(补充说明:一个好的C#实现的链接也将受到赞赏)

5 个解决方案

#1

It sounds like you simply need a hashtable. The HashSet<T> would thus seem to be the ideal choice. (You don't seem to require keys, but Dictionary<T> would be the right option if you did, of course.)

听起来你只需要一个哈希表。因此,HashSet 似乎是理想的选择。 (你似乎不需要密钥,但是如果你做的话,Dictionary 将是正确的选择。)

Here's a summary of the time complexities of the different operations on a HashSet<T> of size n. They're partially based off the fact that the type uses an array as the backing data structure.

这是对大小为n的HashSet 的不同操作的时间复杂性的总结。它们部分地基于类型使用数组作为后备数据结构的事实。

Insertion: Typically O(1), but potentially O(n) if the array needs to be resized.

插入:通常为O(1),但如果需要调整数组大小,则可能为O(n)。

Deletion: O(1)
Exists (Contains): O(1) (given ideal hashtable buckets)

存在(包含):O(1)(给定理想的哈希表桶)

Someone correct me if any of these are wrong please. They are just my best guesses from what I know of the implementation/hashtables in general.

如果有任何错误请有人纠正我。根据我对实现/哈希表的了解,它们只是我最好的猜测。

#2

HashSet is very good for fast insertion and search speeds. Add, Remove and Contains are O(1).

HashSet非常适合快速插入和搜索速度。添加,删除和包含是O(1)。

Edit- Add assumes the array does not need to be resized. If that's the case as Noldorin has stated it is O(n).

Edit- Add假定数组不需要调整大小。如果是这样的话,Noldorin已经声明它是O(n)。

I used HashSet on a recent VB 6 (I didn't write it) to .NET 3.5 upgrade project where I was iterating round a collection that had child items and each child item could appear in more than one parent item. The application processed a list of items I wanted to send to an API that charges a lot of money per call.

我在最近的VB 6(我没有写它)上使用HashSet到.NET 3.5升级项目,在那里我迭代一个包含子项的集合,每个子项可以出现在多个父项中。该应用程序处理了我想要发送到API的项目列表,每次调用会收取大量费用。

I basically used the HashSet to keep track items I'd already sent to prevent us incurring an unnecessary charge. As the process was invoked several times (it is basically a batch job with multiple commands), I serialized the HashSet between invocations. This worked very well- I had a requirement to reuse as much as the existing code as possible as this had been thoroughly tested. The HashSet certainly performed very fast.

我基本上使用HashSet跟踪我已发送的项目,以防止我们产生不必要的费用。由于该进程被多次调用(它基本上是一个包含多个命令的批处理作业),所以我在调用之间序列化了HashSet。这非常有效 - 我需要尽可能多地重用现有代码,因为这已经过彻底测试。 HashSet肯定表现得非常快。

#3

If you're looking for real-time performance or optimal memory efficiency I'd recommend a radix tree or explicit suffix or prefix tree. Otherwise I'd probably use a hash.

如果您正在寻找实时性能或最佳内存效率,我建议使用基数树或显式后缀或前缀树。否则我可能会使用哈希。

Trees have the advantage of having fixed bounds on worst case lookup, insertion and deletion times (based on the length of the pattern you're looking up). Hash based solutions have the advantage of being a whole lot easier to code (you get these out of the box in C#), cheaper to construct initially and if properly configured have similar average-case performance. However, they do tend to use more memory and have non-deterministic time lookups, insertions (and depending on the implementation possibly deletions).

树的优点是在最坏情况查找,插入和删除时间(基于您正在查找的模式的长度)具有固定边界。基于散列的解决方案的优势在于可以更轻松地编写代码(在C#中开箱即用),最初构建成本更低,如果配置正确,则具有相似的平均情况性能。但是,它们确实倾向于使用更多内存并且具有非确定性时间查找,插入(并且取决于实现可能的删除)。

#4

The answers recommending HashSet<T> are spot on if your comparisons are just "is this string present in the set or not". You could even use different IEqualityComparer<string> implementations (probably choosing from the ones in StringComparer) for case-sensitivity etc.

如果您的比较只是“这个字符串是否存在于集合中”,那么推荐HashSet 的答案就会被点亮。您甚至可以使用不同的IEqualityComparer 实现(可能选择StringComparer中的实现)以区分大小写等。

Is this the only type of comparison you need, or do you need things like "where would this string appear in the set if it were actually an ordered list?" If you need that sort of check, then you'll probably want to do a binary search. (List<T> provides a BinarySearch method; I don't know why SortedList and SortedDictionary don't, as both would be able to search pretty easily. Admittedly a SortedDictionary search wouldn't be quite the same as a normal binary search, but it would still usually have similar characteristics I believe.)

这是您需要的唯一比较类型,还是需要“如果它实际上是一个有序列表,那么这个字符串会出现在集合中的哪个位置?”如果你需要那种检查,那么你可能想要进行二分查找。 (List 提供了一个BinarySearch方法;我不知道为什么SortedList和SortedDictionary没有,因为两者都可以很容易地搜索。不可否认,SortedDictionary搜索与普通二进制搜索不完全相同,但它仍然通常具有我认为相似的特征。)

As I say, if you only want "in the set or not" checking, the HashSet<T> is your friend. I just thought I'd bring up the rest in case :)

正如我所说,如果你只想“在集合中”或“不在集合”中,HashSet 就是你的朋友。我只是觉得我会把剩下的事情带来以防万一:)

#5

If you need to know "where would this string appear in the set if it were actually an ordered list" (as in Jon Skeet's answer), you could consider a trie. This solution can only be used for certain types of "string-like" data, and if the "alphabet" is large compared to the number of strings it can quickly lose its advantages. Cache locality could also be a problem.

如果你需要知道“如果它实际上是一个有序列表,那么这个字符串会出现在集合中”(如Jon Skeet的答案),你可以考虑一个特里。此解决方案只能用于某些类型的“字符串式”数据,如果“字母”与字符串数量相比较大,则很快就会失去其优势。缓存局部性也可能是个问题。

This could be over-engineered for a set of only N = 30,000 things that is largely precomputed, however. You might even do better just allocating an array of k * N Optional and filling it by skipping k spaces between each actual thing (thus reducing the probability that your rare insertions will require reallocation, still leaving you with a variant of binary search, and keeping your items in sorted order. If you need precise "where would this string appear in the set", though, this wouldn't work because you would need O(n) time to examine each space before the item checking if it was blank or O(n) time on insert to update a "how many items are really before me" counter in each slot. It could provide you with very fast imprecise indexes, though, and those indexes would be stable between insertions/deletions.

然而,这可能是针对一组仅有大约预先计算的N = 30,000件事而过度设计的。您甚至可以更好地分配一个k * N可选的数组,并通过在每个实际事物之间跳过k空格来填充它(从而降低稀有插入需要重新分配的可能性,仍然让您使用二进制搜索的变体,并保持您的项目按排序顺序。如果您需要精确的“此字符串将出现在集合中的哪个位置”,但这不起作用,因为在项目检查之前需要O(n)时间检查每个空格是否为空白或O(n)插入时间来更新每个插槽中“有多少项在我之前”计数器。它可以为您提供非常快速的不精确索引,并且这些索引在插入/删除之间是稳定的。

#1