b树比AVL或红黑树快?

时间:2022-08-27 11:04:20

I know that performance never is black and white, often one implementation is faster in case X and slower in case Y, etc. but in general - are B-trees faster then AVL or RedBlack-Trees? They are considerably more complex to implement then AVL trees (and maybe even RedBlack-trees?), but are they faster (does their complexity pay off) ?

我知道,性能永远不会是黑白色的,通常情况下,在X和Y的情况下,执行速度会更快,但是一般来说,b树会比AVL或redblack树更快吗?它们要比AVL树(甚至可能是红黑树)更复杂,但是它们更快(它们的复杂性会得到回报吗)?

Edit: I should also like to add that if they are faster then the equivalent AVL/RedBlack tree (in terms of nodes/content) - why are they faster?

编辑:我还想补充一点,如果它们比等效AVL/RedBlack树(在节点/内容方面)更快,那么它们为什么会更快呢?

9 个解决方案

#1


106  

Sean's post (the currently accepted one) contains several incorrect claims. Sorry Sean, I don't mean to be rude; I hope I can convince you that my statement is based in fact.

Sean的帖子(目前被接受的帖子)包含了一些不正确的说法。对不起,肖恩,我不是故意的;我希望我能说服你我的陈述是基于事实的。

They're totally different in their use cases, so it's not possible to make a comparison.

它们在用例中完全不同,所以不可能进行比较。

They're both used for maintaining a set of totally ordered items with fast lookup, insertion and deletion. They have the same interface and the same intention.

它们都用于维护一组快速查找、插入和删除的完全有序项。他们有相同的界面和相同的意图。

RB trees are typically in-memory structures used to provide fast access (ideally O(logN)) to data. [...]

RB树通常是用于提供快速访问(理想情况下是O(logN))到数据的内存结构。[…]

always O(log n)

总是O(log n)

B-trees are typically disk-based structures, and so are inherently slower than in-memory data.

b -树通常是基于磁盘的结构,因此固有的慢于内存中的数据。

Nonsense. When you store search trees on disk, you typically use B-trees. That much is true. When you store data on disk, it's slower to access than data in memory. But a red-black tree stored on disk is also slower than a red-black tree stored in memory.

无稽之谈。当您在磁盘上存储搜索树时,通常使用b树。这是真实的。当您在磁盘上存储数据时,访问的速度比内存中的数据要慢。但是,存储在磁盘上的红黑树也比存储在内存中的红黑树慢。

You're comparing apples and oranges here. What is really interesting is a comparison of in-memory B-trees and in-memory red-black trees.

你在拿苹果和桔子做比较。真正有趣的是对内存中的b -树和内存中的红黑树的比较。

[As an aside: B-trees, as opposed to red-black trees, are theoretically efficient in the I/O-model. I have experimentally tested (and validated) the I/O-model for sorting; I'd expect it to work for B-trees as well.]

另外,与红黑树相对的b树在理论上是有效的I/ o模型。我已通过实验测试(并验证)I/ o模型进行排序;我希望它也能用于b -树。]

B-trees are rarely binary trees, the number of children a node can have is typically a large number.

b树很少是二叉树,一个节点的子节点数通常是一个大的数字。

To be clear, the size range of B-tree nodes is a parameter of the tree (in C++, you may want to use an integer value as a template parameter).

要清楚,B-tree节点的大小范围是树的一个参数(在c++中,您可能希望使用整数值作为模板参数)。

The management of the B-tree structure can be quite complicated when the data changes.

当数据发生变化时,B-tree结构的管理可能会非常复杂。

I remember them to be much simpler to understand (and implement) than red-black trees.

我记得它们比红黑树更容易理解(和实现)。

B-tree try to minimize the number of disk accesses so that data retrieval is reasonably deterministic.

b树试图最小化磁盘访问的数量,以便数据检索具有合理的确定性。

That much is true.

这是真实的。

It's not uncommon to see something like 4 B-tree access necessary to lookup a bit of data in a very database.

在一个非常数据库中查找一些数据需要4个b树访问,这并不少见。

Got data?

有数据吗?

In most cases I'd say that in-memory RB trees are faster.

在大多数情况下,我认为内存中的RB树比较快。

Got data?

有数据吗?

Because the lookup is binary it's very easy to find something. B-tree can have multiple children per node, so on each node you have to scan the node to look for the appropriate child. This is an O(N) operation.

因为查找是二进制的,所以很容易找到东西。b树每个节点可以有多个子节点,因此在每个节点上,您必须扫描节点以寻找合适的子节点。这是一个O(N)操作。

The size of each node is a fixed parameter, so even if you do a linear scan, it's O(1). If we big-oh over the size of each node, note that you typically keep the array sorted so it's O(log n).

每个节点的大小是一个固定的参数,所以即使你做一个线性扫描,也是O(1)。如果我们对每个节点的大小进行大处理,请注意,您通常保持数组的排序,所以它是O(log n)。

On a RB-tree it'd be O(logN) since you're doing one comparison and then branching.

在一个rb树上,它是O(logN)因为你在做一个比较,然后分支。

You're comparing apples and oranges. The O(log n) is because the height of the tree is at most O(log n), just as it is for a B-tree.

你在比较苹果和桔子。O(log n)是因为树的高度最多O(log n),就像对b树一样。

Also, unless you play nasty allocation tricks with the red-black trees, it seems reasonable to conjecture that B-trees have better caching behavior (it accesses an array, not pointers strewn about all over the place, and has less allocation overhead increasing memory locality even more), which might help it in the speed race.

同样,除非你与“红-黑”树玩的分配的把戏,它似乎是合理的猜想,b树有更好的缓存行为(指针访问数组,而不是散落得到处都是,和分配开销增加内存位置更少),这可能有助于它的速度竞赛。

I can point to experimental evidence that B-trees (with size parameters 32 and 64, specifically) are very competitive with red-black trees for small sizes, and outperforms it hands down for even moderately large values of n. See http://idlebox.net/2007/stx-btree/stx-btree-0.8.3/doxygen-html/speedtest.html

我可以通过实验证明,b -树(尺寸参数32和64,特别是)与红黑树的小尺寸非常有竞争优势,甚至比大小写的n都要大。

B-trees are faster. Why? I conjecture that it's due to memory locality, better caching behavior and less pointer chasing (which are, if not the same things, overlapping to some degree).

b树更快。为什么?我推测它是由于内存位置、更好的缓存行为和更少的指针追逐(如果不是相同的东西,在某种程度上是重叠的)。

#2


90  

Actually Wikipedia has a great article that shows every RB-Tree can easily be expressed as a B-Tree. Take the following tree as sample:

实际上,*有一篇很好的文章,它展示了所有的rb树都可以很容易地表示为b -树。以下列树为样本:

b树比AVL或红黑树快?

now just convert it to a B-Tree (to make this more obvious, nodes are still colored R/B, what you usually don't have in a B-Tree):

现在把它转换成B树(为了使这个更明显,节点仍然是彩色的R/B,你通常没有在B-树中):

Same Tree as B-Tree

b -树树一样

(cannot add the image here for some weird reason)

(由于某些奇怪的原因,不能添加图像)

Same is true for any other RB-Tree. It's taken from this article:

对于任何其他的rb树也是如此。它取自这篇文章:

http://en.wikipedia.org/wiki/Red-black_tree

http://en.wikipedia.org/wiki/Red-black_tree

To quote from this article:

引用这篇文章:

The red-black tree is then structurally equivalent to a B-tree of order 4, with a minimum fill factor of 33% of values per cluster with a maximum capacity of 3 values.

在结构上,红黑树在结构上等价于第4级的b树,每个集群的最小填充系数为33%,最大容量为3。

I found no data that one of both is significantly better than the other one. I guess one of both had already died out if that was the case. They are different regarding how much data they must store in memory and how complicated it is to add/remove nodes from the tree.

我没有发现其中一个明显优于另一个的数据。我想,如果是这样的话,两者中的一个已经灭绝了。对于在内存中必须存储多少数据,以及在树中添加/删除节点是多么复杂,它们是不同的。

Update:

My personal tests suggest that B-Trees are better when searching for data, as they have better data locality and thus the CPU cache can do compares somewhat faster. The higher the order of a B-Tree (the order is the number of children a note can have), the faster the lookup will get. On the other hand, they have worse performance for adding and removing new entries the higher their order is. This is caused by the fact that adding a value within a node has linear complexity. As each node is a sorted array, you must move lots of elements around within that array when adding an element into the middle: all elements to the left of the new element must be moved one position to the left or all elements to the right of the new element must be moved one position to the right. If a value moves one node upwards during an insert (which happens frequently in a B-Tree), it leaves a hole which must be also be filled either by moving all elements from the left one position to the right or by moving all elements to the right one position to the left. These operations (in C usually performed by memmove) are in fact O(n). So the higher the order of the B-Tree, the faster the lookup but the slower the modification. On the other hand if you choose the order too low (e.g. 3), a B-Tree shows little advantages or disadvantages over other tree structures in practice (in such a case you can as well use something else). Thus I'd always create B-Trees with high orders (at least 4, 8 and up is fine).

我的个人测试表明,在搜索数据时,b树比较好,因为它们有更好的数据局部性,因此CPU缓存能做的比较快。b -树的顺序越高(order是一个音符可以拥有的孩子的数量),查找就会越快。另一方面,在添加和删除新条目时,它们的性能更差。这是因为在一个节点中添加一个值具有线性复杂度。作为每个节点是一个排序的数组,你必须移动大量的元素数组中添加一个元素时中间:所有元素左边的新元素必须向左移动一个位置或所有元素右边的新元素必须向右移动一个位置。如果insert期间向上移动一个节点的值(经常发生在b - tree),它让一个洞也必须是通过把所有元素从左向右一个位置或通过移动所有元素向左向右移动一个位置。这些操作(通常由memmove执行)实际上是O(n)。所以b树的顺序越高,查找的速度越快,但是修改的速度越慢。另一方面,如果您选择的顺序太低(例如,3),在实践中,b树在其他树结构上显示的优势或劣势(在这种情况下,您也可以使用其他的东西)。因此,我总是创建具有高命令的b树(至少4、8和up是可以的)。

File systems, which often base on B-Trees, use much higher orders (order 200 and even a lot more) - this is because they usually choose the order high enough so that a note (when containing maximum number of allowed elements) equals either the size of a sector on harddrive or of a cluster of the filesystem. This gives optimal performance (since a HD can only write a full sector at a time, even when just one byte is changed, the full sector is rewritten anyway) and optimal space utilization (as each data entry on drive equals at least the size of one cluster or is a multiple of the cluster sizes, no matter how big the data really is). Caused by the fact that the hardware sees data as sectors and the file system groups sectors to clusters, B-Trees can yield much better performance and space utilization for file systems than any other tree structure can; that's why they are so popular for file systems.

文件系统,通常基于b -树,用更高的订单(订单200甚至更多的),这是因为他们通常选择订单足够高,以便报告(包含最大允许数元素时)=部门在硬盘的大小或的集群文件系统。这给最佳性能(因为HD只能写一个完整的部门,即使只是一个字节是改变了,整个部门重写)和最优空间利用率(每个数据条目在开车至少等于一个集群的大小或者是多个集群的大小,无论多大的数据确实是)。由于硬件将数据视为扇区和文件系统组分组到集群,因此b树比任何其他树结构都能更好地实现文件系统的性能和空间利用率;这就是为什么它们如此受文件系统欢迎的原因。

When your app is constantly updating the tree, adding or removing values from it, a RB-Tree or an AVL-Tree may show better performance on average compared to a B-Tree with high order. Somewhat worse for the lookups and they might also need more memory, but therefor modifications are usually fast. Actually RB-Trees are even faster for modifications than AVL-Trees, therefor AVL-Trees are a little bit faster for lookups as they are usually less deep.

当您的应用程序不断更新树、添加或删除它的值时,rb树或avl树的平均性能可能比具有高阶的b树更好。更糟糕的是查找,它们可能还需要更多内存,但是修改通常很快。实际上,rb -树比avl树的修改速度更快,因此,avl树的查找速度要快一些,因为它们通常不那么深。

So as usual it depends a lot what your app is doing. My recommendations are:

和往常一样,这取决于你的应用程序在做什么。我的建议是:

  1. Lots of lookups, little modifications: B-Tree (with high order)
  2. 大量查找,小修改:b树(高阶)
  3. Lots of lookups, lots of modifiations: AVL-Tree
  4. 大量的查找,大量的修改:AVL-Tree。
  5. Little lookups, lots of modifications: RB-Tree
  6. 小查找,很多修改:rb树。

An alternative to all these trees are AA-Trees. As this PDF paper suggests, AA-Trees (which are in fact a sub-group of RB-Trees) are almost equal in performance to normal RB-Trees, but they are much easier to implement than RB-Trees, AVL-Trees, or B-Trees. Here is a full implementation, look how tiny it is (the main-function is not part of the implementation and half of the implementation lines are actually comments).

所有这些树的另一种选择是树。正如这篇PDF文件所指出的,aa树(实际上是一个子群的rb树)在性能上与普通的rb树几乎相等,但是它们比rb树、avl树或b树更容易实现。这里是一个完整的实现,看看它是多么的小(主函数不是实现的一部分,一半的实现行实际上是注释)。

As the PDF paper shows, a Treap is also an interesting alternative to classic tree implementation. A Treap is also a binary tree, but one that doesn't try to enforce balancing. To avoid worst case scenarios that you may get in unbalanced binary trees (causing lookups to become O(n) instead of O(log n)), a Treap adds some randomness to the tree. Randomness cannot guarantee that the tree is well balanced, but it also makes it highly unlikely that the tree is extremely unbalanced.

正如PDF文件所显示的那样,Treap也是传统树实现的一个有趣的选择。Treap也是一种二叉树,但它并没有试图实现平衡。为了避免最坏的情况,您可能会遇到不平衡的二叉树(导致查找变成O(n)而不是O(log n)), Treap给树添加了一些随机性。随机性不能保证树是平衡的,但是它也使得树极不平衡。

#3


27  

Nothing prevents a B-Tree implementation that works only in memory. In fact, if key comparisons are cheap, in-memory B-Tree can be faster because its packing of multiple keys in one node will cause less cache misses during searches. See this link for performance comparisons. A quote: "The speed test results are interesting and show the B+ tree to be significantly faster for trees containing more than 16,000 items." (B+Tree is just a variation on B-Tree).

没有什么可以阻止只在内存中工作的b -树实现。事实上,如果关键的比较是便宜的,内存中的b树可以更快,因为它在一个节点上的多个键的打包会导致在搜索过程中缓存丢失更少。请参阅此链接以进行性能比较。引用:“速度测试结果很有趣,并且显示B+树对包含超过16000个项目的树的速度显著加快。”(B+树只是B-树的一个变种)。

#4


10  

The question is old but I think it is still relevant. Jonas Kölker and Mecki gave very good answers but I don't think the answers cover the whole story. I would even argue that the whole discussion is missing the point :-).

这个问题已经过时了,但我认为它仍然是相关的。Jonas Kolker和Mecki给出了很好的答案,但我不认为答案涵盖了整个故事。我甚至认为整个讨论都没有重点:-)。

What was said about B-Trees is true when entries are relatively small (integers, small strings/words, floats, etc). When entries are large (over 100B) the differences become smaller/insignificant.

当条目相对较小(整数、小字符串/单词、浮点数等)时,关于b树的说法是正确的。当条目很大(超过100B)时,差异变得更小/无关紧要。

Let me sum up the main points about B-Trees:

让我总结一下b -树的主要观点:

  • They are faster than any Binary Search Tree (BSTs) due to memory locality (resulting in less cache and TLB misses).

    由于内存局部性(导致缓存和TLB丢失),它们比任何二进制搜索树(BSTs)都要快。

  • B-Trees are usually more space efficient if entries are relatively small or if entries are of variable size. Free space management is easier (you allocate larger chunks of memory) and the extra metadata overhead per entry is lower. B-Trees will waste some space as nodes are not always full, however, they still end up being more compact that Binary Search Trees.

    如果条目相对较小,或者条目的大小是可变的,那么b树通常更有效。*空间管理更容易(您分配更大的内存块),每个条目的额外元数据开销更低。b树将会浪费一些空间,因为节点并不总是满的,然而,它们最终会变得更加紧凑,而二叉搜索树。

  • The big O performance ( O(logN) ) is the same for both. Moreover, if you do binary search inside each B-Tree node, you will even end up with the same number of comparisons as in a BST (it is a nice math exercise to verify this). If the B-Tree node size is sensible (1-4x cache line size), linear searching inside each node is still faster because of the hardware prefetching. You can also use SIMD instructions for comparing basic data types (e.g. integers).

    大O性能(O(logN))对两者都是一样的。此外,如果在每个B-Tree节点中进行二进制搜索,甚至会得到与BST相同数量的比较(这是一个很好的数学练习来验证这一点)。如果b树节点的大小是合理的(1-4x高速缓存线大小),那么由于硬件的预取,每个节点内部的线性搜索仍然更快。您还可以使用SIMD指令来比较基本数据类型(如整数)。

  • B-Trees are better suited for compression: there is more data per node to compress. In certain cases this can be a huge benefit. Just think of an auto-incrementing key in a relational database table that is used to build an index. The lead nodes of a B-Tree contain consecutive integers that compress very, very well.

    b -树更适合于压缩:每个节点有更多的数据压缩。在某些情况下,这可能是一个巨大的好处。只需考虑在关系数据库表中使用自动递增键来构建索引。b -树的主节点包含连续整数,它们压缩得非常好。

  • B-Trees are clearly much, much faster when stored on secondary storage (where you need to do block IO).

    显然,当存储在辅助存储器上(需要执行block IO)时,b树的速度要快得多。

On paper, B-Trees have a lot of advantages and close to no disadvantages. So should one just use B-Trees for best performance?

在纸上,b树有很多优点,也没有缺点。那么,是否应该使用b树来获得最佳性能呢?

The answer is usually NO -- if the tree fits in memory. In cases where performance is crucial you want a thread-safe tree-like data-structure (simply put, several threads can do more work than a single one). It is more problematic to make a B-Tree support concurrent accesses than to make a BST. The most straight-forward way to make a tree support concurrent accesses is to lock nodes as you are traversing/modifying them. In a B-Tree you lock more entries per node, resulting in more serialization points and more contended locks.

答案通常是否定的——如果树适合记忆。在性能至关重要的情况下,您需要一个线程安全的树状数据结构(简单地说,几个线程可以比单个线程做更多的工作)。让b树支持并发访问比创建BST更有问题。使树支持并发访问的最直接方法是在遍历/修改节点时锁定节点。在B-Tree中,每个节点会锁定更多的条目,从而导致更多的序列化点和更多的争用锁。

All tree versions (AVL, Red/Black, B-Tree, an others) have countless variants that differ in how they support concurrency. The vanilla algorithms that are taught in a university course or read from some introductory books are almost never used in practice. So, it is hard to say which tree performs best as there is no official agreement on the exact algorithms are behind each tree. I would suggest to think of the trees mentioned more like data-structure classes that obey certain tree-like invariants rather than precise data-structures.

所有的树版本(AVL,红/黑,b -树,其他)都有无数的变体,它们在支持并发性方面存在差异。在大学课程中讲授或阅读一些入门书籍的香草算法在实践中几乎从未被使用过。因此,很难说哪棵树表现最好,因为在每棵树后面都没有关于精确算法的官方协议。我建议把这些树想象成更像数据结构类,它们遵从某些树状的不变量,而不是精确的数据结构。

Take for example the B-Tree. The vanilla B-Tree is almost never used in practice -- you cannot make it to scale well! The most common B-Tree variant used is the B+-Tree (widely used in file-systems, databases). The main differences between the B+-Tree and the B-Tree: 1) you dont store entries in the inner nodes of the tree (thus you don't need write locks high in the tree when modifying an entry stored in an inner node); 2) you have links between nodes at the same level (thus you do not have to lock the parent of a node when doing range searches).

以B-Tree为例。香草b树在实践中几乎从未被使用过——你不能让它很好地伸缩!最常见的B树变体是B+树(在文件系统、数据库中广泛使用)。B+-树和B-Tree的主要区别是:1)在树的内部节点中不存储条目(因此,在修改存储在内部节点中的条目时,不需要在树中写高锁);2)在相同级别的节点之间有链接(因此在进行范围搜索时不必锁定节点的父节点)。

I hope this helps.

我希望这可以帮助。

#5


7  

Guys from Google recently released their implementation of STL containers, which is based on B-trees. They claim their version is faster and consume less memory compared to standard STL containers, implemented via red-black trees. More details here

谷歌的成员最近发布了基于b树的STL容器的实现。他们声称他们的版本比标准的STL容器(通过红黑树实现)更快,消耗更少的内存。更多细节在这里

#6


2  

For some applications, B-trees are significantly faster than BSTs. The trees you may find here:

对于某些应用程序,b树比BSTs要快得多。你可以在这里找到的树木:

http://freshmeat.net/projects/bps

http://freshmeat.net/projects/bps

are quite fast. They also use less memory than regular BST implementations, since they do not require the BST infrastructure of 2 or 3 pointers per node, plus some extra fields to keep the balancing information.

非常快。它们使用的内存也比常规BST实现少,因为它们不需要每节点2或3个指针的BST基础结构,还需要一些额外的字段来保持平衡信息。

#7


0  

THey are sed in different circumstances - B-trees are used when the tree nodes need to be kept together in storage - typically because storage is a disk page and so re-balancing could be vey expensive. RB trees are used when you don't have this constraint. So B-trees will probably be faster if you want to implement (say) a relational database index, while RB trees will probably be fasterv for (say) an in memory search.

它们在不同的环境中被使用——当树节点需要在存储中保持在一起时使用b -树,这通常是因为存储是一个磁盘页,所以重新平衡可能非常昂贵。当您没有这种约束时使用RB树。因此,如果您想要实现(比方说)一个关系数据库索引,那么b树可能会更快,而RB树在内存搜索中可能是fasterv(比方说)。

#8


0  

They all have the same asymptotic behavior, so the performance depends more on the implementation than which type of tree you are using. Some combination of tree structures might actually be the fastest approach, where each node of a B-tree fits exactly into a cache-line and some sort of binary tree is used to search within each node. Managing the memory for the nodes yourself might also enable you to achieve even greater cache locality, but at a very high price.

它们都具有相同的渐近行为,因此性能更多地依赖于实现,而不是使用哪种类型的树。树结构的某些组合实际上可能是最快的方法,在这里,b树的每个节点都完全适合于缓存线,并且在每个节点中使用某种二叉树来搜索。管理节点的内存也可能使您能够实现更大的缓存位置,但是代价非常高。

Personally, I just use whatever is in the standard library for the language I am using, since it's a lot of work for a very small performance gain (if any).

就我个人而言,我只是在标准库中使用我正在使用的语言,因为这是一个非常小的性能增益(如果有的话)。

On a theoretical note... RB-trees are actually very similar to B-trees, since they simulate the behavior of 2-3-4 trees. AA-trees are a similar structure, which simulates 2-3 trees instead.

理论的一面……rb -树实际上与b -树非常相似,因为它们模拟了2-3-4树的行为。aa树是一种类似的结构,可以模拟2-3棵树。

#9


0  

moreover ...the height of a red black tree is O(log[2] N) whereas that of B-tree is O(log[q] N) where ceiling[N]<= q <= N . So if we consider comparisons in each key array of B-tree (that is fixed as mentioned above) then time complexity of B-tree <= time complexity of Red-black tree. (equal case for single record equal in size of a block size)

此外……红黑树的高度是O(log[2] N),而B-tree的高度是O(log[q] N),其中的上限[N]<= q <= N。因此,如果我们考虑在b -树的每个键数组中进行比较(如上所述,这是固定的),那么b树的时间复杂度就等于红黑树的时间复杂度。(相同情况下,单项记录大小相等,大小为块大小)

#1


106  

Sean's post (the currently accepted one) contains several incorrect claims. Sorry Sean, I don't mean to be rude; I hope I can convince you that my statement is based in fact.

Sean的帖子(目前被接受的帖子)包含了一些不正确的说法。对不起,肖恩,我不是故意的;我希望我能说服你我的陈述是基于事实的。

They're totally different in their use cases, so it's not possible to make a comparison.

它们在用例中完全不同,所以不可能进行比较。

They're both used for maintaining a set of totally ordered items with fast lookup, insertion and deletion. They have the same interface and the same intention.

它们都用于维护一组快速查找、插入和删除的完全有序项。他们有相同的界面和相同的意图。

RB trees are typically in-memory structures used to provide fast access (ideally O(logN)) to data. [...]

RB树通常是用于提供快速访问(理想情况下是O(logN))到数据的内存结构。[…]

always O(log n)

总是O(log n)

B-trees are typically disk-based structures, and so are inherently slower than in-memory data.

b -树通常是基于磁盘的结构,因此固有的慢于内存中的数据。

Nonsense. When you store search trees on disk, you typically use B-trees. That much is true. When you store data on disk, it's slower to access than data in memory. But a red-black tree stored on disk is also slower than a red-black tree stored in memory.

无稽之谈。当您在磁盘上存储搜索树时,通常使用b树。这是真实的。当您在磁盘上存储数据时,访问的速度比内存中的数据要慢。但是,存储在磁盘上的红黑树也比存储在内存中的红黑树慢。

You're comparing apples and oranges here. What is really interesting is a comparison of in-memory B-trees and in-memory red-black trees.

你在拿苹果和桔子做比较。真正有趣的是对内存中的b -树和内存中的红黑树的比较。

[As an aside: B-trees, as opposed to red-black trees, are theoretically efficient in the I/O-model. I have experimentally tested (and validated) the I/O-model for sorting; I'd expect it to work for B-trees as well.]

另外,与红黑树相对的b树在理论上是有效的I/ o模型。我已通过实验测试(并验证)I/ o模型进行排序;我希望它也能用于b -树。]

B-trees are rarely binary trees, the number of children a node can have is typically a large number.

b树很少是二叉树,一个节点的子节点数通常是一个大的数字。

To be clear, the size range of B-tree nodes is a parameter of the tree (in C++, you may want to use an integer value as a template parameter).

要清楚,B-tree节点的大小范围是树的一个参数(在c++中,您可能希望使用整数值作为模板参数)。

The management of the B-tree structure can be quite complicated when the data changes.

当数据发生变化时,B-tree结构的管理可能会非常复杂。

I remember them to be much simpler to understand (and implement) than red-black trees.

我记得它们比红黑树更容易理解(和实现)。

B-tree try to minimize the number of disk accesses so that data retrieval is reasonably deterministic.

b树试图最小化磁盘访问的数量,以便数据检索具有合理的确定性。

That much is true.

这是真实的。

It's not uncommon to see something like 4 B-tree access necessary to lookup a bit of data in a very database.

在一个非常数据库中查找一些数据需要4个b树访问,这并不少见。

Got data?

有数据吗?

In most cases I'd say that in-memory RB trees are faster.

在大多数情况下,我认为内存中的RB树比较快。

Got data?

有数据吗?

Because the lookup is binary it's very easy to find something. B-tree can have multiple children per node, so on each node you have to scan the node to look for the appropriate child. This is an O(N) operation.

因为查找是二进制的,所以很容易找到东西。b树每个节点可以有多个子节点,因此在每个节点上,您必须扫描节点以寻找合适的子节点。这是一个O(N)操作。

The size of each node is a fixed parameter, so even if you do a linear scan, it's O(1). If we big-oh over the size of each node, note that you typically keep the array sorted so it's O(log n).

每个节点的大小是一个固定的参数,所以即使你做一个线性扫描,也是O(1)。如果我们对每个节点的大小进行大处理,请注意,您通常保持数组的排序,所以它是O(log n)。

On a RB-tree it'd be O(logN) since you're doing one comparison and then branching.

在一个rb树上,它是O(logN)因为你在做一个比较,然后分支。

You're comparing apples and oranges. The O(log n) is because the height of the tree is at most O(log n), just as it is for a B-tree.

你在比较苹果和桔子。O(log n)是因为树的高度最多O(log n),就像对b树一样。

Also, unless you play nasty allocation tricks with the red-black trees, it seems reasonable to conjecture that B-trees have better caching behavior (it accesses an array, not pointers strewn about all over the place, and has less allocation overhead increasing memory locality even more), which might help it in the speed race.

同样,除非你与“红-黑”树玩的分配的把戏,它似乎是合理的猜想,b树有更好的缓存行为(指针访问数组,而不是散落得到处都是,和分配开销增加内存位置更少),这可能有助于它的速度竞赛。

I can point to experimental evidence that B-trees (with size parameters 32 and 64, specifically) are very competitive with red-black trees for small sizes, and outperforms it hands down for even moderately large values of n. See http://idlebox.net/2007/stx-btree/stx-btree-0.8.3/doxygen-html/speedtest.html

我可以通过实验证明,b -树(尺寸参数32和64,特别是)与红黑树的小尺寸非常有竞争优势,甚至比大小写的n都要大。

B-trees are faster. Why? I conjecture that it's due to memory locality, better caching behavior and less pointer chasing (which are, if not the same things, overlapping to some degree).

b树更快。为什么?我推测它是由于内存位置、更好的缓存行为和更少的指针追逐(如果不是相同的东西,在某种程度上是重叠的)。

#2


90  

Actually Wikipedia has a great article that shows every RB-Tree can easily be expressed as a B-Tree. Take the following tree as sample:

实际上,*有一篇很好的文章,它展示了所有的rb树都可以很容易地表示为b -树。以下列树为样本:

b树比AVL或红黑树快?

now just convert it to a B-Tree (to make this more obvious, nodes are still colored R/B, what you usually don't have in a B-Tree):

现在把它转换成B树(为了使这个更明显,节点仍然是彩色的R/B,你通常没有在B-树中):

Same Tree as B-Tree

b -树树一样

(cannot add the image here for some weird reason)

(由于某些奇怪的原因,不能添加图像)

Same is true for any other RB-Tree. It's taken from this article:

对于任何其他的rb树也是如此。它取自这篇文章:

http://en.wikipedia.org/wiki/Red-black_tree

http://en.wikipedia.org/wiki/Red-black_tree

To quote from this article:

引用这篇文章:

The red-black tree is then structurally equivalent to a B-tree of order 4, with a minimum fill factor of 33% of values per cluster with a maximum capacity of 3 values.

在结构上,红黑树在结构上等价于第4级的b树,每个集群的最小填充系数为33%,最大容量为3。

I found no data that one of both is significantly better than the other one. I guess one of both had already died out if that was the case. They are different regarding how much data they must store in memory and how complicated it is to add/remove nodes from the tree.

我没有发现其中一个明显优于另一个的数据。我想,如果是这样的话,两者中的一个已经灭绝了。对于在内存中必须存储多少数据,以及在树中添加/删除节点是多么复杂,它们是不同的。

Update:

My personal tests suggest that B-Trees are better when searching for data, as they have better data locality and thus the CPU cache can do compares somewhat faster. The higher the order of a B-Tree (the order is the number of children a note can have), the faster the lookup will get. On the other hand, they have worse performance for adding and removing new entries the higher their order is. This is caused by the fact that adding a value within a node has linear complexity. As each node is a sorted array, you must move lots of elements around within that array when adding an element into the middle: all elements to the left of the new element must be moved one position to the left or all elements to the right of the new element must be moved one position to the right. If a value moves one node upwards during an insert (which happens frequently in a B-Tree), it leaves a hole which must be also be filled either by moving all elements from the left one position to the right or by moving all elements to the right one position to the left. These operations (in C usually performed by memmove) are in fact O(n). So the higher the order of the B-Tree, the faster the lookup but the slower the modification. On the other hand if you choose the order too low (e.g. 3), a B-Tree shows little advantages or disadvantages over other tree structures in practice (in such a case you can as well use something else). Thus I'd always create B-Trees with high orders (at least 4, 8 and up is fine).

我的个人测试表明,在搜索数据时,b树比较好,因为它们有更好的数据局部性,因此CPU缓存能做的比较快。b -树的顺序越高(order是一个音符可以拥有的孩子的数量),查找就会越快。另一方面,在添加和删除新条目时,它们的性能更差。这是因为在一个节点中添加一个值具有线性复杂度。作为每个节点是一个排序的数组,你必须移动大量的元素数组中添加一个元素时中间:所有元素左边的新元素必须向左移动一个位置或所有元素右边的新元素必须向右移动一个位置。如果insert期间向上移动一个节点的值(经常发生在b - tree),它让一个洞也必须是通过把所有元素从左向右一个位置或通过移动所有元素向左向右移动一个位置。这些操作(通常由memmove执行)实际上是O(n)。所以b树的顺序越高,查找的速度越快,但是修改的速度越慢。另一方面,如果您选择的顺序太低(例如,3),在实践中,b树在其他树结构上显示的优势或劣势(在这种情况下,您也可以使用其他的东西)。因此,我总是创建具有高命令的b树(至少4、8和up是可以的)。

File systems, which often base on B-Trees, use much higher orders (order 200 and even a lot more) - this is because they usually choose the order high enough so that a note (when containing maximum number of allowed elements) equals either the size of a sector on harddrive or of a cluster of the filesystem. This gives optimal performance (since a HD can only write a full sector at a time, even when just one byte is changed, the full sector is rewritten anyway) and optimal space utilization (as each data entry on drive equals at least the size of one cluster or is a multiple of the cluster sizes, no matter how big the data really is). Caused by the fact that the hardware sees data as sectors and the file system groups sectors to clusters, B-Trees can yield much better performance and space utilization for file systems than any other tree structure can; that's why they are so popular for file systems.

文件系统,通常基于b -树,用更高的订单(订单200甚至更多的),这是因为他们通常选择订单足够高,以便报告(包含最大允许数元素时)=部门在硬盘的大小或的集群文件系统。这给最佳性能(因为HD只能写一个完整的部门,即使只是一个字节是改变了,整个部门重写)和最优空间利用率(每个数据条目在开车至少等于一个集群的大小或者是多个集群的大小,无论多大的数据确实是)。由于硬件将数据视为扇区和文件系统组分组到集群,因此b树比任何其他树结构都能更好地实现文件系统的性能和空间利用率;这就是为什么它们如此受文件系统欢迎的原因。

When your app is constantly updating the tree, adding or removing values from it, a RB-Tree or an AVL-Tree may show better performance on average compared to a B-Tree with high order. Somewhat worse for the lookups and they might also need more memory, but therefor modifications are usually fast. Actually RB-Trees are even faster for modifications than AVL-Trees, therefor AVL-Trees are a little bit faster for lookups as they are usually less deep.

当您的应用程序不断更新树、添加或删除它的值时,rb树或avl树的平均性能可能比具有高阶的b树更好。更糟糕的是查找,它们可能还需要更多内存,但是修改通常很快。实际上,rb -树比avl树的修改速度更快,因此,avl树的查找速度要快一些,因为它们通常不那么深。

So as usual it depends a lot what your app is doing. My recommendations are:

和往常一样,这取决于你的应用程序在做什么。我的建议是:

  1. Lots of lookups, little modifications: B-Tree (with high order)
  2. 大量查找,小修改:b树(高阶)
  3. Lots of lookups, lots of modifiations: AVL-Tree
  4. 大量的查找,大量的修改:AVL-Tree。
  5. Little lookups, lots of modifications: RB-Tree
  6. 小查找,很多修改:rb树。

An alternative to all these trees are AA-Trees. As this PDF paper suggests, AA-Trees (which are in fact a sub-group of RB-Trees) are almost equal in performance to normal RB-Trees, but they are much easier to implement than RB-Trees, AVL-Trees, or B-Trees. Here is a full implementation, look how tiny it is (the main-function is not part of the implementation and half of the implementation lines are actually comments).

所有这些树的另一种选择是树。正如这篇PDF文件所指出的,aa树(实际上是一个子群的rb树)在性能上与普通的rb树几乎相等,但是它们比rb树、avl树或b树更容易实现。这里是一个完整的实现,看看它是多么的小(主函数不是实现的一部分,一半的实现行实际上是注释)。

As the PDF paper shows, a Treap is also an interesting alternative to classic tree implementation. A Treap is also a binary tree, but one that doesn't try to enforce balancing. To avoid worst case scenarios that you may get in unbalanced binary trees (causing lookups to become O(n) instead of O(log n)), a Treap adds some randomness to the tree. Randomness cannot guarantee that the tree is well balanced, but it also makes it highly unlikely that the tree is extremely unbalanced.

正如PDF文件所显示的那样,Treap也是传统树实现的一个有趣的选择。Treap也是一种二叉树,但它并没有试图实现平衡。为了避免最坏的情况,您可能会遇到不平衡的二叉树(导致查找变成O(n)而不是O(log n)), Treap给树添加了一些随机性。随机性不能保证树是平衡的,但是它也使得树极不平衡。

#3


27  

Nothing prevents a B-Tree implementation that works only in memory. In fact, if key comparisons are cheap, in-memory B-Tree can be faster because its packing of multiple keys in one node will cause less cache misses during searches. See this link for performance comparisons. A quote: "The speed test results are interesting and show the B+ tree to be significantly faster for trees containing more than 16,000 items." (B+Tree is just a variation on B-Tree).

没有什么可以阻止只在内存中工作的b -树实现。事实上,如果关键的比较是便宜的,内存中的b树可以更快,因为它在一个节点上的多个键的打包会导致在搜索过程中缓存丢失更少。请参阅此链接以进行性能比较。引用:“速度测试结果很有趣,并且显示B+树对包含超过16000个项目的树的速度显著加快。”(B+树只是B-树的一个变种)。

#4


10  

The question is old but I think it is still relevant. Jonas Kölker and Mecki gave very good answers but I don't think the answers cover the whole story. I would even argue that the whole discussion is missing the point :-).

这个问题已经过时了,但我认为它仍然是相关的。Jonas Kolker和Mecki给出了很好的答案,但我不认为答案涵盖了整个故事。我甚至认为整个讨论都没有重点:-)。

What was said about B-Trees is true when entries are relatively small (integers, small strings/words, floats, etc). When entries are large (over 100B) the differences become smaller/insignificant.

当条目相对较小(整数、小字符串/单词、浮点数等)时,关于b树的说法是正确的。当条目很大(超过100B)时,差异变得更小/无关紧要。

Let me sum up the main points about B-Trees:

让我总结一下b -树的主要观点:

  • They are faster than any Binary Search Tree (BSTs) due to memory locality (resulting in less cache and TLB misses).

    由于内存局部性(导致缓存和TLB丢失),它们比任何二进制搜索树(BSTs)都要快。

  • B-Trees are usually more space efficient if entries are relatively small or if entries are of variable size. Free space management is easier (you allocate larger chunks of memory) and the extra metadata overhead per entry is lower. B-Trees will waste some space as nodes are not always full, however, they still end up being more compact that Binary Search Trees.

    如果条目相对较小,或者条目的大小是可变的,那么b树通常更有效。*空间管理更容易(您分配更大的内存块),每个条目的额外元数据开销更低。b树将会浪费一些空间,因为节点并不总是满的,然而,它们最终会变得更加紧凑,而二叉搜索树。

  • The big O performance ( O(logN) ) is the same for both. Moreover, if you do binary search inside each B-Tree node, you will even end up with the same number of comparisons as in a BST (it is a nice math exercise to verify this). If the B-Tree node size is sensible (1-4x cache line size), linear searching inside each node is still faster because of the hardware prefetching. You can also use SIMD instructions for comparing basic data types (e.g. integers).

    大O性能(O(logN))对两者都是一样的。此外,如果在每个B-Tree节点中进行二进制搜索,甚至会得到与BST相同数量的比较(这是一个很好的数学练习来验证这一点)。如果b树节点的大小是合理的(1-4x高速缓存线大小),那么由于硬件的预取,每个节点内部的线性搜索仍然更快。您还可以使用SIMD指令来比较基本数据类型(如整数)。

  • B-Trees are better suited for compression: there is more data per node to compress. In certain cases this can be a huge benefit. Just think of an auto-incrementing key in a relational database table that is used to build an index. The lead nodes of a B-Tree contain consecutive integers that compress very, very well.

    b -树更适合于压缩:每个节点有更多的数据压缩。在某些情况下,这可能是一个巨大的好处。只需考虑在关系数据库表中使用自动递增键来构建索引。b -树的主节点包含连续整数,它们压缩得非常好。

  • B-Trees are clearly much, much faster when stored on secondary storage (where you need to do block IO).

    显然,当存储在辅助存储器上(需要执行block IO)时,b树的速度要快得多。

On paper, B-Trees have a lot of advantages and close to no disadvantages. So should one just use B-Trees for best performance?

在纸上,b树有很多优点,也没有缺点。那么,是否应该使用b树来获得最佳性能呢?

The answer is usually NO -- if the tree fits in memory. In cases where performance is crucial you want a thread-safe tree-like data-structure (simply put, several threads can do more work than a single one). It is more problematic to make a B-Tree support concurrent accesses than to make a BST. The most straight-forward way to make a tree support concurrent accesses is to lock nodes as you are traversing/modifying them. In a B-Tree you lock more entries per node, resulting in more serialization points and more contended locks.

答案通常是否定的——如果树适合记忆。在性能至关重要的情况下,您需要一个线程安全的树状数据结构(简单地说,几个线程可以比单个线程做更多的工作)。让b树支持并发访问比创建BST更有问题。使树支持并发访问的最直接方法是在遍历/修改节点时锁定节点。在B-Tree中,每个节点会锁定更多的条目,从而导致更多的序列化点和更多的争用锁。

All tree versions (AVL, Red/Black, B-Tree, an others) have countless variants that differ in how they support concurrency. The vanilla algorithms that are taught in a university course or read from some introductory books are almost never used in practice. So, it is hard to say which tree performs best as there is no official agreement on the exact algorithms are behind each tree. I would suggest to think of the trees mentioned more like data-structure classes that obey certain tree-like invariants rather than precise data-structures.

所有的树版本(AVL,红/黑,b -树,其他)都有无数的变体,它们在支持并发性方面存在差异。在大学课程中讲授或阅读一些入门书籍的香草算法在实践中几乎从未被使用过。因此,很难说哪棵树表现最好,因为在每棵树后面都没有关于精确算法的官方协议。我建议把这些树想象成更像数据结构类,它们遵从某些树状的不变量,而不是精确的数据结构。

Take for example the B-Tree. The vanilla B-Tree is almost never used in practice -- you cannot make it to scale well! The most common B-Tree variant used is the B+-Tree (widely used in file-systems, databases). The main differences between the B+-Tree and the B-Tree: 1) you dont store entries in the inner nodes of the tree (thus you don't need write locks high in the tree when modifying an entry stored in an inner node); 2) you have links between nodes at the same level (thus you do not have to lock the parent of a node when doing range searches).

以B-Tree为例。香草b树在实践中几乎从未被使用过——你不能让它很好地伸缩!最常见的B树变体是B+树(在文件系统、数据库中广泛使用)。B+-树和B-Tree的主要区别是:1)在树的内部节点中不存储条目(因此,在修改存储在内部节点中的条目时,不需要在树中写高锁);2)在相同级别的节点之间有链接(因此在进行范围搜索时不必锁定节点的父节点)。

I hope this helps.

我希望这可以帮助。

#5


7  

Guys from Google recently released their implementation of STL containers, which is based on B-trees. They claim their version is faster and consume less memory compared to standard STL containers, implemented via red-black trees. More details here

谷歌的成员最近发布了基于b树的STL容器的实现。他们声称他们的版本比标准的STL容器(通过红黑树实现)更快,消耗更少的内存。更多细节在这里

#6


2  

For some applications, B-trees are significantly faster than BSTs. The trees you may find here:

对于某些应用程序,b树比BSTs要快得多。你可以在这里找到的树木:

http://freshmeat.net/projects/bps

http://freshmeat.net/projects/bps

are quite fast. They also use less memory than regular BST implementations, since they do not require the BST infrastructure of 2 or 3 pointers per node, plus some extra fields to keep the balancing information.

非常快。它们使用的内存也比常规BST实现少,因为它们不需要每节点2或3个指针的BST基础结构,还需要一些额外的字段来保持平衡信息。

#7


0  

THey are sed in different circumstances - B-trees are used when the tree nodes need to be kept together in storage - typically because storage is a disk page and so re-balancing could be vey expensive. RB trees are used when you don't have this constraint. So B-trees will probably be faster if you want to implement (say) a relational database index, while RB trees will probably be fasterv for (say) an in memory search.

它们在不同的环境中被使用——当树节点需要在存储中保持在一起时使用b -树,这通常是因为存储是一个磁盘页,所以重新平衡可能非常昂贵。当您没有这种约束时使用RB树。因此,如果您想要实现(比方说)一个关系数据库索引,那么b树可能会更快,而RB树在内存搜索中可能是fasterv(比方说)。

#8


0  

They all have the same asymptotic behavior, so the performance depends more on the implementation than which type of tree you are using. Some combination of tree structures might actually be the fastest approach, where each node of a B-tree fits exactly into a cache-line and some sort of binary tree is used to search within each node. Managing the memory for the nodes yourself might also enable you to achieve even greater cache locality, but at a very high price.

它们都具有相同的渐近行为,因此性能更多地依赖于实现,而不是使用哪种类型的树。树结构的某些组合实际上可能是最快的方法,在这里,b树的每个节点都完全适合于缓存线,并且在每个节点中使用某种二叉树来搜索。管理节点的内存也可能使您能够实现更大的缓存位置,但是代价非常高。

Personally, I just use whatever is in the standard library for the language I am using, since it's a lot of work for a very small performance gain (if any).

就我个人而言,我只是在标准库中使用我正在使用的语言,因为这是一个非常小的性能增益(如果有的话)。

On a theoretical note... RB-trees are actually very similar to B-trees, since they simulate the behavior of 2-3-4 trees. AA-trees are a similar structure, which simulates 2-3 trees instead.

理论的一面……rb -树实际上与b -树非常相似,因为它们模拟了2-3-4树的行为。aa树是一种类似的结构,可以模拟2-3棵树。

#9


0  

moreover ...the height of a red black tree is O(log[2] N) whereas that of B-tree is O(log[q] N) where ceiling[N]<= q <= N . So if we consider comparisons in each key array of B-tree (that is fixed as mentioned above) then time complexity of B-tree <= time complexity of Red-black tree. (equal case for single record equal in size of a block size)

此外……红黑树的高度是O(log[2] N),而B-tree的高度是O(log[q] N),其中的上限[N]<= q <= N。因此,如果我们考虑在b -树的每个键数组中进行比较(如上所述,这是固定的),那么b树的时间复杂度就等于红黑树的时间复杂度。(相同情况下,单项记录大小相等,大小为块大小)