存储树数据的快速关系方法(例如文章上的线程注释)

I have a cms which stores comments against articles. These comments can be both threaded and non threaded. Although technically they are the same just with the reply column left blank when it's not threaded. My application works on sqlLite, MySQL and pgsql so I need fairly standard SQL.

我有一个cms，它存储对文章的评论。这些注释可以是线程的，也可以是非线程的。虽然从技术上讲，它们与没有线程的reply列相同。我的应用程序适用于sqlLite、MySQL和pgsql，因此我需要相当标准的SQL。

I currently have a comment table

我现在有一个注释表

comment_id
article_id
user_id
comment
timestamp
thread (this is the reply column)

My question is to figure out how to best represent the threaded comments in the database. Perhaps in a separate table that supports the tree set without the content and a simple table to hold the text? Perhaps in the way it already is? Perhaps another way?

我的问题是如何最好地表示数据库中的线程注释。也许是在一个单独的表中，它支持没有内容的树集，或者是一个保存文本的简单表?也许已经是这样了?也许是另一种方式?

If the comments are un-threaded I can easily just order by the timestamp.

如果注释是非线程的，我可以很容易地按时间戳排序。

If they are threaded I sort like this

如果它们是螺纹的，我是这样排序的

ORDER BY SUBSTRING(c.thread, 1, (LENGTH(c.thread) - 1))

As you can see from the ORDER BY, the commenting queries will not ever use an index as function based indexes only really live in Oracle. Help me have lightening fast comment pages.

正如您从ORDER BY中可以看到的，注释查询不会使用索引作为基于函数的索引，这些索引只真正存在于Oracle中。帮助我有闪电快速评论页。

6 个解决方案

#1

I really like how Drupal solves this problem. It assigns a thread id to each comment. This id starts at 1 for the first comment. If a reply is added to this comment, the id 1.1 is assigned to it. A reply to comment 1.1 is given the thread id 1.1.1. A sibling of comment 1.1 is given the thread id 1.2. You get the idea. Calculating these thread ids can be done easily with one query when a comment is added.

我非常喜欢Drupal如何解决这个问题。它为每个注释分配一个线程id。第一个注释的id从1开始。如果向该注释添加了应答，则将为其分配id 1.1。注释1.1的回复给出了线程id 1.1.1。注释1.1的另一个分支给出了线程id 1.2。你懂的。在添加注释时，可以轻松地计算这些线程id。

When the thread is rendered, all of the comments that belong to the thread are fetched in a single query, sorted by the thread id. This gives you the threads in the ascending order. Furthermore, using the thread id, you can find the nesting level of each comment, and indent it accordingly.

当线程被呈现时，属于该线程的所有注释都将在一个查询中获取，并按线程id排序，这将按升序给您线程。此外，使用线程id，您可以找到每个注释的嵌套级别，并相应地对其进行缩进。

1
1.1
1.1.1
1.2
1.2.1

There are a few issues to sort out:

有几个问题需要解决:

If one component of the thread id grows to 2 digits, sorting by thread id will not produce the expected order. An easy solution is ensuring that all components of a thread id are padded by zeros to have the same width.
如果线程id的一个组件增长到两位数，那么通过线程id进行排序将不会产生预期的顺序。一个简单的解决方案是确保线程id的所有组件都被0填充以具有相同的宽度。
Sorting by descending thread id does not produce the expected descending order.
通过下降线程id进行排序不会产生预期的下降顺序。

Drupal solves the first issue in a more complicated way using a numbering system called vancode. As for the second issue, it is solved by appending a backslash (whose ASCII code is higher than digits) to thread ids when sorting by descending order. You can find more details about this implementation by checking the source code of the comments module (see the big comment before the function comment_get_thread).

Drupal使用名为vancode的编号系统以更复杂的方式解决了第一个问题。至于第二个问题，在按降序排序时，可以在线程id后面附加一个反斜杠(其ASCII码高于数字)。通过检查comments模块的源代码(请参阅comment_get_thread函数之前的大注释)，您可以找到关于此实现的更多细节。

#2

I know the answer is a bit late, but for tree data use a closure table http://www.slideshare.net/billkarwin/models-for-hierarchical-data

我知道答案有点晚了，但是对于树数据，使用一个关闭表http://www.slideshare.net/billkarwin/model -for分级数据。

It describes 4 methods:

它描述了4种方法:

Adjcency list (the simple parent foreign key)
形容词列表(简单的父外键)
Path enumeration (the Drupal strategy mentioned in the accepted answer)
路径枚举(在已接受的答案中提到的Drupal策略)
Nested sets
嵌套组
Closure table (storing ancestor/descendant facts in a separate relation [table], with a possible distance column)
闭包表(将祖先/后代的事实存储在一个单独的关系中[表]，并有一个可能的距离列)

The last option has advantages of easy CRUD operations compared to the rest. The cost is space, which is O(n^2) size in the number tree nodes in the worst case, but probably not so bad in practice.

与其他选项相比，最后一个选项具有易于CRUD操作的优点。成本的空间,这是O(n ^ 2)大小树节点数量在最坏的情况下,但在实践中可能不是那么糟糕。

#3

Unfortunately, the pure SQL methods to do it are quite slow.

不幸的是，执行此操作的纯SQL方法非常缓慢。

The NESTED SETS proposed by @Marc W are quite elegant but they may require updating the whole tree if your tree branches hit the ranges, which can be quite slow.

@Marc W提出的嵌套集非常优雅，但是如果您的树分支达到范围，可能需要更新整个树，这可能会很慢。

See this article in my blog on how to do it fast in MySQL:

在我的博客上看到这篇关于如何快速使用MySQL的文章:

Hierarchical queries in MySQL - emulating Oracle's CONNECT BY
MySQL中的分层查询——模拟Oracle的CONNECT BY

You'll need to create a function:

您需要创建一个函数:

CREATE FUNCTION hierarchy_connect_by_parent_eq_prior_id(value INT) RETURNS INT
NOT DETERMINISTIC
READS SQL DATA
BEGIN
        DECLARE _id INT;
        DECLARE _parent INT;
        DECLARE _next INT;
        DECLARE CONTINUE HANDLER FOR NOT FOUND SET @id = NULL;

        SET _parent = @id;
        SET _id = -1;

        IF @id IS NULL THEN
                RETURN NULL;
        END IF;

        LOOP
                SELECT  MIN(id)
                INTO    @id
                FROM    t_hierarchy
                WHERE   parent = _parent
                        AND id > _id;
                IF @id IS NOT NULL OR _parent = @start_with THEN
                        SET @level = @level + 1;
                        RETURN @id;
                END IF;
                SET @level := @level - 1;
                SELECT  id, parent
                INTO    _id, _parent
                FROM    t_hierarchy
                WHERE   id = _parent;
        END LOOP;
END

and use it in a query like this:

在这样的查询中使用它:

SELECT  hi.*
FROM    (
        SELECT  hierarchy_connect_by_parent_eq_prior_id(id) AS id, @level AS level
        FROM    (
                SELECT  @start_with := 0,
                        @id := @start_with,
                        @level := 0
                ) vars, t_hierarchy
        WHERE   @id IS NOT NULL
        ) ho
JOIN    t_hierarchy hi
ON      hi.id = ho.id

This is of course MySQL specific but it's real fast.

这当然是特定于MySQL的，但是速度很快。

If you want this to be portable betwen PostgreSQL and MySQL, you can use PostgreSQL's contrib for CONNECT BY and wrap the query into a stored procedure with same name for both systems.

如果您希望这是可移植的betwen PostgreSQL和MySQL，您可以使用PostgreSQL的方法进行连接，并将查询打包到两个系统同名的存储过程中。

#4

I just did this myself, actually! I used the nested set model of representing hierarchical data in a relational database.

实际上，我是自己做的!我使用了在关系数据库中表示层次数据的嵌套集模型。

Managing Hierarchical Data in MySQL was pure gold for me. Nested sets are the second model described in that article.

在MySQL中管理层次化数据对我来说是绝对的黄金。嵌套集合是本文中描述的第二个模型。

#5

You've got a choice between the adjacency and the nested set models. The article Managing Hierarchical Data in MySQL makes for a nice introduction.

您可以在邻接模型和嵌套集模型之间进行选择。在MySQL中管理层次数据的文章是一个很好的介绍。

For a theoretical discussion, see Celko's Trees and Hierarchies.

有关理论讨论，请参见Celko的树和层次结构。

It's rather easy to implement a threaded list if your database supports windowing functions. All you need is a recursive reference in your target database table, such as:

如果您的数据库支持窗口函数，那么很容易实现线程列表。您所需要的只是目标数据库表中的递归引用，例如:

create Tablename (
  RecordID integer not null default 0 auto_increment,
  ParentID integer default null references RecordID,
  ...
)

You can then use a recursive Common Table Expression to display a threaded view. An example is available here.

然后可以使用递归公共表表达式来显示线程视图。这里有一个例子。

#6

Actually, it has to be a balance between read and write.

实际上，它必须是读写之间的平衡。

If you are OK with updating a bunch of rows on every insert, then nested set (or an equivalent) will give you easy, fast reads.

如果您对在每个插入中更新一堆行没有问题，那么嵌套集(或等效的)将会给您简单、快速的读取。

Other than that, a simple FK on the parent will give you ultra-simple insert, but might well be a nightmare for retrieval.

除此之外，父节点上的一个简单FK将为您提供超简单的插入，但是对于检索来说可能是一场噩梦。

I think I'd go with the nested sets, but be careful about the expected data volume and usage patterns (updating several, maybe a lot of, rows on two indexed columns (for left and right info) for every insert might be a problem at some point).

我认为我应该使用嵌套集，但是要注意预期的数据量和使用模式(为每次插入更新两个索引列(左和右的信息)上的几行，可能会有问题)。

#1

1
1.1
1.1.1
1.2
1.2.1