在关系数据库中存储层次数据的选项有哪些?

时间:2022-01-05 17:07:55

Good Overviews

良好的概述

Generally speaking, you're making a decision between fast read times (for example, nested set) or fast write times (adjacency list). Usually, you end up with a combination of the options below that best fit your needs. The following provides some in-depth reading:

一般来说,您需要在快速读时间(例如嵌套集)和快速写时间(邻接列表)之间做出选择。通常情况下,你会得到以下选项的组合,最适合你的需求。以下是一些深入的阅读:

  • One more Nested Intervals vs. Adjacency List comparison: the best comparison of Adjacency List, Materialized Path, Nested Set and Nested Interval I've found.
  • 还有一个嵌套间隔和邻接列表比较:邻接列表、物化路径、嵌套集和嵌套间隔的最佳比较。
  • Models for hierarchical data: slides with good explanations of tradeoffs and example usage
  • 分层数据的模型:对权衡和示例用法有很好的解释的幻灯片
  • Representing hierarchies in MySQL: very good overview of Nested Set in particular
  • 在MySQL中表示层次结构:特别是嵌套集的很好的概述
  • Hierarchical data in RDBMSs: most comprehensive and well-organized set of links I've seen, but not much in the way of explanation
  • RDBMSs中的分层数据:我见过的最全面和组织良好的链接集,但在解释方式上不是很多

Options

选项

Ones I am aware of and general features:

我知道的和一般的特点:

  1. Adjacency List:
    • Columns: ID, ParentID
    • 列:ID,ParentID
    • Easy to implement.
    • 容易实现。
    • Cheap node moves, inserts, and deletes.
    • 廉价的节点移动、插入和删除。
    • Expensive to find the level, ancestry & descendants, path
    • 寻找等级、祖先和后代、路径的花费
    • Avoid N+1 via Common Table Expressions in databases that support them
    • 避免使用支持N+1的数据库中的公共表表达式
  2. 邻接表:列:ID, ParentID易于实现。廉价的节点移动、插入和删除。查找级别、祖先和后代的成本很昂贵,通过在支持它们的数据库中使用公共表表达式来避免N+1。
  3. Nested Set (a.k.a Modified Preorder Tree Traversal)
    • Columns: Left, Right
    • 列:左、右
    • Cheap ancestry, descendants
    • 廉价的祖先、子孙
    • Very expensive O(n/2) moves, inserts, deletes due to volatile encoding
    • 由于易变编码,非常昂贵的O(n/2)移动、插入和删除
  4. 嵌套组(a.k.。修改后的前序树遍历)列:左,右廉价的祖先,代价非常昂贵的O(n/2)移动,插入,删除由于易变编码
  5. Bridge Table (a.k.a. Closure Table /w triggers)
    • Uses separate join table with: ancestor, descendant, depth (optional)
    • 使用单独的连接表:祖先、后代、深度(可选)
    • Cheap ancestry and descendants
    • 廉价的祖先和后裔
    • Writes costs O(log n) (size of subtree) for insert, updates, deletes
    • 为插入、更新和删除写入成本O(log n)(子树的大小)
    • Normalized encoding: good for RDBMS statistics & query planner in joins
    • 规范化编码:在连接中对RDBMS统计和查询规划器很好。
    • Requires multiple rows per node
    • 每个节点需要多行。
  6. 桥接表(又称闭表/w触发器)使用单独的连接表:祖先、后代、深度(可选)便宜的祖先和后代为插入、更新、删除规范化编码而写代价O(log n)(子树的大小)
  7. Lineage Column (a.k.a. Materialized Path, Path Enumeration)
    • Column: lineage (e.g. /parent/child/grandchild/etc...)
    • 专栏:血统(例如/父母/孩子/孙子/等)
    • Cheap descendants via prefix query (e.g. LEFT(lineage, #) = '/enumerated/path')
    • 通过前缀查询(例如,左(沿袭,#)= '/枚举/路径')
    • Writes costs O(log n) (size of subtree) for insert, updates, deletes
    • 为插入、更新和删除写入成本O(log n)(子树的大小)
    • Non-relational: relies on Array datatype or serialized string format
    • 非关系:依赖数组数据类型或序列化的字符串格式。
  8. 沿袭列(即物化路径、路径枚举)列:沿袭列(例如/parent/child/grandchild/等)通过前缀查询(例如,左(沿袭,#)= '/枚举/路径')编写插入、更新、删除非关系的代价O(日志n)(子树的大小):依赖于数组数据类型或序列化字符串格式
  9. Nested Intervals
    • Like nested set, but with real/float/decimal so that the encoding isn't volatile (inexpensive move/insert/delete)
    • 与嵌套设置一样,但使用实数/浮点数/小数,这样编码就不会不稳定(便宜的移动/插入/删除)
    • Has real/float/decimal representation/precision issues
    • 有真正的/浮动/ /十进制表示精度问题
    • Complex matrix encoding variant adds ancestor encoding (materialized path) for "free"
    • 复杂矩阵编码变体为“*”添加祖先编码(物化路径)
  10. 嵌套间隔,比如嵌套设置,但是使用实数/float/decimal,这样编码就不会不稳定(廉价的移动/插入/删除),具有实数/float/decimal表示/decimal表示/精度问题
  11. Flat Table
    • A modified Adjacency List that adds a Level and Rank (e.g. ordering) column to each record.
    • 一种修改过的邻接列表,它向每个记录添加一个级别和等级(例如排序)列。
    • Cheap to iterate/paginate over
    • 廉价的迭代/随意翻阅
    • Expensive move and delete
    • 昂贵的移动和删除
    • Good Use: threaded discussion - forums / blog comments
    • 好的使用:线程化讨论-论坛/博客评论
  12. 扁平化表:修改后的邻接表,为每条记录添加一个级别和等级(例如排序)列。在昂贵的移动和删除很好的使用:线程讨论-论坛/博客评论。
  13. Multiple lineage columns
    • Columns: one for each lineage level, refers to all the parents up to the root, levels down from the item's level are set to NULL
    • 列:每个沿袭级别一个,指到根的所有父级,从项目的级别向下的级别设置为NULL
    • Cheap ancestors, descendants, level
    • 廉价的祖先、子孙、水平
    • Cheap insert, delete, move of the leaves
    • 廉价的插入,删除,移动的叶子
    • Expensive insert, delete, move of the internal nodes
    • 昂贵的插入、删除、移动内部节点
    • Hard limit to how deep the hierarchy can be
    • 很难限制等级制度的深度
  14. 多个家族列列:一个用于每个血统水平,指的是所有的父母到根,水平低于项目的级别设置为NULL便宜的祖先,后代,廉价的插入,删除,移动树叶昂贵的插入、删除、移动硬限制的内部节点层次结构可以多深

Database Specific Notes

数据库特定的音符

MySQL

MySQL

Oracle

甲骨文

  • Use CONNECT BY to traverse Adjacency Lists
  • 使用CONNECT BY来遍历邻接列表

PostgreSQL

PostgreSQL

  • ltree datatype for Materialized Path
  • 物化路径的ltree数据类型

SQL Server

SQL Server

  • General summary
  • 一般的总结
  • 2008 offers HierarchyId data type appears to help with Lineage Column approach and expand the depth that can be represented.
  • 2008提供了层次化数据类型,可以帮助沿袭列方法并扩展可表示的深度。

7 个解决方案

#1


29  

My favorite answer is as what the first sentence in this thread suggested. Use an Adjacency List to maintain the hierarchy and use Nested Sets to query the hierarchy.

我最喜欢的答案就是这篇文章的第一句话。使用邻接列表来维护层次结构,并使用嵌套集来查询层次结构。

The problem up until now has been that the coversion method from an Adjacecy List to Nested Sets has been frightfully slow because most people use the extreme RBAR method known as a "Push Stack" to do the conversion and has been considered to be way to expensive to reach the Nirvana of the simplicity of maintenance by the Adjacency List and the awesome performance of Nested Sets. As a result, most people end up having to settle for one or the other especially if there are more than, say, a lousy 100,000 nodes or so. Using the push stack method can take a whole day to do the conversion on what MLM'ers would consider to be a small million node hierarchy.

这个问题直到现在已经从一个Adjacecy coversion方法列表嵌套集一直非常地缓慢,因为大多数人使用极端RBAR方法称为“推栈”的转换和被认为是昂贵的方式达到涅槃的简单维护由邻接表和可怕的嵌套集的性能。因此,大多数人最终不得不满足于其中一个或另一个,尤其是当有超过10万个糟糕的节点时。使用push stack方法可以花费一整天的时间来完成对MLM'ers所认为的一个小的百万节点层次结构的转换。

I thought I'd give Celko a bit of competition by coming up with a method to convert an Adjacency List to Nested sets at speeds that just seem impossible. Here's the performance of the push stack method on my i5 laptop.

我想我应该给Celko一些竞争,通过提出一种方法,将邻接表转换成嵌套的速度,这看起来是不可能的。下面是我的i5笔记本上的push stack方法的性能。

Duration for     1,000 Nodes = 00:00:00:870 
Duration for    10,000 Nodes = 00:01:01:783 (70 times slower instead of just 10)
Duration for   100,000 Nodes = 00:49:59:730 (3,446 times slower instead of just 100) 
Duration for 1,000,000 Nodes = 'Didn't even try this'

And here's the duration for the new method (with the push stack method in parenthesis).

这是新方法的持续时间(括号内是push stack方法)。

Duration for     1,000 Nodes = 00:00:00:053 (compared to 00:00:00:870)
Duration for    10,000 Nodes = 00:00:00:323 (compared to 00:01:01:783)
Duration for   100,000 Nodes = 00:00:03:867 (compared to 00:49:59:730)
Duration for 1,000,000 Nodes = 00:00:54:283 (compared to something like 2 days!!!)

Yes, that's correct. 1 million nodes converted in less than a minute and 100,000 nodes in under 4 seconds.

是的,这是正确的。100万个节点在不到1分钟的时间内转换成10万个节点,在4秒内完成。

You can read about the new method and get a copy of the code at the following URL. http://www.sqlservercentral.com/articles/Hierarchy/94040/

您可以阅读有关新方法的内容,并获得以下URL中的代码副本。http://www.sqlservercentral.com/articles/Hierarchy/94040/

I also developed a "pre-aggregated" hierarchy using similar methods. MLM'ers and people making bills of materials will be particularly interested in this article. http://www.sqlservercentral.com/articles/T-SQL/94570/

我还使用类似的方法开发了一个“预聚合”层次结构。传销商和制作物料清单的人将对本文特别感兴趣。http://www.sqlservercentral.com/articles/T-SQL/94570/

If you do stop by to take a look at either article, jump into the "Join the discussion" link and let me know what you think.

如果你停下来看看这两篇文章,就跳到“加入讨论”环节,让我知道你的想法。

#2


21  

This is a very partial answer to your question, but I hope still useful.

这是对你问题的一个非常片面的回答,但我希望仍然有用。

Microsoft SQL Server 2008 implements two features that are extremely useful for managing hierarchical data:

Microsoft SQL Server 2008实现了两个特性,它们对于管理层次数据非常有用:

  • the HierarchyId data type.
  • HierarchyId数据类型。
  • common table expressions, using the with keyword.
  • 常用的表表达式,使用with关键字。

Have a look at "Model Your Data Hierarchies With SQL Server 2008" by Kent Tegels on MSDN for starts. See also my own question: Recursive same-table query in SQL Server 2008

查看一下Kent Tegels在MSDN上的“使用SQL Server 2008建模数据层次结构”。请参见我自己的问题:SQL Server 2008中的递归同表查询

#3


18  

This design was not mentioned yet:

这个设计还没有被提及:

Multiple lineage columns

Though it has limitations, if you can bear them, it's very simple and very efficient. Features:

虽然它有局限性,但如果你能忍受它们,它是非常简单和非常有效的。特点:

  • Columns: one for each lineage level, refers to all the parents up to the root, levels below the current items' level are set to NULL
  • 列:每个沿袭级别一个,指到根的所有父级,当前项目级别以下的级别设置为NULL
  • Limit to how deep the hierarchy can be
  • 限制层级的深度
  • Cheap ancestors, descendants, level
  • 廉价的祖先、子孙、水平
  • Cheap insert, delete, move of the leaves
  • 廉价的插入,删除,移动的叶子
  • Expensive insert, delete, move of the internal nodes
  • 昂贵的插入、删除、移动内部节点

Here follows an example - taxonomic tree of birds so the hierarchy is Class/Order/Family/Genus/Species - species is the lowest level, 1 row = 1 taxon (which corresponds to species in the case of the leaf nodes):

下面是一个例子-鸟类的分类树,所以层次是类/序/科/属/种-物种是最低的层次,1排= 1个分类单元(在叶节点的情况下对应物种):

CREATE TABLE `taxons` (
  `TaxonId` smallint(6) NOT NULL default '0',
  `ClassId` smallint(6) default NULL,
  `OrderId` smallint(6) default NULL,
  `FamilyId` smallint(6) default NULL,
  `GenusId` smallint(6) default NULL,
  `Name` varchar(150) NOT NULL default ''
);

and the example of the data:

数据的例子是

+---------+---------+---------+----------+---------+-------------------------------+
| TaxonId | ClassId | OrderId | FamilyId | GenusId | Name                          |
+---------+---------+---------+----------+---------+-------------------------------+
|     254 |       0 |       0 |        0 |       0 | Aves                          |
|     255 |     254 |       0 |        0 |       0 | Gaviiformes                   |
|     256 |     254 |     255 |        0 |       0 | Gaviidae                      |
|     257 |     254 |     255 |      256 |       0 | Gavia                         |
|     258 |     254 |     255 |      256 |     257 | Gavia stellata                |
|     259 |     254 |     255 |      256 |     257 | Gavia arctica                 |
|     260 |     254 |     255 |      256 |     257 | Gavia immer                   |
|     261 |     254 |     255 |      256 |     257 | Gavia adamsii                 |
|     262 |     254 |       0 |        0 |       0 | Podicipediformes              |
|     263 |     254 |     262 |        0 |       0 | Podicipedidae                 |
|     264 |     254 |     262 |      263 |       0 | Tachybaptus                   |

This is great because this way you accomplish all the needed operations in a very easy way, as long as the internal categories don't change their level in the tree.

这很好,因为只要内部类别不改变它们在树中的级别,就可以用非常简单的方式完成所有所需的操作。

#4


13  

Adjacency Model + Nested Sets Model

I went for it because I could insert new items to the tree easily (you just need a branch's id to insert a new item to it) and also query it quite fast.

我这样做是因为我可以很容易地将新项插入到树中(您只需要一个分支的id来插入一个新项),而且还可以快速地查询它。

+-------------+----------------------+--------+-----+-----+
| category_id | name                 | parent | lft | rgt |
+-------------+----------------------+--------+-----+-----+
|           1 | ELECTRONICS          |   NULL |   1 |  20 |
|           2 | TELEVISIONS          |      1 |   2 |   9 |
|           3 | TUBE                 |      2 |   3 |   4 |
|           4 | LCD                  |      2 |   5 |   6 |
|           5 | PLASMA               |      2 |   7 |   8 |
|           6 | PORTABLE ELECTRONICS |      1 |  10 |  19 |
|           7 | MP3 PLAYERS          |      6 |  11 |  14 |
|           8 | FLASH                |      7 |  12 |  13 |
|           9 | CD PLAYERS           |      6 |  15 |  16 |
|          10 | 2 WAY RADIOS         |      6 |  17 |  18 |
+-------------+----------------------+--------+-----+-----+
  • Every time you need all children of any parent you just query the parent column.
  • 每当您需要任何父类的所有子元素时,只需查询父列。
  • If you needed all descendants of any parent you query for items which have their lft between lft and rgt of parent.
  • 如果您需要任何父类的所有后代,您需要查询在父类的lft和rgt之间具有lft的项。
  • If you needed all parents of any node up to the root of the tree, you query for items having lft lower than the node's lft and rgt bigger than the node's rgt and sort the by parent.
  • 如果需要任何节点的所有父节点(直到树的根),可以查询lft低于节点lft的项和rgt大于节点的rgt的项,并按父节点进行排序。

I needed to make accessing and querying the tree faster than inserts, that's why I chose this

我需要使访问和查询树的速度比插入快,这就是我选择这个的原因

The only problem is to fix the left and right columns when inserting new items. well I created a stored procedure for it and called it every time I inserted a new item which was rare in my case but it is really fast. I got the idea from the Joe Celko's book, and the stored procedure and how I came up with it is explained here in DBA SE https://dba.stackexchange.com/q/89051/41481

唯一的问题是在插入新项时修复左列和右列。我为它创建了一个存储过程每次我插入一个新项目时都会调用它这在我的例子中很少见,但速度很快。我是从Joe Celko的书中得到这个想法的,存储过程以及我是如何想到它的,在DBA SE https://dba.stackexchange.com/q/89051/41481中有解释

#5


10  

If your database supports arrays, you can also implement a lineage column or materialized path as an array of parent ids.

如果您的数据库支持数组,您还可以将沿袭列或物化路径作为父id数组实现。

Specifically with Postgres you can then use the set operators to query the hierarchy, and get excellent performance with GIN indices. This makes finding parents, children, and depth pretty trivial in a single query. Updates are pretty manageable as well.

特别是使用Postgres,您可以使用set操作符查询层次结构,并使用GIN索引获得出色的性能。这使得在单个查询中查找父类、子类和深度变得非常简单。更新也很容易管理。

I have a full write up of using arrays for materialized paths if you're curious.

如果你好奇的话,我有一大堆关于使用数组来实现路径的资料。

#6


7  

This is really a square peg, round hole question.

这是一个方形的圆孔问题。

If relational databases and SQL are the only hammer you have or are willing to use, then the answers that have been posted thus far are adequate. However, why not use a tool designed to handle hierarchical data? Graph database are ideal for complex hierarchical data.

如果关系数据库和SQL是您拥有或愿意使用的惟一工具,那么迄今为止已经发布的答案就足够了。但是,为什么不使用一个用来处理分层数据的工具呢?图形数据库是复杂层次数据的理想选择。

The inefficiencies of the relational model along with the complexities of any code/query solution to map a graph/hierarchical model onto a relational model is just not worth the effort when compared to the ease with which a graph database solution can solve the same problem.

关系模型的低效性,以及将图/层次模型映射到关系模型的任何代码/查询解决方案的复杂性,与图数据库解决方案解决相同问题的易用性相比,都不值得付出努力。

Consider a Bill of Materials as a common hierarchical data structure.

考虑一份材料清单作为一个通用的分层数据结构。

class Component extends Vertex {
    long assetId;
    long partNumber;
    long material;
    long amount;
};

class PartOf extends Edge {
};

class AdjacentTo extends Edge {
};

Shortest path between two sub-assemblies: Simple graph traversal algorithm. Acceptable paths can be qualified based on criteria.

两个子程序集之间的最短路径:简单的图遍历算法。可接受路径可以根据标准进行限定。

Similarity: What is the degree of similarity between two assemblies? Perform a traversal on both sub-trees computing the intersection and union of the two sub-trees. The percent similar is the intersection divided by the union.

相似度:两个程序集之间的相似度是多少?在两个子树上执行遍历,计算两个子树的交集和联合。相似的百分比是交集除以单位。

Transitive Closure: Walk the sub-tree and sum up the field(s) of interest, e.g. "How much aluminum is in a sub-assembly?"

传递闭包:遍历子树并总结感兴趣的字段,例如。“一个子装配有多少铝?”

Yes, you can solve the problem with SQL and a relational database. However, there are much better approaches if you are willing to use the right tool for the job.

是的,您可以使用SQL和关系数据库解决这个问题。然而,如果你愿意在工作中使用合适的工具,有更好的方法。

#7


4  

I am using PostgreSQL with closure tables for my hierarchies. I have one universal stored procedure for the whole database:

我使用PostgreSQL和我的层次结构的闭包表。对于整个数据库,我有一个通用存储过程:

CREATE FUNCTION nomen_tree() RETURNS trigger
    LANGUAGE plpgsql
    AS $_$
DECLARE
  old_parent INTEGER;
  new_parent INTEGER;
  id_nom INTEGER;
  txt_name TEXT;
BEGIN
-- TG_ARGV[0] = name of table with entities with PARENT-CHILD relationships (TBL_ORIG)
-- TG_ARGV[1] = name of helper table with ANCESTOR, CHILD, DEPTH information (TBL_TREE)
-- TG_ARGV[2] = name of the field in TBL_ORIG which is used for the PARENT-CHILD relationship (FLD_PARENT)
    IF TG_OP = 'INSERT' THEN
    EXECUTE 'INSERT INTO ' || TG_ARGV[1] || ' (child_id,ancestor_id,depth) 
        SELECT $1.id,$1.id,0 UNION ALL
      SELECT $1.id,ancestor_id,depth+1 FROM ' || TG_ARGV[1] || ' WHERE child_id=$1.' || TG_ARGV[2] USING NEW;
    ELSE                                                           
    -- EXECUTE does not support conditional statements inside
    EXECUTE 'SELECT $1.' || TG_ARGV[2] || ',$2.' || TG_ARGV[2] INTO old_parent,new_parent USING OLD,NEW;
    IF COALESCE(old_parent,0) <> COALESCE(new_parent,0) THEN
      EXECUTE '
      -- prevent cycles in the tree
      UPDATE ' || TG_ARGV[0] || ' SET ' || TG_ARGV[2] || ' = $1.' || TG_ARGV[2]
        || ' WHERE id=$2.' || TG_ARGV[2] || ' AND EXISTS(SELECT 1 FROM '
        || TG_ARGV[1] || ' WHERE child_id=$2.' || TG_ARGV[2] || ' AND ancestor_id=$2.id);
      -- first remove edges between all old parents of node and its descendants
      DELETE FROM ' || TG_ARGV[1] || ' WHERE child_id IN
        (SELECT child_id FROM ' || TG_ARGV[1] || ' WHERE ancestor_id = $1.id)
        AND ancestor_id IN
        (SELECT ancestor_id FROM ' || TG_ARGV[1] || ' WHERE child_id = $1.id AND ancestor_id <> $1.id);
      -- then add edges for all new parents ...
      INSERT INTO ' || TG_ARGV[1] || ' (child_id,ancestor_id,depth) 
        SELECT child_id,ancestor_id,d_c+d_a FROM
        (SELECT child_id,depth AS d_c FROM ' || TG_ARGV[1] || ' WHERE ancestor_id=$2.id) AS child
        CROSS JOIN
        (SELECT ancestor_id,depth+1 AS d_a FROM ' || TG_ARGV[1] || ' WHERE child_id=$2.' 
        || TG_ARGV[2] || ') AS parent;' USING OLD, NEW;
    END IF;
  END IF;
  RETURN NULL;
END;
$_$;

Then for each table where I have a hierarchy, I create a trigger

对于每个有层次结构的表,我创建一个触发器。

CREATE TRIGGER nomenclature_tree_tr AFTER INSERT OR UPDATE ON nomenclature FOR EACH ROW EXECUTE PROCEDURE nomen_tree('my_db.nomenclature', 'my_db.nom_helper', 'parent_id');

For populating a closure table from existing hierarchy I use this stored procedure:

对于从现有层次结构中填充闭包表,我使用以下存储过程:

CREATE FUNCTION rebuild_tree(tbl_base text, tbl_closure text, fld_parent text) RETURNS void
    LANGUAGE plpgsql
    AS $$
BEGIN
    EXECUTE 'TRUNCATE ' || tbl_closure || ';
    INSERT INTO ' || tbl_closure || ' (child_id,ancestor_id,depth) 
        WITH RECURSIVE tree AS
      (
        SELECT id AS child_id,id AS ancestor_id,0 AS depth FROM ' || tbl_base || '
        UNION ALL 
        SELECT t.id,ancestor_id,depth+1 FROM ' || tbl_base || ' AS t
        JOIN tree ON child_id = ' || fld_parent || '
      )
      SELECT * FROM tree;';
END;
$$;

Closure tables are defined with 3 columns - ANCESTOR_ID, DESCENDANT_ID, DEPTH. It is possible (and I even advice) to store records with same value for ANCESTOR and DESCENDANT, and a value of zero for DEPTH. This will simplify the queries for retrieval of the hierarchy. And they are very simple indeed:

闭包表由3列定义——ANCESTOR_ID、descent dant_id、深度。可以(甚至建议)为祖先和后代存储值相同的记录,为深度存储值为0。这将简化检索层次结构的查询。它们非常简单:

-- get all descendants
SELECT tbl_orig.*,depth FROM tbl_closure LEFT JOIN tbl_orig ON descendant_id = tbl_orig.id WHERE ancestor_id = XXX AND depth <> 0;
-- get only direct descendants
SELECT tbl_orig.* FROM tbl_closure LEFT JOIN tbl_orig ON descendant_id = tbl_orig.id WHERE ancestor_id = XXX AND depth = 1;
-- get all ancestors
SELECT tbl_orig.* FROM tbl_closure LEFT JOIN tbl_orig ON ancestor_id = tbl_orig.id WHERE descendant_id = XXX AND depth <> 0;
-- find the deepest level of children
SELECT MAX(depth) FROM tbl_closure WHERE ancestor_id = XXX;

#1


29  

My favorite answer is as what the first sentence in this thread suggested. Use an Adjacency List to maintain the hierarchy and use Nested Sets to query the hierarchy.

我最喜欢的答案就是这篇文章的第一句话。使用邻接列表来维护层次结构,并使用嵌套集来查询层次结构。

The problem up until now has been that the coversion method from an Adjacecy List to Nested Sets has been frightfully slow because most people use the extreme RBAR method known as a "Push Stack" to do the conversion and has been considered to be way to expensive to reach the Nirvana of the simplicity of maintenance by the Adjacency List and the awesome performance of Nested Sets. As a result, most people end up having to settle for one or the other especially if there are more than, say, a lousy 100,000 nodes or so. Using the push stack method can take a whole day to do the conversion on what MLM'ers would consider to be a small million node hierarchy.

这个问题直到现在已经从一个Adjacecy coversion方法列表嵌套集一直非常地缓慢,因为大多数人使用极端RBAR方法称为“推栈”的转换和被认为是昂贵的方式达到涅槃的简单维护由邻接表和可怕的嵌套集的性能。因此,大多数人最终不得不满足于其中一个或另一个,尤其是当有超过10万个糟糕的节点时。使用push stack方法可以花费一整天的时间来完成对MLM'ers所认为的一个小的百万节点层次结构的转换。

I thought I'd give Celko a bit of competition by coming up with a method to convert an Adjacency List to Nested sets at speeds that just seem impossible. Here's the performance of the push stack method on my i5 laptop.

我想我应该给Celko一些竞争,通过提出一种方法,将邻接表转换成嵌套的速度,这看起来是不可能的。下面是我的i5笔记本上的push stack方法的性能。

Duration for     1,000 Nodes = 00:00:00:870 
Duration for    10,000 Nodes = 00:01:01:783 (70 times slower instead of just 10)
Duration for   100,000 Nodes = 00:49:59:730 (3,446 times slower instead of just 100) 
Duration for 1,000,000 Nodes = 'Didn't even try this'

And here's the duration for the new method (with the push stack method in parenthesis).

这是新方法的持续时间(括号内是push stack方法)。

Duration for     1,000 Nodes = 00:00:00:053 (compared to 00:00:00:870)
Duration for    10,000 Nodes = 00:00:00:323 (compared to 00:01:01:783)
Duration for   100,000 Nodes = 00:00:03:867 (compared to 00:49:59:730)
Duration for 1,000,000 Nodes = 00:00:54:283 (compared to something like 2 days!!!)

Yes, that's correct. 1 million nodes converted in less than a minute and 100,000 nodes in under 4 seconds.

是的,这是正确的。100万个节点在不到1分钟的时间内转换成10万个节点,在4秒内完成。

You can read about the new method and get a copy of the code at the following URL. http://www.sqlservercentral.com/articles/Hierarchy/94040/

您可以阅读有关新方法的内容,并获得以下URL中的代码副本。http://www.sqlservercentral.com/articles/Hierarchy/94040/

I also developed a "pre-aggregated" hierarchy using similar methods. MLM'ers and people making bills of materials will be particularly interested in this article. http://www.sqlservercentral.com/articles/T-SQL/94570/

我还使用类似的方法开发了一个“预聚合”层次结构。传销商和制作物料清单的人将对本文特别感兴趣。http://www.sqlservercentral.com/articles/T-SQL/94570/

If you do stop by to take a look at either article, jump into the "Join the discussion" link and let me know what you think.

如果你停下来看看这两篇文章,就跳到“加入讨论”环节,让我知道你的想法。

#2


21  

This is a very partial answer to your question, but I hope still useful.

这是对你问题的一个非常片面的回答,但我希望仍然有用。

Microsoft SQL Server 2008 implements two features that are extremely useful for managing hierarchical data:

Microsoft SQL Server 2008实现了两个特性,它们对于管理层次数据非常有用:

  • the HierarchyId data type.
  • HierarchyId数据类型。
  • common table expressions, using the with keyword.
  • 常用的表表达式,使用with关键字。

Have a look at "Model Your Data Hierarchies With SQL Server 2008" by Kent Tegels on MSDN for starts. See also my own question: Recursive same-table query in SQL Server 2008

查看一下Kent Tegels在MSDN上的“使用SQL Server 2008建模数据层次结构”。请参见我自己的问题:SQL Server 2008中的递归同表查询

#3


18  

This design was not mentioned yet:

这个设计还没有被提及:

Multiple lineage columns

Though it has limitations, if you can bear them, it's very simple and very efficient. Features:

虽然它有局限性,但如果你能忍受它们,它是非常简单和非常有效的。特点:

  • Columns: one for each lineage level, refers to all the parents up to the root, levels below the current items' level are set to NULL
  • 列:每个沿袭级别一个,指到根的所有父级,当前项目级别以下的级别设置为NULL
  • Limit to how deep the hierarchy can be
  • 限制层级的深度
  • Cheap ancestors, descendants, level
  • 廉价的祖先、子孙、水平
  • Cheap insert, delete, move of the leaves
  • 廉价的插入,删除,移动的叶子
  • Expensive insert, delete, move of the internal nodes
  • 昂贵的插入、删除、移动内部节点

Here follows an example - taxonomic tree of birds so the hierarchy is Class/Order/Family/Genus/Species - species is the lowest level, 1 row = 1 taxon (which corresponds to species in the case of the leaf nodes):

下面是一个例子-鸟类的分类树,所以层次是类/序/科/属/种-物种是最低的层次,1排= 1个分类单元(在叶节点的情况下对应物种):

CREATE TABLE `taxons` (
  `TaxonId` smallint(6) NOT NULL default '0',
  `ClassId` smallint(6) default NULL,
  `OrderId` smallint(6) default NULL,
  `FamilyId` smallint(6) default NULL,
  `GenusId` smallint(6) default NULL,
  `Name` varchar(150) NOT NULL default ''
);

and the example of the data:

数据的例子是

+---------+---------+---------+----------+---------+-------------------------------+
| TaxonId | ClassId | OrderId | FamilyId | GenusId | Name                          |
+---------+---------+---------+----------+---------+-------------------------------+
|     254 |       0 |       0 |        0 |       0 | Aves                          |
|     255 |     254 |       0 |        0 |       0 | Gaviiformes                   |
|     256 |     254 |     255 |        0 |       0 | Gaviidae                      |
|     257 |     254 |     255 |      256 |       0 | Gavia                         |
|     258 |     254 |     255 |      256 |     257 | Gavia stellata                |
|     259 |     254 |     255 |      256 |     257 | Gavia arctica                 |
|     260 |     254 |     255 |      256 |     257 | Gavia immer                   |
|     261 |     254 |     255 |      256 |     257 | Gavia adamsii                 |
|     262 |     254 |       0 |        0 |       0 | Podicipediformes              |
|     263 |     254 |     262 |        0 |       0 | Podicipedidae                 |
|     264 |     254 |     262 |      263 |       0 | Tachybaptus                   |

This is great because this way you accomplish all the needed operations in a very easy way, as long as the internal categories don't change their level in the tree.

这很好,因为只要内部类别不改变它们在树中的级别,就可以用非常简单的方式完成所有所需的操作。

#4


13  

Adjacency Model + Nested Sets Model

I went for it because I could insert new items to the tree easily (you just need a branch's id to insert a new item to it) and also query it quite fast.

我这样做是因为我可以很容易地将新项插入到树中(您只需要一个分支的id来插入一个新项),而且还可以快速地查询它。

+-------------+----------------------+--------+-----+-----+
| category_id | name                 | parent | lft | rgt |
+-------------+----------------------+--------+-----+-----+
|           1 | ELECTRONICS          |   NULL |   1 |  20 |
|           2 | TELEVISIONS          |      1 |   2 |   9 |
|           3 | TUBE                 |      2 |   3 |   4 |
|           4 | LCD                  |      2 |   5 |   6 |
|           5 | PLASMA               |      2 |   7 |   8 |
|           6 | PORTABLE ELECTRONICS |      1 |  10 |  19 |
|           7 | MP3 PLAYERS          |      6 |  11 |  14 |
|           8 | FLASH                |      7 |  12 |  13 |
|           9 | CD PLAYERS           |      6 |  15 |  16 |
|          10 | 2 WAY RADIOS         |      6 |  17 |  18 |
+-------------+----------------------+--------+-----+-----+
  • Every time you need all children of any parent you just query the parent column.
  • 每当您需要任何父类的所有子元素时,只需查询父列。
  • If you needed all descendants of any parent you query for items which have their lft between lft and rgt of parent.
  • 如果您需要任何父类的所有后代,您需要查询在父类的lft和rgt之间具有lft的项。
  • If you needed all parents of any node up to the root of the tree, you query for items having lft lower than the node's lft and rgt bigger than the node's rgt and sort the by parent.
  • 如果需要任何节点的所有父节点(直到树的根),可以查询lft低于节点lft的项和rgt大于节点的rgt的项,并按父节点进行排序。

I needed to make accessing and querying the tree faster than inserts, that's why I chose this

我需要使访问和查询树的速度比插入快,这就是我选择这个的原因

The only problem is to fix the left and right columns when inserting new items. well I created a stored procedure for it and called it every time I inserted a new item which was rare in my case but it is really fast. I got the idea from the Joe Celko's book, and the stored procedure and how I came up with it is explained here in DBA SE https://dba.stackexchange.com/q/89051/41481

唯一的问题是在插入新项时修复左列和右列。我为它创建了一个存储过程每次我插入一个新项目时都会调用它这在我的例子中很少见,但速度很快。我是从Joe Celko的书中得到这个想法的,存储过程以及我是如何想到它的,在DBA SE https://dba.stackexchange.com/q/89051/41481中有解释

#5


10  

If your database supports arrays, you can also implement a lineage column or materialized path as an array of parent ids.

如果您的数据库支持数组,您还可以将沿袭列或物化路径作为父id数组实现。

Specifically with Postgres you can then use the set operators to query the hierarchy, and get excellent performance with GIN indices. This makes finding parents, children, and depth pretty trivial in a single query. Updates are pretty manageable as well.

特别是使用Postgres,您可以使用set操作符查询层次结构,并使用GIN索引获得出色的性能。这使得在单个查询中查找父类、子类和深度变得非常简单。更新也很容易管理。

I have a full write up of using arrays for materialized paths if you're curious.

如果你好奇的话,我有一大堆关于使用数组来实现路径的资料。

#6


7  

This is really a square peg, round hole question.

这是一个方形的圆孔问题。

If relational databases and SQL are the only hammer you have or are willing to use, then the answers that have been posted thus far are adequate. However, why not use a tool designed to handle hierarchical data? Graph database are ideal for complex hierarchical data.

如果关系数据库和SQL是您拥有或愿意使用的惟一工具,那么迄今为止已经发布的答案就足够了。但是,为什么不使用一个用来处理分层数据的工具呢?图形数据库是复杂层次数据的理想选择。

The inefficiencies of the relational model along with the complexities of any code/query solution to map a graph/hierarchical model onto a relational model is just not worth the effort when compared to the ease with which a graph database solution can solve the same problem.

关系模型的低效性,以及将图/层次模型映射到关系模型的任何代码/查询解决方案的复杂性,与图数据库解决方案解决相同问题的易用性相比,都不值得付出努力。

Consider a Bill of Materials as a common hierarchical data structure.

考虑一份材料清单作为一个通用的分层数据结构。

class Component extends Vertex {
    long assetId;
    long partNumber;
    long material;
    long amount;
};

class PartOf extends Edge {
};

class AdjacentTo extends Edge {
};

Shortest path between two sub-assemblies: Simple graph traversal algorithm. Acceptable paths can be qualified based on criteria.

两个子程序集之间的最短路径:简单的图遍历算法。可接受路径可以根据标准进行限定。

Similarity: What is the degree of similarity between two assemblies? Perform a traversal on both sub-trees computing the intersection and union of the two sub-trees. The percent similar is the intersection divided by the union.

相似度:两个程序集之间的相似度是多少?在两个子树上执行遍历,计算两个子树的交集和联合。相似的百分比是交集除以单位。

Transitive Closure: Walk the sub-tree and sum up the field(s) of interest, e.g. "How much aluminum is in a sub-assembly?"

传递闭包:遍历子树并总结感兴趣的字段,例如。“一个子装配有多少铝?”

Yes, you can solve the problem with SQL and a relational database. However, there are much better approaches if you are willing to use the right tool for the job.

是的,您可以使用SQL和关系数据库解决这个问题。然而,如果你愿意在工作中使用合适的工具,有更好的方法。

#7


4  

I am using PostgreSQL with closure tables for my hierarchies. I have one universal stored procedure for the whole database:

我使用PostgreSQL和我的层次结构的闭包表。对于整个数据库,我有一个通用存储过程:

CREATE FUNCTION nomen_tree() RETURNS trigger
    LANGUAGE plpgsql
    AS $_$
DECLARE
  old_parent INTEGER;
  new_parent INTEGER;
  id_nom INTEGER;
  txt_name TEXT;
BEGIN
-- TG_ARGV[0] = name of table with entities with PARENT-CHILD relationships (TBL_ORIG)
-- TG_ARGV[1] = name of helper table with ANCESTOR, CHILD, DEPTH information (TBL_TREE)
-- TG_ARGV[2] = name of the field in TBL_ORIG which is used for the PARENT-CHILD relationship (FLD_PARENT)
    IF TG_OP = 'INSERT' THEN
    EXECUTE 'INSERT INTO ' || TG_ARGV[1] || ' (child_id,ancestor_id,depth) 
        SELECT $1.id,$1.id,0 UNION ALL
      SELECT $1.id,ancestor_id,depth+1 FROM ' || TG_ARGV[1] || ' WHERE child_id=$1.' || TG_ARGV[2] USING NEW;
    ELSE                                                           
    -- EXECUTE does not support conditional statements inside
    EXECUTE 'SELECT $1.' || TG_ARGV[2] || ',$2.' || TG_ARGV[2] INTO old_parent,new_parent USING OLD,NEW;
    IF COALESCE(old_parent,0) <> COALESCE(new_parent,0) THEN
      EXECUTE '
      -- prevent cycles in the tree
      UPDATE ' || TG_ARGV[0] || ' SET ' || TG_ARGV[2] || ' = $1.' || TG_ARGV[2]
        || ' WHERE id=$2.' || TG_ARGV[2] || ' AND EXISTS(SELECT 1 FROM '
        || TG_ARGV[1] || ' WHERE child_id=$2.' || TG_ARGV[2] || ' AND ancestor_id=$2.id);
      -- first remove edges between all old parents of node and its descendants
      DELETE FROM ' || TG_ARGV[1] || ' WHERE child_id IN
        (SELECT child_id FROM ' || TG_ARGV[1] || ' WHERE ancestor_id = $1.id)
        AND ancestor_id IN
        (SELECT ancestor_id FROM ' || TG_ARGV[1] || ' WHERE child_id = $1.id AND ancestor_id <> $1.id);
      -- then add edges for all new parents ...
      INSERT INTO ' || TG_ARGV[1] || ' (child_id,ancestor_id,depth) 
        SELECT child_id,ancestor_id,d_c+d_a FROM
        (SELECT child_id,depth AS d_c FROM ' || TG_ARGV[1] || ' WHERE ancestor_id=$2.id) AS child
        CROSS JOIN
        (SELECT ancestor_id,depth+1 AS d_a FROM ' || TG_ARGV[1] || ' WHERE child_id=$2.' 
        || TG_ARGV[2] || ') AS parent;' USING OLD, NEW;
    END IF;
  END IF;
  RETURN NULL;
END;
$_$;

Then for each table where I have a hierarchy, I create a trigger

对于每个有层次结构的表,我创建一个触发器。

CREATE TRIGGER nomenclature_tree_tr AFTER INSERT OR UPDATE ON nomenclature FOR EACH ROW EXECUTE PROCEDURE nomen_tree('my_db.nomenclature', 'my_db.nom_helper', 'parent_id');

For populating a closure table from existing hierarchy I use this stored procedure:

对于从现有层次结构中填充闭包表,我使用以下存储过程:

CREATE FUNCTION rebuild_tree(tbl_base text, tbl_closure text, fld_parent text) RETURNS void
    LANGUAGE plpgsql
    AS $$
BEGIN
    EXECUTE 'TRUNCATE ' || tbl_closure || ';
    INSERT INTO ' || tbl_closure || ' (child_id,ancestor_id,depth) 
        WITH RECURSIVE tree AS
      (
        SELECT id AS child_id,id AS ancestor_id,0 AS depth FROM ' || tbl_base || '
        UNION ALL 
        SELECT t.id,ancestor_id,depth+1 FROM ' || tbl_base || ' AS t
        JOIN tree ON child_id = ' || fld_parent || '
      )
      SELECT * FROM tree;';
END;
$$;

Closure tables are defined with 3 columns - ANCESTOR_ID, DESCENDANT_ID, DEPTH. It is possible (and I even advice) to store records with same value for ANCESTOR and DESCENDANT, and a value of zero for DEPTH. This will simplify the queries for retrieval of the hierarchy. And they are very simple indeed:

闭包表由3列定义——ANCESTOR_ID、descent dant_id、深度。可以(甚至建议)为祖先和后代存储值相同的记录,为深度存储值为0。这将简化检索层次结构的查询。它们非常简单:

-- get all descendants
SELECT tbl_orig.*,depth FROM tbl_closure LEFT JOIN tbl_orig ON descendant_id = tbl_orig.id WHERE ancestor_id = XXX AND depth <> 0;
-- get only direct descendants
SELECT tbl_orig.* FROM tbl_closure LEFT JOIN tbl_orig ON descendant_id = tbl_orig.id WHERE ancestor_id = XXX AND depth = 1;
-- get all ancestors
SELECT tbl_orig.* FROM tbl_closure LEFT JOIN tbl_orig ON ancestor_id = tbl_orig.id WHERE descendant_id = XXX AND depth <> 0;
-- find the deepest level of children
SELECT MAX(depth) FROM tbl_closure WHERE ancestor_id = XXX;