超过一列的b树索引是什么样子的?

时间:2022-02-13 04:18:33

So I was reading up on indexes and their implementation, and I stumbled upon this website that has a brief explanation of b-tree indexes:

所以当我阅读索引和它们的实现时,我偶然发现了这个对b-tree索引有一个简短解释的网站:

http://20bits.com/articles/interview-questions-database-indexes/

http://20bits.com/articles/interview-questions-database-indexes/

The b-tree index makes perfect sense for indexes that are only on a single column, but let's say I create an index with multiple columns, how then does the b-tree work? What is the value of each node in the b-tree?

对于只在一列上的索引来说,b-tree索引非常有意义,但是假设我创建了一个包含多个列的索引,那么b-tree是如何工作的呢?b树中每个节点的值是多少?

For example, if I have this table:

例如,如果我有这个表格:

table customer:
id    number
name   varchar
phone_number   varchar
city   varchar

and I create an index on: (id, name, city)

我在:(id, name, city)上创建一个索引

and then run the following query:

然后运行以下查询:

SELECT id, name 
  FROM customer
 WHERE city = 'My City';

how does this query utilize the multiple column index, or does it not utilize it unless the index is created as (city, id, name) or (city, name, id) instead?

这个查询如何使用多列索引,或者除非将索引创建为(city, id, name)或(city, name, id),否则它不会使用它?

6 个解决方案

#1


9  

Imagine that the key is represented by a Python tuple (col1, col2, col3) ... the indexing operation involves comparing tuple_a with tuple_b ... if you have don't know which value of col1 and col2 that you are interested in, but only col3, then it would have to read the whole index ("full index scan"), which is not as efficient.

假设键由一个Python tuple表示(col1、col2、col3)…索引操作涉及到比较tuple_a和tuple_b…如果您不知道您感兴趣的col1和col2的哪个值,但是只知道col3,那么它就必须读取整个索引(“全索引扫描”),这并不是很有效。

If you have an index on (col1, col2, col3), then you can expect that any RDBMS will use the index (in a direct manner) when the WHERE clause contains reference to (1) all 3 columns (2) both col1 and col2 (3) only col1.

如果您在(col1、col2、col3)上有一个索引,那么当WHERE子句包含对(1)所有3列(2)的引用(col1)和col2(3)都只包含col1时,您可以预期任何RDBMS将(以直接方式)使用该索引。

Otherwise (e.g. only col3 in the WHERE clause), either the RDBMS will not use that index at all (e.g. SQLite), or will do a full index scan (e.g. Oracle) [if no other index is better].

否则(例如,在WHERE子句中只使用col3), RDBMS根本不会使用该索引(例如SQLite),或者进行完整的索引扫描(例如Oracle)[如果没有更好的索引的话]。

In your specific example, presuming that id is a unique identifier of a customer, it is pointless to have it appear in an index (other than the index that your DBMS should set up for a primary key or column noted as UNIQUE).

在您的特定示例中,假设id是客户的唯一标识符,那么将它显示在索引中是没有意义的(除了DBMS应该为一个主键或列设置为unique的索引之外)。

#2


13  

With most implementations, the key is simply a longer key that includes all of the key values, with a separator. No magic there ;-)

在大多数实现中,键只是一个更长的键,包含所有键值和一个分隔符。没有魔法;-)

In your example the key values could looks something like

在您的示例中,键值可以是这样的

"123499|John Doe|Conway, NH"
"32144|Bill Gates| Seattle, WA"

One of the characteristics of these indexes with composite keys is that the intermediate tree nodes can be used in some cases to "cover" the query.

这些具有组合键的索引的一个特征是,在某些情况下,中间树节点可以用于“覆盖”查询。

For example if the query is to find the Name and City given the ID, since the ID is first in the index, the index can search by this efficiently. Once in the intermediate node, it can "parse" the Name and City, from the key, and doesn't need to go to the leaf node to read the same.

例如,如果查询是在给定ID的情况下查找名称和城市,因为ID在索引中是第一个,那么索引可以有效地进行搜索。一旦进入中间节点,它就可以从键“解析”名称和城市,而不需要转到叶节点来读取相同的名称和城市。

If however the query wanted also to display the phone number, then the logic would follow down the leaf when the full record is found.

但是,如果查询还想显示电话号码,那么当找到完整记录时,逻辑将跟随叶节点。

#3


2  

In Oracle a composite key index can be used even though the leading columns are not filtered. This is done through three mechanisms:

在Oracle中,可以使用复合键索引,即使不过滤前列。这是通过三种机制实现的:

  1. A fast full index scan, in which multiblock reads are used to traverse the entire index segment.
  2. 快速全索引扫描,其中多块读取用于遍历整个索引段。
  3. An index full scan, in which the index is read in the logical order of the blocks (I believe I read that in recent versions Oracle can use multiblock reads for this, but really you should count on single block reads)
  4. 索引完整扫描,其中索引按块的逻辑顺序读取(我相信我在最近的版本中已经看到Oracle可以为此使用多块读取,但实际上您应该依赖于单块读取)
  5. An inddex skip scan, where a very low cardinality for the non-predicated leading columns allows Oracle to perform multiple index range scans, one for each unique value of the leading column(s). These are pretty rare in my experience.
  6. inddex跳转扫描,其中非谓词领先列的基数非常低,允许Oracle执行多个索引范围扫描,每个索引列的唯一值都有一个索引范围扫描。这些在我的经验中是非常罕见的。

Look for articles by Richard Foote or Jonathan Lewis for more information on Oracle index internals.

请参阅理查德•富特或乔纳森•刘易斯的文章,了解更多有关Oracle索引内部的信息。

#4


0  

Some implementations simply concatenate the values in the order of the columns, with delimiters.

有些实现简单地将值按列的顺序连接起来,并带有分隔符。

Another solution is to simply have a b-tree within a b-tree. When you hit a leaf on the first column, you get both a list of matching records and a mini b-tree of the next column, and so on. Thus, the order of the columns specified in the index makes a huge difference on whether that index will be useful for particular queries.

另一个解决方案是在b树中有一个b树。当您碰到第一列的叶子时,您将获得匹配记录的列表和下一列的迷你b-tree,等等。因此,索引中指定的列的顺序对于该索引是否对特定查询有用有很大的不同。

Here's a related question I wrote last week:

这是我上周写的一个相关的问题:

Does SQL Server jump leaves when using a composite clustered index?

使用复合集群索引时,SQL Server是否会跳转?

#5


0  

Other than the "composite key" mechanism already described, one possibility is a kdtree which works like a binary tree, but as you traverse each level you cycle through k dimensions. That is, the first level of the tree separates the first dimension into two parts, the second level splits the second dimension, the k+1th level splits the first dimension again, etc.. This allows for efficient partitioning of data in any number of dimensions. This approach is common in "spatial" databases (e.g., Oracle Spatial, PostGIS, etc.), but probably not as useful in "regular" multi-indexed tables.

除了已经描述的“复合键”机制之外,还有一种可能是kdtree,它的工作方式类似于二叉树,但是当您遍历每一层时,您都要遍历k个维度。也就是说,树的第一级将第一个维度分成两部分,第二级将第二个维度分成两部分,k+1级将第一个维度再分成两部分,等等。这允许在任意数量的维度中有效地划分数据。这种方法在“空间”数据库(如Oracle spatial、PostGIS等)中很常见,但在“常规”多索引表中可能没有那么有用。

http://en.wikipedia.org/wiki/Kd-tree

http://en.wikipedia.org/wiki/Kd-tree

#6


0  

It can use the (id,name,city) index to satisfy a "City = ? " predicate, but very very inefficently.

它可以使用(id,name,city)索引来满足“city = ?”"谓词,但非常无效。

In order to use the index to satisfy this query it would need to walk most of tree structure looking for entries with the desired city. This is still probably an order of magnatude faster than scanning the table!

为了使用索引来满足这个查询,它需要遍历树结构,以查找与期望的城市相关的条目。这可能仍然比扫描桌子要快。

An index of (city,name,id) would be the best index for your query. It would find all the desired city entries easily and would not need to access the underlying table to get the id and name values.

(城市、名称、id)索引将是查询的最佳索引。它可以很容易地找到所有需要的城市条目,并且不需要访问底层表来获取id和名称值。

#1


9  

Imagine that the key is represented by a Python tuple (col1, col2, col3) ... the indexing operation involves comparing tuple_a with tuple_b ... if you have don't know which value of col1 and col2 that you are interested in, but only col3, then it would have to read the whole index ("full index scan"), which is not as efficient.

假设键由一个Python tuple表示(col1、col2、col3)…索引操作涉及到比较tuple_a和tuple_b…如果您不知道您感兴趣的col1和col2的哪个值,但是只知道col3,那么它就必须读取整个索引(“全索引扫描”),这并不是很有效。

If you have an index on (col1, col2, col3), then you can expect that any RDBMS will use the index (in a direct manner) when the WHERE clause contains reference to (1) all 3 columns (2) both col1 and col2 (3) only col1.

如果您在(col1、col2、col3)上有一个索引,那么当WHERE子句包含对(1)所有3列(2)的引用(col1)和col2(3)都只包含col1时,您可以预期任何RDBMS将(以直接方式)使用该索引。

Otherwise (e.g. only col3 in the WHERE clause), either the RDBMS will not use that index at all (e.g. SQLite), or will do a full index scan (e.g. Oracle) [if no other index is better].

否则(例如,在WHERE子句中只使用col3), RDBMS根本不会使用该索引(例如SQLite),或者进行完整的索引扫描(例如Oracle)[如果没有更好的索引的话]。

In your specific example, presuming that id is a unique identifier of a customer, it is pointless to have it appear in an index (other than the index that your DBMS should set up for a primary key or column noted as UNIQUE).

在您的特定示例中,假设id是客户的唯一标识符,那么将它显示在索引中是没有意义的(除了DBMS应该为一个主键或列设置为unique的索引之外)。

#2


13  

With most implementations, the key is simply a longer key that includes all of the key values, with a separator. No magic there ;-)

在大多数实现中,键只是一个更长的键,包含所有键值和一个分隔符。没有魔法;-)

In your example the key values could looks something like

在您的示例中,键值可以是这样的

"123499|John Doe|Conway, NH"
"32144|Bill Gates| Seattle, WA"

One of the characteristics of these indexes with composite keys is that the intermediate tree nodes can be used in some cases to "cover" the query.

这些具有组合键的索引的一个特征是,在某些情况下,中间树节点可以用于“覆盖”查询。

For example if the query is to find the Name and City given the ID, since the ID is first in the index, the index can search by this efficiently. Once in the intermediate node, it can "parse" the Name and City, from the key, and doesn't need to go to the leaf node to read the same.

例如,如果查询是在给定ID的情况下查找名称和城市,因为ID在索引中是第一个,那么索引可以有效地进行搜索。一旦进入中间节点,它就可以从键“解析”名称和城市,而不需要转到叶节点来读取相同的名称和城市。

If however the query wanted also to display the phone number, then the logic would follow down the leaf when the full record is found.

但是,如果查询还想显示电话号码,那么当找到完整记录时,逻辑将跟随叶节点。

#3


2  

In Oracle a composite key index can be used even though the leading columns are not filtered. This is done through three mechanisms:

在Oracle中,可以使用复合键索引,即使不过滤前列。这是通过三种机制实现的:

  1. A fast full index scan, in which multiblock reads are used to traverse the entire index segment.
  2. 快速全索引扫描,其中多块读取用于遍历整个索引段。
  3. An index full scan, in which the index is read in the logical order of the blocks (I believe I read that in recent versions Oracle can use multiblock reads for this, but really you should count on single block reads)
  4. 索引完整扫描,其中索引按块的逻辑顺序读取(我相信我在最近的版本中已经看到Oracle可以为此使用多块读取,但实际上您应该依赖于单块读取)
  5. An inddex skip scan, where a very low cardinality for the non-predicated leading columns allows Oracle to perform multiple index range scans, one for each unique value of the leading column(s). These are pretty rare in my experience.
  6. inddex跳转扫描,其中非谓词领先列的基数非常低,允许Oracle执行多个索引范围扫描,每个索引列的唯一值都有一个索引范围扫描。这些在我的经验中是非常罕见的。

Look for articles by Richard Foote or Jonathan Lewis for more information on Oracle index internals.

请参阅理查德•富特或乔纳森•刘易斯的文章,了解更多有关Oracle索引内部的信息。

#4


0  

Some implementations simply concatenate the values in the order of the columns, with delimiters.

有些实现简单地将值按列的顺序连接起来,并带有分隔符。

Another solution is to simply have a b-tree within a b-tree. When you hit a leaf on the first column, you get both a list of matching records and a mini b-tree of the next column, and so on. Thus, the order of the columns specified in the index makes a huge difference on whether that index will be useful for particular queries.

另一个解决方案是在b树中有一个b树。当您碰到第一列的叶子时,您将获得匹配记录的列表和下一列的迷你b-tree,等等。因此,索引中指定的列的顺序对于该索引是否对特定查询有用有很大的不同。

Here's a related question I wrote last week:

这是我上周写的一个相关的问题:

Does SQL Server jump leaves when using a composite clustered index?

使用复合集群索引时,SQL Server是否会跳转?

#5


0  

Other than the "composite key" mechanism already described, one possibility is a kdtree which works like a binary tree, but as you traverse each level you cycle through k dimensions. That is, the first level of the tree separates the first dimension into two parts, the second level splits the second dimension, the k+1th level splits the first dimension again, etc.. This allows for efficient partitioning of data in any number of dimensions. This approach is common in "spatial" databases (e.g., Oracle Spatial, PostGIS, etc.), but probably not as useful in "regular" multi-indexed tables.

除了已经描述的“复合键”机制之外,还有一种可能是kdtree,它的工作方式类似于二叉树,但是当您遍历每一层时,您都要遍历k个维度。也就是说,树的第一级将第一个维度分成两部分,第二级将第二个维度分成两部分,k+1级将第一个维度再分成两部分,等等。这允许在任意数量的维度中有效地划分数据。这种方法在“空间”数据库(如Oracle spatial、PostGIS等)中很常见,但在“常规”多索引表中可能没有那么有用。

http://en.wikipedia.org/wiki/Kd-tree

http://en.wikipedia.org/wiki/Kd-tree

#6


0  

It can use the (id,name,city) index to satisfy a "City = ? " predicate, but very very inefficently.

它可以使用(id,name,city)索引来满足“city = ?”"谓词,但非常无效。

In order to use the index to satisfy this query it would need to walk most of tree structure looking for entries with the desired city. This is still probably an order of magnatude faster than scanning the table!

为了使用索引来满足这个查询,它需要遍历树结构,以查找与期望的城市相关的条目。这可能仍然比扫描桌子要快。

An index of (city,name,id) would be the best index for your query. It would find all the desired city entries easily and would not need to access the underlying table to get the id and name values.

(城市、名称、id)索引将是查询的最佳索引。它可以很容易地找到所有需要的城市条目,并且不需要访问底层表来获取id和名称值。