独特的索引对列搜索性能更好吗?(PGSQL & MySQL)

时间:2022-09-17 21:38:36

I am curious as to whether

我很好奇是否

CREATE INDEX idx ON tbl (columns);

vs.

CREATE UNIQUE INDEX idx ON tbl (columns);

has a significant algorithmic performance benefit in PostgreSQL or MySQL implementations when scanning the indexed column(s), or whether the UNIQUE keyword simply introduces a unique constraint alongside the index.

在扫描索引列时,PostgreSQL或MySQL实现具有显着的算法性能优势,或者UNIQUE关键字是否只是在索引旁边引入了唯一约束。

I imagine it is probably fair to say that there is a marginal benefit insofar as indexes are likely to be internally implemented as some sort of hash1-like structure, and collision handling by definition result in something other than O(1) performance. Given this premise, it is likely that if a large percentage of values are identical than the structure degenerates into something linear.

我想可以公平地说,只要索引很可能在内部实现为某种类似hash1的结构,并且定义中的冲突处理会​​导致O(1)性能之外的其他内容,这可能是公平的。鉴于这一前提,如果大部分值相同而不是结构退化为线性,则很可能。

So, for purposes of my question, assume that the distribution of values is relatively discrete and uniform.

因此,出于我的问题的目的,假设值的分布是相对离散和均匀的。

Thanks in advance!

提前致谢!

1 Which is a matter of pure speculation for me, as I am not familiar with RDBM internals.

1对于我来说这是一个纯粹的推测问题,因为我不熟悉RDBM内部。

3 个解决方案

#1


15  

If your data are unique, you should create a UNIQUE index on them.

如果您的数据是唯一的,您应该在它们上创建一个UNIQUE索引。

This implies no additional overhead and affects optimizer's decisions in certain cases so that it can choose a better algorithm.

这意味着没有额外的开销,并且在某些情况下会影响优化器的决策,以便它可以选择更好的算法。

In SQL Server and in PostgreSQL, for instance, if you sort on a UNIQUE key, the optimizer ignores the ORDER BY clauses used after that (since they are irrelevant), i. e. this query:

例如,在SQL Server和PostgreSQL中,如果对UNIQUE键进行排序,优化器会忽略之后使用的ORDER BY子句(因为它们不相关),i。即这个查询:

SELECT  *
FROM    mytable
ORDER BY
        col_unique, other_col
LIMIT 10

will use an index on col_unique and won't sort on other_col because it's useless.

将使用col_unique上的索引,并且不会对other_col进行排序,因为它没用。

This query:

这个查询:

SELECT  *
FROM    mytable
WHERE   mycol IN
        (
        SELECT  othercol
        FROM    othertable
        )

will also be converted into an INNER JOIN (as opposed to a SEMI JOIN) if there is a UNIQUE index on othertable.othercol.

如果在othertable.othercol上有一个UNIQUE索引,也将转换为INNER JOIN(而不是SEMI JOIN)。

An index always contains some kind of a pointer to the row (ctid in PostgreSQL, row pointer in MyISAM, primary key/uniquifier in InnoDB) and the leaves are ordered on these pointers, so in fact every index leaf is unique is some way (though it may not be obvious).

索引总是包含某种指向行的指针(PostgreSQL中的ctid,MyISAM中的行指针,InnoDB中的主键/ uniquifier),并且叶子在这些指针上排序,所以实际上每个索引叶子都是独特的(某种方式)虽然它可能不是很明显)。

See this article in my blog for performance details:

有关性能详情,请参阅我的博客中的这篇文章

#2


3  

There is a small penalty during update/insert operations for having the unique constraint. It has to search before the insert/update operation to make sure the uniqueness constraint isn't violated.

在具有唯一约束的更新/插入操作期间存在小的惩罚。它必须在插入/更新操作之前进行搜索,以确保不违反唯一性约束。

#3


2  

Well, usually indexes are B-Trees, not hashes (there are hash based indexes, but the most common index (at least in PostgreSQL) is bases on B Tree).

好吧,通常索引是B-Trees,而不是哈希(有基于哈希的索引,但最常见的索引(至少在PostgreSQL中)是基于B Tree)。

As for speed - unique should be faster - when index scanning finds row with given value, it doesn't have to search if there are any other rows with this value, and can finish scanning imemdiately.

至于速度 - 唯一应该更快 - 当索引扫描找到具有给定值的行时,它不必搜索是否存在具有此值的任何其他行,并且可以完全扫描。

#1


15  

If your data are unique, you should create a UNIQUE index on them.

如果您的数据是唯一的,您应该在它们上创建一个UNIQUE索引。

This implies no additional overhead and affects optimizer's decisions in certain cases so that it can choose a better algorithm.

这意味着没有额外的开销,并且在某些情况下会影响优化器的决策,以便它可以选择更好的算法。

In SQL Server and in PostgreSQL, for instance, if you sort on a UNIQUE key, the optimizer ignores the ORDER BY clauses used after that (since they are irrelevant), i. e. this query:

例如,在SQL Server和PostgreSQL中,如果对UNIQUE键进行排序,优化器会忽略之后使用的ORDER BY子句(因为它们不相关),i。即这个查询:

SELECT  *
FROM    mytable
ORDER BY
        col_unique, other_col
LIMIT 10

will use an index on col_unique and won't sort on other_col because it's useless.

将使用col_unique上的索引,并且不会对other_col进行排序,因为它没用。

This query:

这个查询:

SELECT  *
FROM    mytable
WHERE   mycol IN
        (
        SELECT  othercol
        FROM    othertable
        )

will also be converted into an INNER JOIN (as opposed to a SEMI JOIN) if there is a UNIQUE index on othertable.othercol.

如果在othertable.othercol上有一个UNIQUE索引,也将转换为INNER JOIN(而不是SEMI JOIN)。

An index always contains some kind of a pointer to the row (ctid in PostgreSQL, row pointer in MyISAM, primary key/uniquifier in InnoDB) and the leaves are ordered on these pointers, so in fact every index leaf is unique is some way (though it may not be obvious).

索引总是包含某种指向行的指针(PostgreSQL中的ctid,MyISAM中的行指针,InnoDB中的主键/ uniquifier),并且叶子在这些指针上排序,所以实际上每个索引叶子都是独特的(某种方式)虽然它可能不是很明显)。

See this article in my blog for performance details:

有关性能详情,请参阅我的博客中的这篇文章

#2


3  

There is a small penalty during update/insert operations for having the unique constraint. It has to search before the insert/update operation to make sure the uniqueness constraint isn't violated.

在具有唯一约束的更新/插入操作期间存在小的惩罚。它必须在插入/更新操作之前进行搜索,以确保不违反唯一性约束。

#3


2  

Well, usually indexes are B-Trees, not hashes (there are hash based indexes, but the most common index (at least in PostgreSQL) is bases on B Tree).

好吧,通常索引是B-Trees,而不是哈希(有基于哈希的索引,但最常见的索引(至少在PostgreSQL中)是基于B Tree)。

As for speed - unique should be faster - when index scanning finds row with given value, it doesn't have to search if there are any other rows with this value, and can finish scanning imemdiately.

至于速度 - 唯一应该更快 - 当索引扫描找到具有给定值的行时,它不必搜索是否存在具有此值的任何其他行,并且可以完全扫描。