I am curious as to whether
我很好奇是否
CREATE INDEX idx ON tbl (columns);
vs.
与
CREATE UNIQUE INDEX idx ON tbl (columns);
has a significant algorithmic performance benefit in PostgreSQL or MySQL implementations when scanning the indexed column(s), or whether the UNIQUE
keyword simply introduces a unique constraint alongside the index.
在扫描索引列时,PostgreSQL或MySQL实现具有显着的算法性能优势,或者UNIQUE关键字是否只是在索引旁边引入了唯一约束。
I imagine it is probably fair to say that there is a marginal benefit insofar as indexes are likely to be internally implemented as some sort of hash1-like structure, and collision handling by definition result in something other than O(1) performance. Given this premise, it is likely that if a large percentage of values are identical than the structure degenerates into something linear.
我想可以公平地说,只要索引很可能在内部实现为某种类似hash1的结构,并且定义中的冲突处理会导致O(1)性能之外的其他内容,这可能是公平的。鉴于这一前提,如果大部分值相同而不是结构退化为线性,则很可能。
So, for purposes of my question, assume that the distribution of values is relatively discrete and uniform.
因此,出于我的问题的目的,假设值的分布是相对离散和均匀的。
Thanks in advance!
提前致谢!
1 Which is a matter of pure speculation for me, as I am not familiar with RDBM internals.
1对于我来说这是一个纯粹的推测问题,因为我不熟悉RDBM内部。
3 个解决方案
#1
15
If your data are unique, you should create a UNIQUE
index on them.
如果您的数据是唯一的,您应该在它们上创建一个UNIQUE索引。
This implies no additional overhead and affects optimizer's decisions in certain cases so that it can choose a better algorithm.
这意味着没有额外的开销,并且在某些情况下会影响优化器的决策,以便它可以选择更好的算法。
In SQL Server
and in PostgreSQL
, for instance, if you sort on a UNIQUE
key, the optimizer ignores the ORDER BY
clauses used after that (since they are irrelevant), i. e. this query:
例如,在SQL Server和PostgreSQL中,如果对UNIQUE键进行排序,优化器会忽略之后使用的ORDER BY子句(因为它们不相关),i。即这个查询:
SELECT *
FROM mytable
ORDER BY
col_unique, other_col
LIMIT 10
will use an index on col_unique
and won't sort on other_col
because it's useless.
将使用col_unique上的索引,并且不会对other_col进行排序,因为它没用。
This query:
这个查询:
SELECT *
FROM mytable
WHERE mycol IN
(
SELECT othercol
FROM othertable
)
will also be converted into an INNER JOIN
(as opposed to a SEMI JOIN
) if there is a UNIQUE
index on othertable.othercol
.
如果在othertable.othercol上有一个UNIQUE索引,也将转换为INNER JOIN(而不是SEMI JOIN)。
An index always contains some kind of a pointer to the row (ctid
in PostgreSQL
, row pointer in MyISAM
, primary key/uniquifier in InnoDB
) and the leaves are ordered on these pointers, so in fact every index leaf is unique is some way (though it may not be obvious).
索引总是包含某种指向行的指针(PostgreSQL中的ctid,MyISAM中的行指针,InnoDB中的主键/ uniquifier),并且叶子在这些指针上排序,所以实际上每个索引叶子都是独特的(某种方式)虽然它可能不是很明显)。
See this article in my blog for performance details:
有关性能详情,请参阅我的博客中的这篇文章
- Making an index
UNIQUE
- 制作索引UNIQUE
#2
3
There is a small penalty during update/insert operations for having the unique constraint. It has to search before the insert/update operation to make sure the uniqueness constraint isn't violated.
在具有唯一约束的更新/插入操作期间存在小的惩罚。它必须在插入/更新操作之前进行搜索,以确保不违反唯一性约束。
#3
2
Well, usually indexes are B-Trees, not hashes (there are hash based indexes, but the most common index (at least in PostgreSQL) is bases on B Tree).
好吧,通常索引是B-Trees,而不是哈希(有基于哈希的索引,但最常见的索引(至少在PostgreSQL中)是基于B Tree)。
As for speed - unique should be faster - when index scanning finds row with given value, it doesn't have to search if there are any other rows with this value, and can finish scanning imemdiately.
至于速度 - 唯一应该更快 - 当索引扫描找到具有给定值的行时,它不必搜索是否存在具有此值的任何其他行,并且可以完全扫描。
#1
15
If your data are unique, you should create a UNIQUE
index on them.
如果您的数据是唯一的,您应该在它们上创建一个UNIQUE索引。
This implies no additional overhead and affects optimizer's decisions in certain cases so that it can choose a better algorithm.
这意味着没有额外的开销,并且在某些情况下会影响优化器的决策,以便它可以选择更好的算法。
In SQL Server
and in PostgreSQL
, for instance, if you sort on a UNIQUE
key, the optimizer ignores the ORDER BY
clauses used after that (since they are irrelevant), i. e. this query:
例如,在SQL Server和PostgreSQL中,如果对UNIQUE键进行排序,优化器会忽略之后使用的ORDER BY子句(因为它们不相关),i。即这个查询:
SELECT *
FROM mytable
ORDER BY
col_unique, other_col
LIMIT 10
will use an index on col_unique
and won't sort on other_col
because it's useless.
将使用col_unique上的索引,并且不会对other_col进行排序,因为它没用。
This query:
这个查询:
SELECT *
FROM mytable
WHERE mycol IN
(
SELECT othercol
FROM othertable
)
will also be converted into an INNER JOIN
(as opposed to a SEMI JOIN
) if there is a UNIQUE
index on othertable.othercol
.
如果在othertable.othercol上有一个UNIQUE索引,也将转换为INNER JOIN(而不是SEMI JOIN)。
An index always contains some kind of a pointer to the row (ctid
in PostgreSQL
, row pointer in MyISAM
, primary key/uniquifier in InnoDB
) and the leaves are ordered on these pointers, so in fact every index leaf is unique is some way (though it may not be obvious).
索引总是包含某种指向行的指针(PostgreSQL中的ctid,MyISAM中的行指针,InnoDB中的主键/ uniquifier),并且叶子在这些指针上排序,所以实际上每个索引叶子都是独特的(某种方式)虽然它可能不是很明显)。
See this article in my blog for performance details:
有关性能详情,请参阅我的博客中的这篇文章
- Making an index
UNIQUE
- 制作索引UNIQUE
#2
3
There is a small penalty during update/insert operations for having the unique constraint. It has to search before the insert/update operation to make sure the uniqueness constraint isn't violated.
在具有唯一约束的更新/插入操作期间存在小的惩罚。它必须在插入/更新操作之前进行搜索,以确保不违反唯一性约束。
#3
2
Well, usually indexes are B-Trees, not hashes (there are hash based indexes, but the most common index (at least in PostgreSQL) is bases on B Tree).
好吧,通常索引是B-Trees,而不是哈希(有基于哈希的索引,但最常见的索引(至少在PostgreSQL中)是基于B Tree)。
As for speed - unique should be faster - when index scanning finds row with given value, it doesn't have to search if there are any other rows with this value, and can finish scanning imemdiately.
至于速度 - 唯一应该更快 - 当索引扫描找到具有给定值的行时,它不必搜索是否存在具有此值的任何其他行,并且可以完全扫描。