I've heard lots of people saying that the IN
keyword in most relational databases is slow. How true is this? An example query would be this, off the top of my head:
我听说很多人都说大多数关系数据库中的IN关键字很慢。这是真的吗?一个示例查询就是这个,从头到尾:
SELECT * FROM someTable WHERE someColumn IN (value1, value2, value3)
I've heard that is much slower than doing this:
我听说这比这样做慢得多:
SELECT * FROM someTable WHERE
someColumn = value1 OR
someColumn = value2 OR
someColumn = value3
Is this true? Or is the speed difference negligible? If it matters, I'm using PostgreSQL, but I'd also like to know how MySQL fares (and if it's any different). Thanks in advance.
这是真的?或者速度差异可以忽略不计?如果重要的话,我正在使用PostgreSQL,但我也想知道MySQL的票价(如果它有任何不同)。提前致谢。
7 个解决方案
#1
13
In PostgreSQL, exactly what you'll get here depends on the underlying table, so you should use EXPLAIN ANALYZE on some sample queries against a useful subset of your data to figure out exactly what the optimizer is going to do (make sure the tables you're running against have been ANALYZEd too). IN can be processed a couple of different ways, and that's why you need to look at some samples to figure out which alternative is being used for your data. There is no simple generic answer to your question.
在PostgreSQL中,你在这里得到的确切取决于底层表,所以你应该对一些有用的数据子集使用EXPLAIN ANALYZE进行一些示例查询,以确定优化器将要做什么(确保表格你“反对运行也一直在分析”。 IN可以通过几种不同的方式处理,这就是为什么你需要查看一些样本来确定哪种替代方法用于你的数据。你的问题没有简单的通用答案。
As for the specific question you added in your revision, against a trivial data set with no indexes involved here's an example of the two query plans you'll get:
至于您在修订版中添加的具体问题,针对这里没有涉及索引的简单数据集,您将获得两个查询计划的示例:
postgres=# explain analyze select * from x where s in ('123','456');
Seq Scan on x (cost=0.00..84994.69 rows=263271 width=181) (actual time=0.015..1819.702 rows=247823 loops=1)
Filter: (s = ANY ('{123,456}'::bpchar[]))
Total runtime: 1931.370 ms
postgres=# explain analyze select * from x where s='123' or s='456';
Seq Scan on x (cost=0.00..90163.62 rows=263271 width=181) (actual time=0.014..1835.944 rows=247823 loops=1)
Filter: ((s = '123'::bpchar) OR (s = '456'::bpchar))
Total runtime: 1949.478 ms
Those two runtimes are essentially identical, because the real processing time is dominated by the sequential scan across the table; running multiple times shows the difference between the two is below the run to run margin of error. As you can see, PostgreSQL transforms the IN case into using its ANY filter, which should always execute faster than a series of ORs. Again, this trivial case is not necessarily representative of what you'll see on a serious query where indexes and the like are involved. Regardless, manually replacing INs with a series of OR statements should never be faster, because the optimizer knows the best thing to do here if it has good data to work with.
这两个运行时基本相同,因为实际处理时间由表中的顺序扫描支配;多次运行显示两者之间的差异低于运行运行误差范围。正如您所看到的,PostgreSQL将IN情况转换为使用其ANY过滤器,该过滤器应始终比一系列OR更快地执行。同样,这个微不足道的案例并不一定代表您在涉及索引等的严肃查询中会看到的内容。无论如何,用一系列OR语句手动替换INs永远不会更快,因为如果有优秀的数据可用,优化器就知道这里要做的最好的事情。
In general, PostgreSQL knows more tricks for how to optimize complicated queries than the MySQL optimizer does, but it also relies heavily on your having given the optimizer enough data to work with. The first links on the "Performance Optimization" section of the PostgreSQL wiki covers the most important things needed to get good results from the optimizer.
一般来说,PostgreSQL比MySQL优化器知道更多关于如何优化复杂查询的技巧,但它也很大程度上依赖于你给优化器提供了足够的数据。 PostgreSQL wiki的“性能优化”部分的第一个链接涵盖了从优化器获得良好结果所需的最重要的事情。
#2
8
In MySQL
, these are complete synonyms for the optimizer:
在MySQL中,这些是优化器的完整同义词:
SELECT *
FROM someTable
WHERE someColumn IN (value1, value2, value3)
and
SELECT *
FROM someTable
WHERE someColumn = value1 OR
someColumn = value2 OR
someColumn = value3
, provided that value
's are literal contants or preset variables.
,只要该值是文字内容或预设变量。
According to the documentation:
根据文件:
The definition of a range condition for a single-part index is as follows:
单部分索引的范围条件的定义如下:
- For both
BTREE
andHASH
indexes, comparison of a key part with a constant value is a range condition when using the=
,<=>
,IN()
,IS NULL
, orIS NOT NULL
operators.对于BTREE和HASH索引,使用=,<=>,IN(),IS NULL或IS NOT NULL运算符时,关键部分与常量值的比较是范围条件。
- …
- For all types of indexes, multiple range conditions combined with
OR
orAND
form a range condition.对于所有类型的索引,多个范围条件与OR或AND组合形成范围条件。
“Constant value” in the preceding descriptions means one of the following:
前面描述中的“常量值”表示以下之一:
- A constant from the query string
来自查询字符串的常量
- A column of a const or system table from the same join
来自同一连接的const或系统表的列
- The result of an uncorrelated subquery
不相关子查询的结果
- Any expression composed entirely from subexpressions of the preceding types
任何表达式完全由前面类型的子表达式组成
However, this query:
但是,这个查询:
SELECT *
FROM table
WHERE id = 1
OR id = (SELECT id FROM other_table WHERE unique_condition)
will use the index on id
, while this one:
将使用id上的索引,而这一个:
SELECT *
FROM table
WHERE id IN (1, (SELECT id FROM other_table WHERE unique_condition))
will use fullscan.
将使用fullscan。
I. e. there is difference when one of the value
's is a single-row subquery.
I. e。当其中一个值是单行子查询时存在差异。
I've filed it recently as bug 45145 in MySQL
(it turned out to be 5.2
specific, absent in 5.1
and corrected in 6.0
)
我最近提交了它作为MySQL中的错误45145(结果是5.2具体,5.1中缺席并在6.0中更正)
#3
5
Using IN isn't necessarily slow, it's how you build the IN parameters that will slow things down considerably. Too often people use SELECT ... WHERE x IN (SELECT..., which can be very poorly optimized (i.e. not at all). Do a search on "correlated subquery" to see how bad it can get.
使用IN不一定很慢,这就是你如何构建IN参数,这将大大降低速度。人们经常使用SELECT ... WHERE x IN(SELECT ...,它可能非常优化(即根本不是)。对“相关子查询”进行搜索以查看它有多糟糕。
Often you don't have to use IN at all and can use a JOIN instead, and take advantage of derived tables.
通常,您根本不必使用IN,而是可以使用JOIN,并利用派生表。
SELECT * FROM table1 WHERE x IN (SELECT y FROM table2 WHERE z=3)
Can be rephrased like this
可以这样改写
SELECT * FROM table1 JOIN (SELECT y FROM table2 WHERE z=3) AS table2 ON table1.x=table2.y
If the IN syntax is slow, the JOIN syntax will often times be much faster. You can use EXPLAIN to see how each query would be optimized differently. This is a simplistic example and your database may show the same query path, but more complicated queries usually show something different.
如果IN语法很慢,则JOIN语法通常会快得多。您可以使用EXPLAIN查看每个查询的优化方式。这是一个简单的示例,您的数据库可能显示相同的查询路径,但更复杂的查询通常会显示不同的内容。
#4
1
IN with a subselect is often slow. IN with a value list shouldn't be any slower than someColumn = value1 OR someColumn = value2 OR someColumn = value3, etc. That is plenty fast, as long as the number of values is sane.
带有子选择的IN通常很慢。带有值列表的IN不应该比someColumn = value1或someColumn = value2或someColumn = value3等慢。只要值的数量是合理的,这是非常快的。
IN with a subquery is slow when the optimizer can't figure out a good way to perform the query, and has to use the obvious method of building the full result of the subquery. For example:
当优化器无法找到执行查询的好方法时,带子查询的IN很慢,并且必须使用显式方法来构建子查询的完整结果。例如:
SELECT username
FROM users
WHERE userid IN (
SELECT userid FROM users WHERE user_first_name = 'Bob'
)
is going to be much slower than
会慢得多
SELECT username FROM users WHERE user_first_name = 'Bob'
unless the optimizer can figure out what you meant.
除非优化器能够弄清楚你的意思。
#5
1
I think you got the answer(s) you wanted above. Just wanted to add one thing.
我想你得到了你想要的答案。只想添加一件事。
You need to optimize IN and use it the right way. In development, I always set up a debug section at the bottom of the page anytime there is a query and it automatically runs an EXPLAIN EXTENDED on every SELECT and then SHOW WARNINGS in order to see the (likely) way that MySQL's Query Optimizer will rewrite the query internally. Lots to learn from that on how to make sure IN is working for you.
您需要优化IN并以正确的方式使用它。在开发过程中,我总是在有查询的时候在页面底部设置一个调试部分,它会在每个SELECT上自动运行EXPLAIN EXTENDED,然后显示SHOW WARNINGS以查看MySQL的查询优化器将重写的(可能)方式内部查询。很多东西可以从中学到如何确保IN为你工作。
#6
0
The speed of the IN keyword would really depend on the complexity of your subquery. In the example you provide you just want to see if someColumns value is in a set list of values, and a pretty short one at that. So I would imagine that the performance cost would be very minimal in that case.
IN关键字的速度实际上取决于子查询的复杂性。在您提供的示例中,您只想查看someColumns值是否在设置的值列表中,并且非常短。因此,我认为在这种情况下,性能成本将非常小。
#7
0
It says in the docs that IN
is very fast in MySQL but I can't find the source at the moment.
它在文档中说,IN在MySQL中非常快,但我目前找不到源代码。
#1
13
In PostgreSQL, exactly what you'll get here depends on the underlying table, so you should use EXPLAIN ANALYZE on some sample queries against a useful subset of your data to figure out exactly what the optimizer is going to do (make sure the tables you're running against have been ANALYZEd too). IN can be processed a couple of different ways, and that's why you need to look at some samples to figure out which alternative is being used for your data. There is no simple generic answer to your question.
在PostgreSQL中,你在这里得到的确切取决于底层表,所以你应该对一些有用的数据子集使用EXPLAIN ANALYZE进行一些示例查询,以确定优化器将要做什么(确保表格你“反对运行也一直在分析”。 IN可以通过几种不同的方式处理,这就是为什么你需要查看一些样本来确定哪种替代方法用于你的数据。你的问题没有简单的通用答案。
As for the specific question you added in your revision, against a trivial data set with no indexes involved here's an example of the two query plans you'll get:
至于您在修订版中添加的具体问题,针对这里没有涉及索引的简单数据集,您将获得两个查询计划的示例:
postgres=# explain analyze select * from x where s in ('123','456');
Seq Scan on x (cost=0.00..84994.69 rows=263271 width=181) (actual time=0.015..1819.702 rows=247823 loops=1)
Filter: (s = ANY ('{123,456}'::bpchar[]))
Total runtime: 1931.370 ms
postgres=# explain analyze select * from x where s='123' or s='456';
Seq Scan on x (cost=0.00..90163.62 rows=263271 width=181) (actual time=0.014..1835.944 rows=247823 loops=1)
Filter: ((s = '123'::bpchar) OR (s = '456'::bpchar))
Total runtime: 1949.478 ms
Those two runtimes are essentially identical, because the real processing time is dominated by the sequential scan across the table; running multiple times shows the difference between the two is below the run to run margin of error. As you can see, PostgreSQL transforms the IN case into using its ANY filter, which should always execute faster than a series of ORs. Again, this trivial case is not necessarily representative of what you'll see on a serious query where indexes and the like are involved. Regardless, manually replacing INs with a series of OR statements should never be faster, because the optimizer knows the best thing to do here if it has good data to work with.
这两个运行时基本相同,因为实际处理时间由表中的顺序扫描支配;多次运行显示两者之间的差异低于运行运行误差范围。正如您所看到的,PostgreSQL将IN情况转换为使用其ANY过滤器,该过滤器应始终比一系列OR更快地执行。同样,这个微不足道的案例并不一定代表您在涉及索引等的严肃查询中会看到的内容。无论如何,用一系列OR语句手动替换INs永远不会更快,因为如果有优秀的数据可用,优化器就知道这里要做的最好的事情。
In general, PostgreSQL knows more tricks for how to optimize complicated queries than the MySQL optimizer does, but it also relies heavily on your having given the optimizer enough data to work with. The first links on the "Performance Optimization" section of the PostgreSQL wiki covers the most important things needed to get good results from the optimizer.
一般来说,PostgreSQL比MySQL优化器知道更多关于如何优化复杂查询的技巧,但它也很大程度上依赖于你给优化器提供了足够的数据。 PostgreSQL wiki的“性能优化”部分的第一个链接涵盖了从优化器获得良好结果所需的最重要的事情。
#2
8
In MySQL
, these are complete synonyms for the optimizer:
在MySQL中,这些是优化器的完整同义词:
SELECT *
FROM someTable
WHERE someColumn IN (value1, value2, value3)
and
SELECT *
FROM someTable
WHERE someColumn = value1 OR
someColumn = value2 OR
someColumn = value3
, provided that value
's are literal contants or preset variables.
,只要该值是文字内容或预设变量。
According to the documentation:
根据文件:
The definition of a range condition for a single-part index is as follows:
单部分索引的范围条件的定义如下:
- For both
BTREE
andHASH
indexes, comparison of a key part with a constant value is a range condition when using the=
,<=>
,IN()
,IS NULL
, orIS NOT NULL
operators.对于BTREE和HASH索引,使用=,<=>,IN(),IS NULL或IS NOT NULL运算符时,关键部分与常量值的比较是范围条件。
- …
- For all types of indexes, multiple range conditions combined with
OR
orAND
form a range condition.对于所有类型的索引,多个范围条件与OR或AND组合形成范围条件。
“Constant value” in the preceding descriptions means one of the following:
前面描述中的“常量值”表示以下之一:
- A constant from the query string
来自查询字符串的常量
- A column of a const or system table from the same join
来自同一连接的const或系统表的列
- The result of an uncorrelated subquery
不相关子查询的结果
- Any expression composed entirely from subexpressions of the preceding types
任何表达式完全由前面类型的子表达式组成
However, this query:
但是,这个查询:
SELECT *
FROM table
WHERE id = 1
OR id = (SELECT id FROM other_table WHERE unique_condition)
will use the index on id
, while this one:
将使用id上的索引,而这一个:
SELECT *
FROM table
WHERE id IN (1, (SELECT id FROM other_table WHERE unique_condition))
will use fullscan.
将使用fullscan。
I. e. there is difference when one of the value
's is a single-row subquery.
I. e。当其中一个值是单行子查询时存在差异。
I've filed it recently as bug 45145 in MySQL
(it turned out to be 5.2
specific, absent in 5.1
and corrected in 6.0
)
我最近提交了它作为MySQL中的错误45145(结果是5.2具体,5.1中缺席并在6.0中更正)
#3
5
Using IN isn't necessarily slow, it's how you build the IN parameters that will slow things down considerably. Too often people use SELECT ... WHERE x IN (SELECT..., which can be very poorly optimized (i.e. not at all). Do a search on "correlated subquery" to see how bad it can get.
使用IN不一定很慢,这就是你如何构建IN参数,这将大大降低速度。人们经常使用SELECT ... WHERE x IN(SELECT ...,它可能非常优化(即根本不是)。对“相关子查询”进行搜索以查看它有多糟糕。
Often you don't have to use IN at all and can use a JOIN instead, and take advantage of derived tables.
通常,您根本不必使用IN,而是可以使用JOIN,并利用派生表。
SELECT * FROM table1 WHERE x IN (SELECT y FROM table2 WHERE z=3)
Can be rephrased like this
可以这样改写
SELECT * FROM table1 JOIN (SELECT y FROM table2 WHERE z=3) AS table2 ON table1.x=table2.y
If the IN syntax is slow, the JOIN syntax will often times be much faster. You can use EXPLAIN to see how each query would be optimized differently. This is a simplistic example and your database may show the same query path, but more complicated queries usually show something different.
如果IN语法很慢,则JOIN语法通常会快得多。您可以使用EXPLAIN查看每个查询的优化方式。这是一个简单的示例,您的数据库可能显示相同的查询路径,但更复杂的查询通常会显示不同的内容。
#4
1
IN with a subselect is often slow. IN with a value list shouldn't be any slower than someColumn = value1 OR someColumn = value2 OR someColumn = value3, etc. That is plenty fast, as long as the number of values is sane.
带有子选择的IN通常很慢。带有值列表的IN不应该比someColumn = value1或someColumn = value2或someColumn = value3等慢。只要值的数量是合理的,这是非常快的。
IN with a subquery is slow when the optimizer can't figure out a good way to perform the query, and has to use the obvious method of building the full result of the subquery. For example:
当优化器无法找到执行查询的好方法时,带子查询的IN很慢,并且必须使用显式方法来构建子查询的完整结果。例如:
SELECT username
FROM users
WHERE userid IN (
SELECT userid FROM users WHERE user_first_name = 'Bob'
)
is going to be much slower than
会慢得多
SELECT username FROM users WHERE user_first_name = 'Bob'
unless the optimizer can figure out what you meant.
除非优化器能够弄清楚你的意思。
#5
1
I think you got the answer(s) you wanted above. Just wanted to add one thing.
我想你得到了你想要的答案。只想添加一件事。
You need to optimize IN and use it the right way. In development, I always set up a debug section at the bottom of the page anytime there is a query and it automatically runs an EXPLAIN EXTENDED on every SELECT and then SHOW WARNINGS in order to see the (likely) way that MySQL's Query Optimizer will rewrite the query internally. Lots to learn from that on how to make sure IN is working for you.
您需要优化IN并以正确的方式使用它。在开发过程中,我总是在有查询的时候在页面底部设置一个调试部分,它会在每个SELECT上自动运行EXPLAIN EXTENDED,然后显示SHOW WARNINGS以查看MySQL的查询优化器将重写的(可能)方式内部查询。很多东西可以从中学到如何确保IN为你工作。
#6
0
The speed of the IN keyword would really depend on the complexity of your subquery. In the example you provide you just want to see if someColumns value is in a set list of values, and a pretty short one at that. So I would imagine that the performance cost would be very minimal in that case.
IN关键字的速度实际上取决于子查询的复杂性。在您提供的示例中,您只想查看someColumns值是否在设置的值列表中,并且非常短。因此,我认为在这种情况下,性能成本将非常小。
#7
0
It says in the docs that IN
is very fast in MySQL but I can't find the source at the moment.
它在文档中说,IN在MySQL中非常快,但我目前找不到源代码。