I have the following query optimization problem in Spanner, and hoping there's a trick I'm missing that will help me bend the query planner to my will.
我在Spanner中有以下查询优化问题,并希望有一个我不知道的技巧,这将帮助我将查询规划器弯曲到我的意愿。
Here's the simplified schema:
这是简化的架构:
create table T0 (
key0 int64 not null,
value int64,
other int64 not null,
) primary key (key0);
create table T1 {
key1 int64 not null,
other int64 not null
} primary key (key1);
And a query with a subquery in an IN
clause:
以及在IN子句中使用子查询的查询:
select value from T0 t0
where t0.other in (
select t1.other from T1 t1 where t1.key1 in (42, 43, 44) -- note: this subquery is a good deal more complex than this
)
Which produces a 10 element set, via a hash join of T0 against the output of the subquery:
通过T0的散列连接与子查询的输出生成10个元素集:
Operator Rows Executions
----------------------- ----- ----------
Serialize Result 10 1
Hash Join 10 1
Distributed union 10000 1
Local distributed union 10000 1
Table Scan: T0 10000 1
Distributed cross apply: 5 1
...lots moar T1 subquery stuff...
Note that, while the subquery is complex, it actually produces a very small set. Unfortunately, it also scans the entirety of T1 to feed to the hash join, which is very slow.
请注意,虽然子查询很复杂,但它实际上会生成一个非常小的集合。不幸的是,它还扫描整个T1以提供给散列连接,这非常慢。
However, if I take the output of the subquery on T1 and manually shove it into the IN
clause:
但是,如果我在T1上获取子查询的输出并手动将其推入IN子句:
select value from T0
where other in (5, 6, 7, 8, 9) -- presume this `IN` clause to be the output of the above subquery
It is dramatically faster, presumably because it just hits T0's index once per entry, not using a hash join on the full contents:
它的速度要快得多,大概是因为它只是每个条目点击一次T0的索引,而不是在完整内容上使用散列连接:
Operator Rows Executions
----------------------- ---- ----------
Distributed union 10 1
Local distributed union 10 1
Serialize Result 10 1
Filter 10 1
Index Scan: 10 1
I could simply run two queries, and that's my best plan so far. But I'm hoping I can find some way to cajole Spanner into deciding that this is what it ought to do with the output of the subquery in the first example. I've tried everything I can think of, but this may simply not be expressible in SQL at all.
我可以简单地运行两个查询,这是我迄今为止最好的计划。但是我希望我能找到一些方法来哄骗Spanner决定这是它应该在第一个例子中对子查询的输出做什么。我已经尝试了所有我能想到的东西,但这根本不可能在SQL中表达出来。
Also: I haven't quite proven this yet, but in some cases I fear that the 10 element subquery output could blow up to a few thousand elements (T1 will grow more or less without bound, easily to millions). I've manually tested with a few hundred elements in the splatted-out IN
clause and it seems to perform acceptably, but I'm a little concerned it could get out of hand.
另外:我还没有完全证明这一点,但在某些情况下我担心10元子查询输出可能会爆炸到几千个元素(T1会或多或少地增长而不受限制,很容易增加到数百万)。我已经在splatted-out IN子句中用几百个元素手动测试了它似乎表现得很可接受,但我有点担心它可能会失控。
Note that I also tried a join on the subquery, like so:
请注意,我也尝试了子查询的连接,如下所示:
select t0.other from T0 t0
join (
-- Yes, this could be a simple join rather than a subquery, but in practice it's complex
-- enough that it can't be expressed that way.
select t1.other from T1 t1 where t1.key = 42
) sub on sub.other = t0.other
But it did something truly horrifying in the query planner, that I won't even try to explain here.
但它在查询规划器中做了一些真正可怕的事情,我甚至不会在这里解释。
1 个解决方案
#1
2
Does your actual subquery in the IN
clause use any variables from T0
? If not, what happens if you try your join query with the tables reordered (and a distinct added for correctness, unless you know that the values will be distinct)?
IN子句中的实际子查询是否使用T0中的任何变量?如果没有,如果您尝试使用重新排序的表的连接查询会发生什么(并且为了正确性添加了一个独特的,除非您知道值将是不同的)?
SELECT t0.other FROM (
-- Yes, this could be a simple join rather than a subquery, but in practice it's complex
-- enough that it can't be expressed that way.
SELECT DISTINCT t1.other FROM T1 t1 WHERE t1.key = 42
) sub
JOIN T0 t0
ON sub.other = t0.other
#1
2
Does your actual subquery in the IN
clause use any variables from T0
? If not, what happens if you try your join query with the tables reordered (and a distinct added for correctness, unless you know that the values will be distinct)?
IN子句中的实际子查询是否使用T0中的任何变量?如果没有,如果您尝试使用重新排序的表的连接查询会发生什么(并且为了正确性添加了一个独特的,除非您知道值将是不同的)?
SELECT t0.other FROM (
-- Yes, this could be a simple join rather than a subquery, but in practice it's complex
-- enough that it can't be expressed that way.
SELECT DISTINCT t1.other FROM T1 t1 WHERE t1.key = 42
) sub
JOIN T0 t0
ON sub.other = t0.other