为什么在SQL查询中NOT IN比IN慢得多

时间:2021-11-19 03:55:04

I have found a surprising (at least to me) thing with IN and NOT IN. When I try to explain first query on PostgreSQL database:

我发现IN和NOT IN令人惊讶(至少对我而言)。当我尝试解释PostgreSQL数据库的第一个查询时:

EXPLAIN DELETE
FROM AuditTaskImpl l
WHERE  l.processInstanceId in (select spl.processInstanceId
                               FROM ProcessInstanceLog spl
                               WHERE spl.status not in ( 2, 3))

it tells me this:

它告诉我这个:

Delete on audittaskimpl l  (cost=2794.48..6373.52 rows=50859 width=12)
  ->  Hash Semi Join  (cost=2794.48..6373.52 rows=50859 width=12)
        Hash Cond: (l.processinstanceid = spl.processinstanceid)
        ->  Seq Scan on audittaskimpl l  (cost=0.00..2005.59 rows=50859 width=14)
        ->  Hash  (cost=1909.24..1909.24 rows=50899 width=14)
              ->  Seq Scan on processinstancelog spl  (cost=0.00..1909.24 rows=50899 width=14)
                    Filter: (status <> ALL ('{2,3}'::integer[]))

However, when I change in for not in, which is just a negation:

但是,当我换入而不是,这只是一个否定:

EXPLAIN DELETE
FROM AuditTaskImpl l
WHERE  l.processInstanceId NOT in (select spl.processInstanceId
                               FROM ProcessInstanceLog spl
                               WHERE spl.status not in ( 2, 3))

it tells me this:

它告诉我这个:

Delete on audittaskimpl l  (cost=0.00..63321079.15 rows=25430 width=6)
  ->  Seq Scan on audittaskimpl l  (cost=0.00..63321079.15 rows=25430 width=6)
        Filter: (NOT (SubPlan 1))
        SubPlan 1
          ->  Materialize  (cost=0.00..2362.73 rows=50899 width=8)
                ->  Seq Scan on processinstancelog spl  (cost=0.00..1909.24 rows=50899 width=8)
                      Filter: (status <> ALL ('{2,3}'::integer[]))

As you can see, with IN it uses hash join, which is of course much quicker, but with NOT IN it uses just plain sequential scan row by row. But since NOT IN is just a negation it could use hash join again and just do the opposite: with IN when there is processInstanceId in nested select, add it to the result, when there is not, don't add it, with NOT IN when there is processInstanceId in nested select, don't add it to the result, when there is not, add it to the result.

正如您所看到的,使用IN它使用散列连接,这当然要快得多,但是使用NOT IN它只是逐行使用简单顺序扫描。但是因为NOT IN只是一个否定它可以再次使用散列连接而只是反过来:当嵌套选择中有processInstanceId时使用IN,将其添加到结果中,当没有时,不添加它,使用NOT IN当嵌套选择中有processInstanceId时,不要将其添加到结果中,如果没有,则将其添加到结果中。

So can you perhaps explain why this happens? To clarify AuditTaskImpl has processInstanceId attribute which is also present in ProcessInstanceLog table, although there is no foreign key relationship between them.

你能解释一下为什么会这样吗?澄清AuditTaskImpl的processInstanceId属性也存在于ProcessInstanceLog表中,尽管它们之间没有外键关系。

Thanks.

谢谢。

1 个解决方案

#1


3  

The semantics of NOT IN require that nothing be returned if any value in the subquery is NULL. Hence, Postgres needs to look at all the values.

如果子查询中的任何值为NULL,则NOT IN的语义要求不返回任何内容。因此,Postgres需要查看所有值。

I strongly recommend never using NOT IN with a subquery. Always use NOT EXISTS:

我强烈建议不要在子查询中使用NOT IN。始终使用NOT EXISTS:

DELETE FROM AuditTaskImpl l
    WHERE NOT EXISTS (SELECT 1 
                      FROM ProcessInstanceLog spl
                      WHERE l.processInstanceId = spl.spl.processInstanceId AND
                            spl.status not in (2, 3)
                     );

#1


3  

The semantics of NOT IN require that nothing be returned if any value in the subquery is NULL. Hence, Postgres needs to look at all the values.

如果子查询中的任何值为NULL,则NOT IN的语义要求不返回任何内容。因此,Postgres需要查看所有值。

I strongly recommend never using NOT IN with a subquery. Always use NOT EXISTS:

我强烈建议不要在子查询中使用NOT IN。始终使用NOT EXISTS:

DELETE FROM AuditTaskImpl l
    WHERE NOT EXISTS (SELECT 1 
                      FROM ProcessInstanceLog spl
                      WHERE l.processInstanceId = spl.spl.processInstanceId AND
                            spl.status not in (2, 3)
                     );