mysql not in 和 left join 效率问题记录

时间：2022-08-29 21:57:27

声明：感谢laserhe, denniswwh，ACMAIN_CHM，vinsonshen 的热心帮助

首先说明该条sql的功能是查询集合a不在集合b的数据。
not in的写法
select add_tb.RUID
from (select distinct RUID
      from UserMsg
      where SubjectID =12
      and CreateTime>'2009-8-14 15:30:00'
      and CreateTime<='2009-8-17 16:00:00'
) add_tb
where add_tb.RUID
not in (select distinct RUID
          from UserMsg
          where SubjectID =12
    and CreateTime<'2009-8-14 15:30:00'
)
返回444行记录用时 0.07sec
explain 结果
+----+--------------------+------------+----------------+---------------------------+------------+---------+------+------+--

----------------------------+
| id | select_type        | table      | type           | possible_keys             | key        | key_len | ref | rows |

Extra                        |
+----+--------------------+------------+----------------+---------------------------+------------+---------+------+------+--

----------------------------+
| 1 | PRIMARY            | <derived2> | ALL            | NULL                      | NULL       |    NULL | NULL | 452 |

Using where                  |
| 3 | DEPENDENT SUBQUERY | UserMsg    | index_subquery | RUID,SubjectID,CreateTime | RUID       |      96 | func |    2 |

Using index; Using where     |
| 2 | DERIVED            | UserMsg    | range          | SubjectID,CreateTime      | CreateTime |       9 | NULL | 1857 |

Using where; Using temporary |
+----+--------------------+------------+----------------+---------------------------+------------+---------+------+------+--

----------------------------+
分析:该条查询速度快原因为id=2的sql查询出来的结果比较少，所以id=1sql所以运行速度比较快，id=2的使用了临时表，不知道这个时候是否使用索引？
其中一种left join
select a.ruid,b.ruid
from(select distinct RUID
     from UserMsg
     where SubjectID =12
     and CreateTime >= '2009-8-14 15:30:00'
     and CreateTime<='2009-8-17 16:00:00'
) a left join (
    select distinct RUID
    from UserMsg
    where SubjectID =12 and CreateTime< '2009-8-14 15:30:00'
) b on a.ruid = b.ruid
where b.ruid is null
返回444行记录用时 0.39sec
explain 结果
+----+-------------+------------+-------+----------------------+------------+---------+------+------+-----------------------

-------+
| id | select_type | table      | type | possible_keys        | key        | key_len | ref | rows | Extra

     |
+----+-------------+------------+-------+----------------------+------------+---------+------+------+-----------------------

-------+
| 1 | PRIMARY     | <derived2> | ALL   | NULL                 | NULL       |    NULL | NULL | 452 |

     |
| 1 | PRIMARY     | <derived3> | ALL   | NULL                 | NULL       |    NULL | NULL | 1112 | Using where; Not exists

     |
| 3 | DERIVED     | UserMsg    | ref   | SubjectID,CreateTime | SubjectID |       5 |      | 6667 | Using where; Using

temporary |
| 2 | DERIVED     | UserMsg    | range | SubjectID,CreateTime | CreateTime |       9 | NULL | 1838 | Using where; Using

temporary |
+----+-------------+------------+-------+----------------------+------------+---------+------+------+-----------------------

-------+
分析:使用了两个临时表，并且两个临时表做了笛卡尔积，导致不能使用索引并且数据量很大
另外一种left join
select distinct a.RUID
from UserMsg a
left join UserMsg b
    on a.ruid = b.ruid
    and b.subjectID =12 and b.createTime < '2009-8-14 15:30:00'
where a.subjectID =12
and a.createTime >= '2009-8-14 15:30:00'
and a.createtime <='2009-8-17 16:00:00'
and b.ruid is null;
返回444行记录用时 0.07sec
explain 结果
+----+-------------+-------+-------+---------------------------+------------+---------+--------------+------+---------------

--------------------+
| id | select_type | table | type | possible_keys             | key        | key_len | ref          | rows | Extra

                  |
+----+-------------+-------+-------+---------------------------+------------+---------+--------------+------+---------------

--------------------+
| 1 | SIMPLE      | a     | range | SubjectID,CreateTime      | CreateTime |       9 | NULL         | 1839 | Using where;

Using temporary      |
| 1 | SIMPLE      | b     | ref   | RUID,SubjectID,CreateTime | RUID       |      96 | dream.a.RUID |    2 | Using where;

Not exists; Distinct |
+----+-------------+-------+-------+---------------------------+------------+---------+--------------+------+---------------

--------------------+
分析：两次查询都是用上了索引，并且查询时同时进行的，所以查询效率应该很高
使用not exists的sql
select distinct a.ruid
from UserMsg a
where a.subjectID =12
and a.createTime >= '2009-8-14 15:30:00'
and a.createTime <='2009-8-17 16:00:00'
and not exists (
    select distinct RUID
    from UserMsg
    where subjectID =12 and createTime < '2009-8-14 15:30:00'
    and ruid=a.ruid
)
返回444行记录用时 0.08sec
explain 结果
+----+--------------------+---------+-------+---------------------------+------------+---------+--------------+------+------

------------------------+
| id | select_type        | table   | type | possible_keys             | key        | key_len | ref          | rows | Extra

                      |
+----+--------------------+---------+-------+---------------------------+------------+---------+--------------+------+------

------------------------+
| 1 | PRIMARY            | a       | range | SubjectID,CreateTime      | CreateTime |       9 | NULL         | 1839 | Using

where; Using temporary |
| 2 | DEPENDENT SUBQUERY | UserMsg | ref   | RUID,SubjectID,CreateTime | RUID       |      96 | dream.a.RUID |    2 | Using

where                  |
+----+--------------------+---------+-------+---------------------------+------------+---------+--------------+------+------

------------------------+
分析：同上基本上是一样的，只是分解了2个查询顺序执行，查询效率低于第3个

为了验证数据查询效率，将上述查询中的subjectID =12的限制条件去掉，结果统计查询时间如下
0.20s
21.31s
0.25s
0.43s

laserhe帮忙分析问题总结
select a.ruid,b.ruid
from(    select distinct RUID
    from UserMsg
    where CreateTime >= '2009-8-14 15:30:00'
    and CreateTime<='2009-8-17 16:00:00'
) a     left join UserMsg b
    on a.ruid = b.ruid
    and b.createTime < '2009-8-14 15:30:00'
where b.ruid is null;
执行时间0.13s
+----+-------------+------------+-------+-----------------+------------+---------+--------+------+--------------------------

----+
| id | select_type | table      | type | possible_keys   | key        | key_len | ref    | rows | Extra

|
+----+-------------+------------+-------+-----------------+------------+---------+--------+------+--------------------------

----+
| 1 | PRIMARY     | <derived2> | ALL   | NULL            | NULL       |    NULL | NULL   | 1248 |

|
| 1 | PRIMARY     | b          | ref   | RUID,CreateTime | RUID       |      96 | a.RUID |    2 | Using where; Not exists

|
| 2 | DERIVED     | UserMsg    | range | CreateTime      | CreateTime |       9 | NULL   | 3553 | Using where; Using

temporary |
+----+-------------+------------+-------+-----------------+------------+---------+--------+------+--------------------------

----+
执行效率类似与not in的效率

数据库优化的基本原则：让笛卡尔积发生在尽可能小的集合之间，mysql在join的时候可以直接通过索引来扫描，而嵌入到子查询里头，查询规

划器就不晓得用合适的索引了。
一个SQL在数据库里是这么优化的：首先SQL会分析成一堆分析树，一个树状数据结构，然后在这个数据结构里，查询规划器会查找有没有合适

的索引，然后根据具体情况做一个排列组合，然后计算这个排列组合中的每一种的开销（类似explain的输出的计算机可读版本），然后比较里

面开销最小的，选取并执行之。那么：
explain select a.ruid,b.ruid from(select distinct RUID      from UserMsg       where CreateTime >= '2009-8-14 15:30:00'

and CreateTime<='2009-8-17 16:00:00' ) a left join UserMsg b on a.ruid = b.ruid and b.createTime < '2009-8-14 15:30:00'

where b.ruid is null;
和
explain select add_tb.RUID
    -> from (select distinct RUID
    ->       from UserMsg
    ->       where CreateTime>'2009-8-14 15:30:00'
    ->       and CreateTime<='2009-8-17 16:00:00'
    -> ) add_tb
    ->   where add_tb.RUID
    ->   not in (select distinct RUID
    ->           from UserMsg
    ->           where CreateTime<'2009-8-14 15:30:00'
    -> );
explain
+----+--------------------+------------+----------------+-----------------+------------+---------+------+------+------------

------------------+
| id | select_type        | table      | type           | possible_keys   | key        | key_len | ref | rows | Extra

                |
+----+--------------------+------------+----------------+-----------------+------------+---------+------+------+------------

------------------+
| 1 | PRIMARY            | <derived2> | ALL            | NULL            | NULL       |    NULL | NULL | 1248 | Using where

                |
| 3 | DEPENDENT SUBQUERY | UserMsg    | index_subquery | RUID,CreateTime | RUID       |      96 | func |    2 | Using index;

Using where     |
| 2 | DERIVED            | UserMsg    | range          | CreateTime      | CreateTime |       9 | NULL | 3509 | Using where;

Using temporary |
+----+--------------------+------------+----------------+-----------------+------------+---------+------+------+------------

------------------+
开销是完全一样的，开销可以从 rows 那个字段得出（基本上是rows那个字段各个行的数值的乘积，也就是笛卡尔积）
但是呢：下面这个：
explain select a.ruid,b.ruid from(select distinct RUID      from UserMsg       where CreateTime >= '2009-8-14 15:30:00'

and CreateTime<='2009-8-17 16:00:00' ) a left join ( select distinct RUID from UserMsg where createTime < '2009-8-14

15:30:00' ) b on a.ruid = b.ruid where b.ruid is null;
执行时间21.31s
+----+-------------+------------+-------+---------------+------------+---------+------+-------+-----------------------------

-+
| id | select_type | table      | type | possible_keys | key        | key_len | ref | rows | Extra

|
+----+-------------+------------+-------+---------------+------------+---------+------+-------+-----------------------------

-+
| 1 | PRIMARY     | <derived2> | ALL   | NULL          | NULL       |    NULL | NULL | 1248 |

|
| 1 | PRIMARY     | <derived3> | ALL   | NULL          | NULL       |    NULL | NULL | 30308 | Using where; Not exists

|
| 3 | DERIVED     | UserMsg    | ALL   | CreateTime    | NULL       |    NULL | NULL | 69366 | Using where; Using temporary

|
| 2 | DERIVED     | UserMsg    | range | CreateTime    | CreateTime |       9 | NULL | 3510 | Using where; Using temporary

|
+----+-------------+------------+-------+---------------+------------+---------+------+-------+-----------------------------

-+
我就有些不明白
为何是四行
并且中间两行巨大无比
按理说
查询规划器应该能把这个查询优化得跟前面的两个一样的
（至少在我熟悉的pgsql数据库里我有信心是一样的）
但mysql里头不是
所以我感觉查询规划器里头可能还是糙了点
我前面说过优化的基本原则就是，让笛卡尔积发生在尽可能小的集合之间
那么上面最后一种写法至少没有违反这个原则
虽然b 表因为符合条件的非常多，基本上不会用索引
但是并不应该妨碍查询优化器看到外面的join on条件，从而和前面两个SQL一样，选取主键进行join
不过我前面说过查询规划器的作用
理论上来讲
遍历一遍所有可能，计算一下开销
是合理的
我感觉这里最后一种写法没有遍历完整所有可能
可能的原因是子查询的实现还是比较简单？
子查询对数据库的确是个挑战
因为基本都是递归的东西
所以在这个环节有点毛病并不奇怪
其实你仔细想想，最后一种写法无非是我们第一种写法的一个变种，关键在表b的where 条件放在哪里
放在里面，就不会用索引去join
放在外面就会
这个本身就是排列组合的一个可能

标签：subquery null 问题 mysql join table 数据结构 sql

相关文章

