mysql not in 和 left join 效率问题记录

时间:2022-08-29 21:57:27

声明:感谢laserhe, denniswwhACMAIN_CHMvinsonshen 的热心帮助

首先说明该条sql的功能是查询集合a不在集合b的数据。
not in的写法
select add_tb.RUID
from (select distinct RUID
      from UserMsg
      where SubjectID =12
      and CreateTime>'2009-8-14 15:30:00'
      and CreateTime<='2009-8-17 16:00:00'
) add_tb
  where add_tb.RUID
  not in (select distinct RUID
          from UserMsg
          where SubjectID =12
      and CreateTime<'2009-8-14 15:30:00'
)
返回444行记录用时 0.07sec
explain 结果
+----+--------------------+------------+----------------+---------------------------+------------+---------+------+------+--

----------------------------+
| id | select_type        | table      | type           | possible_keys             | key        | key_len | ref  | rows |

Extra                        |
+----+--------------------+------------+----------------+---------------------------+------------+---------+------+------+--

----------------------------+
|  1 | PRIMARY            | <derived2> | ALL            | NULL                      | NULL       |    NULL | NULL |  452 |

Using where                  |
|  3 | DEPENDENT SUBQUERY | UserMsg    | index_subquery | RUID,SubjectID,CreateTime | RUID       |      96 | func |    2 |

Using index; Using where     |
|  2 | DERIVED            | UserMsg    | range          | SubjectID,CreateTime      | CreateTime |       9 | NULL | 1857 |

Using where; Using temporary |
+----+--------------------+------------+----------------+---------------------------+------------+---------+------+------+--

----------------------------+
分析:该条查询速度快原因为id=2的sql查询出来的结果比较少,所以id=1sql所以运行速度比较快,id=2的使用了临时表,不知道这个时候是否使用索引?
其中一种left join
select a.ruid,b.ruid
from(select distinct RUID
     from UserMsg
     where SubjectID =12
     and CreateTime >= '2009-8-14 15:30:00'
     and CreateTime<='2009-8-17 16:00:00'
) a  left join  (
    select distinct RUID
    from UserMsg
    where SubjectID =12 and CreateTime< '2009-8-14 15:30:00' 
) b on a.ruid = b.ruid
where b.ruid is null
返回444行记录用时 0.39sec
explain 结果
+----+-------------+------------+-------+----------------------+------------+---------+------+------+-----------------------

-------+
| id | select_type | table      | type  | possible_keys        | key        | key_len | ref  | rows | Extra                  

     |
+----+-------------+------------+-------+----------------------+------------+---------+------+------+-----------------------

-------+
|  1 | PRIMARY     | <derived2> | ALL   | NULL                 | NULL       |    NULL | NULL |  452 |                        

     |
|  1 | PRIMARY     | <derived3> | ALL   | NULL                 | NULL       |    NULL | NULL | 1112 | Using where; Not exists

     |
|  3 | DERIVED     | UserMsg    | ref   | SubjectID,CreateTime | SubjectID  |       5 |      | 6667 | Using where; Using

temporary |
|  2 | DERIVED     | UserMsg    | range | SubjectID,CreateTime | CreateTime |       9 | NULL | 1838 | Using where; Using

temporary |
+----+-------------+------------+-------+----------------------+------------+---------+------+------+-----------------------

-------+
分析:使用了两个临时表,并且两个临时表做了笛卡尔积,导致不能使用索引并且数据量很大
另外一种left join
select distinct a.RUID
from UserMsg a
left join UserMsg b
    on a.ruid = b.ruid
    and b.subjectID =12 and b.createTime < '2009-8-14 15:30:00'
where a.subjectID =12
and a.createTime >= '2009-8-14 15:30:00'
and a.createtime <='2009-8-17 16:00:00'
and b.ruid is null;
返回444行记录用时 0.07sec
explain 结果
+----+-------------+-------+-------+---------------------------+------------+---------+--------------+------+---------------

--------------------+
| id | select_type | table | type  | possible_keys             | key        | key_len | ref          | rows | Extra          

                  |
+----+-------------+-------+-------+---------------------------+------------+---------+--------------+------+---------------

--------------------+
|  1 | SIMPLE      | a     | range | SubjectID,CreateTime      | CreateTime |       9 | NULL         | 1839 | Using where;

Using temporary      |
|  1 | SIMPLE      | b     | ref   | RUID,SubjectID,CreateTime | RUID       |      96 | dream.a.RUID |    2 | Using where;

Not exists; Distinct |
+----+-------------+-------+-------+---------------------------+------------+---------+--------------+------+---------------

--------------------+
分析:两次查询都是用上了索引,并且查询时同时进行的,所以查询效率应该很高
使用not exists的sql
select distinct a.ruid
from UserMsg a
where a.subjectID =12
and a.createTime >= '2009-8-14 15:30:00'
and a.createTime <='2009-8-17 16:00:00'
and not exists (
    select distinct RUID
    from UserMsg
    where subjectID =12 and createTime  < '2009-8-14 15:30:00'
    and ruid=a.ruid
)
返回444行记录用时 0.08sec
explain 结果
+----+--------------------+---------+-------+---------------------------+------------+---------+--------------+------+------

------------------------+
| id | select_type        | table   | type  | possible_keys             | key        | key_len | ref          | rows | Extra 

                      |
+----+--------------------+---------+-------+---------------------------+------------+---------+--------------+------+------

------------------------+
|  1 | PRIMARY            | a       | range | SubjectID,CreateTime      | CreateTime |       9 | NULL         | 1839 | Using

where; Using temporary |
|  2 | DEPENDENT SUBQUERY | UserMsg | ref   | RUID,SubjectID,CreateTime | RUID       |      96 | dream.a.RUID |    2 | Using

where                  |
+----+--------------------+---------+-------+---------------------------+------------+---------+--------------+------+------

------------------------+
分析:同上基本上是一样的,只是分解了2个查询顺序执行,查询效率低于第3个

为了验证数据查询效率,将上述查询中的subjectID =12的限制条件去掉,结果统计查询时间如下
0.20s
21.31s
0.25s
0.43s

laserhe帮忙分析问题总结
select a.ruid,b.ruid 
from(    select distinct RUID     
    from UserMsg      
    where CreateTime >= '2009-8-14 15:30:00'      
    and CreateTime<='2009-8-17 16:00:00'
) a      left join  UserMsg  b
    on a.ruid = b.ruid 
    and b.createTime < '2009-8-14 15:30:00'
where b.ruid is null;
执行时间0.13s
+----+-------------+------------+-------+-----------------+------------+---------+--------+------+--------------------------

----+
| id | select_type | table      | type  | possible_keys   | key        | key_len | ref    | rows | Extra                     

  |
+----+-------------+------------+-------+-----------------+------------+---------+--------+------+--------------------------

----+
|  1 | PRIMARY     | <derived2> | ALL   | NULL            | NULL       |    NULL | NULL   | 1248 |                           

  |
|  1 | PRIMARY     | b          | ref   | RUID,CreateTime | RUID       |      96 | a.RUID |    2 | Using where; Not exists   

  |
|  2 | DERIVED     | UserMsg    | range | CreateTime      | CreateTime |       9 | NULL   | 3553 | Using where; Using

temporary |
+----+-------------+------------+-------+-----------------+------------+---------+--------+------+--------------------------

----+
执行效率类似与not in的效率

数据库优化的基本原则:让笛卡尔积发生在尽可能小的集合之间,mysql在join的时候可以直接通过索引来扫描,而嵌入到子查询里头,查询规

划器就不晓得用合适的索引了。
一个SQL在数据库里是这么优化的:首先SQL会分析成一堆分析树,一个树状数据结构,然后在这个数据结构里,查询规划器会查找有没有合适

的索引,然后根据具体情况做一个排列组合,然后计算这个排列组合中的每一种的开销(类似explain的输出的计算机可读版本),然后比较里

面开销最小的,选取并执行之。那么:
explain select a.ruid,b.ruid  from(select distinct RUID      from UserMsg       where CreateTime >= '2009-8-14 15:30:00'     

 and CreateTime<='2009-8-17 16:00:00' ) a  left join  UserMsg  b on a.ruid = b.ruid  and b.createTime < '2009-8-14 15:30:00'

where b.ruid is null;

explain select add_tb.RUID
    -> from (select distinct RUID
    ->       from UserMsg
    ->       where  CreateTime>'2009-8-14 15:30:00'
    ->       and CreateTime<='2009-8-17 16:00:00'
    -> ) add_tb
    ->   where add_tb.RUID
    ->   not in (select distinct RUID
    ->           from UserMsg
    ->           where CreateTime<'2009-8-14 15:30:00'
    -> );
explain
+----+--------------------+------------+----------------+-----------------+------------+---------+------+------+------------

------------------+
| id | select_type        | table      | type           | possible_keys   | key        | key_len | ref  | rows | Extra       

                |
+----+--------------------+------------+----------------+-----------------+------------+---------+------+------+------------

------------------+
|  1 | PRIMARY            | <derived2> | ALL            | NULL            | NULL       |    NULL | NULL | 1248 | Using where 

                |
|  3 | DEPENDENT SUBQUERY | UserMsg    | index_subquery | RUID,CreateTime | RUID       |      96 | func |    2 | Using index;

Using where     |
|  2 | DERIVED            | UserMsg    | range          | CreateTime      | CreateTime |       9 | NULL | 3509 | Using where;

Using temporary |
+----+--------------------+------------+----------------+-----------------+------------+---------+------+------+------------

------------------+
开销是完全一样的,开销可以从 rows 那个字段得出(基本上是rows那个字段各个行的数值的乘积,也就是笛卡尔积)
但是呢:下面这个:
explain select a.ruid,b.ruid  from(select distinct RUID      from UserMsg       where CreateTime >= '2009-8-14 15:30:00'     

 and CreateTime<='2009-8-17 16:00:00' ) a  left join  (  select distinct RUID  from UserMsg where createTime < '2009-8-14

15:30:00' ) b on a.ruid = b.ruid  where b.ruid is null;
执行时间21.31s
+----+-------------+------------+-------+---------------+------------+---------+------+-------+-----------------------------

-+
| id | select_type | table      | type  | possible_keys | key        | key_len | ref  | rows  | Extra                       

|
+----+-------------+------------+-------+---------------+------------+---------+------+-------+-----------------------------

-+
|  1 | PRIMARY     | <derived2> | ALL   | NULL          | NULL       |    NULL | NULL |  1248 |                             

|
|  1 | PRIMARY     | <derived3> | ALL   | NULL          | NULL       |    NULL | NULL | 30308 | Using where; Not exists     

|
|  3 | DERIVED     | UserMsg    | ALL   | CreateTime    | NULL       |    NULL | NULL | 69366 | Using where; Using temporary

|
|  2 | DERIVED     | UserMsg    | range | CreateTime    | CreateTime |       9 | NULL |  3510 | Using where; Using temporary

|
+----+-------------+------------+-------+---------------+------------+---------+------+-------+-----------------------------

-+
我就有些不明白
为何是四行
并且中间两行巨大无比
按理说
查询规划器应该能把这个查询优化得跟前面的两个一样的
(至少在我熟悉的pgsql数据库里我有信心是一样的)
但mysql里头不是
 所以我感觉查询规划器里头可能还是糙了点
 我前面说过优化的基本原则就是,让笛卡尔积发生在尽可能小的集合之间
那么上面最后一种写法至少没有违反这个原则
虽然b 表因为符合条件的非常多,基本上不会用索引
但是并不应该妨碍查询优化器看到外面的join on条件,从而和前面两个SQL一样,选取主键进行join
 不过我前面说过查询规划器的作用
 理论上来讲
 遍历一遍所有可能,计算一下开销
 是合理的
 我感觉这里最后一种写法没有遍历完整所有可能
 可能的原因是子查询的实现还是比较简单?
 子查询对数据库的确是个挑战
 因为基本都是递归的东西
 所以在这个环节有点毛病并不奇怪
 其实你仔细想想,最后一种写法无非是我们第一种写法的一个变种,关键在表b的where 条件放在哪里
 放在里面,就不会用索引去join
 放在外面就会
 这个本身就是排列组合的一个可能