聊一聊MySQL的直方图

前一段时间和一个客户在优化SQL，发现其实收集了直方图以后看起来对执行计划起不到多大影响。

MySQL 8.0后开始支持直方图，我们可以通过analyze table时进行收集，

ANALYZE TABLE t UPDATE HISTOGRAM ON c1, c2, c3 WITH 10 BUCKETS;

ANALYZE TABLE t UPDATE HISTOGRAM ON c1, c3 WITH 10 BUCKETS;

ANALYZE TABLE t DROP HISTOGRAM ON c2;

其中可选的WITH N BUCKETS子句指定直方图的buckets数。N必须为1 ~ 1024之间的整数。如果省略掉，则buckets数为100。

按照我们在ORACLE上的对直方图的理解， CBO优化器可以根据直方图收集的列值分布信息，让选择性高（返回数据行比例少）的列值使用索引，而选择性低（返回数据行比例多）的列值不使用索引。尤其对于存在数据倾斜严重的列而言，直方图很重要。

在MYSQL上我们可以测测看，比如我有个100w行的表，status这一列数据倾斜很严重。

master [localhost:22132] {msandbox} (test) > select status ,count(*) from t100w group by status;
+---------+----------+
| status  | count(*) |
+---------+----------+
| ONLINE  |   990000 |
| OFFLINE |    10000 |
+---------+----------+
2 rows in set (5.80 sec)

那么当我在查询时，没有索引，没有直方图的情况下：

我们可以看到两个条件filtered都是10%：

master [localhost:22132] {msandbox} (test) >  explain  select * from t100w where status='ONLINE';
+----+-------------+-------+------------+------+---------------+------+---------+------+--------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key  | key_len | ref  | rows   | filtered | Extra       |
+----+-------------+-------+------------+------+---------------+------+---------+------+--------+----------+-------------+
|  1 | SIMPLE      | t100w | NULL       | ALL  | NULL          | NULL | NULL    | NULL | 994008 |    10.00 | Using where |
+----+-------------+-------+------------+------+---------------+------+---------+------+--------+----------+-------------+
1 row in set, 1 warning (0.00 sec)
master [localhost:22132] {msandbox} (test) >  explain analyze select * from t100w where status='ONLINE';
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                                                                                                                                                             |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| -> Filter: (t100w.`status` = 'ONLINE')  (cost=105265.82 rows=99401) (actual time=10.226..1625.808 rows=990000 loops=1)
    -> Table scan on t100w  (cost=105265.82 rows=994008) (actual time=0.033..1382.466 rows=1000000 loops=1)
 |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (2.14 sec)
master [localhost:22132] {msandbox} (test) >  explain select * from t100w where status='OFFLINE';
+----+-------------+-------+------------+------+---------------+------+---------+------+--------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key  | key_len | ref  | rows   | filtered | Extra       |
+----+-------------+-------+------------+------+---------------+------+---------+------+--------+----------+-------------+
|  1 | SIMPLE      | t100w | NULL       | ALL  | NULL          | NULL | NULL    | NULL | 994008 |    10.00 | Using where |
+----+-------------+-------+------------+------+---------------+------+---------+------+--------+----------+-------------+
1 row in set, 1 warning (0.00 sec)
master [localhost:22132] {msandbox} (test) >  explain analyze select * from t100w where status='OFFLINE';
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                                                                                                                                                            |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| -> Filter: (t100w.`status` = 'OFFLINE')  (cost=105265.82 rows=99401) (actual time=0.032..1488.844 rows=10000 loops=1)
    -> Table scan on t100w  (cost=105265.82 rows=994008) (actual time=0.028..1333.076 rows=1000000 loops=1)
 |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (1.49 sec)

对status字段收集直方图：

master [localhost:22132] {msandbox} (test) > analyze table t100w  UPDATE HISTOGRAM ON status;
+------------+-----------+----------+---------------------------------------------------+
| Table      | Op        | Msg_type | Msg_text                                          |
+------------+-----------+----------+---------------------------------------------------+
| test.t100w | histogram | status   | Histogram statistics created for column 'status'. |
+------------+-----------+----------+---------------------------------------------------+
1 row in set (1.39 sec)
master [localhost:22132] {msandbox} (test) > SELECT
    ->          TABLE_NAME, COLUMN_NAME,
    ->          HISTOGRAM->>'$."data-type"' AS 'data-type',
    ->          JSON_LENGTH(HISTOGRAM->>'$."buckets"') AS 'bucket-count'
    ->        FROM INFORMATION_SCHEMA.COLUMN_STATISTICS;
+------------+-------------+-----------+--------------+
| TABLE_NAME | COLUMN_NAME | data-type | bucket-count |
+------------+-------------+-----------+--------------+
| t100w      | status      | string    |            2 |
+------------+-------------+-----------+--------------+
1 row in set (0.01 sec)

再来对比执行情况：

确实有了直方图之后，我们看到了filtered 确实更准确了。执行时间也有所降低。

master [localhost:22132] {msandbox} (test) > explain select * from t100w where status='ONLINE';
+----+-------------+-------+------------+------+---------------+------+---------+------+--------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key  | key_len | ref  | rows   | filtered | Extra       |
+----+-------------+-------+------------+------+---------------+------+---------+------+--------+----------+-------------+
|  1 | SIMPLE      | t100w | NULL       | ALL  | NULL          | NULL | NULL    | NULL | 994008 |    99.17 | Using where |
+----+-------------+-------+------------+------+---------------+------+---------+------+--------+----------+-------------+
1 row in set, 1 warning (0.00 sec)
master [localhost:22132] {msandbox} (test) > explain analyze select * from t100w where status='ONLINE';
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                                                                                                                                                             |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| -> Filter: (t100w.`status` = 'ONLINE')  (cost=105265.82 rows=985725) (actual time=9.278..1479.325 rows=990000 loops=1)
    -> Table scan on t100w  (cost=105265.82 rows=994008) (actual time=0.028..1240.229 rows=1000000 loops=1)
 |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (1.97 sec)
 
master [localhost:22132] {msandbox} (test) > explain select * from t100w where status='OFFLINE';
+----+-------------+-------+------------+------+---------------+------+---------+------+--------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key  | key_len | ref  | rows   | filtered | Extra       |
+----+-------------+-------+------------+------+---------------+------+---------+------+--------+----------+-------------+
|  1 | SIMPLE      | t100w | NULL       | ALL  | NULL          | NULL | NULL    | NULL | 994008 |     0.83 | Using where |
+----+-------------+-------+------------+------+---------------+------+---------+------+--------+----------+-------------+
1 row in set, 1 warning (0.00 sec)
master [localhost:22132] {msandbox} (test) > explain analyze select * from t100w where status='OFFLINE';
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                                                                                                                                                         |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| -> Filter: (t100w.`status` = 'OFFLINE')  (cost=105265.82 rows=8283) (actual time=0.031..920.213 rows=10000 loops=1)
    -> Table scan on t100w  (cost=105265.82 rows=994008) (actual time=0.029..778.104 rows=1000000 loops=1)
 |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.93 sec)

此时，我再给status字段创建索引，那么这种情况下，如果条件时 status='OFFLINE'，则执行效率会比较高，但如果条件是status='ONLINE'，走索引时得回表，执行效率不比全表扫描好。

master [localhost:22132] {msandbox} (test) > create index idx_t100w_s on t100w(status);
Query OK, 0 rows affected (9.07 sec)
Records: 0  Duplicates: 0  Warnings: 0
master [localhost:22132] {msandbox} (test) > explain select * from t100w where status='ONLINE';
+----+-------------+-------+------------+------+---------------+-------------+---------+-------+--------+----------+-------+
| id | select_type | table | partitions | type | possible_keys | key         | key_len | ref   | rows   | filtered | Extra |
+----+-------------+-------+------------+------+---------------+-------------+---------+-------+--------+----------+-------+
|  1 | SIMPLE      | t100w | NULL       | ref  | idx_t100w_s   | idx_t100w_s | 43      | const | 497004 |   100.00 | NULL  |
+----+-------------+-------+------------+------+---------------+-------------+---------+-------+--------+----------+-------+
1 row in set, 1 warning (0.00 sec)
master [localhost:22132] {msandbox} (test) > explain analyze select * from t100w where status='ONLINE';
+----------------------------------------------------------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                                                                      |
+----------------------------------------------------------------------------------------------------------------------------------------------+
| -> Index lookup on t100w using idx_t100w_s (status='ONLINE')  (cost=67295.45 rows=497004) (actual time=0.112..3988.694 rows=990000 loops=1)
 |
+----------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (4.55 sec) 《《《《《《====
master [localhost:22132] {msandbox} (test) > explain select * from t100w where status='OFFLINE';
+----+-------------+-------+------------+------+---------------+-------------+---------+-------+-------+----------+-------+
| id | select_type | table | partitions | type | possible_keys | key         | key_len | ref   | rows  | filtered | Extra |
+----+-------------+-------+------------+------+---------------+-------------+---------+-------+-------+----------+-------+
|  1 | SIMPLE      | t100w | NULL       | ref  | idx_t100w_s   | idx_t100w_s | 43      | const | 18512 |   100.00 | NULL  |
+----+-------------+-------+------------+------+---------------+-------------+---------+-------+-------+----------+-------+
1 row in set, 1 warning (0.00 sec)
master [localhost:22132] {msandbox} (test) > explain analyze select * from t100w where status='OFFLINE';
+-------------------------------------------------------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                                                                   |
+-------------------------------------------------------------------------------------------------------------------------------------------+
| -> Index lookup on t100w using idx_t100w_s (status='OFFLINE')  (cost=17606.99 rows=18512) (actual time=0.074..80.192 rows=10000 loops=1)
 |
+-------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.09 sec)

可以看到，此时走了索引，和收集的直方图就没什么关系了。

看看官方文档怎么说吧：

Histogram statistics are useful primarily for nonindexed columns. Adding an index to a column for which histogram statistics are applicable might also help the optimizer make row estimates. The tradeoffs are:

l An index must be updated when table data is modified.

l A histogram is created or updated only on demand, so it adds no overhead when table data is modified. On the other hand, the statistics become progressively more out of date when table modifications occur, until the next time they are updated.

The optimizer prefers range optimizer row estimates to those obtained from histogram statistics. If the optimizer determines that the range optimizer applies, it does not use histogram statistics.

For columns that are indexed, row estimates can be obtained for equality comparisons using index dives (see ). In this case, histogram statistics are not necessarily useful because index dives can yield better estimates.

慢慢改进吧，我们太想任何其他数据库都一下子能像ORACLE一样强大了。

秒客网

聊一聊MySQL的直方图

相关文章