大表上的复合索引，优化聚合查询

We are having a large table (Having arround 160 million records) in MySql 5.5.

我们在MySql 5.5中有一个大表(拥有1.6亿条记录)。

The machine having 4GB RAM where we installed our mysql

我们安装了mysql的机器有4GB RAM

table schema

+---------------+---------------+------+-----+---------+-------+
| Field         | Type          | Null | Key | Default | Extra |
+---------------+---------------+------+-----+---------+-------+
| domain        | varchar(50)   | YES  | MUL | NULL    |       |
| uid           | varchar(100)  | YES  |     | NULL    |       |
| sid           | varchar(100)  | YES  | MUL | NULL    |       |
| vurl          | varchar(2500) | YES  |     | NULL    |       |
| ip            | varchar(20)   | YES  |     | NULL    |       |
| ref           | varchar(2500) | YES  |     | NULL    |       |
| stats_time    | datetime      | YES  | MUL | NULL    |       |
| country       | varchar(50)   | YES  |     | NULL    |       |
| region        | varchar(50)   | YES  |     | NULL    |       |
| place         | varchar(50)   | YES  |     | NULL    |       |
| email         | varchar(100)  | YES  | MUL | NULL    |       |
+---------------+---------------+------+-----+---------+-------+

Indexes

    +------------+------------+------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table      | Non_unique | Key_name         | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+------------+------------+------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| visit_views |          1 | sid_index        |            1 | sid         | A         |   157531031 |     NULL | NULL   | YES  | BTREE      |         |               |
| visit_views |          1 | domain_index     |            1 | domain      | A         |          17 |     NULL | NULL   | YES  | BTREE      |         |               |
| visit_views |          1 | email_index      |            1 | email       | A         |      392845 |     NULL | NULL   | YES  | BTREE      |         |               |
| visit_views |          1 | stats_time_index |            1 | stats_time  | A         |    78765515 |     NULL | NULL   | YES  | BTREE      |         |               |
+------------+------------+------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+

Example query

SELECT count(*)
  FROM visit_views
 WHERE domain ='our'
   AND email!=''
   AND stats_time BETWEEN '2010-06-21 00:00:00' AND '2015-08-21 00:00:00';

We are having very slow performance on queries like above, So I want to add composite index on this table

我们在上面的查询上的性能非常慢,所以我想在这个表上添加复合索引

I ran following command

我跑了以下命令

ALTER TABLE visit_views ADD INDEX domain_statstime_email (domain,stats_time,email);

after running this command , our table got locked, it has reached connection limit (connect limit is 1000). Now table is not responding for any INSERTS and SELECTS.

运行此命令后,我们的表被锁定,已达到连接限制(连接限制为1000)。现在表没有响应任何INSERTS和SELECTS。

Here are my few questions

这是我的几个问题

1.Why table got locked and why table is not releasing existing connections

1.为什么表被锁定以及为什么表没有释放现有连接

2.How much time it will take to complete the index. I applied 3 hours back still index not created.

2.完成索引需要多长时间。我申请3小时仍然没有创建索引。

3.How to see index creation progress.

3.如何看待索引创建进度。

4.Why connection limit suddenly increasing to max while adding index to table.

4.为表添加索引时,为什么连接限制突然增加到最大值。

5.Is it safe to add composite indexes for this kind of large table

5.为这种大型表添加复合索引是安全的

6.If I add partitions for this table, will it any better performance.

6.如果我为此表添加分区,它会有更好的性能。

I don't know much about indexes

我对索引知之甚少

some stats

+---------------------------+
| @@innodb_buffer_pool_size |
+---------------------------+
|                3221225472 |
+---------------------------+

1 个解决方案

#1

Your query has three conditions: an inequality, an equality, and a range.

您的查询有三个条件:不等式,等式和范围。

WHERE domain ='our'
  AND email!=''
  AND stats_time BETWEEN '2010-06-21 00:00:00' AND '2015-08-21 00:00:00';

To make this work, you should try the following indexes to see which one works better.

要使其工作,您应该尝试以下索引以查看哪个更好。

 (email, domain, stats_time)
 (domain, email, stats_time)

Why these? MySQL indexes are BTREE. That is, they're sorted in order. So to satisfy the query MySQL finds the first element in the index matching your query. That's based on domain, email, and the starting stats_time value. It then scans the index sequentially looking for the last matching value. Along the way it counts the records, and that satisfies your query. In other words it does a range scan on stats_time.

为什么这些? MySQL索引是BTREE。也就是说,它们按顺序排序。因此,为了满足查询,MySQL找到与查询匹配的索引中的第一个元素。这是基于域名,电子邮件和起始stats_time值。然后,它按顺序扫描索引,查找最后一个匹配值。一路上它会对记录进行计数,并且满足您的查询。换句话说,它在stats_time上进行范围扫描。

Why the choice? I don't know what MySQL will do with the inequality in your email matching predicate. That's why I suggest trying both.

为什么选择?我不知道MySQL会对你的电子邮件匹配谓词中的不等性做些什么。这就是我建议尝试两者的原因。

If you have not simplified the query you showed us, you also might try a compound covering index on

如果您还没有简化您向我们展示的查询,您也可以尝试覆盖索引的复合

 (domain, stats_time, email)

This will random-access immediately to the first matching domain/stats_time combination, and then scan to the last one. As it scans, it will look at the email values from the index (that's why this is called a covering index) and pick out the rows matching. Along the way it counts the rows.

这将立即随机访问第一个匹配的域/ stats_time组合,然后扫描到最后一个。在扫描时,它将查看索引中的电子邮件值(这就是为什么将其称为覆盖索引)并选择匹配的行。一路上它计算行数。

You should consider declaring your email column NOT NULL to help your inequality test use its index more efficiently. Read http://use-the-index-luke.com/ for good background information.

您应该考虑将您的电子邮件列声明为NOT NULL,以帮助您的不等式测试更有效地使用其索引。请阅读http://use-the-index-luke.com/以获取更多背景信息。

As to your questions:

至于你的问题:

Why table got locked and why table is not releasing existing connections Why connection limit suddenly increasing to max while adding index to table.

为什么表被锁定以及为什么表没有释放现有连接为什么在向表添加索引时连接限制突然增加到最大值。

It can take a long time to add an index to a large table. Yours, at 160 megarows, is large. While that indexing operation us going on, other users of the table must wait. So, if you're accessing this from a web app, the connections pile up waiting for it to become available.

将索引添加到大型表可能需要很长时间。你的160美元,很大。当我们继续进行索引操作时,表的其他用户必须等待。因此,如果您从Web应用程序访问此连接,则连接会等待其变为可用。

How much time it will take to complete the index. I applied 3 hours back still index not created.

完成索引需要多长时间。我申请3小时仍然没有创建索引。

It will be much faster on a quiet system. It is also possible you have some redundant single-column indexes you could drop. You may wish to copy the table and index the copy, then, when it's ready, rename it.

在安静的系统上它会快得多。您也可以删除一些冗余的单列索引。您可能希望复制表并索引副本,然后在准备好后重命名。

How to see index creation progress.

如何查看索引创建进度。

SHOW FULL PROCESSLIST will display all the action in your MySQL server. You'll need a command line interface to give this command.

SHOW FULL PROCESSLIST将显示MySQL服务器中的所有操作。您需要一个命令行界面来提供此命令。

Is it safe to add composite indexes for this kind of large table

为这种大型表添加复合索引是否安全

Yes, of course, but it takes time on a production system.

是的,当然,但生产系统需要时间。

If I add partitions for this table, will it any better performance.

如果我为此表添加分区,它会有更好的性能。

Probably not. What WILL help is DELETEing rows that are old, if you don't need them.

可能不会。如果你不需要它们,那么删除旧的行会有什么帮助。

#1