MySQL索引大型表的性能

TL;DR: I have a query on 2 huge tables. They are no indexes. It is slow. Therefore, I build indexes. It is slower. Why does this makes sense? What is the correct way to optimize it?

我有一个关于两个大桌子的查询。他们是没有索引。它是缓慢的。因此,我建立索引。它是慢的。这为什么说得通?优化它的正确方法是什么?

The background:

背景:

I have 2 tables

我有两个表

person, a table containing informations about people (id, birthdate)
人，包含关于人的信息的表格(id，生日)
works_in, a 0-N relation between person and a department; works_in contains id, person_id, department_id.
人与部门之间的0-N关系;works_in包含id、person_id、department_id。

They are InnoDB tables, and it is sadly not an option to switch to MyISAM as data integrity is a requirement.

它们是InnoDB表，不幸的是，不能切换到MyISAM，因为数据完整性是必需的。

Those 2 tables are huge, and don't contain any indexes except a PRIMARY on their respective id.

这两个表很大，除了各自id上的主表外，不包含任何索引。

I'm trying to get the age of the youngest person in each department, and here is the query I've came up with

我想知道每个部门中最年轻的人的年龄，这是我提出的问题

SELECT MAX(YEAR(person.birthdate)) as max_year, works_in.department as department
    FROM person
    INNER JOIN works_in
        ON works_in.person_id = person.id
    WHERE person.birthdate IS NOT NULL
    GROUP BY works_in.department

The query works, but I'm dissatisfied with performances, as it takes ~17s to run. This is expected, as the data is huge and needs to be written to disk, and they are no indexes on the tables.

查询是有效的，但是我对性能不满意，因为运行需要大约17秒。这是意料之中的，因为数据很大，需要写到磁盘上，而且表上没有索引。

EXPLAIN for this query gives

解释这个查询给出。

| id | select_type | table   | type   | possible_keys | key     | key_len | ref                      | rows     | Extra                           | 
|----|-------------|---------|--------|---------------|---------|---------|--------------------------|----------|---------------------------------| 
| 1  | SIMPLE      | works_in| ALL    | NULL          | NULL    | NULL    | NULL                     | 22496409 | Using temporary; Using filesort | 
| 1  | SIMPLE      | person  | eq_ref | PRIMARY       | PRIMARY | 4       | dbtest.works_in.person_id| 1        | Using where                     |

I built a bunch of indexes for the 2 tables,

我为这两个表建立了一堆索引，

/* For works_in */
CREATE INDEX person_id ON works_in(person_id);
CREATE INDEX department_id ON works_in(department_id);
CREATE INDEX department_id_person ON works_in(department_id, person_id);
CREATE INDEX person_department_id ON works_in(person_id, department_id);
/* For person */
CREATE INDEX birthdate ON person(birthdate);

EXPLAIN shows an improvement, at least that's how I understand it, seeing that it now uses an index and scans less lines.

EXPLAIN显示了一种改进，至少我是这么理解的，因为它现在使用了索引并减少了对行的扫描。

| id | select_type | table   | type  | possible_keys                                    | key                  | key_len | ref              | rows   | Extra                                                 | 
|----|-------------|---------|-------|--------------------------------------------------|----------------------|---------|------------------|--------|-------------------------------------------------------| 
| 1  | SIMPLE      | person  | range | PRIMARY,birthdate                                | birthdate            | 4       | NULL             | 267818 | Using where; Using index; Using temporary; Using f... | 
| 1  | SIMPLE      | works_in| ref   | person,department_id_person,person_department_id | person_department_id | 4       | dbtest.person.id | 3      | Using index                                           |

However, the execution time of the query has doubled (from ~17s to ~35s).

但是，查询的执行时间增加了一倍(从~17s到~35s)。

Why does this makes sense, and what is the correct way to optimize this?

为什么这是有意义的，优化它的正确方法是什么?

EDIT

编辑

Using Gordon Linoff's answer (first one), the execution time is ~9s (half of the initial). Choosing good indexes seems to indeed help, but the execution time is still pretty high. Any other idea on how to improve on this?

使用Gordon Linoff的答案(第一个)，执行时间是~9s(初始时间的一半)。选择好的索引似乎确实有帮助，但是执行时间仍然很高。关于如何改进这个问题，还有其他的想法吗?

More information concerning the dataset:

关于数据集的更多信息:

There are about 5'000'000 records in the person table.
在person表中大约有5000条记录。
Of which only 130'000 have a valid (not NULL) birthdate
其中只有130000人有有效的(非零)出生日期。
I indeed have a department table, which contains about 3'000'000 records (they are actually projects and not department)
我确实有一个部门表，它包含了大约3000万条记录(它们实际上是项目而不是部门)

2 个解决方案

#1

For this query:

对于这个查询:

SELECT MAX(YEAR(p.birthdate)) as max_year, wi.department as department
FROM person p INNER JOIN
     works_in wi
     ON wi.person_id = p.id
WHERE p.birthdate IS NOT NULL
GROUP BY wi.department;

The best indexes are: person(birthdate, id) and works_in(person_id, department). These are covering indexes for the query and save the extra cost of reading data pages.

最好的索引是:person(生日，id)和works_in(person_id，部门)。它们涵盖了查询的索引，并节省了读取数据页面的额外成本。

By the way, unless a lot of persons have NULL birthdates (i.e. there are departments where everyone has a NULL birthdate), the query is basically equivalent to:

顺便说一下，除非很多人的出生日期都是空的(也就是说，有些部门的每个人的出生日期都是空的)，否则这个查询基本上等同于:

SELECT MAX(YEAR(p.birthdate)) as max_year, wi.department as department
FROM person p INNER JOIN
     works_in wi
     ON wi.person_id = p.id
GROUP BY wi.department;

For this, the best indexes are person(id, birthdate) and works_in(person_id, department).

为此，最好的索引是person(id, birthdate)和works_in(person_id, department)。

EDIT:

编辑:

I cannot think of an easy way to solve the problem. One solution is more powerful hardware.

我想不出一个简单的办法来解决这个问题。一种解决方案是更强大的硬件。

If you really need this information quickly, then additional work is needed.

如果您确实需要这些信息，那么需要额外的工作。

One approach is to add a maximum birth date to the departments table, and add triggers. For works_in, you need triggers for update, insert, and delete. For persons, only update (presumably the insert and delete would be handled by works_in). This saves the final group by, which should be a big savings.

一种方法是在department表中添加最大出生日期，并添加触发器。对于works_in，您需要触发器来更新、插入和删除。对于人员，只有update(假定insert和delete将由works_in处理)。这将拯救最后一组，这应该是一个很大的节省。

A simpler approach is to add a maximum birth date just to works_in. However, you will still need a final aggregation, and that might be expensive.

一个更简单的方法是为works_in添加一个最大的出生日期。但是，您仍然需要一个最终的聚合，这可能很昂贵。

#2

Indexing improves performance for MyISAM tables. It degrades performance on InnoDB tables.

索引提高了MyISAM表的性能。它会降低InnoDB表的性能。

Add indexes on columns that you expect to query the most. The more complex the data relationships grow, especially when those relationships are with / to itself (such as inner joins), the worse each query's performance gets.

在希望查询最多的列上添加索引。数据关系越复杂，特别是当这些关系本身(比如内部连接)时，每个查询的性能就越差。

With an index, the engine has to use the index to get matching values, which is fast. Then it has to use the matches to look up the actual rows in the table. If the index doesn't narrow down the number of rows, it can be faster to just look up all the rows in the table.

对于索引，引擎必须使用索引来获得匹配值，这是快速的。然后，它必须使用匹配来查找表中的实际行。如果索引没有缩小行数，那么查找表中的所有行会更快。

When to add an index on a SQL table field (MySQL)?

何时在SQL表字段(MySQL)上添加索引?

When to use MyISAM and InnoDB?

何时使用ismyam和InnoDB?

https://dba.stackexchange.com/questions/1/what-are-the-main-differences-between-innodb-and-myisam

#1