使用连接的MySQL查询,不使用索引

时间:2021-08-17 03:46:46

I have the following two tables in MySQL (Simplified).

我在MySQL中有以下两个表(简化了)。

  • clicks (InnoDB)
    • Contains around about 70,000,000 records
    • 包含大约7000万条记录
    • Has an index on the date_added column
    • date_add列上有索引吗
    • Has a column link_id which refers to a record in the links table
    • 是否有引用链接表中的记录的列link_id
  • 点击(InnoDB)包含大约7000万条记录,date_add列上的索引有一个列link_id,它引用链接表中的记录。
  • links (MyISAM)
    • Contains far fewer records, around about 65,000
    • 包含更少的记录,大约65000条
  • 链接(MyISAM)包含的记录要少得多,大约为65,000条

I'm trying to run some analytical queries using these tables. I need to pull out some data, about clicks that occurred inside of two specified dates while applying some other user selected filters using other tables and joining them into the links table.

我正在尝试使用这些表运行一些分析查询。我需要提取一些数据,关于在两个指定日期内发生的单击,同时使用其他表应用其他用户选择的过滤器并将它们连接到links表格中。

My question revolves around the use of indexes however. When I run the following query:

我的问题是关于索引的使用。当我运行以下查询时:

SELECT
    COUNT(1)
FROM
    clicks
WHERE
    date_added >= '2016-11-01 00:00:00'
AND date_added <= '2016-11-03 23:59:59';

I get a response back in 1.40 sec. Using EXPLAIN I find that the MySQL uses the index on the date_added column as expected.

我在1.40秒内得到回复。使用EXPLAIN,我发现MySQL按照预期使用date_add列上的索引。

EXPLAIN SELECT COUNT(1) FROM clicks WHERE date_added >= '2016-11-01 00:00:00' AND date_added <= '2016-11-16 23:59:59';
+----+-------------+--------+-------+---------------+------------+---------+------+---------+--------------------------+
| id | select_type | table  | type  | possible_keys | key        | key_len | ref  | rows    | Extra                    |
+----+-------------+--------+-------+---------------+------------+---------+------+---------+--------------------------+
|  1 | SIMPLE      | clicks | range | date_added    | date_added | 4       | NULL | 1559288 | Using where; Using index |
+----+-------------+--------+-------+---------------+------------+---------+------+---------+--------------------------+

However, when I LEFT JOIN in my links table I find that the query takes much longer to execute:

但是,当我将JOIN放在links表格中时,我发现查询执行起来要花费更长的时间:

SELECT
    COUNT(1) AS clicks
FROM
    clicks AS c
LEFT JOIN links AS l ON l.id = c.link_id
WHERE
    c.date_added >= '2016-11-01 00:00:00'
AND c.date_added <= '2016-11-16 23:59:59';

Which completed in 6.50 sec. Using EXPLAIN I find that the index was not used on the date_added column:

使用EXPLAIN,我发现date_add列上没有使用索引:

EXPLAIN SELECT COUNT(1) AS clicks FROM clicks AS c LEFT JOIN links AS l ON l.id = c.link_id WHERE c.date_added >= '2016-11-01 00:00:00' AND c.date_added <= '2016-11-16 23:59:59';
+----+-------------+-------+--------+---------------+------------+---------+---------------+---------+-------------+
| id | select_type | table | type   | possible_keys | key        | key_len | ref           | rows    | Extra       |
+----+-------------+-------+--------+---------------+------------+---------+---------------+---------+-------------+
|  1 | SIMPLE      | c     | range  | date_added    | date_added | 4       | NULL          | 6613278 | Using where |
|  1 | SIMPLE      | l     | eq_ref | PRIMARY       | PRIMARY    | 4       | c.link_id     |       1 | Using index |
+----+-------------+-------+--------+---------------+------------+---------+---------------+---------+-------------+

As you can see the index isn't being used for the date_added column in the larger table and seems to take far longer. This seems to get even worse when I join in other tables.

正如您所看到的,索引并没有用于较大表中的date_add列,而且似乎需要更长的时间。当我加入其他表时,情况似乎更糟。

Does anyone know why this is happening or if there's anything I can do to get it to use the index on the date_added column in the clicks table?

有人知道为什么会发生这种情况吗?或者如果有什么我可以做的,让它在点击表格中的date_add列上使用索引?


Edit

编辑

I've just attempted to get my stats out of the database using a different method. The first step in my method involves pulling out a distinct set of link_ids from the clicks table. I've found that I'm seeing the same problem here again, without a JOIN. The index is not being used:

我刚刚尝试使用不同的方法从数据库中获取我的统计数据。我的方法的第一步涉及从单击表中取出一组不同的link_id。我发现我又在这里看到了同样的问题,没有加入。该指数没有被使用:

My query:

我的查询:

SELECT
    DISTINCT(link_id) AS link_id
FROM
    clicks
WHERE
    date_added >= '2016-11-01 00:00:00'
AND date_added <= '2016-12-05 10:16:00'

This query took almost a minute to complete. I ran an EXPLAIN on this and found that the query is not using the index as I expected it would:

这个查询花了近一分钟完成。我对此进行了解释,发现查询并没有像我预期的那样使用索引:

+----+-------------+---------+-------+---------------+----------+---------+------+----------+-------------+
| id | select_type | table   | type  | possible_keys | key      | key_len | ref  | rows     | Extra       |
+----+-------------+---------+-------+---------------+----------+---------+------+----------+-------------+
|  1 | SIMPLE      | clicks  | index | date_added    | link_id  | 4       | NULL | 79786609 | Using where |
+----+-------------+---------+-------+---------------+----------+---------+------+----------+-------------+

I expected that it would use the index on date_added to filter down the result set and then pull out the distinct link_id values. Any idea why this is happening? I have an index on link_id as well as date_added.

我期望它会使用date_add上的索引来过滤结果集,然后提取不同的link_id值。你知道为什么会这样吗?我有一个关于link_id和date_add的索引。

2 个解决方案

#1


1  

Not absolutely sure but consider moving the condition from WHERE condition to JOIN ON condition since you are performing a outer join (LEFT JOIN) it makes difference in performance unlike inner join where the condition be it on where or join on clause is equivalent.

虽然不是绝对确定,但是可以考虑将条件从条件所在的位置移动到条件所在的位置,因为您正在执行一个外部连接(左连接),这在性能上有所不同,而内部连接的条件是在条件相同的位置或JOIN ON子句上。

SELECT COUNT(1) AS clicks 
FROM clicks AS c 
LEFT JOIN links AS l ON l.id = c.link_id 
AND (c.date_added >= '2016-11-01 00:00:00' 
AND c.date_added <= '2016-11-16 23:59:59');

#2


1  

Do you want to use an ordinary JOIN in place of the LEFT JOIN? LEFT JOIN preserves all the rows on the right, so it will yield the same value of COUNT() as the unjoined table. If you want to count only the rows from your right-hand table that have matching rows in the left-hand table, use JOIN, not LEFT JOIN.

您想要在左连接中使用普通连接吗?左连接保留了右边的所有行,因此将产生与未联接表相同的COUNT()值。如果您希望只计算右边表中具有匹配行的行,请使用JOIN,而不是LEFT JOIN。

Try dropping your index on date_added and replacing it with a compound index on (date_added, link_id). This sort of index is called a covering index. When the query planner knows it can get everything it needs from an index, it doesn't have to bounce back to the table. In this case the query planner can random-access the index to the beginning of your date range, then do an index range scan to the end of the range. It's still going to have to refer to the other table, though.

尝试在date_add上删除索引,并用复合索引替换它(date_add, link_id)。这种索引称为覆盖索引。当查询计划器知道它可以从索引中获得所需的所有内容时,它就不必返回到表中。在这种情况下,查询计划器可以随机访问到日期范围的开始部分的索引,然后对范围的末尾执行索引范围扫描。但是它仍然需要指向另一个表。

(Edit) For the sake of experimentation, try a narrower date range. See if EXPLAIN changes. In that case, the query planner might be guessing your date_added column's cardinality wrong.

(编辑)为了实验,尝试一个更窄的日期范围。看看解释的变化。在这种情况下,查询计划器可能会猜测date_add列的基数是错误的。

You might try an index hint. For example, try

您可以尝试一个索引提示。例如,试着

SELECT COUNT(1) AS clicks
  FROM clicks AS c USE INDEX (date_added)
  LEFT JOIN links AS l ON l.id = c.link_id
 WHERE etc

But, judging from your EXPLAIN output, you're already doing a range scan on date_added. Your next step, like it or not, is the compound covering index.

但是,从解释输出来看,您已经对date_add进行了范围扫描。你的下一步,不管你喜欢与否,是复盖指数。

Make sure there's an index on links(id). There probably is, because it's probably the PK.

确保在链接(id)上有索引。可能有,因为可能是PK。

Try using COUNT(*) instead of COUNT(1). It probably won't make a difference, but it's worth a try. COUNT(*) simply counts rows rather than evaluating something for each row it counts.

尝试使用COUNT(*)而不是COUNT(1)。这可能不会有什么不同,但值得一试。COUNT(*)只是对行进行计数,而不是对其计数的每一行进行计算。

(Nitpick) Your date range smells funny. Use < for the end of your range for best results, like so.

(挑剔)你的约会范围闻起来很有趣。在你的范围的末尾使用 <以获得最好的结果,就像这样。< p>

 WHERE c.date_added >= '2016-11-01'
   AND c.date_added <  '2016-11-17';

Edit: Look, the MySQL query planner uses lots of internal knowledge about how tables are structured. And, it can only use one index per table to satisfy a query as of late 2016. That's a limitation.

编辑:看,MySQL查询计划器使用了很多关于表结构的内部知识。并且,到2016年末,它只能使用每个表一个索引来满足查询。这是一个限制。

SELECT DISTINCT column is actually a fairly complex query, because it has to de-dupe the column in question. If there's an index on that column, the query planner is likely to use it. Choosing that index means it could not choose some other index.

选择不同的列实际上是一个相当复杂的查询,因为它必须对该列进行除法。如果该列上有索引,查询计划器可能会使用它。选择那个指数意味着它不能选择其他的指数。

Compound indexes (covering indexes) sometimes but not always resolve this kind of index-selection dilemma, and allow index dual usage. You can read about all this at http://use-the-index-luke.com/

复合索引(覆盖索引)有时可以解决这种索引选择的难题,并且允许索引的双重使用。你可以在http://use-the-index-卢克.com/上看到这些

But if your operational constraints prevent the adding of compound indexes, you'll need to live with the one-second query. It isn't that bad.

但是,如果您的操作约束阻止了复合索引的添加,那么您将需要使用一秒查询。这不是那么糟糕。

Of course, saying you can't add compound indexes to get your job done is like this:

当然,说你不能添加复合索引来完成你的工作是这样的:

A: stuff is falling off my truck on the freeway.

A:高速公路上我的卡车上的东西要掉下来了。

B: put a tarp over the stuff and tie it down.

B:把防水布盖在上面,绑起来。

A: my boss won't let me put a tarp on the truck.

A:我老板不让我在卡车上放防水布。

B: well, then, drive slow.

那么,开慢点。

#1


1  

Not absolutely sure but consider moving the condition from WHERE condition to JOIN ON condition since you are performing a outer join (LEFT JOIN) it makes difference in performance unlike inner join where the condition be it on where or join on clause is equivalent.

虽然不是绝对确定,但是可以考虑将条件从条件所在的位置移动到条件所在的位置,因为您正在执行一个外部连接(左连接),这在性能上有所不同,而内部连接的条件是在条件相同的位置或JOIN ON子句上。

SELECT COUNT(1) AS clicks 
FROM clicks AS c 
LEFT JOIN links AS l ON l.id = c.link_id 
AND (c.date_added >= '2016-11-01 00:00:00' 
AND c.date_added <= '2016-11-16 23:59:59');

#2


1  

Do you want to use an ordinary JOIN in place of the LEFT JOIN? LEFT JOIN preserves all the rows on the right, so it will yield the same value of COUNT() as the unjoined table. If you want to count only the rows from your right-hand table that have matching rows in the left-hand table, use JOIN, not LEFT JOIN.

您想要在左连接中使用普通连接吗?左连接保留了右边的所有行,因此将产生与未联接表相同的COUNT()值。如果您希望只计算右边表中具有匹配行的行,请使用JOIN,而不是LEFT JOIN。

Try dropping your index on date_added and replacing it with a compound index on (date_added, link_id). This sort of index is called a covering index. When the query planner knows it can get everything it needs from an index, it doesn't have to bounce back to the table. In this case the query planner can random-access the index to the beginning of your date range, then do an index range scan to the end of the range. It's still going to have to refer to the other table, though.

尝试在date_add上删除索引,并用复合索引替换它(date_add, link_id)。这种索引称为覆盖索引。当查询计划器知道它可以从索引中获得所需的所有内容时,它就不必返回到表中。在这种情况下,查询计划器可以随机访问到日期范围的开始部分的索引,然后对范围的末尾执行索引范围扫描。但是它仍然需要指向另一个表。

(Edit) For the sake of experimentation, try a narrower date range. See if EXPLAIN changes. In that case, the query planner might be guessing your date_added column's cardinality wrong.

(编辑)为了实验,尝试一个更窄的日期范围。看看解释的变化。在这种情况下,查询计划器可能会猜测date_add列的基数是错误的。

You might try an index hint. For example, try

您可以尝试一个索引提示。例如,试着

SELECT COUNT(1) AS clicks
  FROM clicks AS c USE INDEX (date_added)
  LEFT JOIN links AS l ON l.id = c.link_id
 WHERE etc

But, judging from your EXPLAIN output, you're already doing a range scan on date_added. Your next step, like it or not, is the compound covering index.

但是,从解释输出来看,您已经对date_add进行了范围扫描。你的下一步,不管你喜欢与否,是复盖指数。

Make sure there's an index on links(id). There probably is, because it's probably the PK.

确保在链接(id)上有索引。可能有,因为可能是PK。

Try using COUNT(*) instead of COUNT(1). It probably won't make a difference, but it's worth a try. COUNT(*) simply counts rows rather than evaluating something for each row it counts.

尝试使用COUNT(*)而不是COUNT(1)。这可能不会有什么不同,但值得一试。COUNT(*)只是对行进行计数,而不是对其计数的每一行进行计算。

(Nitpick) Your date range smells funny. Use < for the end of your range for best results, like so.

(挑剔)你的约会范围闻起来很有趣。在你的范围的末尾使用 <以获得最好的结果,就像这样。< p>

 WHERE c.date_added >= '2016-11-01'
   AND c.date_added <  '2016-11-17';

Edit: Look, the MySQL query planner uses lots of internal knowledge about how tables are structured. And, it can only use one index per table to satisfy a query as of late 2016. That's a limitation.

编辑:看,MySQL查询计划器使用了很多关于表结构的内部知识。并且,到2016年末,它只能使用每个表一个索引来满足查询。这是一个限制。

SELECT DISTINCT column is actually a fairly complex query, because it has to de-dupe the column in question. If there's an index on that column, the query planner is likely to use it. Choosing that index means it could not choose some other index.

选择不同的列实际上是一个相当复杂的查询,因为它必须对该列进行除法。如果该列上有索引,查询计划器可能会使用它。选择那个指数意味着它不能选择其他的指数。

Compound indexes (covering indexes) sometimes but not always resolve this kind of index-selection dilemma, and allow index dual usage. You can read about all this at http://use-the-index-luke.com/

复合索引(覆盖索引)有时可以解决这种索引选择的难题,并且允许索引的双重使用。你可以在http://use-the-index-卢克.com/上看到这些

But if your operational constraints prevent the adding of compound indexes, you'll need to live with the one-second query. It isn't that bad.

但是,如果您的操作约束阻止了复合索引的添加,那么您将需要使用一秒查询。这不是那么糟糕。

Of course, saying you can't add compound indexes to get your job done is like this:

当然,说你不能添加复合索引来完成你的工作是这样的:

A: stuff is falling off my truck on the freeway.

A:高速公路上我的卡车上的东西要掉下来了。

B: put a tarp over the stuff and tie it down.

B:把防水布盖在上面,绑起来。

A: my boss won't let me put a tarp on the truck.

A:我老板不让我在卡车上放防水布。

B: well, then, drive slow.

那么,开慢点。