在大表上查询慢的mysql

时间:2021-10-14 16:54:32

I have a statistics table with ~600k records in it, on which i perform the following (raw sql) query to get statistical data for a graph:

我有一个包含约600k记录的统计表,在其上执行以下(原始sql)查询以获取图形的统计数据:

SELECT 
(UNIX_TIMESTAMP(s.date)*1000+3600000) as time,
ROUND((s.loadtime / s.loadtimeMeasurements), 3) as loadtime 
FROM mw_statistics s 
WHERE s.type = 0 
    AND s.date >= '2013-02-01 07:52:06' 
    AND s.date <= '2013-02-01 11:52:06' 
    AND s.product_id IN (1,8,9,10,11) 
GROUP BY s.date

This query takes approximately 1 second to complete. I would like it to take just few hundred ms. Any idea how i might improve this query? I am using Symfony2/Doctrine with a mysql database and innodb engine.

此查询大约需要1秒钟才能完成。我希望它只需几百毫秒。知道如何改进这个查询吗?我正在使用Symfony2 / Doctrine与mysql数据库和innodb引擎。

Regards, Jasper

Here's a structure dump of the table:

这是表的结构转储:

CREATE TABLE IF NOT EXISTS `mw_statistics` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`contentErrors` smallint(6) DEFAULT NULL,
`contentMeasurements` smallint(6) DEFAULT NULL,
`thirdpartyErrors` smallint(6) DEFAULT NULL,
`thirdpartyMeasurements` smallint(6) DEFAULT NULL,
`applicationErrors` smallint(6) DEFAULT NULL,
`applicationMeasurements` smallint(6) DEFAULT NULL,
`loadtime` double NOT NULL,
`loadtimeMeasurements` smallint(6) NOT NULL,
`unavailable` smallint(6) DEFAULT NULL,
`unavailableMeasurements` smallint(6) DEFAULT NULL,
`type` smallint(6) NOT NULL,
`step` smallint(6) DEFAULT NULL,
`date` datetime NOT NULL,
`status` smallint(6) DEFAULT NULL,
`url` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`product_id` int(11) DEFAULT NULL,
`script_id` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `IDX_FC665E6F4584665A` (`product_id`),
KEY `IDX_FC665E6FA1C01850` (`script_id`),
KEY `date` (`date`) 
) ENGINE=InnoDB DEFAULT
  CHARSET=utf8 COLLATE=utf8_unicode_ci AUTO_INCREMENT=2105417 ;

Notice that the combined is unique: (type=0, product_id, date) or (type=1, script_id, step, date)

请注意,组合是唯一的:(type = 0,product_id,date)或(type = 1,script_id,step,date)

3 个解决方案

#1


0  

Create index for date & id. In where condition put AND p.id IN (1,8,9,10,11) after s.type = 0 i hope it should make your query faster than previous.

为date&id创建索引。在s.type = 0之后条件放入AND p.id IN(1,8,9,10,11)的情况下,我希望它能使你的查询比以前更快。

#2


0  

Do you really need to join with mw_brands? You're not using any data from it, so the only use right now is to make sure that the mw_statistics is related (through mw_products) to a mw_brands?

你真的需要加入mw_brands吗?你没有使用它的任何数据,所以现在唯一的用途是确保mw_statistics与mw_brands相关(通过mw_products)?

If you don't need it, remove both joins and change out p.id in (1,8,9,10,11) for s.product_id in (1,8,9,10,11).

如果您不需要它,请删除两个连接并更改(1,8,9,10,11)中的p.id,以获取(1,8,9,10,11)中的s.product_id。

#3


0  

To be completely sure of the reasons, I'd need the execution plan (obtained with EXPLAIN).

为了完全确定原因,我需要执行计划(使用EXPLAIN获得)。

In a pinch, I'd guess there's one or more full table scans involved, due to improper/missing indexes.

在紧要关头,我猜测由于索引不正确/缺失而涉及一个或多个全表扫描。

You want an INDEX on mw_statistics based on type, date, product_id in this order:

您希望基于类型,日期,product_id按顺序在mw_statistics上使用INDEX:

 CREATE INDEX mw_ndx ON mw_statistics ( type, date, product_id )

You could also try moving the condition on p.id to s:

您也可以尝试将p.id上的条件移动到s:

WHERE s.type = 0
    AND s.date >= '2013-02-01 06:12:32' AND s.date <= '2013-02-01 10:12:30'
    AND s.product_id IN (1,8,9,10,11)

...in which case your index would probably perform better like this:

...在这种情况下,您的索引可能会像这样表现得更好:

 CREATE INDEX mw_ndx ON mw_statistics ( type, product_id, date )

A closer look

You have a column called date, yet you range it using a datetime, and group on it, without any aggregate functions. It might be the case that you always want to query a single day, and the GROUP BY is then superfluous. If the column held a datetime, you would have very granular (probably useless) groups of very few items, in most cases a single one.

您有一个名为date的列,但您可以使用日期时间对其进行范围调整,并对其进行分组,而不使用任何聚合函数。可能是您总是想要查询一天,然后GROUP BY是多余的。如果列保留了日期时间,那么您将拥有非常精细(可能无用)的极少数项目组,在大多数情况下只有一个项目。

Then, all the data you're loading in comes from the s table. You might be better served by implementing constraints on product_id to make sure that statistics do have a product and the latter does have a brand.

然后,您加载的所有数据都来自s表。通过在product_id上实施约束来确保统计数据确实具有产品而后者确实具有品牌,可以更好地为您提供服务。

You could also check beforehand whether the product_ids are legit in this regard. When this is done, your query boils down to

您还可以事先检查product_ids在这方面是否合法。完成后,您的查询归结为

SELECT 
    (UNIX_TIMESTAMP(date)*1000+3600000) as time,
    ROUND((loadtime / loadtimeMeasurements), 3) as loadtime
FROM mw_statistics
WHERE type = 0
    AND product_id IN (1,8,9,10,11)
    AND date BETWEEN '2013-02-01 06:12:32' AND '2013-02-01 10:12:30'
;

which, indexed on type, product_id and date, ought to run in tens of milliseconds.

哪个,索引类型,product_id和日期,应该在几十毫秒内运行。

Specific attempt

CREATE INDEX mw_ndx ON mw_statistics (
          type, product_id, date, loadtime, loadtimeMeasurements
     );

SELECT
    (UNIX_TIMESTAMP(date)*1000+3600000) as time,
    ROUND((loadtime / loadtimeMeasurements), 3) as loadtime
FROM mw_statistics
WHERE type = 0
  AND product_id IN (1,8,9,10,11)
  AND date BETWEEN '2013-02-01 06:12:32' AND '2013-02-01 10:12:30'
;

This way, the necessary records are quickly whittled down by exact selection on type and set selection on product_id. The date selection also ought to perform well; in another situation you might want to consider partitioning or sharding, but with less than a few million records it just doesn't smell worthwhile. Every index entry is weighed with two smallints, but by accepting this small overhead, you actually never access the main table at all.

这样,通过对product_id上的类型和集合选择进行精确选择,可以快速减少必要的记录。日期选择也应该表现良好;在另一种情况下,您可能需要考虑分区或分片,但只有不到几百万条记录,它就闻不到值得。每个索引条目都有两个小的权重,但是通过接受这个小的开销,你实际上根本不会访问主表。

Query runtime will depend on column cardinality; but on a sample, evenly (actually randomly) populated sample table with one million rows, I'm getting round-trip times between 8 and 90 milliseconds, depending on cache performances and number of rows actually retrieved.

查询运行时将取决于列基数;但是在样本上,均匀(实际上是随机)填充的样本表有一百万行,我的往返时间在8到90毫秒之间,具体取决于缓存性能和实际检索的行数。

For a more precise tuning I'd need the output of EXPLAIN SELECT (UNIX_TIMESTAMP....

为了更精确的调整,我需要EXPLAIN SELECT的输出(UNIX_TIMESTAMP ....

#1


0  

Create index for date & id. In where condition put AND p.id IN (1,8,9,10,11) after s.type = 0 i hope it should make your query faster than previous.

为date&id创建索引。在s.type = 0之后条件放入AND p.id IN(1,8,9,10,11)的情况下,我希望它能使你的查询比以前更快。

#2


0  

Do you really need to join with mw_brands? You're not using any data from it, so the only use right now is to make sure that the mw_statistics is related (through mw_products) to a mw_brands?

你真的需要加入mw_brands吗?你没有使用它的任何数据,所以现在唯一的用途是确保mw_statistics与mw_brands相关(通过mw_products)?

If you don't need it, remove both joins and change out p.id in (1,8,9,10,11) for s.product_id in (1,8,9,10,11).

如果您不需要它,请删除两个连接并更改(1,8,9,10,11)中的p.id,以获取(1,8,9,10,11)中的s.product_id。

#3


0  

To be completely sure of the reasons, I'd need the execution plan (obtained with EXPLAIN).

为了完全确定原因,我需要执行计划(使用EXPLAIN获得)。

In a pinch, I'd guess there's one or more full table scans involved, due to improper/missing indexes.

在紧要关头,我猜测由于索引不正确/缺失而涉及一个或多个全表扫描。

You want an INDEX on mw_statistics based on type, date, product_id in this order:

您希望基于类型,日期,product_id按顺序在mw_statistics上使用INDEX:

 CREATE INDEX mw_ndx ON mw_statistics ( type, date, product_id )

You could also try moving the condition on p.id to s:

您也可以尝试将p.id上的条件移动到s:

WHERE s.type = 0
    AND s.date >= '2013-02-01 06:12:32' AND s.date <= '2013-02-01 10:12:30'
    AND s.product_id IN (1,8,9,10,11)

...in which case your index would probably perform better like this:

...在这种情况下,您的索引可能会像这样表现得更好:

 CREATE INDEX mw_ndx ON mw_statistics ( type, product_id, date )

A closer look

You have a column called date, yet you range it using a datetime, and group on it, without any aggregate functions. It might be the case that you always want to query a single day, and the GROUP BY is then superfluous. If the column held a datetime, you would have very granular (probably useless) groups of very few items, in most cases a single one.

您有一个名为date的列,但您可以使用日期时间对其进行范围调整,并对其进行分组,而不使用任何聚合函数。可能是您总是想要查询一天,然后GROUP BY是多余的。如果列保留了日期时间,那么您将拥有非常精细(可能无用)的极少数项目组,在大多数情况下只有一个项目。

Then, all the data you're loading in comes from the s table. You might be better served by implementing constraints on product_id to make sure that statistics do have a product and the latter does have a brand.

然后,您加载的所有数据都来自s表。通过在product_id上实施约束来确保统计数据确实具有产品而后者确实具有品牌,可以更好地为您提供服务。

You could also check beforehand whether the product_ids are legit in this regard. When this is done, your query boils down to

您还可以事先检查product_ids在这方面是否合法。完成后,您的查询归结为

SELECT 
    (UNIX_TIMESTAMP(date)*1000+3600000) as time,
    ROUND((loadtime / loadtimeMeasurements), 3) as loadtime
FROM mw_statistics
WHERE type = 0
    AND product_id IN (1,8,9,10,11)
    AND date BETWEEN '2013-02-01 06:12:32' AND '2013-02-01 10:12:30'
;

which, indexed on type, product_id and date, ought to run in tens of milliseconds.

哪个,索引类型,product_id和日期,应该在几十毫秒内运行。

Specific attempt

CREATE INDEX mw_ndx ON mw_statistics (
          type, product_id, date, loadtime, loadtimeMeasurements
     );

SELECT
    (UNIX_TIMESTAMP(date)*1000+3600000) as time,
    ROUND((loadtime / loadtimeMeasurements), 3) as loadtime
FROM mw_statistics
WHERE type = 0
  AND product_id IN (1,8,9,10,11)
  AND date BETWEEN '2013-02-01 06:12:32' AND '2013-02-01 10:12:30'
;

This way, the necessary records are quickly whittled down by exact selection on type and set selection on product_id. The date selection also ought to perform well; in another situation you might want to consider partitioning or sharding, but with less than a few million records it just doesn't smell worthwhile. Every index entry is weighed with two smallints, but by accepting this small overhead, you actually never access the main table at all.

这样,通过对product_id上的类型和集合选择进行精确选择,可以快速减少必要的记录。日期选择也应该表现良好;在另一种情况下,您可能需要考虑分区或分片,但只有不到几百万条记录,它就闻不到值得。每个索引条目都有两个小的权重,但是通过接受这个小的开销,你实际上根本不会访问主表。

Query runtime will depend on column cardinality; but on a sample, evenly (actually randomly) populated sample table with one million rows, I'm getting round-trip times between 8 and 90 milliseconds, depending on cache performances and number of rows actually retrieved.

查询运行时将取决于列基数;但是在样本上,均匀(实际上是随机)填充的样本表有一百万行,我的往返时间在8到90毫秒之间,具体取决于缓存性能和实际检索的行数。

For a more precise tuning I'd need the output of EXPLAIN SELECT (UNIX_TIMESTAMP....

为了更精确的调整,我需要EXPLAIN SELECT的输出(UNIX_TIMESTAMP ....