I need an SQL guru to help me speed up my query.
我需要一个SQL guru来帮助我加快查询速度。
I have 2 tables, quantities and prices. quantities records a quantity value between 2 timestamps, 15 minutes apart. prices records a price for a given timestamp, for a given price type and there is a price 5 record for every 5 minutes.
我有两张表,数量和价格。数量在两个时间戳之间记录一个数量值,间隔15分钟。价格记录给定时间戳的价格,对于给定的价格类型,每5分钟有5条记录。
I need 2 work out the total price for each period, e.g. hour or day, between two timestamps. This is calculated by the sum of the (quantity multiplied by the average of the 3 prices in the 15 minute quantity window) in each period.
我需要计算出每个时间段的总价格,例如在两个时间戳之间的小时或一天。这是由每个周期(数量乘以15分钟数量窗口内3个价格的平均值)的总和来计算的。
For example, let's say I want to see the total price each hour for 1 day. The total price value in each row in the result set is the sum of the total prices for each of the four 15 minute periods in that hour. And the total price for each 15 minute period is calculated by multiplying the quantity value in that period by the average of the 3 prices (one for each 5 minutes) in that quantity's period.
例如,假设我想知道每小时1天的总价格。结果集中每一行的总价值是该小时内4个15分钟周期的总价之和。每个15分钟周期的总价格是通过这个时间段内的数量值乘以3个价格(每5分钟1个)的平均值来计算的。
Here's the query I'm using, and the results:
这是我正在使用的查询,结果如下:
SELECT
MIN( `quantities`.`start_timestamp` ) AS `start`,
MAX( `quantities`.`end_timestamp` ) AS `end`,
SUM( `quantities`.`quantity` * (
SELECT AVG( `prices`.`price` )
FROM `prices`
WHERE `prices`.`timestamp` >= `quantities`.`start_timestamp`
AND `prices`.`timestamp` < `quantities`.`end_timestamp`
AND `prices`.`type_id` = 1
) ) AS total
FROM `quantities`
WHERE `quantities`.`start_timestamp` >= '2010-07-01 00:00:00'
AND `quantities`.`start_timestamp` < '2010-07-02 00:00:00'
GROUP BY HOUR( `quantities`.`start_timestamp` );
+---------------------+---------------------+----------+
| start | end | total |
+---------------------+---------------------+----------+
| 2010-07-01 00:00:00 | 2010-07-01 01:00:00 | 0.677733 |
| 2010-07-01 01:00:00 | 2010-07-01 02:00:00 | 0.749133 |
| 2010-07-01 02:00:00 | 2010-07-01 03:00:00 | 0.835467 |
| 2010-07-01 03:00:00 | 2010-07-01 04:00:00 | 0.692233 |
| 2010-07-01 04:00:00 | 2010-07-01 05:00:00 | 0.389533 |
| 2010-07-01 05:00:00 | 2010-07-01 06:00:00 | 0.335300 |
| 2010-07-01 06:00:00 | 2010-07-01 07:00:00 | 1.231467 |
| 2010-07-01 07:00:00 | 2010-07-01 08:00:00 | 0.352800 |
| 2010-07-01 08:00:00 | 2010-07-01 09:00:00 | 1.447200 |
| 2010-07-01 09:00:00 | 2010-07-01 10:00:00 | 0.756733 |
| 2010-07-01 10:00:00 | 2010-07-01 11:00:00 | 0.599467 |
| 2010-07-01 11:00:00 | 2010-07-01 12:00:00 | 1.056467 |
| 2010-07-01 12:00:00 | 2010-07-01 13:00:00 | 1.252600 |
| 2010-07-01 13:00:00 | 2010-07-01 14:00:00 | 1.285567 |
| 2010-07-01 14:00:00 | 2010-07-01 15:00:00 | 0.442933 |
| 2010-07-01 15:00:00 | 2010-07-01 16:00:00 | 0.692567 |
| 2010-07-01 16:00:00 | 2010-07-01 17:00:00 | 1.281067 |
| 2010-07-01 17:00:00 | 2010-07-01 18:00:00 | 0.652033 |
| 2010-07-01 18:00:00 | 2010-07-01 19:00:00 | 1.721900 |
| 2010-07-01 19:00:00 | 2010-07-01 20:00:00 | 1.362400 |
| 2010-07-01 20:00:00 | 2010-07-01 21:00:00 | 1.099300 |
| 2010-07-01 21:00:00 | 2010-07-01 22:00:00 | 0.646267 |
| 2010-07-01 22:00:00 | 2010-07-01 23:00:00 | 0.873100 |
| 2010-07-01 23:00:00 | 2010-07-02 00:00:00 | 0.546533 |
+---------------------+---------------------+----------+
24 rows in set (5.16 sec)
I need the query to run a lot faster than this, and would have though it would be possible. Here's the results from EXPLAIN EXTENDED ...
我需要查询的运行速度比这个快得多,而且可能会有。这是解释扩展的结果……
+----+--------------------+------------+-------+-------------------+-----------------+---------+-------+-------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+------------+-------+-------------------+-----------------+---------+-------+-------+----------------------------------------------+
| 1 | PRIMARY | quantities | range | start_timestamp | start_timestamp | 8 | NULL | 89 | Using where; Using temporary; Using filesort |
| 2 | DEPENDENT SUBQUERY | prices | ref | timestamp,type_id | type_id | 4 | const | 22930 | Using where |
+----+--------------------+------------+-------+-------------------+-----------------+---------+-------+-------+----------------------------------------------+
2 rows in set, 3 warnings (0.00 sec)
I noticed the dependent sub query is not using the timestamp field in the key and the query is scanning loads of rows.
我注意到从属子查询没有在键中使用时间戳字段,而查询是扫描行数。
Can anyone help me get this running a hell of a lot faster?
有谁能帮我让它跑得更快一点吗?
Here are the SQL statements required to create the schema and fill it with a lot of data (2 months worth)
下面是创建模式并填充大量数据所需的SQL语句(2个月)
# Create prices table
CREATE TABLE `prices` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`timestamp` datetime NOT NULL,
`type_id` int(11) NOT NULL,
`price` float(8,2) NOT NULL,
PRIMARY KEY (`id`),
KEY `timestamp` (`timestamp`),
KEY `type_id` (`type_id`)
) ENGINE=MyISAM;
# Create quantities table
CREATE TABLE `quantities` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`start_timestamp` datetime NOT NULL,
`end_timestamp` datetime NOT NULL,
`quantity` float(7,2) NOT NULL,
PRIMARY KEY (`id`),
KEY `start_timestamp` (`start_timestamp`),
KEY `end_timestamp` (`end_timestamp`)
) ENGINE=MyISAM;
# Insert first 2 rows into prices, one for each of 2 types, starting 64 days ago
INSERT INTO `prices` (`id`, `timestamp`, `type_id`, `price`) VALUES
(NULL, DATE_SUB(CURDATE(), INTERVAL 64 DAY), '1', RAND()),
(NULL, DATE_SUB(CURDATE(), INTERVAL 64 DAY), '2', RAND());
# Fill the prices table with a record for each type, for every 5 minutes, for the next 64 days
INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_ADD(`timestamp`, INTERVAL 32 DAY), `type_id`, RAND() FROM prices;
INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_ADD(`timestamp`, INTERVAL 16 DAY), `type_id`, RAND() FROM prices;
INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_ADD(`timestamp`, INTERVAL 8 DAY), `type_id`, RAND() FROM prices;
INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_ADD(`timestamp`, INTERVAL 4 DAY), `type_id`, RAND() FROM prices;
INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_ADD(`timestamp`, INTERVAL 2 DAY), `type_id`, RAND() FROM prices;
INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_ADD(`timestamp`, INTERVAL 1 DAY), `type_id`, RAND() FROM prices;
INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_ADD(`timestamp`, INTERVAL 12 HOUR), `type_id`, RAND() FROM prices;
INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_ADD(`timestamp`, INTERVAL 6 HOUR), `type_id`, RAND() FROM prices;
INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_ADD(`timestamp`, INTERVAL 3 HOUR), `type_id`, RAND() FROM prices;
INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_ADD(`timestamp`, INTERVAL 90 MINUTE), `type_id`, RAND() FROM prices;
INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_ADD(`timestamp`, INTERVAL 45 MINUTE), `type_id`, RAND() FROM prices;
INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_ADD(`timestamp`, INTERVAL 20 MINUTE), `type_id`, RAND() FROM prices;
INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_ADD(`timestamp`, INTERVAL 10 MINUTE), `type_id`, RAND() FROM prices;
INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_ADD(`timestamp`, INTERVAL 5 MINUTE), `type_id`, RAND() FROM prices;
INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_SUB(`timestamp`, INTERVAL 5 MINUTE), `type_id`, RAND() FROM prices WHERE MOD( (TIME_TO_SEC( `timestamp`) - TIME_TO_SEC(CONCAT(DATE_SUB(CURDATE(), INTERVAL 64 DAY), ' 00:00:00')) ), 45 *60 ) = 0 AND `timestamp` > CONCAT(DATE_SUB(CURDATE(), INTERVAL 64 DAY), ' 00:00:00');
# Insert first row into quantities, start timestamp is 64 days ago, end timestamp is start timestamp plus 15 minutes
INSERT INTO `quantities` (`id`, `start_timestamp`, `end_timestamp`, `quantity`) VALUES (NULL, DATE_SUB(CURDATE(), INTERVAL 64 DAY), DATE_SUB(CURDATE(), INTERVAL '63 23:45' DAY_MINUTE), RAND());
# Fill the quantities table with a record for each 15 minute period for the next 64 days
INSERT INTO `quantities` (`start_timestamp`, `end_timestamp`, `quantity`) SELECT DATE_ADD(`start_timestamp`, INTERVAL 32 DAY), DATE_ADD(`end_timestamp`, INTERVAL 32 DAY), RAND() FROM quantities;
INSERT INTO `quantities` (`start_timestamp`, `end_timestamp`, `quantity`) SELECT DATE_ADD(`start_timestamp`, INTERVAL 16 DAY), DATE_ADD(`end_timestamp`, INTERVAL 16 DAY), RAND() FROM quantities;
INSERT INTO `quantities` (`start_timestamp`, `end_timestamp`, `quantity`) SELECT DATE_ADD(`start_timestamp`, INTERVAL 8 DAY), DATE_ADD(`end_timestamp`, INTERVAL 8 DAY), RAND() FROM quantities;
INSERT INTO `quantities` (`start_timestamp`, `end_timestamp`, `quantity`) SELECT DATE_ADD(`start_timestamp`, INTERVAL 4 DAY), DATE_ADD(`end_timestamp`, INTERVAL 4 DAY), RAND() FROM quantities;
INSERT INTO `quantities` (`start_timestamp`, `end_timestamp`, `quantity`) SELECT DATE_ADD(`start_timestamp`, INTERVAL 2 DAY), DATE_ADD(`end_timestamp`, INTERVAL 2 DAY), RAND() FROM quantities;
INSERT INTO `quantities` (`start_timestamp`, `end_timestamp`, `quantity`) SELECT DATE_ADD(`start_timestamp`, INTERVAL 1 DAY), DATE_ADD(`end_timestamp`, INTERVAL 1 DAY), RAND() FROM quantities;
INSERT INTO `quantities` (`start_timestamp`, `end_timestamp`, `quantity`) SELECT DATE_ADD(`start_timestamp`, INTERVAL 12 HOUR), DATE_ADD(`end_timestamp`, INTERVAL 12 HOUR), RAND() FROM quantities;
INSERT INTO `quantities` (`start_timestamp`, `end_timestamp`, `quantity`) SELECT DATE_ADD(`start_timestamp`, INTERVAL 6 HOUR), DATE_ADD(`end_timestamp`, INTERVAL 6 HOUR), RAND() FROM quantities;
INSERT INTO `quantities` (`start_timestamp`, `end_timestamp`, `quantity`) SELECT DATE_ADD(`start_timestamp`, INTERVAL 3 HOUR), DATE_ADD(`end_timestamp`, INTERVAL 3 HOUR), RAND() FROM quantities;
INSERT INTO `quantities` (`start_timestamp`, `end_timestamp`, `quantity`) SELECT DATE_ADD(`start_timestamp`, INTERVAL 90 MINUTE), DATE_ADD(`end_timestamp`, INTERVAL 90 MINUTE), RAND() FROM quantities;
INSERT INTO `quantities` (`start_timestamp`, `end_timestamp`, `quantity`) SELECT DATE_ADD(`start_timestamp`, INTERVAL 45 MINUTE), DATE_ADD(`end_timestamp`, INTERVAL 45 MINUTE), RAND() FROM quantities;
INSERT INTO `quantities` (`start_timestamp`, `end_timestamp`, `quantity`) SELECT DATE_ADD(`start_timestamp`, INTERVAL 15 MINUTE), DATE_ADD(`end_timestamp`, INTERVAL 15 MINUTE), RAND() FROM quantities;
INSERT INTO quantities (`start_timestamp`, `end_timestamp`, `quantity`) SELECT DATE_SUB(`start_timestamp`, INTERVAL 15 MINUTE), DATE_SUB(`end_timestamp`, INTERVAL 15 MINUTE), RAND() FROM quantities WHERE MOD( (TIME_TO_SEC( `start_timestamp`) - TIME_TO_SEC(CONCAT(DATE_SUB(CURDATE(), INTERVAL 64 DAY), ' 00:00:00')) ), 45 * 60 ) = 0 AND `start_timestamp` > CONCAT(DATE_SUB(CURDATE(), INTERVAL 64 DAY), ' 00:00:00');
4 个解决方案
#1
5
Here is my first attempt. This one is dirty and uses the following properties on data:
这是我的第一次尝试。这个是脏的,在数据上使用以下属性:
- there are three 5 minute prices for each quarter in quantities (if this is violated in data the query will not work)
- 每个季度的数量都有3个5分钟的价格(如果数据违反了这一点,查询将无效)
- notice for each and cardinality of three, this is not guaranteed by data integrity checks so therefore I call it dirty
- 注意,对于3的每个和基数,数据完整性检查不能保证这一点,因此我称它为dirty
- it is also not flexible to changes in periods
- 它对周期的变化也不灵活
Query 1:
查询1:
SELECT sql_no_cache
min(q.start_timestamp) as start,
max(q.end_timestamp) as end,
sum((p1.price + p2.price + p3.price)/3*q.quantity) as total
FROM
quantities q join
prices p1 on q.start_timestamp = p1.timestamp and p1.type_id = 1 join
prices p2 on p2.timestamp = adddate(q.start_timestamp, interval 5 minute) and p2.type_id = 1 join
prices p3 on p3.timestamp = adddate(q.start_timestamp, interval 10 minute) and p3.type_id = 1
WHERE
q.start_timestamp between '2010-07-01 00:00:00' and '2010-07-01 23:59:59'
GROUP BY hour(q.start_timestamp);
This one returns results in 0.01 sec on my slow testing machine, where original query runs in ~6 sec, and gnarf's query in ~0.85 sec (all queries always tested with SQL_NO_CACHE
keyword which does not reuse the results, but on a warm database).
在我的慢速测试机器上,这个返回的结果是0.01秒,原始查询在6秒内运行,gnarf的查询在0.85秒内运行(所有查询总是使用SQL_NO_CACHE关键字进行测试,该关键字不重用结果,而是在温暖的数据库上)。
EDIT: Here is a version that is not sensitive to missing rows on the price side Query 1a
编辑:这里有一个版本对price查询1a中缺少的行不敏感
SELECT sql_no_cache
min(q.start_timestamp) as start,
max(q.end_timestamp) as end,
sum( ( COALESCE(p1.price,0) + COALESCE(p2.price,0) + COALESCE(p3.price,0) ) / (
3 -
COALESCE(p1.price-p1.price,1) -
COALESCE(p2.price-p2.price,1) -
COALESCE(p3.price-p3.price,1)
)
*q.quantity) as total
FROM
quantities q LEFT JOIN
prices p1 on q.start_timestamp = p1.timestamp and p1.type_id = 1 LEFT JOIN
prices p2 on p2.timestamp = adddate(q.start_timestamp, interval 5 minute) and p2.type_id = 1 LEFT JOIN
prices p3 on p3.timestamp = adddate(q.start_timestamp, interval 10 minute) and p3.type_id = 1
WHERE
q.start_timestamp between '2010-07-01 00:00:00' and '2010-07-01 23:59:59'
GROUP BY hour(q.start_timestamp);
EDIT2: Query 2: Here is a direct improvement, and different approach, to your query with minimal changes that brings the execuction time to ~0.22 sec on my machine
查询2:这是对您的查询的一个直接改进和不同的方法,只需进行最小的更改,就可以使我的机器上的执行时间缩短到0.22秒
SELECT sql_no_cache
MIN( `quantities`.`start_timestamp` ) AS `start`,
MAX( `quantities`.`end_timestamp` ) AS `end`,
SUM( `quantities`.`quantity` * (
SELECT AVG( `prices`.`price` )
FROM `prices`
WHERE
`prices`.`timestamp` >= '2010-07-01 00:00:00'
AND `prices`.`timestamp` < '2010-07-02 00:00:00'
AND `prices`.`timestamp` >= `quantities`.`start_timestamp`
AND `prices`.`timestamp` < `quantities`.`end_timestamp`
AND `prices`.`type_id` = 1
) ) AS total
FROM `quantities`
WHERE `quantities`.`start_timestamp` >= '2010-07-01 00:00:00'
AND `quantities`.`start_timestamp` < '2010-07-02 00:00:00'
GROUP BY HOUR( `quantities`.`start_timestamp` );
That is mysql 5.1, I think I have read that in 5.5 this kind of thing (merging indexes) will be available to the query planner. Also, if you could make your start_timestamp and timestamp be related through foreign key that should allow these kind of correlated queries to make use of indexes (but for this you would need to modify design and establish some sort of timeline table which could then be referenced by quantities and prices both).
这是mysql 5.1,我想我在5.5中读到过这类东西(合并索引)将对查询计划器可用。也,如果你能让你的start_timestamp和时间戳通过外键应该允许这些相关的相关查询使用索引(但这个需要修改设计,建立某种形式的时间表表既可以引用的数量和价格)。
Query 3: Finally, the last version which does it in ~0.03 sec, but should be as robust and flexible as Query 2
查询3:最后一个版本在~0.03秒内完成,但是应该与查询2一样健壮和灵活
SELECT sql_no_cache
MIN(start),
MAX(end),
SUM(subtotal)
FROM
(
SELECT sql_no_cache
q.`start_timestamp` AS `start`,
q.`end_timestamp` AS `end`,
AVG(p.`price` * q.`quantity`) AS `subtotal`
FROM `quantities` q
LEFT JOIN `prices` p ON p.timestamp >= q.start_timestamp AND
p.timestamp < q.end_timestamp AND
p.timestamp >= '2010-07-01 00:00:00' AND
p.`timestamp` < '2010-07-02 00:00:00'
WHERE q.`start_timestamp` >= '2010-07-01 00:00:00'
AND q.`start_timestamp` < '2010-07-02 00:00:00'
AND p.type_id = 1
GROUP BY q.`start_timestamp`
) forced_tmp
GROUP BY hour( start );
NOTE: Do not forget to remove sql_no_cache keywords in production.
注意:不要忘记在产品中删除sql_no_cache关键字。
There are many counter intuitive tricks applied in the above queries (sometimes conditions repeated in the join condition speed up queries, sometimes they slow them down). Mysql is great little RDBMS and really fast when it comes to relatively simple queries, but when the complexity increases it is easy to run into the above scenarios.
在上面的查询中有许多反直觉的技巧(有时连接条件中重复的条件会加速查询,有时它们会减慢查询)。Mysql是非常小的RDBMS,在相对简单的查询中速度非常快,但是当复杂性增加时,很容易运行到上面的场景中。
So in general, I apply the following principle to set my expectations regarding the performance of a query:
因此,一般来说,我应用以下原则来设置我对查询性能的期望:
- if the base result set has < 1,000 rows then query should do its business in ~0.01 sec (base result set is the number of rows that functionally determine resulting set)
- 如果基结果集有< 1,000行,那么查询应该在~0.01秒内完成它的工作(基结果集是功能上决定结果集的行数)
In this particular case you start with less then 1000 rows (all the prices and quantities in one day, with 15 minutes precision) and from that you should be able to compute the final results.
在这个特殊的例子中,你从少于1000行开始(一天中所有的价格和数量,精确到15分钟),然后你应该能够计算出最终的结果。
#2
2
This should return the same results and perform slightly faster:
这应该返回相同的结果,并且执行得稍微快一点:
SELECT
MIN( `quantities`.`start_timestamp` ) AS `start`,
MAX( `quantities`.`end_timestamp` ) AS `end`,
SUM( `quantities`.`quantity` * `prices`.`price` )
* COUNT(DISTINCT `quantities`.`id`)
/ COUNT(DISTINCT `prices`.`id`)
AS total
FROM `quantities`
JOIN `prices` ON `prices`.`timestamp` >= `quantities`.`start_timestamp`
AND `prices`.`timestamp` < `quantities`.`end_timestamp`
AND `prices`.`type_id` = 1
WHERE `quantities`.`start_timestamp` >= '2010-07-01 00:00:00'
AND `quantities`.`start_timestamp` < '2010-07-02 00:00:00'
GROUP BY HOUR( `quantities`.`start_timestamp` );
Since you can't calculate AVG()
inside the SUM()
, I had to do some interesting COUNT(DISTINCT)
to calculate the number of prices
returned per quantities
. I'm wondering if this gives you the same results with "real" data...
因为您不能在SUM()中计算AVG(),所以我必须做一些有趣的计数(不同的)来计算每个数量返回的价格。我想知道这对“真实”数据是否有同样的结果……
Using JOIN
:
使用连接:
+----+-------------+------------+-------+-------------------------------+-----------------+---------+------+-------+----------+----------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+------------+-------+-------------------------------+-----------------+---------+------+-------+----------+----------------------------------------------+ | 1 | SIMPLE | quantities | range | start_timestamp,end_timestamp | start_timestamp | 8 | NULL | 89 | 100.00 | Using where; Using temporary; Using filesort | | 1 | SIMPLE | prices | ALL | timestamp,type_id | NULL | NULL | NULL | 36862 | 62.20 | Using where; Using join buffer | +----+-------------+------------+-------+-------------------------------+-----------------+---------+------+-------+----------+----------------------------------------------+
vs. the same query only adding LEFT
to the JOIN
与相同的查询只添加左到连接
+----+-------------+------------+-------+-------------------+-----------------+---------+-------+-------+----------+----------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+------------+-------+-------------------+-----------------+---------+-------+-------+----------+----------------------------------------------+ | 1 | SIMPLE | quantities | range | start_timestamp | start_timestamp | 8 | NULL | 89 | 100.00 | Using where; Using temporary; Using filesort | | 1 | SIMPLE | prices | ref | timestamp,type_id | type_id | 4 | const | 22930 | 100.00 | | +----+-------------+------------+-------+-------------------+-----------------+---------+-------+-------+----------+----------------------------------------------+
Interesting that LEFT
can completely removes the end_timestamp
as a possible key, and changes the selected keys so much, making it take 15 times as long...
有趣的是,LEFT可以完全删除end_timestamp作为一个可能的键,并大量更改所选的键,使其花费15倍的时间……
This reference page could help you out a little more if you want to look at specifying index hints for your JOINS
如果您希望查看为连接指定索引提示,那么这个引用页面可以帮助您完成更多工作
#3
0
I don't know if it is faster, but try this one:
我不知道它是不是更快,但试试这个:
SELECT
MIN( `quantities`.`start_timestamp` ) AS `start`,
MAX( `quantities`.`end_timestamp` ) AS `end`,
( `quantities`.`quantity` * AVG (`prices`.`price`) * COUNT (`prices`.`price`)) AS `total`
FROM `quantities`
LEFT JOIN `prices`
ON `prices`.`timestamp` >= `quantities`.`start_timestamp`
AND `prices`.`timestamp` < `quantities`.`end_timestamp`
WHERE `quantities`.`start_timestamp` >= '2010-07-01 00:00:00'
AND `quantities`.`start_timestamp` < '2010-07-02 00:00:00'
AND `prices`.`type_id` = 1
GROUP BY HOUR( `quantities`.`start_timestamp` );
Also compare the results, because the logic is a little different.
还要比较结果,因为逻辑有点不同。
I don't do SUM(quantety * AVG(price)
我不做总和(数量* AVG(价格)
I do AVG(price) * COUNT(price) * quantety
我做AVG(价格)* COUNT(价格)* quantety
#4
0
Remember, just because you have indexes on your columns doesn't necessarily mean they'll run faster. As it stands, the index created is for each individual column, which, if you were only limiting the data on one column, would return results quite fast.
记住,仅仅因为在列上有索引并不意味着它们会运行得更快。就目前的情况而言,创建的索引是针对每个列的,如果您只限制了一个列上的数据,那么将很快返回结果。
So to try and avoid "Using filesort" (which you need to do as much as possible), maybe try the following index:
因此,为了避免“使用filesort”(你需要尽可能多地使用),不妨试试下面的索引:
CREATE INDEX start_timestamp_end_timestamp_id ON quantities (start_timestamp,end_timestamp,id);
And something similar for the prices table (combining the 3 individual indexes you have into 1 index for faster lookup)
price表也类似(将3个单独的索引合并为1个索引以便更快地查找)
An excellent resource which explains it in great detail and how to optimize your indexes (and what the different Explain's mean, and what to aim for) is: http://hackmysql.com/case1
一个非常好的资源,它详细地解释了它,以及如何优化索引(以及不同的Explain的含义,以及目标是什么):http://hackmysql.com/case1
#1
5
Here is my first attempt. This one is dirty and uses the following properties on data:
这是我的第一次尝试。这个是脏的,在数据上使用以下属性:
- there are three 5 minute prices for each quarter in quantities (if this is violated in data the query will not work)
- 每个季度的数量都有3个5分钟的价格(如果数据违反了这一点,查询将无效)
- notice for each and cardinality of three, this is not guaranteed by data integrity checks so therefore I call it dirty
- 注意,对于3的每个和基数,数据完整性检查不能保证这一点,因此我称它为dirty
- it is also not flexible to changes in periods
- 它对周期的变化也不灵活
Query 1:
查询1:
SELECT sql_no_cache
min(q.start_timestamp) as start,
max(q.end_timestamp) as end,
sum((p1.price + p2.price + p3.price)/3*q.quantity) as total
FROM
quantities q join
prices p1 on q.start_timestamp = p1.timestamp and p1.type_id = 1 join
prices p2 on p2.timestamp = adddate(q.start_timestamp, interval 5 minute) and p2.type_id = 1 join
prices p3 on p3.timestamp = adddate(q.start_timestamp, interval 10 minute) and p3.type_id = 1
WHERE
q.start_timestamp between '2010-07-01 00:00:00' and '2010-07-01 23:59:59'
GROUP BY hour(q.start_timestamp);
This one returns results in 0.01 sec on my slow testing machine, where original query runs in ~6 sec, and gnarf's query in ~0.85 sec (all queries always tested with SQL_NO_CACHE
keyword which does not reuse the results, but on a warm database).
在我的慢速测试机器上,这个返回的结果是0.01秒,原始查询在6秒内运行,gnarf的查询在0.85秒内运行(所有查询总是使用SQL_NO_CACHE关键字进行测试,该关键字不重用结果,而是在温暖的数据库上)。
EDIT: Here is a version that is not sensitive to missing rows on the price side Query 1a
编辑:这里有一个版本对price查询1a中缺少的行不敏感
SELECT sql_no_cache
min(q.start_timestamp) as start,
max(q.end_timestamp) as end,
sum( ( COALESCE(p1.price,0) + COALESCE(p2.price,0) + COALESCE(p3.price,0) ) / (
3 -
COALESCE(p1.price-p1.price,1) -
COALESCE(p2.price-p2.price,1) -
COALESCE(p3.price-p3.price,1)
)
*q.quantity) as total
FROM
quantities q LEFT JOIN
prices p1 on q.start_timestamp = p1.timestamp and p1.type_id = 1 LEFT JOIN
prices p2 on p2.timestamp = adddate(q.start_timestamp, interval 5 minute) and p2.type_id = 1 LEFT JOIN
prices p3 on p3.timestamp = adddate(q.start_timestamp, interval 10 minute) and p3.type_id = 1
WHERE
q.start_timestamp between '2010-07-01 00:00:00' and '2010-07-01 23:59:59'
GROUP BY hour(q.start_timestamp);
EDIT2: Query 2: Here is a direct improvement, and different approach, to your query with minimal changes that brings the execuction time to ~0.22 sec on my machine
查询2:这是对您的查询的一个直接改进和不同的方法,只需进行最小的更改,就可以使我的机器上的执行时间缩短到0.22秒
SELECT sql_no_cache
MIN( `quantities`.`start_timestamp` ) AS `start`,
MAX( `quantities`.`end_timestamp` ) AS `end`,
SUM( `quantities`.`quantity` * (
SELECT AVG( `prices`.`price` )
FROM `prices`
WHERE
`prices`.`timestamp` >= '2010-07-01 00:00:00'
AND `prices`.`timestamp` < '2010-07-02 00:00:00'
AND `prices`.`timestamp` >= `quantities`.`start_timestamp`
AND `prices`.`timestamp` < `quantities`.`end_timestamp`
AND `prices`.`type_id` = 1
) ) AS total
FROM `quantities`
WHERE `quantities`.`start_timestamp` >= '2010-07-01 00:00:00'
AND `quantities`.`start_timestamp` < '2010-07-02 00:00:00'
GROUP BY HOUR( `quantities`.`start_timestamp` );
That is mysql 5.1, I think I have read that in 5.5 this kind of thing (merging indexes) will be available to the query planner. Also, if you could make your start_timestamp and timestamp be related through foreign key that should allow these kind of correlated queries to make use of indexes (but for this you would need to modify design and establish some sort of timeline table which could then be referenced by quantities and prices both).
这是mysql 5.1,我想我在5.5中读到过这类东西(合并索引)将对查询计划器可用。也,如果你能让你的start_timestamp和时间戳通过外键应该允许这些相关的相关查询使用索引(但这个需要修改设计,建立某种形式的时间表表既可以引用的数量和价格)。
Query 3: Finally, the last version which does it in ~0.03 sec, but should be as robust and flexible as Query 2
查询3:最后一个版本在~0.03秒内完成,但是应该与查询2一样健壮和灵活
SELECT sql_no_cache
MIN(start),
MAX(end),
SUM(subtotal)
FROM
(
SELECT sql_no_cache
q.`start_timestamp` AS `start`,
q.`end_timestamp` AS `end`,
AVG(p.`price` * q.`quantity`) AS `subtotal`
FROM `quantities` q
LEFT JOIN `prices` p ON p.timestamp >= q.start_timestamp AND
p.timestamp < q.end_timestamp AND
p.timestamp >= '2010-07-01 00:00:00' AND
p.`timestamp` < '2010-07-02 00:00:00'
WHERE q.`start_timestamp` >= '2010-07-01 00:00:00'
AND q.`start_timestamp` < '2010-07-02 00:00:00'
AND p.type_id = 1
GROUP BY q.`start_timestamp`
) forced_tmp
GROUP BY hour( start );
NOTE: Do not forget to remove sql_no_cache keywords in production.
注意:不要忘记在产品中删除sql_no_cache关键字。
There are many counter intuitive tricks applied in the above queries (sometimes conditions repeated in the join condition speed up queries, sometimes they slow them down). Mysql is great little RDBMS and really fast when it comes to relatively simple queries, but when the complexity increases it is easy to run into the above scenarios.
在上面的查询中有许多反直觉的技巧(有时连接条件中重复的条件会加速查询,有时它们会减慢查询)。Mysql是非常小的RDBMS,在相对简单的查询中速度非常快,但是当复杂性增加时,很容易运行到上面的场景中。
So in general, I apply the following principle to set my expectations regarding the performance of a query:
因此,一般来说,我应用以下原则来设置我对查询性能的期望:
- if the base result set has < 1,000 rows then query should do its business in ~0.01 sec (base result set is the number of rows that functionally determine resulting set)
- 如果基结果集有< 1,000行,那么查询应该在~0.01秒内完成它的工作(基结果集是功能上决定结果集的行数)
In this particular case you start with less then 1000 rows (all the prices and quantities in one day, with 15 minutes precision) and from that you should be able to compute the final results.
在这个特殊的例子中,你从少于1000行开始(一天中所有的价格和数量,精确到15分钟),然后你应该能够计算出最终的结果。
#2
2
This should return the same results and perform slightly faster:
这应该返回相同的结果,并且执行得稍微快一点:
SELECT
MIN( `quantities`.`start_timestamp` ) AS `start`,
MAX( `quantities`.`end_timestamp` ) AS `end`,
SUM( `quantities`.`quantity` * `prices`.`price` )
* COUNT(DISTINCT `quantities`.`id`)
/ COUNT(DISTINCT `prices`.`id`)
AS total
FROM `quantities`
JOIN `prices` ON `prices`.`timestamp` >= `quantities`.`start_timestamp`
AND `prices`.`timestamp` < `quantities`.`end_timestamp`
AND `prices`.`type_id` = 1
WHERE `quantities`.`start_timestamp` >= '2010-07-01 00:00:00'
AND `quantities`.`start_timestamp` < '2010-07-02 00:00:00'
GROUP BY HOUR( `quantities`.`start_timestamp` );
Since you can't calculate AVG()
inside the SUM()
, I had to do some interesting COUNT(DISTINCT)
to calculate the number of prices
returned per quantities
. I'm wondering if this gives you the same results with "real" data...
因为您不能在SUM()中计算AVG(),所以我必须做一些有趣的计数(不同的)来计算每个数量返回的价格。我想知道这对“真实”数据是否有同样的结果……
Using JOIN
:
使用连接:
+----+-------------+------------+-------+-------------------------------+-----------------+---------+------+-------+----------+----------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+------------+-------+-------------------------------+-----------------+---------+------+-------+----------+----------------------------------------------+ | 1 | SIMPLE | quantities | range | start_timestamp,end_timestamp | start_timestamp | 8 | NULL | 89 | 100.00 | Using where; Using temporary; Using filesort | | 1 | SIMPLE | prices | ALL | timestamp,type_id | NULL | NULL | NULL | 36862 | 62.20 | Using where; Using join buffer | +----+-------------+------------+-------+-------------------------------+-----------------+---------+------+-------+----------+----------------------------------------------+
vs. the same query only adding LEFT
to the JOIN
与相同的查询只添加左到连接
+----+-------------+------------+-------+-------------------+-----------------+---------+-------+-------+----------+----------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+------------+-------+-------------------+-----------------+---------+-------+-------+----------+----------------------------------------------+ | 1 | SIMPLE | quantities | range | start_timestamp | start_timestamp | 8 | NULL | 89 | 100.00 | Using where; Using temporary; Using filesort | | 1 | SIMPLE | prices | ref | timestamp,type_id | type_id | 4 | const | 22930 | 100.00 | | +----+-------------+------------+-------+-------------------+-----------------+---------+-------+-------+----------+----------------------------------------------+
Interesting that LEFT
can completely removes the end_timestamp
as a possible key, and changes the selected keys so much, making it take 15 times as long...
有趣的是,LEFT可以完全删除end_timestamp作为一个可能的键,并大量更改所选的键,使其花费15倍的时间……
This reference page could help you out a little more if you want to look at specifying index hints for your JOINS
如果您希望查看为连接指定索引提示,那么这个引用页面可以帮助您完成更多工作
#3
0
I don't know if it is faster, but try this one:
我不知道它是不是更快,但试试这个:
SELECT
MIN( `quantities`.`start_timestamp` ) AS `start`,
MAX( `quantities`.`end_timestamp` ) AS `end`,
( `quantities`.`quantity` * AVG (`prices`.`price`) * COUNT (`prices`.`price`)) AS `total`
FROM `quantities`
LEFT JOIN `prices`
ON `prices`.`timestamp` >= `quantities`.`start_timestamp`
AND `prices`.`timestamp` < `quantities`.`end_timestamp`
WHERE `quantities`.`start_timestamp` >= '2010-07-01 00:00:00'
AND `quantities`.`start_timestamp` < '2010-07-02 00:00:00'
AND `prices`.`type_id` = 1
GROUP BY HOUR( `quantities`.`start_timestamp` );
Also compare the results, because the logic is a little different.
还要比较结果,因为逻辑有点不同。
I don't do SUM(quantety * AVG(price)
我不做总和(数量* AVG(价格)
I do AVG(price) * COUNT(price) * quantety
我做AVG(价格)* COUNT(价格)* quantety
#4
0
Remember, just because you have indexes on your columns doesn't necessarily mean they'll run faster. As it stands, the index created is for each individual column, which, if you were only limiting the data on one column, would return results quite fast.
记住,仅仅因为在列上有索引并不意味着它们会运行得更快。就目前的情况而言,创建的索引是针对每个列的,如果您只限制了一个列上的数据,那么将很快返回结果。
So to try and avoid "Using filesort" (which you need to do as much as possible), maybe try the following index:
因此,为了避免“使用filesort”(你需要尽可能多地使用),不妨试试下面的索引:
CREATE INDEX start_timestamp_end_timestamp_id ON quantities (start_timestamp,end_timestamp,id);
And something similar for the prices table (combining the 3 individual indexes you have into 1 index for faster lookup)
price表也类似(将3个单独的索引合并为1个索引以便更快地查找)
An excellent resource which explains it in great detail and how to optimize your indexes (and what the different Explain's mean, and what to aim for) is: http://hackmysql.com/case1
一个非常好的资源,它详细地解释了它,以及如何优化索引(以及不同的Explain的含义,以及目标是什么):http://hackmysql.com/case1