What is 'correct' query to fetch a cumulative sum in MySQL?
在MySQL中获取累积和的'正确'查询是什么?
I've a table where I keep information about files, one column list contains the size of the files in bytes. (the actual files are kept on disk somewhere)
我有一个表格,我保存有关文件的信息,一列列表包含文件的大小(以字节为单位)。 (实际文件保存在某处的磁盘上)
I would like to get the cumulative file size like this:
我想得到这样的累积文件大小:
+------------+---------+--------+----------------+
| fileInfoId | groupId | size | cumulativeSize |
+------------+---------+--------+----------------+
| 1 | 1 | 522120 | 522120 |
| 2 | 2 | 316042 | 316042 |
| 4 | 2 | 711084 | 1027126 |
| 5 | 2 | 697002 | 1724128 |
| 6 | 2 | 663425 | 2387553 |
| 7 | 2 | 739553 | 3127106 |
| 8 | 2 | 700938 | 3828044 |
| 9 | 2 | 695614 | 4523658 |
| 10 | 2 | 744204 | 5267862 |
| 11 | 2 | 609022 | 5876884 |
| ... | ... | ... | ... |
+------------+---------+--------+----------------+
20000 rows in set (19.2161 sec.)
Right now, I use the following query to get the above results
现在,我使用以下查询来获得上述结果
SELECT
a.fileInfoId
, a.groupId
, a.size
, SUM(b.size) AS cumulativeSize
FROM fileInfo AS a
LEFT JOIN fileInfo AS b USING(groupId)
WHERE a.fileInfoId >= b.fileInfoId
GROUP BY a.fileInfoId
ORDER BY a.groupId, a.fileInfoId
My solution is however, extremely slow. (around 19 seconds without cache).
然而,我的解决方案非常缓慢。 (大约19秒没有缓存)。
Explain gives the following execution details
Explain给出了以下执行细节
+----+--------------+-------+-------+-------------------+-----------+---------+----------------+-------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------+-------+-------+-------------------+-----------+---------+----------------+-------+-------------+
| 1 | SIMPLE | a | index | PRIMARY,foreignId | PRIMARY | 4 | NULL | 14905 | |
| 1 | SIMPLE | b | ref | PRIMARY,foreignId | foreignId | 4 | db.a.foreignId | 36 | Using where |
+----+--------------+-------+-------+-------------------+-----------+---------+----------------+-------+-------------+
My question is:
How can I optimize the above query?
如何优化上述查询?
Update
I've updated the question as to provide the table structure and a procedure to fill the table with 20,000 records test data.
更新我已经更新了问题,以提供表结构和一个程序,用20,000个记录测试数据填充表。
CREATE TABLE `fileInfo` (
`fileInfoId` int(10) unsigned NOT NULL AUTO_INCREMENT
, `groupId` int(10) unsigned NOT NULL
, `name` varchar(128) NOT NULL
, `size` int(10) unsigned NOT NULL
, PRIMARY KEY (`fileInfoId`)
, KEY `groupId` (`groupId`)
) ENGINE=InnoDB;
delimiter $$
DROP PROCEDURE IF EXISTS autofill$$
CREATE PROCEDURE autofill()
BEGIN
DECLARE i INT DEFAULT 0;
DECLARE gid INT DEFAULT 0;
DECLARE nam char(20);
DECLARE siz INT DEFAULT 0;
WHILE i < 20000 DO
SET gid = FLOOR(RAND() * 250);
SET nam = CONV(FLOOR(RAND() * 10000000000000), 20, 36);
SET siz = FLOOR((RAND() * 1024 * 1024));
INSERT INTO `fileInfo` (`groupId`, `name`, `size`) VALUES(gid, nam, siz);
SET i = i + 1;
END WHILE;
END;$$
delimiter ;
CALL autofill();
About the possible duplicate question
The question linked by Forgotten Semicolon is not the same question. My question has extra column. because of this extra groupId column, the accepted answer there does not work for my problem. (maybe it can be adapted to work, but I don't know how, hence my question)
关于可能的重复问题由遗忘分号链接的问题不是同一个问题。我的问题有额外的专栏。因为这个额外的groupId列,接受的答案对我的问题不起作用。 (也许它可以适应工作,但我不知道如何,因此我的问题)
2 个解决方案
#1
18
You could use a variable - it's far quicker than any join:
你可以使用一个变量 - 它比任何连接都快得多:
SELECT
id,
size,
@total := @total + size AS cumulativeSize,
FROM table, (SELECT @total:=0) AS t;
Here's a quick test case on a Pentium III with 128MB RAM running Debian 5.0:
这是一个奔腾III的快速测试案例,运行Debian 5.0的128MB RAM:
Create the table:
创建表:
DROP TABLE IF EXISTS `table1`;
CREATE TABLE `table1` (
`id` int(11) NOT NULL auto_increment,
`size` int(11) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB;
Fill with 20,000 random numbers:
填写20,000个随机数:
DELIMITER //
DROP PROCEDURE IF EXISTS autofill//
CREATE PROCEDURE autofill()
BEGIN
DECLARE i INT DEFAULT 0;
WHILE i < 20000 DO
INSERT INTO table1 (size) VALUES (FLOOR((RAND() * 1000)));
SET i = i + 1;
END WHILE;
END;
//
DELIMITER ;
CALL autofill();
Check the row count:
检查行数:
SELECT COUNT(*) FROM table1;
+----------+
| COUNT(*) |
+----------+
| 20000 |
+----------+
Run the cumulative total query:
运行累计总查询:
SELECT
id,
size,
@total := @total + size AS cumulativeSize
FROM table1, (SELECT @total:=0) AS t;
+-------+------+----------------+
| id | size | cumulativeSize |
+-------+------+----------------+
| 1 | 226 | 226 |
| 2 | 869 | 1095 |
| 3 | 668 | 1763 |
| 4 | 733 | 2496 |
...
| 19997 | 966 | 10004741 |
| 19998 | 522 | 10005263 |
| 19999 | 713 | 10005976 |
| 20000 | 0 | 10005976 |
+-------+------+----------------+
20000 rows in set (0.07 sec)
UPDATE
I'd missed the grouping by groupId in the original question, and that certainly made things a bit trickier. I then wrote a solution which used a temporary table, but I didn't like it—it was messy and overly complicated. I went away and did some more research, and have come up with something far simpler and faster.
在原始问题中,我错过了groupId的分组,这肯定让事情变得有点棘手。然后我写了一个使用临时表的解决方案,但我不喜欢它 - 它太杂乱而且过于复杂。我离开了,做了一些更多的研究,并提出了更简单,更快捷的东西。
I can't claim all the credit for this—in fact, I can barely claim any at all, as it is just a modified version of Emulate row number from Common MySQL Queries.
我无法为此声称所有功劳 - 实际上,我根本不能声称任何功能,因为它只是来自Common MySQL Queries的Emulate行号的修改版本。
It's beautifully simple, elegant, and very quick:
它非常简单,优雅,非常快速:
SELECT fileInfoId, groupId, name, size, cumulativeSize
FROM (
SELECT
fileInfoId,
groupId,
name,
size,
@cs := IF(@prev_groupId = groupId, @cs+size, size) AS cumulativeSize,
@prev_groupId := groupId AS prev_groupId
FROM fileInfo, (SELECT @prev_groupId:=0, @cs:=0) AS vars
ORDER BY groupId
) AS tmp;
You can remove the outer SELECT ... AS tmp
if you don't mind the prev_groupID
column being returned. I found that it ran marginally faster without it.
如果您不介意返回prev_groupID列,则可以删除外部SELECT ... AS tmp。没有它,我发现它的运行速度略快。
Here's a simple test case:
这是一个简单的测试用例:
INSERT INTO `fileInfo` VALUES
( 1, 3, 'name0', '10'),
( 5, 3, 'name1', '10'),
( 7, 3, 'name2', '10'),
( 8, 1, 'name3', '10'),
( 9, 1, 'name4', '10'),
(10, 2, 'name5', '10'),
(12, 4, 'name6', '10'),
(20, 4, 'name7', '10'),
(21, 4, 'name8', '10'),
(25, 5, 'name9', '10');
SELECT fileInfoId, groupId, name, size, cumulativeSize
FROM (
SELECT
fileInfoId,
groupId,
name,
size,
@cs := IF(@prev_groupId = groupId, @cs+size, size) AS cumulativeSize,
@prev_groupId := groupId AS prev_groupId
FROM fileInfo, (SELECT @prev_groupId := 0, @cs := 0) AS vars
ORDER BY groupId
) AS tmp;
+------------+---------+-------+------+----------------+
| fileInfoId | groupId | name | size | cumulativeSize |
+------------+---------+-------+------+----------------+
| 8 | 1 | name3 | 10 | 10 |
| 9 | 1 | name4 | 10 | 20 |
| 10 | 2 | name5 | 10 | 10 |
| 1 | 3 | name0 | 10 | 10 |
| 5 | 3 | name1 | 10 | 20 |
| 7 | 3 | name2 | 10 | 30 |
| 12 | 4 | name6 | 10 | 10 |
| 20 | 4 | name7 | 10 | 20 |
| 21 | 4 | name8 | 10 | 30 |
| 25 | 5 | name9 | 10 | 10 |
+------------+---------+-------+------+----------------+
Here's a sample of the last few rows from a 20,000 row table:
以下是20,000行表中最后几行的示例:
| 19481 | 248 | 8CSLJX22RCO | 1037469 | 51270389 |
| 19486 | 248 | 1IYGJ1UVCQE | 937150 | 52207539 |
| 19817 | 248 | 3FBU3EUSE1G | 616614 | 52824153 |
| 19871 | 248 | 4N19QB7PYT | 153031 | 52977184 |
| 132 | 249 | 3NP9UGMTRTD | 828073 | 828073 |
| 275 | 249 | 86RJM39K72K | 860323 | 1688396 |
| 802 | 249 | 16Z9XADLBFI | 623030 | 2311426 |
...
| 19661 | 249 | ADZXKQUI0O3 | 837213 | 39856277 |
| 19870 | 249 | 9AVRTI3QK6I | 331342 | 40187619 |
| 19972 | 249 | 1MTAEE3LLEM | 1027714 | 41215333 |
+------------+---------+-------------+---------+----------------+
20000 rows in set (0.31 sec)
#2
1
I think that MySQL is only using one of the indexes on the table. In this case, it's choosing the index on foreignId.
我认为MySQL只使用表中的一个索引。在这种情况下,它选择foreignId上的索引。
Add a covering compound index that includes both primaryId and foreignId.
添加包含primaryId和foreignId的覆盖复合索引。
#1
18
You could use a variable - it's far quicker than any join:
你可以使用一个变量 - 它比任何连接都快得多:
SELECT
id,
size,
@total := @total + size AS cumulativeSize,
FROM table, (SELECT @total:=0) AS t;
Here's a quick test case on a Pentium III with 128MB RAM running Debian 5.0:
这是一个奔腾III的快速测试案例,运行Debian 5.0的128MB RAM:
Create the table:
创建表:
DROP TABLE IF EXISTS `table1`;
CREATE TABLE `table1` (
`id` int(11) NOT NULL auto_increment,
`size` int(11) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB;
Fill with 20,000 random numbers:
填写20,000个随机数:
DELIMITER //
DROP PROCEDURE IF EXISTS autofill//
CREATE PROCEDURE autofill()
BEGIN
DECLARE i INT DEFAULT 0;
WHILE i < 20000 DO
INSERT INTO table1 (size) VALUES (FLOOR((RAND() * 1000)));
SET i = i + 1;
END WHILE;
END;
//
DELIMITER ;
CALL autofill();
Check the row count:
检查行数:
SELECT COUNT(*) FROM table1;
+----------+
| COUNT(*) |
+----------+
| 20000 |
+----------+
Run the cumulative total query:
运行累计总查询:
SELECT
id,
size,
@total := @total + size AS cumulativeSize
FROM table1, (SELECT @total:=0) AS t;
+-------+------+----------------+
| id | size | cumulativeSize |
+-------+------+----------------+
| 1 | 226 | 226 |
| 2 | 869 | 1095 |
| 3 | 668 | 1763 |
| 4 | 733 | 2496 |
...
| 19997 | 966 | 10004741 |
| 19998 | 522 | 10005263 |
| 19999 | 713 | 10005976 |
| 20000 | 0 | 10005976 |
+-------+------+----------------+
20000 rows in set (0.07 sec)
UPDATE
I'd missed the grouping by groupId in the original question, and that certainly made things a bit trickier. I then wrote a solution which used a temporary table, but I didn't like it—it was messy and overly complicated. I went away and did some more research, and have come up with something far simpler and faster.
在原始问题中,我错过了groupId的分组,这肯定让事情变得有点棘手。然后我写了一个使用临时表的解决方案,但我不喜欢它 - 它太杂乱而且过于复杂。我离开了,做了一些更多的研究,并提出了更简单,更快捷的东西。
I can't claim all the credit for this—in fact, I can barely claim any at all, as it is just a modified version of Emulate row number from Common MySQL Queries.
我无法为此声称所有功劳 - 实际上,我根本不能声称任何功能,因为它只是来自Common MySQL Queries的Emulate行号的修改版本。
It's beautifully simple, elegant, and very quick:
它非常简单,优雅,非常快速:
SELECT fileInfoId, groupId, name, size, cumulativeSize
FROM (
SELECT
fileInfoId,
groupId,
name,
size,
@cs := IF(@prev_groupId = groupId, @cs+size, size) AS cumulativeSize,
@prev_groupId := groupId AS prev_groupId
FROM fileInfo, (SELECT @prev_groupId:=0, @cs:=0) AS vars
ORDER BY groupId
) AS tmp;
You can remove the outer SELECT ... AS tmp
if you don't mind the prev_groupID
column being returned. I found that it ran marginally faster without it.
如果您不介意返回prev_groupID列,则可以删除外部SELECT ... AS tmp。没有它,我发现它的运行速度略快。
Here's a simple test case:
这是一个简单的测试用例:
INSERT INTO `fileInfo` VALUES
( 1, 3, 'name0', '10'),
( 5, 3, 'name1', '10'),
( 7, 3, 'name2', '10'),
( 8, 1, 'name3', '10'),
( 9, 1, 'name4', '10'),
(10, 2, 'name5', '10'),
(12, 4, 'name6', '10'),
(20, 4, 'name7', '10'),
(21, 4, 'name8', '10'),
(25, 5, 'name9', '10');
SELECT fileInfoId, groupId, name, size, cumulativeSize
FROM (
SELECT
fileInfoId,
groupId,
name,
size,
@cs := IF(@prev_groupId = groupId, @cs+size, size) AS cumulativeSize,
@prev_groupId := groupId AS prev_groupId
FROM fileInfo, (SELECT @prev_groupId := 0, @cs := 0) AS vars
ORDER BY groupId
) AS tmp;
+------------+---------+-------+------+----------------+
| fileInfoId | groupId | name | size | cumulativeSize |
+------------+---------+-------+------+----------------+
| 8 | 1 | name3 | 10 | 10 |
| 9 | 1 | name4 | 10 | 20 |
| 10 | 2 | name5 | 10 | 10 |
| 1 | 3 | name0 | 10 | 10 |
| 5 | 3 | name1 | 10 | 20 |
| 7 | 3 | name2 | 10 | 30 |
| 12 | 4 | name6 | 10 | 10 |
| 20 | 4 | name7 | 10 | 20 |
| 21 | 4 | name8 | 10 | 30 |
| 25 | 5 | name9 | 10 | 10 |
+------------+---------+-------+------+----------------+
Here's a sample of the last few rows from a 20,000 row table:
以下是20,000行表中最后几行的示例:
| 19481 | 248 | 8CSLJX22RCO | 1037469 | 51270389 |
| 19486 | 248 | 1IYGJ1UVCQE | 937150 | 52207539 |
| 19817 | 248 | 3FBU3EUSE1G | 616614 | 52824153 |
| 19871 | 248 | 4N19QB7PYT | 153031 | 52977184 |
| 132 | 249 | 3NP9UGMTRTD | 828073 | 828073 |
| 275 | 249 | 86RJM39K72K | 860323 | 1688396 |
| 802 | 249 | 16Z9XADLBFI | 623030 | 2311426 |
...
| 19661 | 249 | ADZXKQUI0O3 | 837213 | 39856277 |
| 19870 | 249 | 9AVRTI3QK6I | 331342 | 40187619 |
| 19972 | 249 | 1MTAEE3LLEM | 1027714 | 41215333 |
+------------+---------+-------------+---------+----------------+
20000 rows in set (0.31 sec)
#2
1
I think that MySQL is only using one of the indexes on the table. In this case, it's choosing the index on foreignId.
我认为MySQL只使用表中的一个索引。在这种情况下,它选择foreignId上的索引。
Add a covering compound index that includes both primaryId and foreignId.
添加包含primaryId和foreignId的覆盖复合索引。