I have a large table containing over 10 million records and It will keep growing. I am performing an aggregation query (count of particular value) on records of last 24 hours. The time taken by this query will keep increasing with number of records in the table.
我有一个包含超过1000万条记录的大表,它会继续增长。我正在对过去24小时的记录执行聚合查询(特定值的计数)。此查询所花费的时间将随着表中记录的数量而增加。
I can limit the time taken by keeping these 24 hours records in separate table and perform aggregation on that table. Does mysql provide any functionality to handle this kind of scenario?
我可以限制将这些24小时记录保存在单独的表中并在该表上执行聚合所花费的时间。 mysql是否提供了处理这种场景的任何功能?
Table schema and query for reference:
表模式和查询供参考:
CREATE TABLE purchases (
Id int(11) NOT NULL AUTO_INCREMENT,
ProductId int(11) NOT NULL,
CustomerId int(11) NOT NULL,
PurchaseDateTime datetime(3) NOT NULL,
PRIMARY KEY (Id),
KEY ix_purchases_PurchaseDateTime (PurchaseDateTime) USING BTREE,
KEY ix_purchases_ProductId (ProductId) USING BTREE,
KEY ix_purchases_CustomerId (CustomerId) USING BTREE
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
select COALESCE(sum(ProductId = v_ProductId), 0),
COALESCE(sum(CustomerId = v_CustomerId), 0)
into v_ProductCount, v_CustomerCount
from purchases
where PurchaseDateTime > NOW() - INTERVAL 1 DAY
and ( ProductId = v_ProductId
or CustomerId = v_CustomerId );
2 个解决方案
#1
1
Build and maintain a separate Summary table .
构建并维护一个单独的Summary表。
With partitioning, you might get a small improvement, or you might get no improvement. With a summary table, you might get a factor of 10 improvement.
使用分区,您可能会获得一些小的改进,或者您可能没有任何改进。使用汇总表,您可能会获得10倍的改进。
The summary table could have a 1-day resolution, or you might need 1-hour. Please provide SHOW CREATE TABLE
for what you currently have, so we can discuss more specifics.
摘要表可能具有1天的分辨率,或者您可能需要1小时。请提供SHOW CREATE TABLE,了解您目前的情况,以便我们讨论更多细节。
(There is no built-in mechanism for what you want.)
(没有你想要的内置机制。)
#2
0
Plan A
计划A.
I would leave off
我会离开
and ( ProductId = v_ProductId
or CustomerId = v_CustomerId )
since the rest of the query will simply deal with it anyway.
因为查询的其余部分无论如何都会处理它。
Then I would add
然后我会补充一下
INDEX(PurchaseDateTime, ProductId, CustomerId)
which would be "covering" -- that is, the entire SELECT
can be performed in the INDEX's BTree. It would also be 'clustered' in the sense that all the data needed would be consecutively stored in the index. Yes, the datetime is deliberately first. (OR
is a nuisance to optimize. I don't trust the Optimizer to do "index merge union".)
这将是“覆盖” - 也就是说,整个SELECT可以在INDEX的BTree中执行。在所有需要的数据将连续存储在索引中的意义上,它也将是“聚集的”。是的,日期时间是刻意的。 (或者是优化的麻烦。我不相信优化器可以做“索引合并联盟”。)
Plan B
B计划
If you expect to touch very few rows (because of v_ProductId
and v_CustomerId
), then the following may be faster, in spite of being more complex:
如果您希望触摸很少的行(因为v_ProductId和v_CustomerId),那么以下可能会更快,尽管更复杂:
SELECT COALESCE(sum(ProductId = v_ProductId), 0)
INTO v_ProductCount
FROM purchases
WHERE PurchaseDateTime > NOW() - INTERVAL 1 DAY
AND ProductId = v_ProductId;
SELECT COALESCE(sum(CustomerId = v_CustomerId), 0)
INTO v_CustomerCount
FROM purchases
WHERE PurchaseDateTime > NOW() - INTERVAL 1 DAY
AND CustomerId = v_CustomerId;
together with both:
与两者一起:
INDEX(ProductId, PurchaseDateTime),
INDEX(CustomerId, PurchaseDateTime)
Yes, the order of the columns is deliberately different.
是的,列的顺序是故意不同的。
Original Question
原始问题
Both of these approaches are better than your original suggestion of a separate table. These isolate the data in one part of an index (or two indexes), thereby having the effect of "separate". And these do the task with less effort on your part.
这两种方法都比您对单独表格的原始建议更好。这些将数据隔离在索引的一部分(或两个索引)中,从而具有“分离”的效果。而这些任务可以帮助您完成任务。
#1
1
Build and maintain a separate Summary table .
构建并维护一个单独的Summary表。
With partitioning, you might get a small improvement, or you might get no improvement. With a summary table, you might get a factor of 10 improvement.
使用分区,您可能会获得一些小的改进,或者您可能没有任何改进。使用汇总表,您可能会获得10倍的改进。
The summary table could have a 1-day resolution, or you might need 1-hour. Please provide SHOW CREATE TABLE
for what you currently have, so we can discuss more specifics.
摘要表可能具有1天的分辨率,或者您可能需要1小时。请提供SHOW CREATE TABLE,了解您目前的情况,以便我们讨论更多细节。
(There is no built-in mechanism for what you want.)
(没有你想要的内置机制。)
#2
0
Plan A
计划A.
I would leave off
我会离开
and ( ProductId = v_ProductId
or CustomerId = v_CustomerId )
since the rest of the query will simply deal with it anyway.
因为查询的其余部分无论如何都会处理它。
Then I would add
然后我会补充一下
INDEX(PurchaseDateTime, ProductId, CustomerId)
which would be "covering" -- that is, the entire SELECT
can be performed in the INDEX's BTree. It would also be 'clustered' in the sense that all the data needed would be consecutively stored in the index. Yes, the datetime is deliberately first. (OR
is a nuisance to optimize. I don't trust the Optimizer to do "index merge union".)
这将是“覆盖” - 也就是说,整个SELECT可以在INDEX的BTree中执行。在所有需要的数据将连续存储在索引中的意义上,它也将是“聚集的”。是的,日期时间是刻意的。 (或者是优化的麻烦。我不相信优化器可以做“索引合并联盟”。)
Plan B
B计划
If you expect to touch very few rows (because of v_ProductId
and v_CustomerId
), then the following may be faster, in spite of being more complex:
如果您希望触摸很少的行(因为v_ProductId和v_CustomerId),那么以下可能会更快,尽管更复杂:
SELECT COALESCE(sum(ProductId = v_ProductId), 0)
INTO v_ProductCount
FROM purchases
WHERE PurchaseDateTime > NOW() - INTERVAL 1 DAY
AND ProductId = v_ProductId;
SELECT COALESCE(sum(CustomerId = v_CustomerId), 0)
INTO v_CustomerCount
FROM purchases
WHERE PurchaseDateTime > NOW() - INTERVAL 1 DAY
AND CustomerId = v_CustomerId;
together with both:
与两者一起:
INDEX(ProductId, PurchaseDateTime),
INDEX(CustomerId, PurchaseDateTime)
Yes, the order of the columns is deliberately different.
是的,列的顺序是故意不同的。
Original Question
原始问题
Both of these approaches are better than your original suggestion of a separate table. These isolate the data in one part of an index (or two indexes), thereby having the effect of "separate". And these do the task with less effort on your part.
这两种方法都比您对单独表格的原始建议更好。这些将数据隔离在索引的一部分(或两个索引)中,从而具有“分离”的效果。而这些任务可以帮助您完成任务。