I currently run a site which tracks up-to-the-minute scores and ratings in a list. The list has thousands of entries that are updated frequently, and the list should be sortable by these score and ratings columns.
我目前运行一个网站,跟踪最新的分数和排名。列表中有数千个条目经常更新,列表应该由这些评分和评级列进行排序。
My SQL for getting this data currently looks like (roughly):
我获取这些数据的SQL语句如下(大致):
SELECT e.*, SUM(sa.amount) AS score, AVG(ra.rating) AS rating
FROM entries e
LEFT JOIN score_adjustments sa ON sa.entry_id = e.id
HAVING sa.created BETWEEN ... AND ...
LEFT JOIN rating_adjustments ra ON ra.entry_id = e.id
HAVING ra.rating > 0
ORDER BY score
LIMIT 0, 10
Where the tables are (simplified):
表格(简化):
entries:
id: INT(11) PRIMARY
...other data...
score_adjustments:
id: INT(11), PRIMARY
entry_id: INT(11), INDEX, FOREIGN KEY (entries.id)
created: DATETIME
amount: INT(4)
rating_adjustments:
id: INT(11), PRIMARY
entry_id: INT(11), INDEX, FOREIGN KEY (entries.id)
rating: DOUBLE
There are approx 300,000 score_adjustments
entries and they grow at about 5,000 a day. The rating_adjustments
is about 1/4 that.
有大约30万的计分条目,它们每天大约增长5000个。rating_调整是大约1/4。
Now, I'm no DBA expert but I'm guessing calling SUM()
and AVG()
all the time isn't a good thing - especially when sa
and ra
contain hundreds of thousands of records - right?
现在,我不是DBA专家,但我猜测调用SUM()和AVG()一直都不是一件好事,特别是当sa和ra包含成千上万的记录时,对吗?
I already do caching on the query, but I want the query itself to be fast - yet still as up to date as possible. I was wondering if anyone could share any solutions to optimise heavy join/aggregation queries like this? I'm willing to make structural changes if necessary.
我已经对查询进行了缓存,但是我希望查询本身是快速的——但仍然尽可能地保持最新。我想知道是否有人可以共享任何的解决方案来优化大量的连接/聚合查询?如果需要的话,我愿意做结构性的改变。
EDIT 1
编辑1
Added more info about the query.
添加关于查询的更多信息。
2 个解决方案
#1
2
Your data is badly clustered.
您的数据集成度很差。
InnoDB will store rows with "close" PKs physically close together. Since your child tables use surrogate PKs, their rows will be stored in effect randomly. When the time comes to make calculations for the given row in the "master" table, DBMS must jump all over the place to gather the related rows from the child tables.
InnoDB将用“close”PKs一起存储行。由于子表使用代理PKs,因此它们的行将被随机存储。当需要对“master”表中的给定行进行计算时,DBMS必须到处跳转以从子表中收集相关的行。
Instead of surrogate keys, try using more "natural" keys, with the parent's PK in the leading edge, similar to this:
与其使用代理键,不如尝试使用更“自然”的键,并在前缘使用父键PK,类似如下:
score_adjustments:
entry_id: INT(11), FOREIGN KEY (entries.id)
created: DATETIME
amount: INT(4)
PRIMARY KEY (entry_id, created)
rating_adjustments:
entry_id: INT(11), FOREIGN KEY (entries.id)
rating_no: INT(11)
rating: DOUBLE
PRIMARY KEY (entry_id, rating_no)
NOTE: This assumes created
's resolution is fine enough and the rating_no
was added to allow multiple ratings per entry_id
. This is just an example - you may vary the PKs according to your needs.
注意:这假设create的分辨率足够好,并且增加了rating_no,允许每个entry_id有多个评级。这只是一个例子——您可以根据需要更改PKs。
This will "force" rows belonging to the same entry_id
to be stored physically close together, so a SUM or AVG can be calculated by just a range scan on the PK/clustering key and with very few I/Os.
这将“强制”属于同一个entry_id的行被物理上紧密地存储在一起,因此可以通过对PK/集群键的范围扫描和很少的I/Os来计算和或AVG。
Alternatively (e.g. if you are using MyISAM that doesn't support clustering), cover the query with indexes so the child tables are not touched during querying at all.
或者(例如,如果您正在使用不支持集群的MyISAM),则使用索引覆盖查询,以便在查询时不涉及子表。
On top of that, you could denormalize your design, and cache the current results in the parent table:
除此之外,您还可以反规范化设计,并在父表中缓存当前结果:
- Store SUM(score_adjustments.amount) as a physical field and adjust it via triggers every time a row is inserted, updated or deleted from
score_adjustments
. - 将SUM(score_adjustment .amount)存储为物理字段,并在每次从score_adjustment中插入、更新或删除一行时通过触发器进行调整。
- Store SUM(rating_adjustments.rating) as "S" and COUNT(rating_adjustments.rating) as "C". When a row is added to
rating_adjustments
, add it to S and increment C. Calculate S/C at run-time to get the average. Handle updates and deletes similarly. - 存储和(rating_adjustment .rating)为“S”,计数(rating_debug .rating)为“C”。当向rating_adjustment添加一行时,将它添加到S并增加C。运行时计算S/C,得到平均值。处理类似的更新和删除。
#2
2
If you're worried about performance you could add the score and rating columns to the corresponding tables and update them on insert or update to the referenced tables using a trigger. This would cache the new results every time they are updated and you won't have to recalculate them every time, significantly reducing the amount of joining needed to get the results... just guessing but in most cases the results of your query are probably much more often fetched than updated.
如果您担心性能问题,可以将score和rating列添加到相应的表中,并在使用触发器插入或更新引用的表时更新它们。这将缓存每次更新的新结果,您不必每次都重新计算它们,从而显著减少获得结果所需的连接量……只是猜测,但在大多数情况下,查询的结果可能更多地是获取而不是更新。
Check out this sql fiddle http://sqlfiddle.com/#!2/b7101/1 to see how to make the triggers and their effect, I only added triggers on insert, you can add update triggers just as easily, if you ever delete data add triggers for delete as well.
看看这个sqlfiddle http://sqlfiddle.com/#!2/b7101/1,看看如何制作触发器和它们的效果,我只在插入时添加了触发器,你可以很容易地添加更新触发器,如果你删除了数据,也可以添加删除触发器。
Didn't add the datetime field, if the between ... and ...
parameters change often you might have to still do that manually every time, otherwise you can just add the between clause to the score_update trigger.
没有添加datetime字段,如果…和…参数经常更改,您可能每次都必须手动执行,否则您可以将between子句添加到score_update触发器中。
#1
2
Your data is badly clustered.
您的数据集成度很差。
InnoDB will store rows with "close" PKs physically close together. Since your child tables use surrogate PKs, their rows will be stored in effect randomly. When the time comes to make calculations for the given row in the "master" table, DBMS must jump all over the place to gather the related rows from the child tables.
InnoDB将用“close”PKs一起存储行。由于子表使用代理PKs,因此它们的行将被随机存储。当需要对“master”表中的给定行进行计算时,DBMS必须到处跳转以从子表中收集相关的行。
Instead of surrogate keys, try using more "natural" keys, with the parent's PK in the leading edge, similar to this:
与其使用代理键,不如尝试使用更“自然”的键,并在前缘使用父键PK,类似如下:
score_adjustments:
entry_id: INT(11), FOREIGN KEY (entries.id)
created: DATETIME
amount: INT(4)
PRIMARY KEY (entry_id, created)
rating_adjustments:
entry_id: INT(11), FOREIGN KEY (entries.id)
rating_no: INT(11)
rating: DOUBLE
PRIMARY KEY (entry_id, rating_no)
NOTE: This assumes created
's resolution is fine enough and the rating_no
was added to allow multiple ratings per entry_id
. This is just an example - you may vary the PKs according to your needs.
注意:这假设create的分辨率足够好,并且增加了rating_no,允许每个entry_id有多个评级。这只是一个例子——您可以根据需要更改PKs。
This will "force" rows belonging to the same entry_id
to be stored physically close together, so a SUM or AVG can be calculated by just a range scan on the PK/clustering key and with very few I/Os.
这将“强制”属于同一个entry_id的行被物理上紧密地存储在一起,因此可以通过对PK/集群键的范围扫描和很少的I/Os来计算和或AVG。
Alternatively (e.g. if you are using MyISAM that doesn't support clustering), cover the query with indexes so the child tables are not touched during querying at all.
或者(例如,如果您正在使用不支持集群的MyISAM),则使用索引覆盖查询,以便在查询时不涉及子表。
On top of that, you could denormalize your design, and cache the current results in the parent table:
除此之外,您还可以反规范化设计,并在父表中缓存当前结果:
- Store SUM(score_adjustments.amount) as a physical field and adjust it via triggers every time a row is inserted, updated or deleted from
score_adjustments
. - 将SUM(score_adjustment .amount)存储为物理字段,并在每次从score_adjustment中插入、更新或删除一行时通过触发器进行调整。
- Store SUM(rating_adjustments.rating) as "S" and COUNT(rating_adjustments.rating) as "C". When a row is added to
rating_adjustments
, add it to S and increment C. Calculate S/C at run-time to get the average. Handle updates and deletes similarly. - 存储和(rating_adjustment .rating)为“S”,计数(rating_debug .rating)为“C”。当向rating_adjustment添加一行时,将它添加到S并增加C。运行时计算S/C,得到平均值。处理类似的更新和删除。
#2
2
If you're worried about performance you could add the score and rating columns to the corresponding tables and update them on insert or update to the referenced tables using a trigger. This would cache the new results every time they are updated and you won't have to recalculate them every time, significantly reducing the amount of joining needed to get the results... just guessing but in most cases the results of your query are probably much more often fetched than updated.
如果您担心性能问题,可以将score和rating列添加到相应的表中,并在使用触发器插入或更新引用的表时更新它们。这将缓存每次更新的新结果,您不必每次都重新计算它们,从而显著减少获得结果所需的连接量……只是猜测,但在大多数情况下,查询的结果可能更多地是获取而不是更新。
Check out this sql fiddle http://sqlfiddle.com/#!2/b7101/1 to see how to make the triggers and their effect, I only added triggers on insert, you can add update triggers just as easily, if you ever delete data add triggers for delete as well.
看看这个sqlfiddle http://sqlfiddle.com/#!2/b7101/1,看看如何制作触发器和它们的效果,我只在插入时添加了触发器,你可以很容易地添加更新触发器,如果你删除了数据,也可以添加删除触发器。
Didn't add the datetime field, if the between ... and ...
parameters change often you might have to still do that manually every time, otherwise you can just add the between clause to the score_update trigger.
没有添加datetime字段,如果…和…参数经常更改,您可能每次都必须手动执行,否则您可以将between子句添加到score_update触发器中。