哪个是最有效的SELECT方法，为什么？

Consider a site where people vote up (+1) or down (-1) on their favourite colour and I have two tables:

考虑一个人们以他们喜欢的颜色投票(+1)或减少(-1)的网站,我有两张桌子:

One lists all the colours people can vote for and the second table records each individual vote made, what colour it was for and whether is was +1 or -1.

一个列出人们可以投票的所有颜色,第二个表记录每个单独的投票,它的颜色是什么,是否是+1或-1。

With regards to fetching the aggregate vote for a specific colour, would it be more efficient include an aggregate score on the colours table and when a person votes there is an insert statement and an update statement:

关于获取特定颜色的聚合投票,是否更有效包括颜色表上的总分数以及当一个人投票时有插入语句和更新语句:

INSERT INTO votes (colour,vote) VALUES (red,-1);
UPDATE colours SET score=score-1 WHERE colour='red';

SELECT score FROM colours WHERE colour='red';

Or would it be more efficient to just have a single INSERT statement when a vote is made, and then to fetch the score you;

或者在投票时只有一个INSERT语句,然后获取你的分数会更有效率;

SELECT SUM(vote) AS score FROM votes WHERE colour='red';

I guess when there's a very small number of votes then option #2 is best but does option #1 become better when the votes table is very large?

我猜当选票数量非常少时,选项#2最好,但是当选票表非常大时,选项#1会变得更好吗?

Is there some tool that I can use to give a kind of ranking on certain SQL Queries depending on table sizes etc?

是否有一些工具可以根据表大小等对某些SQL查询进行排序?

4 个解决方案

#1

Personally, I think if you want to display an aggregate score (and I imagine that you would want to display the score frequently), then as the number of rows in the voting table increases, you'll find that the aggregate SUM query will take longer and longer and not scale very well.

就个人而言,我认为如果你想显示一个总分(我想你想要经常显示得分),那么当投票表中的行数增加时,你会发现聚合SUM查询将需要越来越长,不能很好地扩展。

In addition, if you plan on implementing a queries that only show colours with a score of 100 or more, then having the aggregate will make for simpler and quicker queries.

此外,如果您计划实现仅显示分数为100或更高的颜色的查询,那么使用聚合将使查询更简单,更快捷。

Another advantage of using the score column is that if at some future date you want to clean out the votes table (e.g if it gets too big), then you could do that and wouldn't lose the colour scores.

使用得分列的另一个好处是,如果在将来某个日期你要清理投票表(例如,如果它变得太大),那么你可以做到这一点并且不会丢失颜色分数。

I don't think this is premature optimisation, I think this is designing a system with scale in mind, so what I would do is to create some sample datasets of a realistic number of votes, colours and queries per minute you'd expect and run some performance tests to evaluate what is the better approach, for it is easier (read cheaper) to pick the right approach now rather than fixing it when things start going wrong.

我不认为这是过早的优化,我认为这是设计一个考虑到规模的系统,所以我要做的是创建一些样本数据集,其中包含您期望的每分钟实际投票数,颜色数和查询数。运行一些性能测试来评估什么是更好的方法,因为现在选择正确的方法更容易(阅读更便宜)而不是在事情开始出错时修复它。

#2

The difference in performance between the two queries is trivial. You should determine the structure based on the information you want to keep.

两个查询之间的性能差异是微不足道的。您应该根据要保留的信息确定结构。

If you only need an aggregate score, then use

如果您只需要总分,那么请使用

UPDATE colours SET score=score-1 WHERE colour='red';

This will be very fast, because the table colours is only going to have a few rows.

这将非常快,因为表格颜色只有几行。

On the other hand, there might be a reason to store each user's vote (such as making sure they don't vote twice). In that case insert a row for each vote.

另一方面,可能有理由存储每个用户的投票(例如确保他们不投票两次)。在这种情况下,为每次投票插入一行。

INSERT INTO votes (colour,vote,user_id) VALUES (red,-1);

But don't create a structure of unnecessary rows just because you think it will be faster.

但是不要因为你认为它会更快而创建一个不必要的行的结构。

#3

Are you prematurely optimizing or is this a real issue?

您是否过早优化或这是一个真正的问题?

First approach might be faster but you change your domain model for the sake of optimization. It's okay as long as you know what you're doing and what disadvantages it brings to you (probable necessity to update two tables in all the places that work with votes, leading to missynchronization, for instance)

第一种方法可能会更快,但为了优化,您需要更改域模型。只要您知道自己在做什么以及它给您带来了哪些不利之处就可以了(可能需要更新所有使用投票的地方的两个表,例如导致错误同步)

But you might consider other options. For instance if numbers of colours is not that big, you might build a caching for their ratings. That'll keep the simple model, plain rating mechanics and provide speed you need, minus some memory ;)

但你可能会考虑其他选择。例如,如果颜色数量不是那么大,您可以为其评级构建缓存。这将保持简单的模型,简单的评级机制,并提供您所需的速度,减去一些记忆;)

#4

The key point to this type of optimization is what you want to optimize. Storing the sum makes insertions/deletions/updates take longer. Calculating the sum affects the performance of queries on the data.

这种优化的关键点是您想要优化的。存储总和会使插入/删除/更新花费更长时间。计算总和会影响数据查询的性能。

If you are doing deletes or updates on the data, you quickly see the folly of pre-calculating the sum. Any such change to the data requires modifications to multiple records, when you think you are only changing one.

如果您正在删除或更新数据,您很快就会看到预先计算总和的愚蠢行为。当您认为只更改数据时,对数据的任何此类更改都需要修改多个记录。

Your structure, though, appears to have only inserts -- a good design choice by the way, because you see every change. In this case, the question is whether you want to take the overhead on each insert or you want the overhead on the "reporting" side. The question is easy in certain cases.

但是,您的结构似乎只有插入 - 顺便说一下,这是一个很好的设计选择,因为您可以看到每一个变化。在这种情况下,问题是您是否要在每个插入上花费开销,或者您想要“报告”方面的开销。在某些情况下,这个问题很容易。

If you have 1000 votes for every time that you are going to look at the sum, calculate the sum on the fly. If you have 1000 sums for every vote, then storing the sum looks like the more efficient approach.

如果您每次要查看总和时有1000票,请立即计算总和。如果每次投票都有1000个总和,那么存储总和看起来就像是更有效的方法。

My guess is that the work-load is somewhere between the extremes. My natural bias is to store the data as generated, and then to have additional tables for summaries and reporting. I would recommend one of the following two approaches:

我的猜测是工作负荷介于极端之间。我的自然偏见是将数据存储为生成的数据,然后为摘要和报告添加其他表。我建议使用以下两种方法之一:

(1) Keep only the transaction data and calculate the sums on-the-fly. Arrange the indexes on the table to make the sums as efficient as possible.

(1)仅保留交易数据并即时计算总和。在表格上排列索引以使总和尽可能高效。

(2) Keep only the transactions in one table and calculate the sums in another table (using either a trigger or a stored procedure). This gives you the up-to-date values needed for most purposes. The inserts should be more efficient than storing the sum on each record (because the table at the user level is smaller than the table at the vote level).

(2)仅将事务保留在一个表中,并计算另一个表中的总和(使用触发器或存储过程)。这为您提供了大多数用途所需的最新值。插入应该比在每个记录上存储总和更有效(因为用户级别的表小于投票级别的表)。

Your suggestion of calculating the sum in the votes record would not normally be an option that I would consider. It would be desirable when you need the history of incremental votes. But, if you are looking at the history, then doing the sum calculation or calculating the sum in the application layer would also be feasible alternatives.

你在计票记录中计算总和的建议通常不是我会考虑的选择。当你需要增量投票的历史时,这将是可取的。但是,如果您正在查看历史记录,那么进行总和计算或计算应用程序层中的总和也是可行的替代方案。

#1