
时间:2022-09-23 17:12:53

Consider a site where people vote up (+1) or down (-1) on their favourite colour and I have two tables:


One lists all the colours people can vote for and the second table records each individual vote made, what colour it was for and whether is was +1 or -1.


With regards to fetching the aggregate vote for a specific colour, would it be more efficient include an aggregate score on the colours table and when a person votes there is an insert statement and an update statement:


INSERT INTO votes (colour,vote) VALUES (red,-1);
UPDATE colours SET score=score-1 WHERE colour='red';

SELECT score FROM colours WHERE colour='red';

Or would it be more efficient to just have a single INSERT statement when a vote is made, and then to fetch the score you;


SELECT SUM(vote) AS score FROM votes WHERE colour='red';

I guess when there's a very small number of votes then option #2 is best but does option #1 become better when the votes table is very large?


Is there some tool that I can use to give a kind of ranking on certain SQL Queries depending on table sizes etc?


4 个解决方案



Personally, I think if you want to display an aggregate score (and I imagine that you would want to display the score frequently), then as the number of rows in the voting table increases, you'll find that the aggregate SUM query will take longer and longer and not scale very well.


In addition, if you plan on implementing a queries that only show colours with a score of 100 or more, then having the aggregate will make for simpler and quicker queries.


Another advantage of using the score column is that if at some future date you want to clean out the votes table (e.g if it gets too big), then you could do that and wouldn't lose the colour scores.


I don't think this is premature optimisation, I think this is designing a system with scale in mind, so what I would do is to create some sample datasets of a realistic number of votes, colours and queries per minute you'd expect and run some performance tests to evaluate what is the better approach, for it is easier (read cheaper) to pick the right approach now rather than fixing it when things start going wrong.




The difference in performance between the two queries is trivial. You should determine the structure based on the information you want to keep.


If you only need an aggregate score, then use


UPDATE colours SET score=score-1 WHERE colour='red';

This will be very fast, because the table colours is only going to have a few rows.


On the other hand, there might be a reason to store each user's vote (such as making sure they don't vote twice). In that case insert a row for each vote.


INSERT INTO votes (colour,vote,user_id) VALUES (red,-1);

But don't create a structure of unnecessary rows just because you think it will be faster.




Are you prematurely optimizing or is this a real issue?


First approach might be faster but you change your domain model for the sake of optimization. It's okay as long as you know what you're doing and what disadvantages it brings to you (probable necessity to update two tables in all the places that work with votes, leading to missynchronization, for instance)


But you might consider other options. For instance if numbers of colours is not that big, you might build a caching for their ratings. That'll keep the simple model, plain rating mechanics and provide speed you need, minus some memory ;)




The key point to this type of optimization is what you want to optimize. Storing the sum makes insertions/deletions/updates take longer. Calculating the sum affects the performance of queries on the data.


If you are doing deletes or updates on the data, you quickly see the folly of pre-calculating the sum. Any such change to the data requires modifications to multiple records, when you think you are only changing one.


Your structure, though, appears to have only inserts -- a good design choice by the way, because you see every change. In this case, the question is whether you want to take the overhead on each insert or you want the overhead on the "reporting" side. The question is easy in certain cases.

但是,您的结构似乎只有插入 - 顺便说一下,这是一个很好的设计选择,因为您可以看到每一个变化。在这种情况下,问题是您是否要在每个插入上花费开销,或者您想要“报告”方面的开销。在某些情况下,这个问题很容易。

If you have 1000 votes for every time that you are going to look at the sum, calculate the sum on the fly. If you have 1000 sums for every vote, then storing the sum looks like the more efficient approach.


My guess is that the work-load is somewhere between the extremes. My natural bias is to store the data as generated, and then to have additional tables for summaries and reporting. I would recommend one of the following two approaches:


(1) Keep only the transaction data and calculate the sums on-the-fly. Arrange the indexes on the table to make the sums as efficient as possible.


(2) Keep only the transactions in one table and calculate the sums in another table (using either a trigger or a stored procedure). This gives you the up-to-date values needed for most purposes. The inserts should be more efficient than storing the sum on each record (because the table at the user level is smaller than the table at the vote level).


Your suggestion of calculating the sum in the votes record would not normally be an option that I would consider. It would be desirable when you need the history of incremental votes. But, if you are looking at the history, then doing the sum calculation or calculating the sum in the application layer would also be feasible alternatives.




Personally, I think if you want to display an aggregate score (and I imagine that you would want to display the score frequently), then as the number of rows in the voting table increases, you'll find that the aggregate SUM query will take longer and longer and not scale very well.


In addition, if you plan on implementing a queries that only show colours with a score of 100 or more, then having the aggregate will make for simpler and quicker queries.


Another advantage of using the score column is that if at some future date you want to clean out the votes table (e.g if it gets too big), then you could do that and wouldn't lose the colour scores.


I don't think this is premature optimisation, I think this is designing a system with scale in mind, so what I would do is to create some sample datasets of a realistic number of votes, colours and queries per minute you'd expect and run some performance tests to evaluate what is the better approach, for it is easier (read cheaper) to pick the right approach now rather than fixing it when things start going wrong.




The difference in performance between the two queries is trivial. You should determine the structure based on the information you want to keep.


If you only need an aggregate score, then use


UPDATE colours SET score=score-1 WHERE colour='red';

This will be very fast, because the table colours is only going to have a few rows.


On the other hand, there might be a reason to store each user's vote (such as making sure they don't vote twice). In that case insert a row for each vote.


INSERT INTO votes (colour,vote,user_id) VALUES (red,-1);

But don't create a structure of unnecessary rows just because you think it will be faster.




Are you prematurely optimizing or is this a real issue?


First approach might be faster but you change your domain model for the sake of optimization. It's okay as long as you know what you're doing and what disadvantages it brings to you (probable necessity to update two tables in all the places that work with votes, leading to missynchronization, for instance)


But you might consider other options. For instance if numbers of colours is not that big, you might build a caching for their ratings. That'll keep the simple model, plain rating mechanics and provide speed you need, minus some memory ;)




The key point to this type of optimization is what you want to optimize. Storing the sum makes insertions/deletions/updates take longer. Calculating the sum affects the performance of queries on the data.


If you are doing deletes or updates on the data, you quickly see the folly of pre-calculating the sum. Any such change to the data requires modifications to multiple records, when you think you are only changing one.


Your structure, though, appears to have only inserts -- a good design choice by the way, because you see every change. In this case, the question is whether you want to take the overhead on each insert or you want the overhead on the "reporting" side. The question is easy in certain cases.

但是,您的结构似乎只有插入 - 顺便说一下,这是一个很好的设计选择,因为您可以看到每一个变化。在这种情况下,问题是您是否要在每个插入上花费开销,或者您想要“报告”方面的开销。在某些情况下,这个问题很容易。

If you have 1000 votes for every time that you are going to look at the sum, calculate the sum on the fly. If you have 1000 sums for every vote, then storing the sum looks like the more efficient approach.


My guess is that the work-load is somewhere between the extremes. My natural bias is to store the data as generated, and then to have additional tables for summaries and reporting. I would recommend one of the following two approaches:


(1) Keep only the transaction data and calculate the sums on-the-fly. Arrange the indexes on the table to make the sums as efficient as possible.


(2) Keep only the transactions in one table and calculate the sums in another table (using either a trigger or a stored procedure). This gives you the up-to-date values needed for most purposes. The inserts should be more efficient than storing the sum on each record (because the table at the user level is smaller than the table at the vote level).


Your suggestion of calculating the sum in the votes record would not normally be an option that I would consider. It would be desirable when you need the history of incremental votes. But, if you are looking at the history, then doing the sum calculation or calculating the sum in the application layer would also be feasible alternatives.
