I saw this question on meta: https://meta.stackexchange.com/questions/33101/how-does-so-query-comments
我在meta上看到了这个问题:https://meta.stackexchange.com/questions/33101/how-didquerycomments。
I wanted to set the record straight and ask the question in a proper technical way.
我想澄清事实,用恰当的技术方法问问题。
Say I have 2 tables:
假设我有两个表格:
Posts id content parent_id (null for questions, question_id for answer) Comments id body is_deleted post_id upvotes date
Note: I think this is how the schema for SO is setup, answers have a parent_id which is the question, questions have null there. Questions and answers are stored in the same table.
注意:我认为SO的模式是这样设置的,答案有一个parent_id,这就是问题所在,问题在这里是空的。问题和答案存储在同一个表中。
How do I pull out comments * style in a very efficient way with minimal round trips?
如何以一种非常有效的方式在最少的往返中提取注释*样式?
The rules:
规则:
- A single query should pull out all the comments needed for a page with multiple posts to render
- 单个查询应该提取具有多个帖子的页面所需的所有注释。
- Needs to only pull out 5 comments per answer, with pref for upvotes
- 每个答案只需要抽出5条评论,使用pref进行向上投票
- Needs to provide enough information to inform the user there are more comments beyond the 5 that are there. (and the actual count - eg. 2 more comments)
- 需要提供足够的信息来通知用户,在那里有更多的评论。以及实际的计数。2更多的评论)
- Sorting is really hairy for comments, as you can see on the comments in this question. The rules are, display comments by date, HOWEVER if a comment has an upvote it is to get preferential treatment and be displayed as well at the bottom of the list. (this is nasty hard to express in sql)
- 对于评论,排序真的很麻烦,您可以在这个问题的评论中看到。规则是,按日期显示注释,但是如果注释有向上的投票,它将得到优先处理,并显示在列表的底部。(这很难用sql表示)
If any denormalizations make stuff better what are they? What indexes are critical?
如果任何非正化都能使事情变得更好它们是什么?关键指标是什么?
3 个解决方案
#1
4
I wouldn't bother to filter the comments using SQL (which may surprise you because I'm an SQL advocate). Just fetch them all sorted by CommentId, and filter them in application code.
我不需要使用SQL过滤评论(这可能会让您吃惊,因为我是SQL的拥护者)。只需按CommentId获取它们,并在应用程序代码中过滤它们。
It's actually pretty infrequent that there are more than five comments for a given post, so that they need to be filtered. In *'s October data dump, 78% of posts have zero or one comment, and 97% have five or fewer comments. Only 20 posts have >= 50 comments, and only two posts have over 100 comments.
实际上,对于一个给定的post,有5个以上的注释,所以它们需要被过滤。在*十月份的数据转储中,78%的帖子没有或只有一条评论,97%的帖子有5条或更少的评论。只有20个帖子有>= 50条评论,只有两个帖子有超过100条评论。
So writing complex SQL to do that kind of filtering would increase complexity when querying all posts. I'm all for using clever SQL when appropriate, but this would be penny-wise and pound-foolish.
因此,编写复杂的SQL来进行这种过滤会增加查询所有帖子的复杂性。我完全赞成在适当的时候使用聪明的SQL,但这样做既省钱又愚蠢。
You could do it this way:
你可以这样做:
SELECT q.PostId, a.PostId, c.CommentId
FROM Posts q
LEFT OUTER JOIN Posts a
ON (a.ParentId = q.PostId)
LEFT OUTER JOIN Comments c
ON (c.PostId IN (q.PostId, a.PostId))
WHERE q.PostId = 1234
ORDER BY q.PostId, a.PostId, c.CommentId;
But this gives you redundant copies of q
and a
columns, which is significant because those columns include text blobs. The extra cost of copying redundant text from the RDBMS to the app becomes significant.
但是这给了你q和列的冗余副本,这很重要,因为这些列包括文本blob。从RDBMS复制冗余文本到应用程序的额外成本变得非常重要。
So it's probably better to not do this in two queries. Instead, given that the client is viewing a Question with PostId = 1234, do the following:
所以最好不要在两个查询中这样做。相反,如果客户正在查看PostId = 1234的问题,请执行以下操作:
SELECT c.PostId, c.Text
FROM Comments c
JOIN (SELECT 1234 AS PostId UNION ALL
SELECT a.PostId FROM Posts a WHERE a.ParentId = 1234) p
ON (c.PostId = p.PostId);
And then sort through them in application code, collecting them by the referenced post and filtering out extra comments beyond the five most interesting ones per post.
然后在应用程序代码中对它们进行排序,通过引用的post收集它们,并过滤掉每个post中5个最有趣的注释之外的额外注释。
I tested these two queries against a MySQL 5.1 database loaded with *'s data dump from October. The first query takes about 50 seconds. The second query is pretty much instantaneous (after I pre-cached indexes for the Posts
and Comments
tables).
我在10月份装载*数据转储的MySQL 5.1数据库上测试了这两个查询。第一个查询大约需要50秒。第二个查询几乎是即时的(在我为post和Comments表预缓存索引之后)。
The bottom line is that insisting on fetching all the data you need using a single SQL query is an artificial requirement (probably based on a misconception that the round-trip of issuing a query against an RDBMS is overhead that must be minimized at any cost). Often a single query is a less efficient solution. Do you try to write all your application code in a single function? :-)
底线是,坚持使用单个SQL查询获取所需的所有数据是一种人工需求(可能是基于这样一种误解,即针对RDBMS发出查询的往返过程是开销,必须不惜任何代价将其最小化)。通常一个查询是一个效率较低的解决方案。您是否尝试在一个函数中编写所有应用程序代码?:-)
#2
1
the real question is not the query, but the schema, specially the clustered indexes. The comment ordering requirements are ambuigous as you defined them (is it only 5 per answer or not?). I interpreted the requirements as 'pull 5 comments per post (answer or question) and give preference to upvoted ones, then to newer ones. I know this is not how SO comments are showen, but you gotta define your requirements more precisesly.
真正的问题不是查询,而是模式,特别是聚集索引。注释排序需求在您定义它们时是动态的(每个答案是否只有5个?)我将要求解释为“每篇文章(回答或提问)都要有5条评论,并优先选择那些被向上投的,然后是更新的。”我知道这不是展示评论的方式,但是您必须更精确地定义您的需求。
Here is my query:
这是我的查询:
declare @postId int;
set @postId = ?;
with cteQuestionAndReponses as (
select post_id
from Posts
where post_id = @postId
union all
select post_id
from Posts
where parent_id = @postId)
select * from
cteQuestionAndReponses p
outer apply (
select count(*) as CommentsCount
from Comments c
where is_deleted = 0
and c.post_id = p.post_id) as cc
outer apply (
select top(5) *
from Comments c
where is_deleted = 0
and p.post_id = c.post_id
order by upvotes desc, date desc
) as c
I have some 14k posts and 67k comments in my test tables, the query gets the posts in 7ms:
在我的测试表中,我有一些14k的帖子和67k的评论,这个查询得到了7ms的帖子:
Table 'Comments'. Scan count 12, logical reads 50, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Posts'. Scan count 1, logical reads 5, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 7 ms.
Here is the schema I tested with:
下面是我测试的模式:
create table Posts (
post_id int identity (1,1) not null
, content varchar(max) not null
, parent_id int null -- (null for questions, question_id for answer)
, constraint fkPostsParent_id
foreign key (parent_id)
references Posts(post_id)
, constraint pkPostsId primary key nonclustered (post_id)
);
create clustered index cdxPosts on
Posts(parent_id, post_id);
go
create table Comments (
comment_id int identity(1,1) not null
, body varchar(max) not null
, is_deleted bit not null default 0
, post_id int not null
, upvotes int not null default 0
, date datetime not null default getutcdate()
, constraint pkComments primary key nonclustered (comment_id)
, constraint fkCommentsPostId
foreign key (post_id)
references Posts(post_id)
);
create clustered index cdxComments on
Comments (is_deleted, post_id, upvotes, date, comment_id);
go
and here is my test data generation:
这是我的测试数据生成:
insert into Posts (content)
select 'Lorem Ipsum'
from master..spt_values;
insert into Posts (content, parent_id)
select 'Ipsum Lorem', post_id
from Posts p
cross apply (
select top(checksum(newid(), p.post_id) % 10) Number
from master..spt_values) as r
where parent_id is NULL
insert into Comments (body, is_deleted, post_id, upvotes, date)
select 'Sit Amet'
-- 5% deleted comments
, case when abs(checksum(newid(), p.post_id, r.Number)) % 100 > 95 then 1 else 0 end
, p.post_id
-- up to 10 upvotes
, abs(checksum(newid(), p.post_id, r.Number)) % 10
-- up to 1 year old posts
, dateadd(minute, -abs(checksum(newid(), p.post_id, r.Number) % 525600), getutcdate())
from Posts p
cross apply (
select top(abs(checksum(newid(), p.post_id)) % 10) Number
from master..spt_values) as r
#3
1
Use:
使用:
WITH post_hierarchy AS (
SELECT p.id,
p.content,
p.parent_id,
1 AS post_level
FROM POSTS p
WHERE p.parent_id IS NULL
UNION ALL
SELECT p.id,
p.content,
p.parent_id,
ph.post_level + 1 AS post_level
FROM POSTS p
JOIN post_hierarchy ph ON ph.id = p.parent_id)
SELECT ph.id,
ph.post_level,
c.upvotes,
c.body
FROM COMMENTS c
JOIN post_hierarchy ph ON ph.id = c.post_id
ORDER BY ph.post_level, c.date
Couple of things to be aware of:
有几点需要注意:
- * displays the first 5 comments, doesn't matter if they were upvoted or not. Subsequent comments that were upvoted are immediately displayed
- *显示了前5条评论,不管它们是否被向上投票。随后的评论立即被显示出来。
- You can't accommodate a limit of 5 comments per post without devoting a SELECT to each post. Adding
TOP 5
to what I posted will only return the first five rows based on the ORDER BY statement - 你不能在每一篇文章中提供5条评论的限制,而不需要对每个帖子进行选择。将TOP 5添加到我发布的内容中,只会根据ORDER BY语句返回前5行
#1
4
I wouldn't bother to filter the comments using SQL (which may surprise you because I'm an SQL advocate). Just fetch them all sorted by CommentId, and filter them in application code.
我不需要使用SQL过滤评论(这可能会让您吃惊,因为我是SQL的拥护者)。只需按CommentId获取它们,并在应用程序代码中过滤它们。
It's actually pretty infrequent that there are more than five comments for a given post, so that they need to be filtered. In *'s October data dump, 78% of posts have zero or one comment, and 97% have five or fewer comments. Only 20 posts have >= 50 comments, and only two posts have over 100 comments.
实际上,对于一个给定的post,有5个以上的注释,所以它们需要被过滤。在*十月份的数据转储中,78%的帖子没有或只有一条评论,97%的帖子有5条或更少的评论。只有20个帖子有>= 50条评论,只有两个帖子有超过100条评论。
So writing complex SQL to do that kind of filtering would increase complexity when querying all posts. I'm all for using clever SQL when appropriate, but this would be penny-wise and pound-foolish.
因此,编写复杂的SQL来进行这种过滤会增加查询所有帖子的复杂性。我完全赞成在适当的时候使用聪明的SQL,但这样做既省钱又愚蠢。
You could do it this way:
你可以这样做:
SELECT q.PostId, a.PostId, c.CommentId
FROM Posts q
LEFT OUTER JOIN Posts a
ON (a.ParentId = q.PostId)
LEFT OUTER JOIN Comments c
ON (c.PostId IN (q.PostId, a.PostId))
WHERE q.PostId = 1234
ORDER BY q.PostId, a.PostId, c.CommentId;
But this gives you redundant copies of q
and a
columns, which is significant because those columns include text blobs. The extra cost of copying redundant text from the RDBMS to the app becomes significant.
但是这给了你q和列的冗余副本,这很重要,因为这些列包括文本blob。从RDBMS复制冗余文本到应用程序的额外成本变得非常重要。
So it's probably better to not do this in two queries. Instead, given that the client is viewing a Question with PostId = 1234, do the following:
所以最好不要在两个查询中这样做。相反,如果客户正在查看PostId = 1234的问题,请执行以下操作:
SELECT c.PostId, c.Text
FROM Comments c
JOIN (SELECT 1234 AS PostId UNION ALL
SELECT a.PostId FROM Posts a WHERE a.ParentId = 1234) p
ON (c.PostId = p.PostId);
And then sort through them in application code, collecting them by the referenced post and filtering out extra comments beyond the five most interesting ones per post.
然后在应用程序代码中对它们进行排序,通过引用的post收集它们,并过滤掉每个post中5个最有趣的注释之外的额外注释。
I tested these two queries against a MySQL 5.1 database loaded with *'s data dump from October. The first query takes about 50 seconds. The second query is pretty much instantaneous (after I pre-cached indexes for the Posts
and Comments
tables).
我在10月份装载*数据转储的MySQL 5.1数据库上测试了这两个查询。第一个查询大约需要50秒。第二个查询几乎是即时的(在我为post和Comments表预缓存索引之后)。
The bottom line is that insisting on fetching all the data you need using a single SQL query is an artificial requirement (probably based on a misconception that the round-trip of issuing a query against an RDBMS is overhead that must be minimized at any cost). Often a single query is a less efficient solution. Do you try to write all your application code in a single function? :-)
底线是,坚持使用单个SQL查询获取所需的所有数据是一种人工需求(可能是基于这样一种误解,即针对RDBMS发出查询的往返过程是开销,必须不惜任何代价将其最小化)。通常一个查询是一个效率较低的解决方案。您是否尝试在一个函数中编写所有应用程序代码?:-)
#2
1
the real question is not the query, but the schema, specially the clustered indexes. The comment ordering requirements are ambuigous as you defined them (is it only 5 per answer or not?). I interpreted the requirements as 'pull 5 comments per post (answer or question) and give preference to upvoted ones, then to newer ones. I know this is not how SO comments are showen, but you gotta define your requirements more precisesly.
真正的问题不是查询,而是模式,特别是聚集索引。注释排序需求在您定义它们时是动态的(每个答案是否只有5个?)我将要求解释为“每篇文章(回答或提问)都要有5条评论,并优先选择那些被向上投的,然后是更新的。”我知道这不是展示评论的方式,但是您必须更精确地定义您的需求。
Here is my query:
这是我的查询:
declare @postId int;
set @postId = ?;
with cteQuestionAndReponses as (
select post_id
from Posts
where post_id = @postId
union all
select post_id
from Posts
where parent_id = @postId)
select * from
cteQuestionAndReponses p
outer apply (
select count(*) as CommentsCount
from Comments c
where is_deleted = 0
and c.post_id = p.post_id) as cc
outer apply (
select top(5) *
from Comments c
where is_deleted = 0
and p.post_id = c.post_id
order by upvotes desc, date desc
) as c
I have some 14k posts and 67k comments in my test tables, the query gets the posts in 7ms:
在我的测试表中,我有一些14k的帖子和67k的评论,这个查询得到了7ms的帖子:
Table 'Comments'. Scan count 12, logical reads 50, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Posts'. Scan count 1, logical reads 5, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 7 ms.
Here is the schema I tested with:
下面是我测试的模式:
create table Posts (
post_id int identity (1,1) not null
, content varchar(max) not null
, parent_id int null -- (null for questions, question_id for answer)
, constraint fkPostsParent_id
foreign key (parent_id)
references Posts(post_id)
, constraint pkPostsId primary key nonclustered (post_id)
);
create clustered index cdxPosts on
Posts(parent_id, post_id);
go
create table Comments (
comment_id int identity(1,1) not null
, body varchar(max) not null
, is_deleted bit not null default 0
, post_id int not null
, upvotes int not null default 0
, date datetime not null default getutcdate()
, constraint pkComments primary key nonclustered (comment_id)
, constraint fkCommentsPostId
foreign key (post_id)
references Posts(post_id)
);
create clustered index cdxComments on
Comments (is_deleted, post_id, upvotes, date, comment_id);
go
and here is my test data generation:
这是我的测试数据生成:
insert into Posts (content)
select 'Lorem Ipsum'
from master..spt_values;
insert into Posts (content, parent_id)
select 'Ipsum Lorem', post_id
from Posts p
cross apply (
select top(checksum(newid(), p.post_id) % 10) Number
from master..spt_values) as r
where parent_id is NULL
insert into Comments (body, is_deleted, post_id, upvotes, date)
select 'Sit Amet'
-- 5% deleted comments
, case when abs(checksum(newid(), p.post_id, r.Number)) % 100 > 95 then 1 else 0 end
, p.post_id
-- up to 10 upvotes
, abs(checksum(newid(), p.post_id, r.Number)) % 10
-- up to 1 year old posts
, dateadd(minute, -abs(checksum(newid(), p.post_id, r.Number) % 525600), getutcdate())
from Posts p
cross apply (
select top(abs(checksum(newid(), p.post_id)) % 10) Number
from master..spt_values) as r
#3
1
Use:
使用:
WITH post_hierarchy AS (
SELECT p.id,
p.content,
p.parent_id,
1 AS post_level
FROM POSTS p
WHERE p.parent_id IS NULL
UNION ALL
SELECT p.id,
p.content,
p.parent_id,
ph.post_level + 1 AS post_level
FROM POSTS p
JOIN post_hierarchy ph ON ph.id = p.parent_id)
SELECT ph.id,
ph.post_level,
c.upvotes,
c.body
FROM COMMENTS c
JOIN post_hierarchy ph ON ph.id = c.post_id
ORDER BY ph.post_level, c.date
Couple of things to be aware of:
有几点需要注意:
- * displays the first 5 comments, doesn't matter if they were upvoted or not. Subsequent comments that were upvoted are immediately displayed
- *显示了前5条评论,不管它们是否被向上投票。随后的评论立即被显示出来。
- You can't accommodate a limit of 5 comments per post without devoting a SELECT to each post. Adding
TOP 5
to what I posted will only return the first five rows based on the ORDER BY statement - 你不能在每一篇文章中提供5条评论的限制,而不需要对每个帖子进行选择。将TOP 5添加到我发布的内容中,只会根据ORDER BY语句返回前5行