如何查询*样式的注释?

时间:2022-03-20 09:16:56

I saw this question on meta: https://meta.stackexchange.com/questions/33101/how-does-so-query-comments

我在meta上看到了这个问题:https://meta.stackexchange.com/questions/33101/how-didquerycomments。

I wanted to set the record straight and ask the question in a proper technical way.

我想澄清事实,用恰当的技术方法问问题。

Say I have 2 tables:

假设我有两个表格:

Posts
 id
 content
 parent_id           (null for questions, question_id for answer)  

Comments
 id 
 body 
 is_deleted
 post_id 
 upvotes 
 date 

Note: I think this is how the schema for SO is setup, answers have a parent_id which is the question, questions have null there. Questions and answers are stored in the same table.

注意:我认为SO的模式是这样设置的,答案有一个parent_id,这就是问题所在,问题在这里是空的。问题和答案存储在同一个表中。

How do I pull out comments * style in a very efficient way with minimal round trips?

如何以一种非常有效的方式在最少的往返中提取注释*样式?

The rules:

规则:

  1. A single query should pull out all the comments needed for a page with multiple posts to render
  2. 单个查询应该提取具有多个帖子的页面所需的所有注释。
  3. Needs to only pull out 5 comments per answer, with pref for upvotes
  4. 每个答案只需要抽出5条评论,使用pref进行向上投票
  5. Needs to provide enough information to inform the user there are more comments beyond the 5 that are there. (and the actual count - eg. 2 more comments)
  6. 需要提供足够的信息来通知用户,在那里有更多的评论。以及实际的计数。2更多的评论)
  7. Sorting is really hairy for comments, as you can see on the comments in this question. The rules are, display comments by date, HOWEVER if a comment has an upvote it is to get preferential treatment and be displayed as well at the bottom of the list. (this is nasty hard to express in sql)
  8. 对于评论,排序真的很麻烦,您可以在这个问题的评论中看到。规则是,按日期显示注释,但是如果注释有向上的投票,它将得到优先处理,并显示在列表的底部。(这很难用sql表示)

If any denormalizations make stuff better what are they? What indexes are critical?

如果任何非正化都能使事情变得更好它们是什么?关键指标是什么?

3 个解决方案

#1


4  

I wouldn't bother to filter the comments using SQL (which may surprise you because I'm an SQL advocate). Just fetch them all sorted by CommentId, and filter them in application code.

我不需要使用SQL过滤评论(这可能会让您吃惊,因为我是SQL的拥护者)。只需按CommentId获取它们,并在应用程序代码中过滤它们。

It's actually pretty infrequent that there are more than five comments for a given post, so that they need to be filtered. In *'s October data dump, 78% of posts have zero or one comment, and 97% have five or fewer comments. Only 20 posts have >= 50 comments, and only two posts have over 100 comments.

实际上,对于一个给定的post,有5个以上的注释,所以它们需要被过滤。在*十月份的数据转储中,78%的帖子没有或只有一条评论,97%的帖子有5条或更少的评论。只有20个帖子有>= 50条评论,只有两个帖子有超过100条评论。

So writing complex SQL to do that kind of filtering would increase complexity when querying all posts. I'm all for using clever SQL when appropriate, but this would be penny-wise and pound-foolish.

因此,编写复杂的SQL来进行这种过滤会增加查询所有帖子的复杂性。我完全赞成在适当的时候使用聪明的SQL,但这样做既省钱又愚蠢。

You could do it this way:

你可以这样做:

SELECT q.PostId, a.PostId, c.CommentId
FROM Posts q
LEFT OUTER JOIN Posts a
  ON (a.ParentId = q.PostId)
LEFT OUTER JOIN Comments c
  ON (c.PostId IN (q.PostId, a.PostId))
WHERE q.PostId = 1234
ORDER BY q.PostId, a.PostId, c.CommentId;

But this gives you redundant copies of q and a columns, which is significant because those columns include text blobs. The extra cost of copying redundant text from the RDBMS to the app becomes significant.

但是这给了你q和列的冗余副本,这很重要,因为这些列包括文本blob。从RDBMS复制冗余文本到应用程序的额外成本变得非常重要。

So it's probably better to not do this in two queries. Instead, given that the client is viewing a Question with PostId = 1234, do the following:

所以最好不要在两个查询中这样做。相反,如果客户正在查看PostId = 1234的问题,请执行以下操作:

SELECT c.PostId, c.Text
FROM Comments c
JOIN (SELECT 1234 AS PostId UNION ALL 
    SELECT a.PostId FROM Posts a WHERE a.ParentId = 1234) p
  ON (c.PostId = p.PostId);

And then sort through them in application code, collecting them by the referenced post and filtering out extra comments beyond the five most interesting ones per post.

然后在应用程序代码中对它们进行排序,通过引用的post收集它们,并过滤掉每个post中5个最有趣的注释之外的额外注释。


I tested these two queries against a MySQL 5.1 database loaded with *'s data dump from October. The first query takes about 50 seconds. The second query is pretty much instantaneous (after I pre-cached indexes for the Posts and Comments tables).

我在10月份装载*数据转储的MySQL 5.1数据库上测试了这两个查询。第一个查询大约需要50秒。第二个查询几乎是即时的(在我为post和Comments表预缓存索引之后)。

The bottom line is that insisting on fetching all the data you need using a single SQL query is an artificial requirement (probably based on a misconception that the round-trip of issuing a query against an RDBMS is overhead that must be minimized at any cost). Often a single query is a less efficient solution. Do you try to write all your application code in a single function? :-)

底线是,坚持使用单个SQL查询获取所需的所有数据是一种人工需求(可能是基于这样一种误解,即针对RDBMS发出查询的往返过程是开销,必须不惜任何代价将其最小化)。通常一个查询是一个效率较低的解决方案。您是否尝试在一个函数中编写所有应用程序代码?:-)

#2


1  

the real question is not the query, but the schema, specially the clustered indexes. The comment ordering requirements are ambuigous as you defined them (is it only 5 per answer or not?). I interpreted the requirements as 'pull 5 comments per post (answer or question) and give preference to upvoted ones, then to newer ones. I know this is not how SO comments are showen, but you gotta define your requirements more precisesly.

真正的问题不是查询,而是模式,特别是聚集索引。注释排序需求在您定义它们时是动态的(每个答案是否只有5个?)我将要求解释为“每篇文章(回答或提问)都要有5条评论,并优先选择那些被向上投的,然后是更新的。”我知道这不是展示评论的方式,但是您必须更精确地定义您的需求。

Here is my query:

这是我的查询:

declare @postId int;
set @postId = ?;

with cteQuestionAndReponses as (
  select post_id
  from Posts
  where post_id = @postId
  union all
  select post_id
  from Posts
  where parent_id = @postId)
select * from
cteQuestionAndReponses p
outer apply (
  select count(*) as CommentsCount
  from Comments c 
  where is_deleted = 0
  and c.post_id = p.post_id) as cc
outer apply (
  select top(5) *
  from Comments c 
  where is_deleted = 0
  and p.post_id = c.post_id
  order by upvotes desc, date desc
  ) as c

I have some 14k posts and 67k comments in my test tables, the query gets the posts in 7ms:

在我的测试表中,我有一些14k的帖子和67k的评论,这个查询得到了7ms的帖子:

Table 'Comments'. Scan count 12, logical reads 50, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Posts'. Scan count 1, logical reads 5, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 7 ms.

Here is the schema I tested with:

下面是我测试的模式:

create table Posts (
 post_id int identity (1,1) not null
 , content varchar(max) not null
 , parent_id int null -- (null for questions, question_id for answer) 
 , constraint fkPostsParent_id 
    foreign key (parent_id)
    references Posts(post_id)
 , constraint pkPostsId primary key nonclustered (post_id)
);
create clustered index cdxPosts on 
  Posts(parent_id, post_id);
go

create table Comments (
 comment_id int identity(1,1) not null
 , body varchar(max) not null
 , is_deleted bit not null default 0
 , post_id int not null
 , upvotes int not null default 0
 , date datetime not null default getutcdate()
 , constraint pkComments primary key nonclustered (comment_id)
 , constraint fkCommentsPostId
    foreign key (post_id)
    references Posts(post_id)
 );
create clustered index cdxComments on 
  Comments (is_deleted, post_id,  upvotes, date, comment_id);
go

and here is my test data generation:

这是我的测试数据生成:

insert into Posts (content)
select 'Lorem Ipsum' 
from master..spt_values;

insert into Posts (content, parent_id)
select 'Ipsum Lorem', post_id
from Posts p
cross apply (
  select top(checksum(newid(), p.post_id) % 10) Number
  from master..spt_values) as r
where parent_id is NULL  

insert into Comments (body, is_deleted, post_id, upvotes, date)
select 'Sit Amet'
  -- 5% deleted comments
  , case when abs(checksum(newid(), p.post_id, r.Number)) % 100 > 95 then 1 else 0 end
  , p.post_id
  -- up to 10 upvotes
  , abs(checksum(newid(), p.post_id, r.Number)) % 10
  -- up to 1 year old posts
  , dateadd(minute, -abs(checksum(newid(), p.post_id, r.Number) % 525600), getutcdate()) 
from Posts p
cross apply (
  select top(abs(checksum(newid(), p.post_id)) % 10) Number
  from master..spt_values) as r

#3


1  

Use:

使用:

WITH post_hierarchy AS (
  SELECT p.id,
         p.content,
         p.parent_id,
         1 AS post_level
    FROM POSTS p
   WHERE p.parent_id IS NULL
  UNION ALL
  SELECT p.id,
         p.content,
         p.parent_id,
         ph.post_level + 1 AS post_level
    FROM POSTS p
    JOIN post_hierarchy ph ON ph.id = p.parent_id)  
SELECT ph.id, 
       ph.post_level,
       c.upvotes,
       c.body
  FROM COMMENTS c
  JOIN post_hierarchy ph ON ph.id = c.post_id
ORDER BY ph.post_level, c.date

Couple of things to be aware of:

有几点需要注意:

  1. * displays the first 5 comments, doesn't matter if they were upvoted or not. Subsequent comments that were upvoted are immediately displayed
  2. *显示了前5条评论,不管它们是否被向上投票。随后的评论立即被显示出来。
  3. You can't accommodate a limit of 5 comments per post without devoting a SELECT to each post. Adding TOP 5 to what I posted will only return the first five rows based on the ORDER BY statement
  4. 你不能在每一篇文章中提供5条评论的限制,而不需要对每个帖子进行选择。将TOP 5添加到我发布的内容中,只会根据ORDER BY语句返回前5行

#1


4  

I wouldn't bother to filter the comments using SQL (which may surprise you because I'm an SQL advocate). Just fetch them all sorted by CommentId, and filter them in application code.

我不需要使用SQL过滤评论(这可能会让您吃惊,因为我是SQL的拥护者)。只需按CommentId获取它们,并在应用程序代码中过滤它们。

It's actually pretty infrequent that there are more than five comments for a given post, so that they need to be filtered. In *'s October data dump, 78% of posts have zero or one comment, and 97% have five or fewer comments. Only 20 posts have >= 50 comments, and only two posts have over 100 comments.

实际上,对于一个给定的post,有5个以上的注释,所以它们需要被过滤。在*十月份的数据转储中,78%的帖子没有或只有一条评论,97%的帖子有5条或更少的评论。只有20个帖子有>= 50条评论,只有两个帖子有超过100条评论。

So writing complex SQL to do that kind of filtering would increase complexity when querying all posts. I'm all for using clever SQL when appropriate, but this would be penny-wise and pound-foolish.

因此,编写复杂的SQL来进行这种过滤会增加查询所有帖子的复杂性。我完全赞成在适当的时候使用聪明的SQL,但这样做既省钱又愚蠢。

You could do it this way:

你可以这样做:

SELECT q.PostId, a.PostId, c.CommentId
FROM Posts q
LEFT OUTER JOIN Posts a
  ON (a.ParentId = q.PostId)
LEFT OUTER JOIN Comments c
  ON (c.PostId IN (q.PostId, a.PostId))
WHERE q.PostId = 1234
ORDER BY q.PostId, a.PostId, c.CommentId;

But this gives you redundant copies of q and a columns, which is significant because those columns include text blobs. The extra cost of copying redundant text from the RDBMS to the app becomes significant.

但是这给了你q和列的冗余副本,这很重要,因为这些列包括文本blob。从RDBMS复制冗余文本到应用程序的额外成本变得非常重要。

So it's probably better to not do this in two queries. Instead, given that the client is viewing a Question with PostId = 1234, do the following:

所以最好不要在两个查询中这样做。相反,如果客户正在查看PostId = 1234的问题,请执行以下操作:

SELECT c.PostId, c.Text
FROM Comments c
JOIN (SELECT 1234 AS PostId UNION ALL 
    SELECT a.PostId FROM Posts a WHERE a.ParentId = 1234) p
  ON (c.PostId = p.PostId);

And then sort through them in application code, collecting them by the referenced post and filtering out extra comments beyond the five most interesting ones per post.

然后在应用程序代码中对它们进行排序,通过引用的post收集它们,并过滤掉每个post中5个最有趣的注释之外的额外注释。


I tested these two queries against a MySQL 5.1 database loaded with *'s data dump from October. The first query takes about 50 seconds. The second query is pretty much instantaneous (after I pre-cached indexes for the Posts and Comments tables).

我在10月份装载*数据转储的MySQL 5.1数据库上测试了这两个查询。第一个查询大约需要50秒。第二个查询几乎是即时的(在我为post和Comments表预缓存索引之后)。

The bottom line is that insisting on fetching all the data you need using a single SQL query is an artificial requirement (probably based on a misconception that the round-trip of issuing a query against an RDBMS is overhead that must be minimized at any cost). Often a single query is a less efficient solution. Do you try to write all your application code in a single function? :-)

底线是,坚持使用单个SQL查询获取所需的所有数据是一种人工需求(可能是基于这样一种误解,即针对RDBMS发出查询的往返过程是开销,必须不惜任何代价将其最小化)。通常一个查询是一个效率较低的解决方案。您是否尝试在一个函数中编写所有应用程序代码?:-)

#2


1  

the real question is not the query, but the schema, specially the clustered indexes. The comment ordering requirements are ambuigous as you defined them (is it only 5 per answer or not?). I interpreted the requirements as 'pull 5 comments per post (answer or question) and give preference to upvoted ones, then to newer ones. I know this is not how SO comments are showen, but you gotta define your requirements more precisesly.

真正的问题不是查询,而是模式,特别是聚集索引。注释排序需求在您定义它们时是动态的(每个答案是否只有5个?)我将要求解释为“每篇文章(回答或提问)都要有5条评论,并优先选择那些被向上投的,然后是更新的。”我知道这不是展示评论的方式,但是您必须更精确地定义您的需求。

Here is my query:

这是我的查询:

declare @postId int;
set @postId = ?;

with cteQuestionAndReponses as (
  select post_id
  from Posts
  where post_id = @postId
  union all
  select post_id
  from Posts
  where parent_id = @postId)
select * from
cteQuestionAndReponses p
outer apply (
  select count(*) as CommentsCount
  from Comments c 
  where is_deleted = 0
  and c.post_id = p.post_id) as cc
outer apply (
  select top(5) *
  from Comments c 
  where is_deleted = 0
  and p.post_id = c.post_id
  order by upvotes desc, date desc
  ) as c

I have some 14k posts and 67k comments in my test tables, the query gets the posts in 7ms:

在我的测试表中,我有一些14k的帖子和67k的评论,这个查询得到了7ms的帖子:

Table 'Comments'. Scan count 12, logical reads 50, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Posts'. Scan count 1, logical reads 5, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 7 ms.

Here is the schema I tested with:

下面是我测试的模式:

create table Posts (
 post_id int identity (1,1) not null
 , content varchar(max) not null
 , parent_id int null -- (null for questions, question_id for answer) 
 , constraint fkPostsParent_id 
    foreign key (parent_id)
    references Posts(post_id)
 , constraint pkPostsId primary key nonclustered (post_id)
);
create clustered index cdxPosts on 
  Posts(parent_id, post_id);
go

create table Comments (
 comment_id int identity(1,1) not null
 , body varchar(max) not null
 , is_deleted bit not null default 0
 , post_id int not null
 , upvotes int not null default 0
 , date datetime not null default getutcdate()
 , constraint pkComments primary key nonclustered (comment_id)
 , constraint fkCommentsPostId
    foreign key (post_id)
    references Posts(post_id)
 );
create clustered index cdxComments on 
  Comments (is_deleted, post_id,  upvotes, date, comment_id);
go

and here is my test data generation:

这是我的测试数据生成:

insert into Posts (content)
select 'Lorem Ipsum' 
from master..spt_values;

insert into Posts (content, parent_id)
select 'Ipsum Lorem', post_id
from Posts p
cross apply (
  select top(checksum(newid(), p.post_id) % 10) Number
  from master..spt_values) as r
where parent_id is NULL  

insert into Comments (body, is_deleted, post_id, upvotes, date)
select 'Sit Amet'
  -- 5% deleted comments
  , case when abs(checksum(newid(), p.post_id, r.Number)) % 100 > 95 then 1 else 0 end
  , p.post_id
  -- up to 10 upvotes
  , abs(checksum(newid(), p.post_id, r.Number)) % 10
  -- up to 1 year old posts
  , dateadd(minute, -abs(checksum(newid(), p.post_id, r.Number) % 525600), getutcdate()) 
from Posts p
cross apply (
  select top(abs(checksum(newid(), p.post_id)) % 10) Number
  from master..spt_values) as r

#3


1  

Use:

使用:

WITH post_hierarchy AS (
  SELECT p.id,
         p.content,
         p.parent_id,
         1 AS post_level
    FROM POSTS p
   WHERE p.parent_id IS NULL
  UNION ALL
  SELECT p.id,
         p.content,
         p.parent_id,
         ph.post_level + 1 AS post_level
    FROM POSTS p
    JOIN post_hierarchy ph ON ph.id = p.parent_id)  
SELECT ph.id, 
       ph.post_level,
       c.upvotes,
       c.body
  FROM COMMENTS c
  JOIN post_hierarchy ph ON ph.id = c.post_id
ORDER BY ph.post_level, c.date

Couple of things to be aware of:

有几点需要注意:

  1. * displays the first 5 comments, doesn't matter if they were upvoted or not. Subsequent comments that were upvoted are immediately displayed
  2. *显示了前5条评论,不管它们是否被向上投票。随后的评论立即被显示出来。
  3. You can't accommodate a limit of 5 comments per post without devoting a SELECT to each post. Adding TOP 5 to what I posted will only return the first five rows based on the ORDER BY statement
  4. 你不能在每一篇文章中提供5条评论的限制,而不需要对每个帖子进行选择。将TOP 5添加到我发布的内容中,只会根据ORDER BY语句返回前5行