如何查询*样式的注释?

I saw this question on meta: https://meta.stackexchange.com/questions/33101/how-does-so-query-comments

我在meta上看到了这个问题:https://meta.stackexchange.com/questions/33101/how-didquerycomments。

I wanted to set the record straight and ask the question in a proper technical way.

我想澄清事实，用恰当的技术方法问问题。

Say I have 2 tables:

假设我有两个表格:

Posts
 id
 content
 parent_id           (null for questions, question_id for answer)  

Comments
 id 
 body 
 is_deleted
 post_id 
 upvotes 
 date

Note: I think this is how the schema for SO is setup, answers have a parent_id which is the question, questions have null there. Questions and answers are stored in the same table.

注意:我认为SO的模式是这样设置的，答案有一个parent_id，这就是问题所在，问题在这里是空的。问题和答案存储在同一个表中。

How do I pull out comments * style in a very efficient way with minimal round trips?

如何以一种非常有效的方式在最少的往返中提取注释*样式?

The rules:

规则:

A single query should pull out all the comments needed for a page with multiple posts to render
单个查询应该提取具有多个帖子的页面所需的所有注释。
Needs to only pull out 5 comments per answer, with pref for upvotes
每个答案只需要抽出5条评论，使用pref进行向上投票
Needs to provide enough information to inform the user there are more comments beyond the 5 that are there. (and the actual count - eg. 2 more comments)
需要提供足够的信息来通知用户，在那里有更多的评论。以及实际的计数。2更多的评论)
Sorting is really hairy for comments, as you can see on the comments in this question. The rules are, display comments by date, HOWEVER if a comment has an upvote it is to get preferential treatment and be displayed as well at the bottom of the list. (this is nasty hard to express in sql)
对于评论，排序真的很麻烦，您可以在这个问题的评论中看到。规则是，按日期显示注释，但是如果注释有向上的投票，它将得到优先处理，并显示在列表的底部。(这很难用sql表示)

If any denormalizations make stuff better what are they? What indexes are critical?

如果任何非正化都能使事情变得更好它们是什么?关键指标是什么?

3 个解决方案

#1

I wouldn't bother to filter the comments using SQL (which may surprise you because I'm an SQL advocate). Just fetch them all sorted by CommentId, and filter them in application code.

我不需要使用SQL过滤评论(这可能会让您吃惊，因为我是SQL的拥护者)。只需按CommentId获取它们，并在应用程序代码中过滤它们。

It's actually pretty infrequent that there are more than five comments for a given post, so that they need to be filtered. In *'s October data dump, 78% of posts have zero or one comment, and 97% have five or fewer comments. Only 20 posts have >= 50 comments, and only two posts have over 100 comments.

实际上，对于一个给定的post，有5个以上的注释，所以它们需要被过滤。在*十月份的数据转储中，78%的帖子没有或只有一条评论，97%的帖子有5条或更少的评论。只有20个帖子有>= 50条评论，只有两个帖子有超过100条评论。

So writing complex SQL to do that kind of filtering would increase complexity when querying all posts. I'm all for using clever SQL when appropriate, but this would be penny-wise and pound-foolish.

因此，编写复杂的SQL来进行这种过滤会增加查询所有帖子的复杂性。我完全赞成在适当的时候使用聪明的SQL，但这样做既省钱又愚蠢。

You could do it this way:

你可以这样做:

SELECT q.PostId, a.PostId, c.CommentId
FROM Posts q
LEFT OUTER JOIN Posts a
  ON (a.ParentId = q.PostId)
LEFT OUTER JOIN Comments c
  ON (c.PostId IN (q.PostId, a.PostId))
WHERE q.PostId = 1234
ORDER BY q.PostId, a.PostId, c.CommentId;

But this gives you redundant copies of q and a columns, which is significant because those columns include text blobs. The extra cost of copying redundant text from the RDBMS to the app becomes significant.

但是这给了你q和列的冗余副本，这很重要，因为这些列包括文本blob。从RDBMS复制冗余文本到应用程序的额外成本变得非常重要。

So it's probably better to not do this in two queries. Instead, given that the client is viewing a Question with PostId = 1234, do the following:

所以最好不要在两个查询中这样做。相反，如果客户正在查看PostId = 1234的问题，请执行以下操作:

SELECT c.PostId, c.Text
FROM Comments c
JOIN (SELECT 1234 AS PostId UNION ALL 
    SELECT a.PostId FROM Posts a WHERE a.ParentId = 1234) p
  ON (c.PostId = p.PostId);

And then sort through them in application code, collecting them by the referenced post and filtering out extra comments beyond the five most interesting ones per post.

然后在应用程序代码中对它们进行排序，通过引用的post收集它们，并过滤掉每个post中5个最有趣的注释之外的额外注释。

I tested these two queries against a MySQL 5.1 database loaded with *'s data dump from October. The first query takes about 50 seconds. The second query is pretty much instantaneous (after I pre-cached indexes for the Posts and Comments tables).

我在10月份装载*数据转储的MySQL 5.1数据库上测试了这两个查询。第一个查询大约需要50秒。第二个查询几乎是即时的(在我为post和Comments表预缓存索引之后)。

The bottom line is that insisting on fetching all the data you need using a single SQL query is an artificial requirement (probably based on a misconception that the round-trip of issuing a query against an RDBMS is overhead that must be minimized at any cost). Often a single query is a less efficient solution. Do you try to write all your application code in a single function? :-)

底线是，坚持使用单个SQL查询获取所需的所有数据是一种人工需求(可能是基于这样一种误解，即针对RDBMS发出查询的往返过程是开销，必须不惜任何代价将其最小化)。通常一个查询是一个效率较低的解决方案。您是否尝试在一个函数中编写所有应用程序代码?:-)

#2

the real question is not the query, but the schema, specially the clustered indexes. The comment ordering requirements are ambuigous as you defined them (is it only 5 per answer or not?). I interpreted the requirements as 'pull 5 comments per post (answer or question) and give preference to upvoted ones, then to newer ones. I know this is not how SO comments are showen, but you gotta define your requirements more precisesly.

真正的问题不是查询，而是模式，特别是聚集索引。注释排序需求在您定义它们时是动态的(每个答案是否只有5个?)我将要求解释为“每篇文章(回答或提问)都要有5条评论，并优先选择那些被向上投的，然后是更新的。”我知道这不是展示评论的方式，但是您必须更精确地定义您的需求。

Here is my query:

这是我的查询:

declare @postId int;
set @postId = ?;

with cteQuestionAndReponses as (
  select post_id
  from Posts
  where post_id = @postId
  union all
  select post_id
  from Posts
  where parent_id = @postId)
select * from
cteQuestionAndReponses p
outer apply (
  select count(*) as CommentsCount
  from Comments c 
  where is_deleted = 0
  and c.post_id = p.post_id) as cc
outer apply (
  select top(5) *
  from Comments c 
  where is_deleted = 0
  and p.post_id = c.post_id
  order by upvotes desc, date desc
  ) as c

I have some 14k posts and 67k comments in my test tables, the query gets the posts in 7ms:

在我的测试表中，我有一些14k的帖子和67k的评论，这个查询得到了7ms的帖子:

Table 'Comments'. Scan count 12, logical reads 50, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Posts'. Scan count 1, logical reads 5, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 7 ms.

Here is the schema I tested with:

下面是我测试的模式:

create table Posts (
 post_id int identity (1,1) not null
 , content varchar(max) not null
 , parent_id int null -- (null for questions, question_id for answer) 
 , constraint fkPostsParent_id 
    foreign key (parent_id)
    references Posts(post_id)
 , constraint pkPostsId primary key nonclustered (post_id)
);
create clustered index cdxPosts on 
  Posts(parent_id, post_id);
go

create table Comments (
 comment_id int identity(1,1) not null
 , body varchar(max) not null
 , is_deleted bit not null default 0
 , post_id int not null
 , upvotes int not null default 0
 , date datetime not null default getutcdate()
 , constraint pkComments primary key nonclustered (comment_id)
 , constraint fkCommentsPostId
    foreign key (post_id)
    references Posts(post_id)
 );
create clustered index cdxComments on 
  Comments (is_deleted, post_id,  upvotes, date, comment_id);
go

and here is my test data generation:

这是我的测试数据生成:

insert into Posts (content)
select 'Lorem Ipsum' 
from master..spt_values;

insert into Posts (content, parent_id)
select 'Ipsum Lorem', post_id
from Posts p
cross apply (
  select top(checksum(newid(), p.post_id) % 10) Number
  from master..spt_values) as r
where parent_id is NULL  

insert into Comments (body, is_deleted, post_id, upvotes, date)
select 'Sit Amet'
  -- 5% deleted comments
  , case when abs(checksum(newid(), p.post_id, r.Number)) % 100 > 95 then 1 else 0 end
  , p.post_id
  -- up to 10 upvotes
  , abs(checksum(newid(), p.post_id, r.Number)) % 10
  -- up to 1 year old posts
  , dateadd(minute, -abs(checksum(newid(), p.post_id, r.Number) % 525600), getutcdate()) 
from Posts p
cross apply (
  select top(abs(checksum(newid(), p.post_id)) % 10) Number
  from master..spt_values) as r

#3

Use:

使用:

WITH post_hierarchy AS (
  SELECT p.id,
         p.content,
         p.parent_id,
         1 AS post_level
    FROM POSTS p
   WHERE p.parent_id IS NULL
  UNION ALL
  SELECT p.id,
         p.content,
         p.parent_id,
         ph.post_level + 1 AS post_level
    FROM POSTS p
    JOIN post_hierarchy ph ON ph.id = p.parent_id)  
SELECT ph.id, 
       ph.post_level,
       c.upvotes,
       c.body
  FROM COMMENTS c
  JOIN post_hierarchy ph ON ph.id = c.post_id
ORDER BY ph.post_level, c.date

Couple of things to be aware of:

有几点需要注意:

* displays the first 5 comments, doesn't matter if they were upvoted or not. Subsequent comments that were upvoted are immediately displayed
*显示了前5条评论，不管它们是否被向上投票。随后的评论立即被显示出来。
You can't accommodate a limit of 5 comments per post without devoting a SELECT to each post. Adding TOP 5 to what I posted will only return the first five rows based on the ORDER BY statement
你不能在每一篇文章中提供5条评论的限制，而不需要对每个帖子进行选择。将TOP 5添加到我发布的内容中，只会根据ORDER BY语句返回前5行

#1