每个组的最小行数

时间:2020-12-03 12:27:58

The following is my query:

以下是我的疑问:

SELECT *
FROM (
    SELECT f.max, f.min, p.user_id, p.id, p.title, p.rating,
    RANK() OVER (
        PARTITION BY p.user_id
        ORDER BY p.rating DESC, p.id DESC
    ) AS rnk
    FROM posts AS p
    INNER JOIN friends AS f ON (p.user_id = f.friend_id)
    WHERE f.user_id=1
) AS subq
WHERE (subq.rnk <= subq.max)
LIMIT 10

It searches for posts of my friends, sorted by their rating and date. The window function implemented in this query lets me limit the number of rows returned for each friend according to the MAX field on Friends table.

它搜索我朋友的帖子,根据他们的评级和日期排序。这个查询中实现的窗口函数允许我根据Friends表上的MAX字段限制每个朋友返回的行数。

However, I also have a field MIN, which is used to specify the minimum number of posts I want from the query for a given friend. How is that possible?

但是,我也有一个字段MIN,它用于指定我希望从查询中获得的给定朋友的最少数量。这怎么可能?

I also wonder if SQL is the best option for those types of queries? I already tried Neo4j Graph database, and while it seemed as a good solution, I would rather avoid using 2 separate databases.

我还想知道SQL是否是这类查询的最佳选择?我已经尝试过Neo4j图形数据库,虽然它看起来是一个很好的解决方案,但我宁愿避免使用两个单独的数据库。

SQLFiddle

SQLFiddle

Schema:

模式:

CREATE TABLE friends(
    user_id int,
    friend_id int,
    min int,
    max int
);

CREATE TABLE posts(
   id int,
   title varchar(255),
   rating int,
   date date,
   user_id int
);

Suppose we have the following data:

假设我们有以下数据:

INSERT INTO friends VALUES
  (1,2,1,3)
, (1,3,0,5)
, (1,4,2,10);

INSERT INTO posts VALUES
  (1,  'posts1',  2,  now(), 2)
, (2,  'posts2',  1,  now(), 2)
, (3,  'posts3',  5,  now(), 2)
, (4,  'posts4',  2,  now(), 2)
, (5,  'posts5',  11, now(), 2)
, (6,  'posts6',  7,  now(), 2)
, (7,  'posts7',  3,  now(), 2)
, (8,  'posts8',  4,  now(), 3)
, (9,  'posts9',  1,  now(), 3)
, (10, 'posts10', 0,  now(), 3)
, (11, 'posts11', 7,  now(), 3)
, (12, 'posts12', 3,  now(), 3)
, (13, 'posts13', 2,  now(), 3)
, (14, 'posts14', 4,  now(), 4)
, (15, 'posts15', 9,  now(), 4)
, (16, 'posts16', 0,  now(), 4)
, (17, 'posts17', 3,  now(), 4)
, (18, 'posts18', 2,  now(), 4)
, (19, 'posts19', 1,  now(), 4)
, (20, 'posts20', 2,  now(), 4);

Hence I would like to see (post_id, title, rating, date, friend_id) combinations with the following conditions, if possible:

因此,如果可能的话,我希望看到(post_id、title、rate、date、friend_id)与以下条件的组合:

  1. between 1 and 3 posts from the friend with id=2
  2. id=2的朋友在1到3个帖子之间
  3. between 0 and 5 posts from the friend with id=3
  4. id=3的朋友在0到5个帖子之间
  5. between 2 and 10 posts from the friend with id=4
  6. id=4的朋友发送2到10个帖子

So basically, if my friend with friend_id=2 posted 1 or more articles, I want at least 2 of them. If he posted more than 3 articles, I want no more than 3.

基本上,如果我的friend_id=2的朋友发表了1篇或更多的文章,我至少想要2篇。如果他发表的文章超过3篇,我希望不超过3篇。

3 个解决方案

#1


5  

Let's say I want to have 2-5 posts from you everyday, if you post that much. If you post only one, it is alright, and I will have the only one post.

假设我想每天从你那里收到2-5个帖子,如果你发那么多的话。如果你只发布一个,它是好的,而且我将只有一个帖子。

Your explanation in the comment still does not add up. Your min number would be noise without effect according to this explanation.

你在评论中所作的解释仍不充分。根据这个解释,你的最小数量将是没有影响的噪音。

This is not what you wrote, but this would make sense:

这不是你写的,但这是有意义的:

Given a maximum of display slots for posts (the outer LIMIT), I want to get min posts from each friend first (if available). If there are free slots after that, fill up with up to max posts per friend.

给定最大的帖子显示槽(外部限制),我希望首先从每个朋友那里获得最小的帖子(如果有的话)。如果之后有免费的插槽,每个朋友最多可以填满最多的帖子。

In the example that would be 1 (min) post from friend 2 with top priority and another 2 (max - min) posts if more slots are still available.

在这个示例中,如果还有更多的插槽,那么将会有1 (min)来自好友2的优先级和另外2 (max - min)的帖子。

It would be arbitrary which posts make the cut if there are not enough slots for each priority. I went ahead and assumed that the first post from each should be selected first, etc.

如果每个优先级都没有足够的位置,那么哪个职位会被裁掉,这将是任意的。我继续往前走,假设每个人的第一篇文章应该首先被选择,等等。

The rest is still arbitrary, but can be solved easily if you manage to formulate a requirement.

其余的仍然是任意的,但是如果您设法制定一个需求,就可以很容易地解决。

SELECT *
FROM   friends f
,  LATERAL (
   SELECT *
        , row_number() OVER (ORDER  BY rating DESC NULLS LAST, id DESC) AS rn
   FROM   posts p
   WHERE  user_id = f.friend_id  -- LATERAL reference
   ORDER  BY rating DESC NULLS LAST, date DESC NULLS LAST
   LIMIT  f.max  -- LATERAL reference
   ) p
WHERE  f.user_id = 1
ORDER  BY (p.rn > f.min)  -- minimum posts from each first
        , p.rn
LIMIT  10;  -- arbitrary total maximum

SQL Fiddle.

SQL小提琴。

Notes

  • Assuming friends.user_id and posts.id to be primary keys. Your table definition is lacking there.

    假设的朋友。user_id和帖子。id应该是主键。这里缺少表定义。

  • All other columns should be defined NOT NULL to make sense.

    所有其他列都应该定义为非NULL。

  • Use a LATERAL join to select only max postings per friend in the subquery:

    使用横向连接在子查询中只选择每个朋友的最大帖子:

  • Use row_number(), not rank() in the subquery. It's a common mistake to confuse both.

    在子查询中使用row_number(),而不是rank()。混淆两者是很常见的错误。

  • You mentioned date but it did not show in your query. Maybe you really want:

    您提到了日期,但它没有显示在您的查询中。也许你真正想要的:

    , row_number() OVER (ORDER  BY rating DESC NULLS LAST
                                 , date   DESC NULLS LAST) AS rn
    
  • DESC NULLS LAST only because rating and date could be NULL:

    DESC NULLS仅因为评级和日期可以为空:

  • In Postgres, you can use a simple boolean expression in ORDER BY:

    在Postgres中,可以按以下顺序使用简单的布尔表达式:

    ORDER  BY (p.rn > f.min), p.rn
    

    That puts min posts per friend first. The second item (p.rn) gives each friend an equal chance (first post first etc.).

    这就把每个朋友的帖子放在第一位。第二项(p.rn)给了每个朋友平等的机会(第一名等)。

  • Don't use date as identifier. It's a reserved word in standard SQL and a basic type name in Postgres.

    不要使用日期作为标识符。它是标准SQL中的保留字,是Postgres中的基本类型名。

#2


0  

Am I missing something? Can't you just add and subq.rnk >= subq.min to the where clause?

我遗漏了什么东西?你就不能直接加和减q吗?rnk > = subq。最小的是where子句?

SELECT *
FROM (SELECT f.max, f.min, p.friend_id, p.id, p.title, p.rating
             row_number() over (partition by p.user_id
                                order by p.rating DESC, p.id DESC
                               ) as rnk
      FROM posts p INNER JOIN
           friends f
           ON p.friend_id = f.user_id
      WHERE f.user_id = 1
     )  subq
WHERE subq.rnk <= subq.max and subq.rnk >= subq.min

#3


0  

I think from a neo4j/cypher perspective, this is really what you want to do...

我认为从neo4j/cypher的角度来看,这才是你真正想做的……

match (u:User {id: 1})-[r:FOLLOWS]->(p:Publisher)
with u, p, r
match p-[:PUBLISHED]-(i:Item)
with u, p, r, i
order by i.name
return u.name, p.name, i.name
skip 5
limit 2

You just would need to parameterize out the min and max and bind them at run time which would involve two queries rather than one but I think this is still an elegant solution. I have tried to include properties in skip and limit before but cypher obviously does not support that (yet). It wants a parameter or an unsigned integer.

您只需要将最小和最大值参数化,并在运行时绑定它们,这将涉及两个查询,而不是一个查询,但我认为这仍然是一个优雅的解决方案。我曾经尝试过在skip和limit中包含属性,但是cypher显然不支持这个(现在)。它需要一个参数或一个无符号整数。

match (u:User {id: 1})-[r:FOLLOWS]->(p:Publisher)
with u, p, r
match p-[:PUBLISHED]-(i:Item)
with u, p, r, i
order by i.name
return u.name, p.name, i.name
skip {min}
limit {max}

#1


5  

Let's say I want to have 2-5 posts from you everyday, if you post that much. If you post only one, it is alright, and I will have the only one post.

假设我想每天从你那里收到2-5个帖子,如果你发那么多的话。如果你只发布一个,它是好的,而且我将只有一个帖子。

Your explanation in the comment still does not add up. Your min number would be noise without effect according to this explanation.

你在评论中所作的解释仍不充分。根据这个解释,你的最小数量将是没有影响的噪音。

This is not what you wrote, but this would make sense:

这不是你写的,但这是有意义的:

Given a maximum of display slots for posts (the outer LIMIT), I want to get min posts from each friend first (if available). If there are free slots after that, fill up with up to max posts per friend.

给定最大的帖子显示槽(外部限制),我希望首先从每个朋友那里获得最小的帖子(如果有的话)。如果之后有免费的插槽,每个朋友最多可以填满最多的帖子。

In the example that would be 1 (min) post from friend 2 with top priority and another 2 (max - min) posts if more slots are still available.

在这个示例中,如果还有更多的插槽,那么将会有1 (min)来自好友2的优先级和另外2 (max - min)的帖子。

It would be arbitrary which posts make the cut if there are not enough slots for each priority. I went ahead and assumed that the first post from each should be selected first, etc.

如果每个优先级都没有足够的位置,那么哪个职位会被裁掉,这将是任意的。我继续往前走,假设每个人的第一篇文章应该首先被选择,等等。

The rest is still arbitrary, but can be solved easily if you manage to formulate a requirement.

其余的仍然是任意的,但是如果您设法制定一个需求,就可以很容易地解决。

SELECT *
FROM   friends f
,  LATERAL (
   SELECT *
        , row_number() OVER (ORDER  BY rating DESC NULLS LAST, id DESC) AS rn
   FROM   posts p
   WHERE  user_id = f.friend_id  -- LATERAL reference
   ORDER  BY rating DESC NULLS LAST, date DESC NULLS LAST
   LIMIT  f.max  -- LATERAL reference
   ) p
WHERE  f.user_id = 1
ORDER  BY (p.rn > f.min)  -- minimum posts from each first
        , p.rn
LIMIT  10;  -- arbitrary total maximum

SQL Fiddle.

SQL小提琴。

Notes

  • Assuming friends.user_id and posts.id to be primary keys. Your table definition is lacking there.

    假设的朋友。user_id和帖子。id应该是主键。这里缺少表定义。

  • All other columns should be defined NOT NULL to make sense.

    所有其他列都应该定义为非NULL。

  • Use a LATERAL join to select only max postings per friend in the subquery:

    使用横向连接在子查询中只选择每个朋友的最大帖子:

  • Use row_number(), not rank() in the subquery. It's a common mistake to confuse both.

    在子查询中使用row_number(),而不是rank()。混淆两者是很常见的错误。

  • You mentioned date but it did not show in your query. Maybe you really want:

    您提到了日期,但它没有显示在您的查询中。也许你真正想要的:

    , row_number() OVER (ORDER  BY rating DESC NULLS LAST
                                 , date   DESC NULLS LAST) AS rn
    
  • DESC NULLS LAST only because rating and date could be NULL:

    DESC NULLS仅因为评级和日期可以为空:

  • In Postgres, you can use a simple boolean expression in ORDER BY:

    在Postgres中,可以按以下顺序使用简单的布尔表达式:

    ORDER  BY (p.rn > f.min), p.rn
    

    That puts min posts per friend first. The second item (p.rn) gives each friend an equal chance (first post first etc.).

    这就把每个朋友的帖子放在第一位。第二项(p.rn)给了每个朋友平等的机会(第一名等)。

  • Don't use date as identifier. It's a reserved word in standard SQL and a basic type name in Postgres.

    不要使用日期作为标识符。它是标准SQL中的保留字,是Postgres中的基本类型名。

#2


0  

Am I missing something? Can't you just add and subq.rnk >= subq.min to the where clause?

我遗漏了什么东西?你就不能直接加和减q吗?rnk > = subq。最小的是where子句?

SELECT *
FROM (SELECT f.max, f.min, p.friend_id, p.id, p.title, p.rating
             row_number() over (partition by p.user_id
                                order by p.rating DESC, p.id DESC
                               ) as rnk
      FROM posts p INNER JOIN
           friends f
           ON p.friend_id = f.user_id
      WHERE f.user_id = 1
     )  subq
WHERE subq.rnk <= subq.max and subq.rnk >= subq.min

#3


0  

I think from a neo4j/cypher perspective, this is really what you want to do...

我认为从neo4j/cypher的角度来看,这才是你真正想做的……

match (u:User {id: 1})-[r:FOLLOWS]->(p:Publisher)
with u, p, r
match p-[:PUBLISHED]-(i:Item)
with u, p, r, i
order by i.name
return u.name, p.name, i.name
skip 5
limit 2

You just would need to parameterize out the min and max and bind them at run time which would involve two queries rather than one but I think this is still an elegant solution. I have tried to include properties in skip and limit before but cypher obviously does not support that (yet). It wants a parameter or an unsigned integer.

您只需要将最小和最大值参数化,并在运行时绑定它们,这将涉及两个查询,而不是一个查询,但我认为这仍然是一个优雅的解决方案。我曾经尝试过在skip和limit中包含属性,但是cypher显然不支持这个(现在)。它需要一个参数或一个无符号整数。

match (u:User {id: 1})-[r:FOLLOWS]->(p:Publisher)
with u, p, r
match p-[:PUBLISHED]-(i:Item)
with u, p, r, i
order by i.name
return u.name, p.name, i.name
skip {min}
limit {max}