GROUP BY和聚合顺序数值

时间:2022-03-01 01:19:44

Using PostgreSQL 9.0.

使用PostgreSQL 9.0。

Let's say I have a table containing the fields: company, profession and year. I want to return a result which contains unique companies and professions, but aggregates (into an array is fine) years based on numeric sequence:

假设我有一个包含字段的表:公司,专业和年份。我想返回一个包含独特公司和专业的结果,但基于数字序列聚合(到一个数组中很好)年份:

Example Table:

示例表:

+-----------------------------+
| company | profession | year |
+---------+------------+------+
| Google  | Programmer | 2000 |
| Google  | Sales      | 2000 |
| Google  | Sales      | 2001 |
| Google  | Sales      | 2002 |
| Google  | Sales      | 2004 |
| Mozilla | Sales      | 2002 |
+-----------------------------+

I'm interested in a query which would output rows similar to the following:

我对一个输出类似于以下行的查询感兴趣:

+-----------------------------------------+
| company | profession | year             |
+---------+------------+------------------+
| Google  | Programmer | [2000]           |
| Google  | Sales      | [2000,2001,2002] |
| Google  | Sales      | [2004]           |
| Mozilla | Sales      | [2002]           |
+-----------------------------------------+

The essential feature is that only consecutive years shall be grouped together.

基本特征是只有连续年份才能组合在一起。

3 个解决方案

#1


9  

There's much value to @a_horse_with_no_name's answer, both as a correct solution and, like I already said in a comment, as a good material for learning how to use different kinds of window functions in PostgreSQL.

@ a_horse_with_no_name的答案有很多价值,作为一个正确的解决方案,就像我在评论中已经说过的那样,是学习如何在PostgreSQL中使用不同类型的窗口函数的好材料。

And yet I cannot help feeling that the approach taken in that answer is a bit too much of an effort for a problem like this one. Basically, what you need is an additional criterion for grouping before you go on aggregating years in arrays. You've already got company and profession, now you only need something to distinguish years that belong to different sequences.

然而,我不禁感到在这个答案中所采用的方法对于像这样的问题来说有点过分了。基本上,在进行数组聚合年之前,您需要的是一个额外的分组标准。你已经有了公司和专业,现在你只需要一些东西来区分属于不同序列的年份。

That is just what the above mentioned answer provides and that is precisely what I think can be done in a simpler way. Here's how:

这就是上面提到的答案提供的内容,而这正是我认为可以用更简单的方式完成的。就是这样:

WITH MarkedForGrouping AS (
  SELECT
    company,
    profession,
    year,
    year - ROW_NUMBER() OVER (
      PARTITION BY company, profession
      ORDER BY year
    ) AS seqID
  FROM atable
)
SELECT
  company,
  profession,
  array_agg(year) AS years
FROM MarkedForGrouping
GROUP BY
  company,
  profession,
  seqID

#2


18  

Identifying non-consecutive values is always a bit tricky and involves several nested sub-queries (at least I cannot come up with a better solution).

识别非连续值总是有点棘手,涉及几个嵌套的子查询(至少我不能提出更好的解决方案)。

The first step is to identify non-consecutive values for the year:

第一步是确定年份的非连续值:

Step 1) Identify non-consecutive values

select company, 
       profession,
       year,
       case 
          when row_number() over (partition by company, profession order by year) = 1 or 
               year - lag(year,1,year) over (partition by company, profession order by year) > 1 then 1
          else 0
       end as group_cnt
from qualification

This returns the following result:

这将返回以下结果:

 company | profession | year | group_cnt
---------+------------+------+-----------
 Google  | Programmer | 2000 |         1
 Google  | Sales      | 2000 |         1
 Google  | Sales      | 2001 |         0
 Google  | Sales      | 2002 |         0
 Google  | Sales      | 2004 |         1
 Mozilla | Sales      | 2002 |         1

Now with the group_cnt value we can create "group IDs" for each group that has consecutive years:

现在使用group_cnt值,我们可以为连续年份的每个组创建“组ID”:

Step 2) Define group IDs

select company,
   profession,
   year,
   sum(group_cnt) over (order by company, profession, year) as group_nr
from ( 
select company, 
       profession,
       year,
       case 
          when row_number() over (partition by company, profession order by year) = 1 or 
               year - lag(year,1,year) over (partition by company, profession order by year) > 1 then 1
          else 0
       end as group_cnt
from qualification
) t1

This returns the following result:

这将返回以下结果:

 company | profession | year | group_nr
---------+------------+------+----------
 Google  | Programmer | 2000 |        1
 Google  | Sales      | 2000 |        2
 Google  | Sales      | 2001 |        2
 Google  | Sales      | 2002 |        2
 Google  | Sales      | 2004 |        3
 Mozilla | Sales      | 2002 |        4
(6 rows)

As you can see each "group" got its own group_nr and this we can finally use to aggregate over by adding yet another derived table:

正如您所看到的,每个“group”都有自己的group_nr,我们最终可以通过添加另一个派生表来聚合:

Step 3) Final query

select company,
       profession,
       array_agg(year) as years
from (
  select company,
       profession,
       year,
       sum(group_cnt) over (order by company, profession, year) as group_nr
  from ( 
    select company, 
           profession,
           year,
           case 
              when row_number() over (partition by company, profession order by year) = 1 or 
                   year - lag(year,1,year) over (partition by company, profession order by year) > 1 then 1
              else 0
           end as group_cnt
    from qualification
  ) t1
) t2
group by company, profession, group_nr
order by company, profession, group_nr

This returns the following result:

这将返回以下结果:

 company | profession |      years
---------+------------+------------------
 Google  | Programmer | {2000}
 Google  | Sales      | {2000,2001,2002}
 Google  | Sales      | {2004}
 Mozilla | Sales      | {2002}
(4 rows)

Which is exactly what you wanted, if I'm not mistaken.

如果我没弄错的话,这正是你想要的。

#3


4  

Procedural solution with PL/pgSQL

The problem is rather unwieldy for plain SQL with aggregate / windows functions. While looping is typically slower than set-based solutions with plain SQL, a procedural solution with plpgsql can make do with a single sequential scan over the table (implicit cursor of a FOR loop) and should be substantially faster in this particular case:

对于具有聚合/窗口函数的纯SQL,这个问题相当笨拙。虽然循环通常比使用普通SQL的基于集合的解决方案慢,但是使用plpgsql的过程解决方案可以对表进行单个顺序扫描(FOR循环的隐式游标),并且在这种特定情况下应该大大加快:

Test table:

测试表:

CREATE TEMP TABLE tbl (company text, profession text, year int);
INSERT INTO tbl VALUES
 ('Google',  'Programmer', 2000)
,('Google',  'Sales',      2000)
,('Google',  'Sales',      2001)
,('Google',  'Sales',      2002)
,('Google',  'Sales',      2004)
,('Mozilla', 'Sales',      2002);

Function:

功能:

CREATE OR REPLACE FUNCTION f_periods()
  RETURNS TABLE (company text, profession text, years int[]) AS
$func$
DECLARE
   r  tbl; -- use table type as row variable
   r0 tbl;
BEGIN

FOR r IN
   SELECT * FROM tbl t ORDER BY t.company, t.profession, t.year
LOOP
   IF ( r.company,  r.profession,  r.year)
   <> (r0.company, r0.profession, r0.year + 1) THEN -- not true for first row

      RETURN QUERY
      SELECT r0.company, r0.profession, years; -- output row

      years := ARRAY[r.year];     -- start new array
   ELSE
      years := years || r.year;   -- add to array - year can be NULL, too
   END IF;

   r0 := r;                       -- remember last row
END LOOP;

RETURN QUERY                      -- output last iteration
SELECT r0.company, r0.profession, years;

END
$func$ LANGUAGE plpgsql;

Call:

呼叫:

SELECT * FROM f_periods();

Produces the requested result.

产生请求的结果。

#1


9  

There's much value to @a_horse_with_no_name's answer, both as a correct solution and, like I already said in a comment, as a good material for learning how to use different kinds of window functions in PostgreSQL.

@ a_horse_with_no_name的答案有很多价值,作为一个正确的解决方案,就像我在评论中已经说过的那样,是学习如何在PostgreSQL中使用不同类型的窗口函数的好材料。

And yet I cannot help feeling that the approach taken in that answer is a bit too much of an effort for a problem like this one. Basically, what you need is an additional criterion for grouping before you go on aggregating years in arrays. You've already got company and profession, now you only need something to distinguish years that belong to different sequences.

然而,我不禁感到在这个答案中所采用的方法对于像这样的问题来说有点过分了。基本上,在进行数组聚合年之前,您需要的是一个额外的分组标准。你已经有了公司和专业,现在你只需要一些东西来区分属于不同序列的年份。

That is just what the above mentioned answer provides and that is precisely what I think can be done in a simpler way. Here's how:

这就是上面提到的答案提供的内容,而这正是我认为可以用更简单的方式完成的。就是这样:

WITH MarkedForGrouping AS (
  SELECT
    company,
    profession,
    year,
    year - ROW_NUMBER() OVER (
      PARTITION BY company, profession
      ORDER BY year
    ) AS seqID
  FROM atable
)
SELECT
  company,
  profession,
  array_agg(year) AS years
FROM MarkedForGrouping
GROUP BY
  company,
  profession,
  seqID

#2


18  

Identifying non-consecutive values is always a bit tricky and involves several nested sub-queries (at least I cannot come up with a better solution).

识别非连续值总是有点棘手,涉及几个嵌套的子查询(至少我不能提出更好的解决方案)。

The first step is to identify non-consecutive values for the year:

第一步是确定年份的非连续值:

Step 1) Identify non-consecutive values

select company, 
       profession,
       year,
       case 
          when row_number() over (partition by company, profession order by year) = 1 or 
               year - lag(year,1,year) over (partition by company, profession order by year) > 1 then 1
          else 0
       end as group_cnt
from qualification

This returns the following result:

这将返回以下结果:

 company | profession | year | group_cnt
---------+------------+------+-----------
 Google  | Programmer | 2000 |         1
 Google  | Sales      | 2000 |         1
 Google  | Sales      | 2001 |         0
 Google  | Sales      | 2002 |         0
 Google  | Sales      | 2004 |         1
 Mozilla | Sales      | 2002 |         1

Now with the group_cnt value we can create "group IDs" for each group that has consecutive years:

现在使用group_cnt值,我们可以为连续年份的每个组创建“组ID”:

Step 2) Define group IDs

select company,
   profession,
   year,
   sum(group_cnt) over (order by company, profession, year) as group_nr
from ( 
select company, 
       profession,
       year,
       case 
          when row_number() over (partition by company, profession order by year) = 1 or 
               year - lag(year,1,year) over (partition by company, profession order by year) > 1 then 1
          else 0
       end as group_cnt
from qualification
) t1

This returns the following result:

这将返回以下结果:

 company | profession | year | group_nr
---------+------------+------+----------
 Google  | Programmer | 2000 |        1
 Google  | Sales      | 2000 |        2
 Google  | Sales      | 2001 |        2
 Google  | Sales      | 2002 |        2
 Google  | Sales      | 2004 |        3
 Mozilla | Sales      | 2002 |        4
(6 rows)

As you can see each "group" got its own group_nr and this we can finally use to aggregate over by adding yet another derived table:

正如您所看到的,每个“group”都有自己的group_nr,我们最终可以通过添加另一个派生表来聚合:

Step 3) Final query

select company,
       profession,
       array_agg(year) as years
from (
  select company,
       profession,
       year,
       sum(group_cnt) over (order by company, profession, year) as group_nr
  from ( 
    select company, 
           profession,
           year,
           case 
              when row_number() over (partition by company, profession order by year) = 1 or 
                   year - lag(year,1,year) over (partition by company, profession order by year) > 1 then 1
              else 0
           end as group_cnt
    from qualification
  ) t1
) t2
group by company, profession, group_nr
order by company, profession, group_nr

This returns the following result:

这将返回以下结果:

 company | profession |      years
---------+------------+------------------
 Google  | Programmer | {2000}
 Google  | Sales      | {2000,2001,2002}
 Google  | Sales      | {2004}
 Mozilla | Sales      | {2002}
(4 rows)

Which is exactly what you wanted, if I'm not mistaken.

如果我没弄错的话,这正是你想要的。

#3


4  

Procedural solution with PL/pgSQL

The problem is rather unwieldy for plain SQL with aggregate / windows functions. While looping is typically slower than set-based solutions with plain SQL, a procedural solution with plpgsql can make do with a single sequential scan over the table (implicit cursor of a FOR loop) and should be substantially faster in this particular case:

对于具有聚合/窗口函数的纯SQL,这个问题相当笨拙。虽然循环通常比使用普通SQL的基于集合的解决方案慢,但是使用plpgsql的过程解决方案可以对表进行单个顺序扫描(FOR循环的隐式游标),并且在这种特定情况下应该大大加快:

Test table:

测试表:

CREATE TEMP TABLE tbl (company text, profession text, year int);
INSERT INTO tbl VALUES
 ('Google',  'Programmer', 2000)
,('Google',  'Sales',      2000)
,('Google',  'Sales',      2001)
,('Google',  'Sales',      2002)
,('Google',  'Sales',      2004)
,('Mozilla', 'Sales',      2002);

Function:

功能:

CREATE OR REPLACE FUNCTION f_periods()
  RETURNS TABLE (company text, profession text, years int[]) AS
$func$
DECLARE
   r  tbl; -- use table type as row variable
   r0 tbl;
BEGIN

FOR r IN
   SELECT * FROM tbl t ORDER BY t.company, t.profession, t.year
LOOP
   IF ( r.company,  r.profession,  r.year)
   <> (r0.company, r0.profession, r0.year + 1) THEN -- not true for first row

      RETURN QUERY
      SELECT r0.company, r0.profession, years; -- output row

      years := ARRAY[r.year];     -- start new array
   ELSE
      years := years || r.year;   -- add to array - year can be NULL, too
   END IF;

   r0 := r;                       -- remember last row
END LOOP;

RETURN QUERY                      -- output last iteration
SELECT r0.company, r0.profession, years;

END
$func$ LANGUAGE plpgsql;

Call:

呼叫:

SELECT * FROM f_periods();

Produces the requested result.

产生请求的结果。