BigQuery GROUP_CONCAT和ORDER BY

时间:2022-06-28 21:09:15

I am currently using BigQuery and GROUP_CONCAT which works perfectly fine. However, when I try to add a ORDER BY clause to the GROUP_CONCAT statement like I would do in SQL, I receive an error.

我目前使用的是BigQuery和GROUP_CONCAT,这两种工具非常好用。但是,当我尝试向GROUP_CONCAT语句中添加ORDER BY子句(就像在SQL中那样)时,会收到一个错误。

So e.g., something like

例如,类似

SELECT a, GROUP_CONCAT(b ORDER BY c) FROM test GROUP BY a

从测试组中选择a, GROUP_CONCAT(b ORDER BY c)。

The same happens if I try to specify the separator.

如果我尝试指定分隔符,也会发生同样的情况。

Any ideas on how to approach this?

有什么办法吗?

2 个解决方案

#1


4  

Since BigQuery doesn't support ORDER BY clause inside GROUP_CONCAT function, this functionality can be achieved by use of analytic window functions. And in BigQuery separator for GROUP_CONCAT is simply a second parameter for the function. Below example illustrates this:

由于BigQuery在GROUP_CONCAT函数中不支持ORDER BY子句,因此可以使用分析窗口函数来实现这个功能。而GROUP_CONCAT的BigQuery分隔符只是函数的第二个参数。下面的例子说明了这一点:

select key, first(grouped_value) concat_value from (
select 
  key, 
  group_concat(value, ':') over 
    (partition by key
     order by value asc
     rows between unbounded preceding and unbounded following) 
  grouped_value 
from (
select key, value from
(select 1 as key, 'b' as value),
(select 1 as key, 'c' as value),
(select 1 as key, 'a' as value),
(select 2 as key, 'y' as value),
(select 2 as key, 'x' as value))) group by key

Will produce the following:

将产生如下:

Row key concat_value     
1   1   a:b:c    
2   2   x:y

NOTE on Window specification: The query uses "rows between unbounded preceding and unbounded following" window specification, to make sure that all rows within a partition participate in GROUP_CONCAT aggregation. Per SQL Standard default window specification is "rows between unbounded preceding and current row" which is good for things like running sum, but won't work correctly in this problem.

关于窗口规范的注意:查询使用“*之前和*跟随之间的行”窗口规范,以确保分区内的所有行都参与GROUP_CONCAT聚合。每个SQL标准默认窗口规范都是“*前行和当前行之间的行”,这对于运行和之类的东西来说很好,但是在这个问题中不能正确工作。

Performance note: Even though it looks wasteful to recompute aggregation function multiple times, the BigQuery optimizer does recognize that since window is not changing result will be the same, so it only computes aggregation once per partition.

性能注意:尽管多次重新计算聚合函数看起来很浪费,但BigQuery优化器确实认识到,由于窗口没有更改结果,所以每次分区只计算一次聚合。

#2


1  

Standard SQL mode in BigQuery does support ORDER BY clause within some aggregate functions, including STRING_AGG, for example:

BigQuery中的标准SQL模式确实支持一些聚合函数(包括STRING_AGG)中的ORDER BY子句:

#standardSQL
select string_agg(t.x order by t.y) 
from unnest([struct<x STRING, y INT64>('a', 5), ('b', 1), ('c', 10)]) t

will result in

将导致

b,a,c

b,a、c

Documentation is here: https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#using-order-by-with-aggregate-functions

文档:https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators using-order-by-with-aggregate-functions

#1


4  

Since BigQuery doesn't support ORDER BY clause inside GROUP_CONCAT function, this functionality can be achieved by use of analytic window functions. And in BigQuery separator for GROUP_CONCAT is simply a second parameter for the function. Below example illustrates this:

由于BigQuery在GROUP_CONCAT函数中不支持ORDER BY子句,因此可以使用分析窗口函数来实现这个功能。而GROUP_CONCAT的BigQuery分隔符只是函数的第二个参数。下面的例子说明了这一点:

select key, first(grouped_value) concat_value from (
select 
  key, 
  group_concat(value, ':') over 
    (partition by key
     order by value asc
     rows between unbounded preceding and unbounded following) 
  grouped_value 
from (
select key, value from
(select 1 as key, 'b' as value),
(select 1 as key, 'c' as value),
(select 1 as key, 'a' as value),
(select 2 as key, 'y' as value),
(select 2 as key, 'x' as value))) group by key

Will produce the following:

将产生如下:

Row key concat_value     
1   1   a:b:c    
2   2   x:y

NOTE on Window specification: The query uses "rows between unbounded preceding and unbounded following" window specification, to make sure that all rows within a partition participate in GROUP_CONCAT aggregation. Per SQL Standard default window specification is "rows between unbounded preceding and current row" which is good for things like running sum, but won't work correctly in this problem.

关于窗口规范的注意:查询使用“*之前和*跟随之间的行”窗口规范,以确保分区内的所有行都参与GROUP_CONCAT聚合。每个SQL标准默认窗口规范都是“*前行和当前行之间的行”,这对于运行和之类的东西来说很好,但是在这个问题中不能正确工作。

Performance note: Even though it looks wasteful to recompute aggregation function multiple times, the BigQuery optimizer does recognize that since window is not changing result will be the same, so it only computes aggregation once per partition.

性能注意:尽管多次重新计算聚合函数看起来很浪费,但BigQuery优化器确实认识到,由于窗口没有更改结果,所以每次分区只计算一次聚合。

#2


1  

Standard SQL mode in BigQuery does support ORDER BY clause within some aggregate functions, including STRING_AGG, for example:

BigQuery中的标准SQL模式确实支持一些聚合函数(包括STRING_AGG)中的ORDER BY子句:

#standardSQL
select string_agg(t.x order by t.y) 
from unnest([struct<x STRING, y INT64>('a', 5), ('b', 1), ('c', 10)]) t

will result in

将导致

b,a,c

b,a、c

Documentation is here: https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators#using-order-by-with-aggregate-functions

文档:https://cloud.google.com/bigquery/docs/reference/standard-sql/functions-and-operators using-order-by-with-aggregate-functions