是否可以在Hive中分组之后连接字符串字段

时间:2021-04-09 07:57:26

I am evaluating Hive and need to do some string field concatenation after group by. I found a function named "concat_ws" but it looks like I have to explicitly list all the values to be concatenated. I am wondering if I can do something like this with concat_ws in Hive. Here is an example. So I have a table named "my_table" and it has two fields named country and city. I want to have only one record per country and each record will have two fields - country and cities:

我正在评估Hive,需要在分组后进行一些字符串字段连接。我找到了一个名为“concat_ws”的函数,但看起来我必须明确列出要连接的所有值。我想知道我是否可以在Hive中使用concat_ws做这样的事情。这是一个例子。所以我有一个名为“my_table”的表,它有两个名为country和city的字段。我想每个国家只有一条记录,每条记录都有两个字段 - 国家和城市:

select country, concat_ws(city, "|") as cities
from my_table
group by country

Is this possible in Hive? I am using Hive 0.11 from CDH5 right now

这可能在Hive中吗?我现在正在使用CDH5的Hive 0.11

1 个解决方案

#1


In database management an aggregate function is a function where the values of multiple rows are grouped together as input on certain criteria to form a single value of more significant meaning or measurement such as a set, a bag or a list.

在数据库管理中,聚合函数是一种函数,其中多行的值被组合在一起作为某些标准的输入,以形成更重要的含义或测量的单个值,例如集合,包或列表。

Source: Aggregate function - Wikipedia

来源:聚合函数 - *

Hive's out-of-the-box aggregate functions listed on the following web-page:
Built-in Aggregate Functions (UDAF - user defined aggregation function)

Hive在下面的网页上列出了开箱即用的聚合函数:内置聚合函数(UDAF - 用户定义的聚合函数)

So, the only built-in option (for Hive 0.11; for Hive 0.13 and above you have collect_list) is:
array collect_set(col)

因此,唯一的内置选项(对于Hive 0.11;对于Hive 0.13及更高版本,您有collect_list)是:array collect_set(col)

This one will answer your request in case there is no duplicate city records per country (returns a set of objects with duplicate elements eliminated). Otherwise create your own UDAF or aggregate outside of Hive.

如果每个国家/地区没有重复的城市记录,则会回复您的请求(返回一组消除了重复元素的对象)。否则,在Hive之外创建自己的UDAF或聚合。

References for writing UDAF:

编写UDAF的参考资料:

#1


In database management an aggregate function is a function where the values of multiple rows are grouped together as input on certain criteria to form a single value of more significant meaning or measurement such as a set, a bag or a list.

在数据库管理中,聚合函数是一种函数,其中多行的值被组合在一起作为某些标准的输入,以形成更重要的含义或测量的单个值,例如集合,包或列表。

Source: Aggregate function - Wikipedia

来源:聚合函数 - *

Hive's out-of-the-box aggregate functions listed on the following web-page:
Built-in Aggregate Functions (UDAF - user defined aggregation function)

Hive在下面的网页上列出了开箱即用的聚合函数:内置聚合函数(UDAF - 用户定义的聚合函数)

So, the only built-in option (for Hive 0.11; for Hive 0.13 and above you have collect_list) is:
array collect_set(col)

因此,唯一的内置选项(对于Hive 0.11;对于Hive 0.13及更高版本,您有collect_list)是:array collect_set(col)

This one will answer your request in case there is no duplicate city records per country (returns a set of objects with duplicate elements eliminated). Otherwise create your own UDAF or aggregate outside of Hive.

如果每个国家/地区没有重复的城市记录,则会回复您的请求(返回一组消除了重复元素的对象)。否则,在Hive之外创建自己的UDAF或聚合。

References for writing UDAF:

编写UDAF的参考资料: