在谷歌bigquery中转换数据——提取文本,将其分割为多个列并将数据旋转

时间:2022-12-18 15:48:18

I have some weblog data in big query which I need to transform to make it easier to use and query. The data looks like:

我在大查询中有一些weblog数据,我需要对这些数据进行转换,以便于使用和查询。数据看起来像:

在谷歌bigquery中转换数据——提取文本,将其分割为多个列并将数据旋转

I want to extract and transform the data within the curled brackets after Results{…..} (colored blue). The data is of the form ‘(\d+((PQ)|(KL))+\d+)’ and there can be 1-20+ entries in the result array. I am only interested in the first 16 entries.

我想提取并转换完结果后括号内的数据}(蓝色)。数据为“(\d+((PQ)|(KL))+\d+)”形式,结果数组中可以有1-20+项。我只对前16个条目感兴趣。

I have been able to extract the data within curled brackets into a new column, using Substr and regext_extract. But I'm unable to SPLIT it into columns (sometimes there is only 1 result and so the delimiter "," is missing. I'm new with regex, may be I can use something like ‘(\d+((PQ)|(KL))+\d+){1}’ etc. to split the data into multiple columns and then pivot it.

我已经能够使用Substr和regext_extract将卷曲括号中的数据提取到一个新的列中。但是我无法将它分割成列(有时只有一个结果,因此缺少了分隔符“,”。我是regex的新手,可能是我可以使用' (\d+(PQ)|(KL) +\d+){1} '等方法将数据拆分为多个列,然后对其进行数据透视。

Ideal output in my case would be to transform it into something like:

在我的情况下,理想的输出是将它转换成如下内容:

在谷歌bigquery中转换数据——提取文本,将其分割为多个列并将数据旋转

In the above solution, each row in original table is repeated from 1-16 times depending on the number of items in the Results array.

在上面的解决方案中,根据结果数组中的项数,原始表中的每一行重复1-16次。

I’m not completely sure if it’s possible to do this in big query. I’ll be grateful if anyone can help me out a little here.

我不完全确定是否可以在大查询中这样做。如果有人能帮我一点忙,我会很感激的。

If this is not possible, then I can have 16 rows for every event with NULL values in Event_details for cases where there are less than 16 entries in result array.

如果不可能,那么对于结果数组中小于16个条目的情况,我可以在Event_details中为每个带有空值的事件设置16行。

In case both of these are not possible, the last solution would be to have it transformed into something like: 在谷歌bigquery中转换数据——提取文本,将其分割为多个列并将数据旋转

如果这两种方法都不可能,最后一种方法是将其转换为以下内容:

The reason I want to transform the data is that in most of the cases I would need to find which result array items are appearing and in what order.

我想要转换数据的原因是,在大多数情况下,我需要找到出现的结果数组项以及它们的顺序。

2 个解决方案

#1


2  

Check this out: Split string into multiple columns with bigquery. In their case its delimited by spaces. replace the \s with ','

检查这个:将字符串拆分为多个带有bigquery的列。在他们的例子中,它被空间分隔开。用' '

something like:

喜欢的东西:

SELECT  
Regexp_extract(StringToParse,r'^*{(?:[^,]*,){0}(\d+(?:(?:PQ)|(?:KL))+\d+)\s?') as Word0,
Regexp_extract(StringToParse,r'^*{(?:[^,]*,){1}(\d+(?:(?:PQ)|(?:KL))+\d+)\s?') as Word1,
Regexp_extract(StringToParse,r'^*{(?:[^,]*,){2}(\d+(?:(?:PQ)|(?:KL))+\d+)\s?') as Word2,
Regexp_extract(StringToParse,r'^*{(?:[^,]*,){3}(\d+(?:(?:PQ)|(?:KL))+\d+)\s?') as Word3,
FROM
(SELECT 'bla{1234PQ5,6789KL0,1234PQ5,6789KL0,123' as StringToParse)

#2


1  

Use SPLIT()

使用分割()

在谷歌bigquery中转换数据——提取文本,将其分割为多个列并将数据旋转

SELECT Event_ID, Event_UserID, Event_SessionID, Keyword,
SPLIT(REGEXP_EXTRACT(Event_details,"Results\{(.*)\}"),",") as Event_details_item
FROM mydata.mytable

#1


2  

Check this out: Split string into multiple columns with bigquery. In their case its delimited by spaces. replace the \s with ','

检查这个:将字符串拆分为多个带有bigquery的列。在他们的例子中,它被空间分隔开。用' '

something like:

喜欢的东西:

SELECT  
Regexp_extract(StringToParse,r'^*{(?:[^,]*,){0}(\d+(?:(?:PQ)|(?:KL))+\d+)\s?') as Word0,
Regexp_extract(StringToParse,r'^*{(?:[^,]*,){1}(\d+(?:(?:PQ)|(?:KL))+\d+)\s?') as Word1,
Regexp_extract(StringToParse,r'^*{(?:[^,]*,){2}(\d+(?:(?:PQ)|(?:KL))+\d+)\s?') as Word2,
Regexp_extract(StringToParse,r'^*{(?:[^,]*,){3}(\d+(?:(?:PQ)|(?:KL))+\d+)\s?') as Word3,
FROM
(SELECT 'bla{1234PQ5,6789KL0,1234PQ5,6789KL0,123' as StringToParse)

#2


1  

Use SPLIT()

使用分割()

在谷歌bigquery中转换数据——提取文本,将其分割为多个列并将数据旋转

SELECT Event_ID, Event_UserID, Event_SessionID, Keyword,
SPLIT(REGEXP_EXTRACT(Event_details,"Results\{(.*)\}"),",") as Event_details_item
FROM mydata.mytable