I have some weblog data in big query which I need to transform to make it easier to use and query. The data looks like:
我在大查询中有一些weblog数据,我需要对这些数据进行转换,以便于使用和查询。数据看起来像:
I want to extract and transform the data within the curled brackets after Results{…..} (colored blue). The data is of the form ‘(\d+((PQ)|(KL))+\d+)’ and there can be 1-20+ entries in the result array. I am only interested in the first 16 entries.
我想提取并转换完结果后括号内的数据}(蓝色)。数据为“(\d+((PQ)|(KL))+\d+)”形式,结果数组中可以有1-20+项。我只对前16个条目感兴趣。
I have been able to extract the data within curled brackets into a new column, using Substr and regext_extract. But I'm unable to SPLIT it into columns (sometimes there is only 1 result and so the delimiter "," is missing. I'm new with regex, may be I can use something like ‘(\d+((PQ)|(KL))+\d+){1}’ etc. to split the data into multiple columns and then pivot it.
我已经能够使用Substr和regext_extract将卷曲括号中的数据提取到一个新的列中。但是我无法将它分割成列(有时只有一个结果,因此缺少了分隔符“,”。我是regex的新手,可能是我可以使用' (\d+(PQ)|(KL) +\d+){1} '等方法将数据拆分为多个列,然后对其进行数据透视。
Ideal output in my case would be to transform it into something like:
在我的情况下,理想的输出是将它转换成如下内容:
In the above solution, each row in original table is repeated from 1-16 times depending on the number of items in the Results array.
在上面的解决方案中,根据结果数组中的项数,原始表中的每一行重复1-16次。
I’m not completely sure if it’s possible to do this in big query. I’ll be grateful if anyone can help me out a little here.
我不完全确定是否可以在大查询中这样做。如果有人能帮我一点忙,我会很感激的。
If this is not possible, then I can have 16 rows for every event with NULL values in Event_details for cases where there are less than 16 entries in result array.
如果不可能,那么对于结果数组中小于16个条目的情况,我可以在Event_details中为每个带有空值的事件设置16行。
In case both of these are not possible, the last solution would be to have it transformed into something like:
如果这两种方法都不可能,最后一种方法是将其转换为以下内容:
The reason I want to transform the data is that in most of the cases I would need to find which result array items are appearing and in what order.
我想要转换数据的原因是,在大多数情况下,我需要找到出现的结果数组项以及它们的顺序。
2 个解决方案
#1
2
Check this out: Split string into multiple columns with bigquery. In their case its delimited by spaces. replace the \s with ','
检查这个:将字符串拆分为多个带有bigquery的列。在他们的例子中,它被空间分隔开。用' '
something like:
喜欢的东西:
SELECT
Regexp_extract(StringToParse,r'^*{(?:[^,]*,){0}(\d+(?:(?:PQ)|(?:KL))+\d+)\s?') as Word0,
Regexp_extract(StringToParse,r'^*{(?:[^,]*,){1}(\d+(?:(?:PQ)|(?:KL))+\d+)\s?') as Word1,
Regexp_extract(StringToParse,r'^*{(?:[^,]*,){2}(\d+(?:(?:PQ)|(?:KL))+\d+)\s?') as Word2,
Regexp_extract(StringToParse,r'^*{(?:[^,]*,){3}(\d+(?:(?:PQ)|(?:KL))+\d+)\s?') as Word3,
FROM
(SELECT 'bla{1234PQ5,6789KL0,1234PQ5,6789KL0,123' as StringToParse)
#2
1
Use SPLIT()
使用分割()
SELECT Event_ID, Event_UserID, Event_SessionID, Keyword,
SPLIT(REGEXP_EXTRACT(Event_details,"Results\{(.*)\}"),",") as Event_details_item
FROM mydata.mytable
#1
2
Check this out: Split string into multiple columns with bigquery. In their case its delimited by spaces. replace the \s with ','
检查这个:将字符串拆分为多个带有bigquery的列。在他们的例子中,它被空间分隔开。用' '
something like:
喜欢的东西:
SELECT
Regexp_extract(StringToParse,r'^*{(?:[^,]*,){0}(\d+(?:(?:PQ)|(?:KL))+\d+)\s?') as Word0,
Regexp_extract(StringToParse,r'^*{(?:[^,]*,){1}(\d+(?:(?:PQ)|(?:KL))+\d+)\s?') as Word1,
Regexp_extract(StringToParse,r'^*{(?:[^,]*,){2}(\d+(?:(?:PQ)|(?:KL))+\d+)\s?') as Word2,
Regexp_extract(StringToParse,r'^*{(?:[^,]*,){3}(\d+(?:(?:PQ)|(?:KL))+\d+)\s?') as Word3,
FROM
(SELECT 'bla{1234PQ5,6789KL0,1234PQ5,6789KL0,123' as StringToParse)
#2
1
Use SPLIT()
使用分割()
SELECT Event_ID, Event_UserID, Event_SessionID, Keyword,
SPLIT(REGEXP_EXTRACT(Event_details,"Results\{(.*)\}"),",") as Event_details_item
FROM mydata.mytable