Let's say, I have music video play stats table mydataset.stats for a given day (3B rows, 1M users, 6K artists). Simplified schema is: UserGUID String, ArtistGUID String
比如,我有音乐视频播放统计表mydataset。某一天的统计数据(3B行,100万用户,6K艺术家)。简化的模式是:UserGUID字符串、ArtistGUID字符串
I need pivot/transpose artists from rows to columns, so schema will be:
UserGUID String, Artist1 Int, Artist2 Int, … Artist8000 Int
With Artist plays count by respective user
我需要从行到列的pivot/转置艺术家,所以模式将是:UserGUID字符串,Artist1 Int, Artist2 Int,…Artist8000 Int与Artist play count by各自的用户。
There was an approach suggested in How to transpose rows to columns with large amount of the data in BigQuery/SQL? and How to create dummy variable columns for thousands of categories in Google BigQuery? but looks like it doesn’t scale for numbers I have in my example
有一种方法建议如何将行转换为BigQuery/SQL中包含大量数据的列?如何在谷歌BigQuery中为数千个类别创建虚拟变量列?但是我的例子中,它看起来不像我的数字。
Can this approach be scaled for my example?
我的例子可以按比例缩放吗?
1 个解决方案
#1
5
I tried below approach for up to 6000 features and it worked as expected. I believe it will work up to 10K features which is hard limit for number of columns in a table
我尝试了下面的方法,最多6000个功能,并按预期工作。我相信它会达到10K的特性,这是一个表中列数的硬限制。
STEP 1 - Aggregate plays by user / artist
第一步-用户/艺术家的集合播放
SELECT userGUID as uid, artistGUID as aid, COUNT(1) as plays
FROM [mydataset.stats] GROUP BY 1, 2
STEP 2 – Normalize uid and aid – so they are consecutive numbers 1, 2, 3, … .
We need this at least for two reasons: a) make later dynamically created sql as compact as possible and b) to have more usable/friendly columns names
第二步-正常化uid和援助-所以他们是连续的数字1,2,3,…。我们需要这样做至少有两个原因:a)使以后动态创建的sql尽可能紧凑;b)具有更有用/更友好的列名
Combined with first step – it will be:
结合第一步,它将是:
SELECT u.uid AS uid, a.aid AS aid, plays
FROM (
SELECT userGUID, artistGUID, COUNT(1) AS plays
FROM [mydataset.stats]
GROUP BY 1, 2
) AS s
JOIN (
SELECT userGUID, ROW_NUMBER() OVER() AS uid FROM [mydataset.stats] GROUP BY 1
) AS u ON u. userGUID = s.userGUID
JOIN (
SELECT artistGUID, ROW_NUMBER() OVER() AS aid FROM [mydataset.stats] GROUP BY 1
) AS a ON a.artistGUID = s.artistGUID
Let’s write output to table - mydataset.aggs
让我们将输出写入表mydataset.aggs
STEP 3 – Using already suggested (in above mentioned questions) approach for N features (artists) at a time. In my particular example, by experimenting, I found that basic approach works well for number of features between 2000 and 3000. To be on safe side I decided to use 2000 features at a time
第3步-每次为N个特性(艺术家)使用已经建议的(在上面提到的问题中)方法。在我的例子中,通过实验,我发现基本方法在2000年到3000年之间的特性数量上运行良好。为了安全起见,我决定一次使用2000个特性
Below script is used for dynamically generating query that then run to create partitioned tables
下面的脚本用于动态生成查询,然后运行查询来创建分区表
SELECT 'SELECT uid,' +
GROUP_CONCAT_UNQUOTED(
'SUM(IF(aid=' + STRING(aid) + ',plays,NULL)) as a' + STRING(aid)
)
+ ' FROM [mydataset.aggs] GROUP EACH BY uid'
FROM (SELECT aid FROM [mydataset.aggs] GROUP BY aid HAVING aid > 0 and aid < 2001)
Above query produces yet another query like below:
上述查询又生成如下所示的另一个查询:
SELECT uid,SUM(IF(aid=1,plays,NULL)) a1,SUM(IF(aid=3,plays,NULL)) a3,
SUM(IF(aid=2,plays,NULL)) a2,SUM(IF(aid=4,plays,NULL)) a4 . . .
FROM [mydataset.aggs] GROUP EACH BY uid
This should be run and written to mydataset.pivot_1_2000
应该将其运行并写入mydataset.pivot_1_2000
Executing STEP 3 two more times (adjusting HAVING aid > NNNN and aid < NNNN
) we get three more tables mydataset.pivot_2001_4000
, mydataset.pivot_4001_6000
As you can see - mydataset.pivot_1_2000 has expected schema but for features with aid from 1 to 2001; mydataset.pivot_2001_4000 has only features with aid from 2001 to 4000; and so on
再执行步骤3两次(调整帮助> NNNN和aid < NNNN),我们得到三个表mydataset。pivot_2001_4000 mydataset。如您所见,pivot_4001_6000 - mydataset。pivot_1_2000期望有模式,但是对从1到2001年的功能有帮助;mydataset。pivot_2001_4000只具有2001年至4000年的辅助功能;等等
STEP 4 – Merging all partitioned pivot table to final pivot table with all features represented as columns in one table
步骤4 -将所有分区的pivot表合并到最终的pivot表,并在一个表中以列表示所有特性
Same as in above steps. First we need generate query and then run it So, initially we will “stitch” mydataset.pivot_1_2000 and mydataset.pivot_2001_4000. Then result with mydataset.pivot_4001_6000
与以上步骤相同。首先我们需要生成查询,然后运行它,首先我们将“缝合”mydataset。pivot_1_2000 mydataset.pivot_2001_4000。然后用mydataset.pivot_4001_6000结果
SELECT 'SELECT x.uid uid,' +
GROUP_CONCAT_UNQUOTED(
'a' + STRING(aid)
)
+ ' FROM [mydataset.pivot_1_2000] AS x
JOIN EACH [mydataset.pivot_2001_4000] AS y ON y.uid = x.uid
'
FROM (SELECT aid FROM [mydataset.aggs] GROUP BY aid HAVING aid < 4001 ORDER BY aid)
Output string from above should be run and result written to mydataset.pivot_1_4000
应该运行上面的输出字符串,并将结果写入mydataset.pivot_1_4000
Then we repeat STEP 4 like below
然后我们重复第4步,如下所示
SELECT 'SELECT x.uid uid,' +
GROUP_CONCAT_UNQUOTED(
'a' + STRING(aid)
)
+ ' FROM [mydataset.pivot_1_4000] AS x
JOIN EACH [mydataset.pivot_4001_6000] AS y ON y.uid = x.uid
'
FROM (SELECT aid FROM [mydataset.aggs] GROUP BY aid HAVING aid < 6001 ORDER BY aid)
Result to be written to mydataset.pivot_1_6000
结果被写入mydataset.pivot_1_6000
The resulted table has following schema:
结果表有以下模式:
uid int, a1 int, a2 int, a3 int, . . . , a5999 int, a6000 int
NOTE:
a. I tried this approach only up to 6000 features and it worked as expected
b. Run time for second/main queries in step 3 and 4 varied from 20 to 60 min
c. IMPORTANT: billing tier in steps 3 and 4 varied from 1 to 90. The good news is that respective table’s size is relatively small (30-40MB) so does billing bytes. For “before 2016” projects everything is billed as tier 1 but after October 2016 this can be an issue.
For more information, see Timing
in High-Compute queries
d. Above example shows power of large-scale data transformation with BigQuery! Still I think (but I can be wrong) that storing materialized feature matrix is not the best idea
注意:a.我只尝试了6000个特性,并按预期运行b.第3步和第4步中第二个/主要查询的运行时间从20分钟到60分钟不等。好消息是,相应的表的大小相对较小(30-40MB),账单字节也是如此。对于“2016年之前”项目,一切都被宣传为第一级,但2016年10月之后,这可能成为一个问题。有关更多信息,请参见High-Compute查询d中的计时。我仍然认为(但我可能是错的)存储物化特征矩阵不是最好的主意。
#1
5
I tried below approach for up to 6000 features and it worked as expected. I believe it will work up to 10K features which is hard limit for number of columns in a table
我尝试了下面的方法,最多6000个功能,并按预期工作。我相信它会达到10K的特性,这是一个表中列数的硬限制。
STEP 1 - Aggregate plays by user / artist
第一步-用户/艺术家的集合播放
SELECT userGUID as uid, artistGUID as aid, COUNT(1) as plays
FROM [mydataset.stats] GROUP BY 1, 2
STEP 2 – Normalize uid and aid – so they are consecutive numbers 1, 2, 3, … .
We need this at least for two reasons: a) make later dynamically created sql as compact as possible and b) to have more usable/friendly columns names
第二步-正常化uid和援助-所以他们是连续的数字1,2,3,…。我们需要这样做至少有两个原因:a)使以后动态创建的sql尽可能紧凑;b)具有更有用/更友好的列名
Combined with first step – it will be:
结合第一步,它将是:
SELECT u.uid AS uid, a.aid AS aid, plays
FROM (
SELECT userGUID, artistGUID, COUNT(1) AS plays
FROM [mydataset.stats]
GROUP BY 1, 2
) AS s
JOIN (
SELECT userGUID, ROW_NUMBER() OVER() AS uid FROM [mydataset.stats] GROUP BY 1
) AS u ON u. userGUID = s.userGUID
JOIN (
SELECT artistGUID, ROW_NUMBER() OVER() AS aid FROM [mydataset.stats] GROUP BY 1
) AS a ON a.artistGUID = s.artistGUID
Let’s write output to table - mydataset.aggs
让我们将输出写入表mydataset.aggs
STEP 3 – Using already suggested (in above mentioned questions) approach for N features (artists) at a time. In my particular example, by experimenting, I found that basic approach works well for number of features between 2000 and 3000. To be on safe side I decided to use 2000 features at a time
第3步-每次为N个特性(艺术家)使用已经建议的(在上面提到的问题中)方法。在我的例子中,通过实验,我发现基本方法在2000年到3000年之间的特性数量上运行良好。为了安全起见,我决定一次使用2000个特性
Below script is used for dynamically generating query that then run to create partitioned tables
下面的脚本用于动态生成查询,然后运行查询来创建分区表
SELECT 'SELECT uid,' +
GROUP_CONCAT_UNQUOTED(
'SUM(IF(aid=' + STRING(aid) + ',plays,NULL)) as a' + STRING(aid)
)
+ ' FROM [mydataset.aggs] GROUP EACH BY uid'
FROM (SELECT aid FROM [mydataset.aggs] GROUP BY aid HAVING aid > 0 and aid < 2001)
Above query produces yet another query like below:
上述查询又生成如下所示的另一个查询:
SELECT uid,SUM(IF(aid=1,plays,NULL)) a1,SUM(IF(aid=3,plays,NULL)) a3,
SUM(IF(aid=2,plays,NULL)) a2,SUM(IF(aid=4,plays,NULL)) a4 . . .
FROM [mydataset.aggs] GROUP EACH BY uid
This should be run and written to mydataset.pivot_1_2000
应该将其运行并写入mydataset.pivot_1_2000
Executing STEP 3 two more times (adjusting HAVING aid > NNNN and aid < NNNN
) we get three more tables mydataset.pivot_2001_4000
, mydataset.pivot_4001_6000
As you can see - mydataset.pivot_1_2000 has expected schema but for features with aid from 1 to 2001; mydataset.pivot_2001_4000 has only features with aid from 2001 to 4000; and so on
再执行步骤3两次(调整帮助> NNNN和aid < NNNN),我们得到三个表mydataset。pivot_2001_4000 mydataset。如您所见,pivot_4001_6000 - mydataset。pivot_1_2000期望有模式,但是对从1到2001年的功能有帮助;mydataset。pivot_2001_4000只具有2001年至4000年的辅助功能;等等
STEP 4 – Merging all partitioned pivot table to final pivot table with all features represented as columns in one table
步骤4 -将所有分区的pivot表合并到最终的pivot表,并在一个表中以列表示所有特性
Same as in above steps. First we need generate query and then run it So, initially we will “stitch” mydataset.pivot_1_2000 and mydataset.pivot_2001_4000. Then result with mydataset.pivot_4001_6000
与以上步骤相同。首先我们需要生成查询,然后运行它,首先我们将“缝合”mydataset。pivot_1_2000 mydataset.pivot_2001_4000。然后用mydataset.pivot_4001_6000结果
SELECT 'SELECT x.uid uid,' +
GROUP_CONCAT_UNQUOTED(
'a' + STRING(aid)
)
+ ' FROM [mydataset.pivot_1_2000] AS x
JOIN EACH [mydataset.pivot_2001_4000] AS y ON y.uid = x.uid
'
FROM (SELECT aid FROM [mydataset.aggs] GROUP BY aid HAVING aid < 4001 ORDER BY aid)
Output string from above should be run and result written to mydataset.pivot_1_4000
应该运行上面的输出字符串,并将结果写入mydataset.pivot_1_4000
Then we repeat STEP 4 like below
然后我们重复第4步,如下所示
SELECT 'SELECT x.uid uid,' +
GROUP_CONCAT_UNQUOTED(
'a' + STRING(aid)
)
+ ' FROM [mydataset.pivot_1_4000] AS x
JOIN EACH [mydataset.pivot_4001_6000] AS y ON y.uid = x.uid
'
FROM (SELECT aid FROM [mydataset.aggs] GROUP BY aid HAVING aid < 6001 ORDER BY aid)
Result to be written to mydataset.pivot_1_6000
结果被写入mydataset.pivot_1_6000
The resulted table has following schema:
结果表有以下模式:
uid int, a1 int, a2 int, a3 int, . . . , a5999 int, a6000 int
NOTE:
a. I tried this approach only up to 6000 features and it worked as expected
b. Run time for second/main queries in step 3 and 4 varied from 20 to 60 min
c. IMPORTANT: billing tier in steps 3 and 4 varied from 1 to 90. The good news is that respective table’s size is relatively small (30-40MB) so does billing bytes. For “before 2016” projects everything is billed as tier 1 but after October 2016 this can be an issue.
For more information, see Timing
in High-Compute queries
d. Above example shows power of large-scale data transformation with BigQuery! Still I think (but I can be wrong) that storing materialized feature matrix is not the best idea
注意:a.我只尝试了6000个特性,并按预期运行b.第3步和第4步中第二个/主要查询的运行时间从20分钟到60分钟不等。好消息是,相应的表的大小相对较小(30-40MB),账单字节也是如此。对于“2016年之前”项目,一切都被宣传为第一级,但2016年10月之后,这可能成为一个问题。有关更多信息,请参见High-Compute查询d中的计时。我仍然认为(但我可能是错的)存储物化特征矩阵不是最好的主意。