I have a problem in transposing a large amount of data table in BigQuery (1.5 billion rows) from rows to columns. I could figure out how to do it with small amount of data when hardcoded, but with this large amount. A snapshot of the table looks like this:
在将BigQuery(15亿行)中的大量数据表从行转到列时,我遇到了一个问题。当硬编码时,我可以用少量的数据,但是用这么多。该表的快照如下所示:
+--------------------------+ | CustomerID Feature Value | +--------------------------+ | 1 A123 3 | | 1 F213 7 | | 1 F231 8 | | 1 B789 9.1 | | 2 A123 4 | | 2 U123 4 | | 2 B789 12 | | .. .. .. | | .. .. .. | | 400000 A123 8 | | 400000 U123 7 | | 400000 R231 6 | +--------------------------+
+ - - - - - - - - - - - - - - - - - - - - - - - - - - - - + | | CustomerID功能价值+ - - - - - - - - - - - - - - - - - - - - - - - - - - - - + | 1 A123 3 | | 1 F213 7 | | 1 F231 8 | | 1 B789 9.1 | | 2 A123 4 | | 2 U123 4 | | 2 B789 12 | | . . . . . .| | .. .. ..| | 400000 A123 8 | | 400000 U123 7 | | 400000 R231 6 | + - - - - - - - - - - - - - - - - - - - - - - - - - - - - +
So basically there are approximately 400,000 distinct customerID with 3000 features, and not every customerID has the same features, so some customerID may have 2000 features while some have 3000. The end result table I would like to get is each row presents one distinct customerID, and with 3000 columns that presents all the features. Like this:
基本上有大约400,000个不同的customerID,有3000个特性,并不是每个customerID都有相同的特性,所以有些customerID可能有2000个特性,有些有3000个。我希望得到的最终结果表是,每一行显示一个不同的customerID,以及3000列表示所有特性。是这样的:
CustomerID Feature1 Feature2 ... Feature3000
CustomerID Feature1 Feature2……Feature3000
So some of the cells may have missing values.
所以有些细胞可能丢失了值。
Anyone has idea how to do this in BigQuery or SQL?
有谁知道如何在BigQuery或SQL中实现这一点吗?
Thanks in advance.
提前谢谢。
2 个解决方案
#1
5
STEP #1
In below query replace yourTable
with real name of your table and execute/run it
在下面的查询中,用表的真实名称替换您的表并执行/运行它
SELECT 'SELECT CustomerID, ' +
GROUP_CONCAT_UNQUOTED(
'MAX(IF(Feature = "' + STRING(Feature) + '", Value, NULL))'
)
+ ' FROM yourTable GROUP BY CustomerID'
FROM (SELECT Feature FROM yourTable GROUP BY Feature)
As a result you will get some string to be used in next step!
因此,您将获得在下一步中使用的一些字符串!
STEP #2
Take string you got from Step 1 and just execute it as a query
The output is a Pivot you asked in question
取第1步中得到的字符串,将其作为查询执行,输出就是您所询问的Pivot
#2
0
Hi @Jade I posted a very similar question before. And got a very helpful (and similar) answer from @MikhailBerlyant. For what it's worth, I had about 4000 features to dummify in my case and also ran into "Resources exceeded during query execution" error.
你好,@Jade,我以前贴过一个非常相似的问题。从@MikhailBerlyant那里得到了一个非常有用的(和类似的)答案。值得注意的是,在我的例子中,我有大约4000个特性需要简化,并且还遇到了“查询执行期间超出资源”的错误。
I think that this type of large-scale data transformation (rather than query) is better left for other tools more suitable for this task (such as Spark).
我认为这种类型的大规模数据转换(而不是查询)最好留给更适合这个任务的其他工具(例如Spark)。
#1
5
STEP #1
In below query replace yourTable
with real name of your table and execute/run it
在下面的查询中,用表的真实名称替换您的表并执行/运行它
SELECT 'SELECT CustomerID, ' +
GROUP_CONCAT_UNQUOTED(
'MAX(IF(Feature = "' + STRING(Feature) + '", Value, NULL))'
)
+ ' FROM yourTable GROUP BY CustomerID'
FROM (SELECT Feature FROM yourTable GROUP BY Feature)
As a result you will get some string to be used in next step!
因此,您将获得在下一步中使用的一些字符串!
STEP #2
Take string you got from Step 1 and just execute it as a query
The output is a Pivot you asked in question
取第1步中得到的字符串,将其作为查询执行,输出就是您所询问的Pivot
#2
0
Hi @Jade I posted a very similar question before. And got a very helpful (and similar) answer from @MikhailBerlyant. For what it's worth, I had about 4000 features to dummify in my case and also ran into "Resources exceeded during query execution" error.
你好,@Jade,我以前贴过一个非常相似的问题。从@MikhailBerlyant那里得到了一个非常有用的(和类似的)答案。值得注意的是,在我的例子中,我有大约4000个特性需要简化,并且还遇到了“查询执行期间超出资源”的错误。
I think that this type of large-scale data transformation (rather than query) is better left for other tools more suitable for this task (such as Spark).
我认为这种类型的大规模数据转换(而不是查询)最好留给更适合这个任务的其他工具(例如Spark)。