Basicly we want to split a big (billions of rows) bigquery table into a large number (can be around 100k) smaller tables based on the value of a particular column (not date). I can't figure out how to do it efficiently in bigquery itself, so I am thinking of using dataflow.
基本上,我们希望根据特定列(不是日期)的值将大(数十亿行)bigquery表拆分成大数(可以是大约100k)小表。我无法弄清楚如何在bigquery中有效地做到这一点,所以我正在考虑使用数据流。
With dataflow, we can first load the data from , then create a key value pair for each record, the key is all the possible values for the particular column we want to split the table, then we can group the records by the key. so after this operation, we have PCollection of the (key, [records]). we would then need to write PCollection back to bigquery table, the table name can be key_table.
使用数据流,我们可以先从中加载数据,然后为每条记录创建一个键值对,键是我们要分割表的特定列的所有可能值,然后我们可以按键对记录进行分组。所以在这个操作之后,我们有PCollection(key,[records])。然后我们需要将PCollection写回bigquery表,表名可以是key_table。
So the operation would be: p | beam.io.Read(beam.io.BigQuerySource()) | beam.map(lambda record : (record['splitcol'], record)) | beam.GroupByKey() | beam.io.Write(beam.io.BigQuerySink)
所以操作将是:p | beam.io.Read(beam.io.BigQuerySource())| beam.map(lambda record:(record ['splitcol'],record))| beam.GroupByKey()| beam.io.Write(beam.io.BigQuerySink)
The key question now is how do I write to different tables in the last step based on the value in each element in PCollection.
现在的关键问题是如何根据PCollection中每个元素的值在最后一步中写入不同的表。
This question is somehow related to the another question: Writing different values to different BigQuery tables in Apache Beam. But I am a python guy, not sure if the same solution is possible in Python SDK also.
这个问题在某种程度上与另一个问题有关:在Apache Beam中为不同的BigQuery表写入不同的值。但我是一个蟒蛇人,不确定在Python SDK中是否也可以使用相同的解决方案。
1 个解决方案
#1
0
Currently this feature (value-dependent BigQueryIO.write()
) is only supported in Beam Java. Unfortunately I can't think of an easy way to mimic it using Beam Python, short of reimplementing the respective Java code. Please feel free to open a JIRA feature request.
目前,此功能(依赖于值的BigQueryIO.write())仅在Beam Java中受支持。遗憾的是,我无法想到使用Beam Python模拟它的简单方法,而不是重新实现相应的Java代码。请随时打开JIRA功能请求。
I guess the simplest thing that comes to mind is writing a DoFn
to manually write your rows to the respective tables, using the BigQuery streaming insert API (rather than the Beam BigQuery connector), however keep in mind that streaming inserts are more expensive and subject to more strict quota policies than bulk imports (which are used by the Java BigQuery connector when writing a bounded PCollection
).
我想最简单的事情就是编写一个DoFn,使用BigQuery流插入API(而不是Beam BigQuery连接器)手动将行写入相应的表,但请记住,流插入更昂贵且主题比批量导入更严格的配额策略(在编写有界PCollection时由Java BigQuery连接器使用)。
There is also work happening in Beam on allowing reuse of transforms across languages - a design is being discussed at https://s.apache.org/beam-mixed-language-pipelines. When that work is completed, you would be able to use the Java BigQuery connector from a Python pipeline.
Beam中还有一些工作允许跨语言重用变换 - 在https://s.apache.org/beam-mixed-language-pipelines上讨论了一种设计。完成该工作后,您将能够使用Python管道中的Java BigQuery连接器。
#1
0
Currently this feature (value-dependent BigQueryIO.write()
) is only supported in Beam Java. Unfortunately I can't think of an easy way to mimic it using Beam Python, short of reimplementing the respective Java code. Please feel free to open a JIRA feature request.
目前,此功能(依赖于值的BigQueryIO.write())仅在Beam Java中受支持。遗憾的是,我无法想到使用Beam Python模拟它的简单方法,而不是重新实现相应的Java代码。请随时打开JIRA功能请求。
I guess the simplest thing that comes to mind is writing a DoFn
to manually write your rows to the respective tables, using the BigQuery streaming insert API (rather than the Beam BigQuery connector), however keep in mind that streaming inserts are more expensive and subject to more strict quota policies than bulk imports (which are used by the Java BigQuery connector when writing a bounded PCollection
).
我想最简单的事情就是编写一个DoFn,使用BigQuery流插入API(而不是Beam BigQuery连接器)手动将行写入相应的表,但请记住,流插入更昂贵且主题比批量导入更严格的配额策略(在编写有界PCollection时由Java BigQuery连接器使用)。
There is also work happening in Beam on allowing reuse of transforms across languages - a design is being discussed at https://s.apache.org/beam-mixed-language-pipelines. When that work is completed, you would be able to use the Java BigQuery connector from a Python pipeline.
Beam中还有一些工作允许跨语言重用变换 - 在https://s.apache.org/beam-mixed-language-pipelines上讨论了一种设计。完成该工作后,您将能够使用Python管道中的Java BigQuery连接器。