We have a pipeline that looks like:
我们有一个看起来像这样的管道:
BigQuery -> ParDo -> BigQuery
BigQuery - > ParDo - > BigQuery
The table has ~2B rows, and is just under 1TB.
该表有~2B行,不到1TB。
After running for just over 8 hours, the job failed with the following error:
运行8个小时后,作业失败并出现以下错误:
May 19, 2015, 10:09:15 PM
S09: (f5a951d84007ef89): Workflow failed. Causes: (f5a951d84007e064): BigQuery job "dataflow_job_17701769799585490748" in project "gdfp-xxxx" finished with error(s): job error: Sources are too large. Limit is 5.00Ti., error: Sources are too large. Limit is 5.00Ti.
Job id is: 2015-05-18_21_04_28-9907828662358367047
职位编号为:2015-05-18_21_04_28-9907828662358367047
It's a big table, but it's not that big and Dataflow should be easily able to handle it. Why can't it handle this use case?
这是一个很大的表,但它并不大,Dataflow应该能够轻松处理它。为什么不能处理这个用例?
Also, even though the job failed, it still shows it as successful on the diagram. Why?
此外,即使作业失败,它仍然在图表上显示成功。为什么?
1 个解决方案
#1
0
I think that error means the data you are trying to write to BigQuery exceeds the 5TB limit set by BigQuery for a single import job.
我认为该错误意味着您尝试写入BigQuery的数据超过了BigQuery针对单个导入作业设置的5TB限制。
One way to work around this limit might be to split your BigQuery writes into multiple jobs by having multiple Write transforms so that no Write transform receives more than 5TB.
解决此限制的一种方法可能是通过多次写入转换将BigQuery写入拆分为多个作业,以便Write转换不会超过5TB。
Before your write transform, you could have a DoFn with N outputs. For each record randomly assign it to one of the outputs. Each of the N outputs can then have its own BigQuery.Write transform. The write transforms could all append data to the same table so that all of the data will end up in the same table.
在写入转换之前,您可以使用具有N个输出的DoFn。对于每个记录,将其随机分配给其中一个输出。然后,N个输出中的每一个都可以拥有自己的BigQuery.Write变换。写转换可以将数据全部附加到同一个表中,以便所有数据最终都在同一个表中。
#1
0
I think that error means the data you are trying to write to BigQuery exceeds the 5TB limit set by BigQuery for a single import job.
我认为该错误意味着您尝试写入BigQuery的数据超过了BigQuery针对单个导入作业设置的5TB限制。
One way to work around this limit might be to split your BigQuery writes into multiple jobs by having multiple Write transforms so that no Write transform receives more than 5TB.
解决此限制的一种方法可能是通过多次写入转换将BigQuery写入拆分为多个作业,以便Write转换不会超过5TB。
Before your write transform, you could have a DoFn with N outputs. For each record randomly assign it to one of the outputs. Each of the N outputs can then have its own BigQuery.Write transform. The write transforms could all append data to the same table so that all of the data will end up in the same table.
在写入转换之前,您可以使用具有N个输出的DoFn。对于每个记录,将其随机分配给其中一个输出。然后,N个输出中的每一个都可以拥有自己的BigQuery.Write变换。写转换可以将数据全部附加到同一个表中,以便所有数据最终都在同一个表中。