Google Cloud Dataflow BigQueryIO.Write发生未知错误（http代码500）

Has somebody occur same problem with me that Google Cloud Dataflow BigQueryIO.Write happen unknown error (http code 500)?

有人问我谷歌云数据流BigQueryIO.Write发生未知错误（http代码500）吗？

I use Dataflow to handle some data in April, May, June, I use same code to process April data (400MB) and write to BigQuery success, but when I process May (60MB) or June (90MB) data, It was fail.

我在4月，5月，6月使用Dataflow处理一些数据，我使用相同的代码处理4月数据（400MB）并写入BigQuery成功，但是当我处理May（60MB）或June（90MB）数据时，它失败了。

The data format in April, May and June are same.
4月，5月和6月的数据格式相同。
Change writer from BigQuery to TextIO, job will success, so I think data format is good.
将作者从BigQuery改为TextIO，工作会成功，所以我认为数据格式是好的。
Log Dashboard no any error log.....
日志仪表板没有任何错误日志.....
System only same unknown error
系统只有同样的未知错误

The code I wrote is here: http://pastie.org/10907947

我写的代码在这里：http：//pastie.org/10907947

Error Message after "Executing BigQuery import job":

“执行BigQuery导入作业”后出现错误消息：

Workflow failed. Causes: 
(cc846): S01:Read Files/Read+Window.Into()+AnonymousParDo+BigQueryIO.Write/DataflowPipelineRunner.BatchBigQueryIOWrite/DataflowPipelineRunner.BatchBigQueryIONativeWrite failed., 
(e19a27451b49ae8d): BigQuery import job "dataflow_job_631261" failed., (e19a745a666): BigQuery creation of import job for table "hi_event_m6" in dataset "TESTSET" in project "lib-ro-123" failed., 
(e19a2749ae3f): BigQuery execution failed., 
(e19a2745a618): Error: Message: An internal error occurred and the request could not be completed. HTTP Code: 500

2 个解决方案

#1

Sorry for the frustration. Looks like you are hit a limit on the number of files being written to BQ. This is a known issue that we're in the process of fixing.

抱歉沮丧。看起来您对写入BQ的文件数量有限制。这是我们正在修复的已知问题。

In the meantime, you can work around this issue by either decreasing the number of input files or resharding the data (do a GroupByKey and then ungroup the data -- semantically it's a no-op, but it forces the data to be materialized so that the parallelism of the write operation isn't constrained by the parallelism of the read).

在此期间，您可以通过减少输入文件的数量或重新分配数据来解决此问题（执行GroupByKey然后取消组合数据 - 从语义上讲，这是一个无操作，但它会强制数据实现，以便写操作的并行性不受读取的并行性的约束。

#2

Dataflow SDK for Java 1.x: as a workaround, you can enable this experiment in : --experiments=enable_custom_bigquery_sink

Dataflow SDK for Java 1.x：作为一种变通方法，您可以在以下位置启用此实验： - instperiments = enable_custom_bigquery_sink

In Dataflow SDK for Java 2.x, this behavior is default and no experiments are necessary.

在Dataflow SDK for Java 2.x中，此行为是默认行为，不需要进行任何实验。

Note that in both versions, temporary files in GCS may be left over if your job fails.

请注意，在两个版本中，如果作业失败，GCS中的临时文件可能会遗留下来。

Hope that helps!

希望有所帮助！

#1