I have been attempting to run an apache beam job on Dataflow, but I'm getting an error from GCP with the following message:
我一直在尝试在Dataflow上运行apache beam作业,但是我从GCP收到错误,并显示以下消息:
The job graph is too large. Please try again with a smaller job graph, or split your job into two or more smaller jobs.
I have run jobs with larger graphs in the past and had no problems. The job also runs fine locally with DirectRunner. There are about 12 nodes in the graph including a read from Bigquery step, a WriteToText
step and a CoGroupByKey
step.
我过去使用较大的图表运行工作并且没有问题。使用DirectRunner本地工作也很好。图中大约有12个节点,包括从Bigquery步骤读取,WriteToText步骤和CoGroupByKey步骤。
Is there a way to increase the graph size Dataflow is willing to accept?
有没有办法增加Dataflow愿意接受的图表大小?
1 个解决方案
#1
1
With a small pipeline, the most likely cause of this is accidentally serializing extra data into your DoFns (or other serialized code). Do you have any large objects in your main class that are being automatically included in closures? If so, the easiest thing to do is do build up your pipeline in a static function.
使用小型管道时,最可能的原因是意外地将额外数据序列化到您的DoFns(或其他序列化代码)中。您的主类中是否有任何大型对象自动包含在闭包中?如果是这样,最简单的方法是在静态函数中构建管道。
It's not possible to raise the graph size limit.
无法提高图表大小限制。
#1
1
With a small pipeline, the most likely cause of this is accidentally serializing extra data into your DoFns (or other serialized code). Do you have any large objects in your main class that are being automatically included in closures? If so, the easiest thing to do is do build up your pipeline in a static function.
使用小型管道时,最可能的原因是意外地将额外数据序列化到您的DoFns(或其他序列化代码)中。您的主类中是否有任何大型对象自动包含在闭包中?如果是这样,最简单的方法是在静态函数中构建管道。
It's not possible to raise the graph size limit.
无法提高图表大小限制。