
时间:2020-11-25 15:27:15

I am looking at the opportunities for implementing a data analysis algorithm using Google Cloud Dataflow. Mind you, I have no experience with dataflow yet. I am just doing some research on whether it can fulfill my needs.

我正在研究使用Google Cloud Dataflow实施数据分析算法的机会。请注意,我还没有数据流的经验。我正在研究它是否能满足我的需求。

Part of my algorithm contains some conditional iterations, that is, continue until some condition is met:


PCollection data  = ...
while(needsMoreWork(data)) {
  data = doAStep(data)

I have looked around in the documentation and as far as I can see I am only able to do "iterations" if I know the exact number of iterations before the pipeline starts. In this case my pipeline construction code can just create a sequential pipeline with fixed number of steps.


The only "solution" I can think of is to run each iteration in separate pipelines, store the intermediate data in some database, and then decide in my pipeline construction whether or not to launch a new pipeline for the next iteration. This seems to be an extremely inefficient solution!


Are there any good ways to perform this kind of additional iterations in Google cloud dataflow?




1 个解决方案



For the time being, the two options you've mentioned are both reasonable. You could even combine the two approaches. Create a pipeline which does a few iterations (becoming a no-op if needsMoreWork is false), and then have a main Java program that submits that pipeline multiple times until needsMoreWork is false.


We've seen this use case a few times and hope to address it natively in the future. Native support is being tracked in




For the time being, the two options you've mentioned are both reasonable. You could even combine the two approaches. Create a pipeline which does a few iterations (becoming a no-op if needsMoreWork is false), and then have a main Java program that submits that pipeline multiple times until needsMoreWork is false.


We've seen this use case a few times and hope to address it natively in the future. Native support is being tracked in
