触发Apache Beam(Python)从GAE Cronjob运行

时间:2021-04-21 15:34:57

In replacing my old appengine-mapreduce job, I need a way to trigger this python dataflow job from my cron.


I have read https://cloud.google.com/blog/big-data/2016/04/scheduling-dataflow-pipelines-using-app-engine-cron-service-or-cloud-functions , but am unclear on the full translation for Python.


Cloud Functions do not have python installed, and I'm not sure if/how it's possible to install a portable python. So I assume triggering from my Managed VM Python instance will be easier...as far as I can tell, it will be something like this:

云功能没有安装python,我不确定是否/如何安装便携式python。所以我假设从我的Managed VM Python实例触发将更容易......据我所知,它将是这样的:

  • I am using GAE Flexible VMs (no sandbox).
  • 我正在使用GAE Flexible VMs(没有沙盒)。
  • I can include the apache_beam libraries (to run my_dataflow.py) into my docker image.
  • 我可以将apache_beam库(运行my_dataflow.py)包含到我的docker镜像中。
  • I can upload these files with my project push so they are accessible from the VM disk: my_dataflow.py, setup.py (that installs my library dependencies), and apache-beam.tar.gz (since I'm writing against the 0.7.0 API that's not yet released on PyPI)
  • 我可以通过项目推送上传这些文件,以便可以从VM磁盘访问它们:my_dataflow.py,setup.py(安装我的库依赖项)和apache-beam.tar.gz(因为我写的是0.7 .0尚未在PyPI上发布的API)
  • I can call my_dataflow.run() pointing PipelineOptions at the setup.py and apache-beam.tar.gz.
  • 我可以在setup.py和apache-beam.tar.gz上调用my_dataflow.run()指向PipelineOptions。

Is that it, or am I missing any other steps? Hoping to avoid barking up the wrong tree here, and worried about running into known impassable roadblocks after spending a few hours pushing-and-repushing trying to get this working.


1 个解决方案



Yes, template are currently Java only.


You may be able to use this technique instead to invoke your pipeline periodically instead. This doesn't use a template pipeline, but instead launches a normal pipeline. You can setup a cloud function to launch the pipeline by running a subprocess to launch the pipeline. There are various ways of invoking the cloud function. This one uses app engine cron service.






Yes, template are currently Java only.


You may be able to use this technique instead to invoke your pipeline periodically instead. This doesn't use a template pipeline, but instead launches a normal pipeline. You can setup a cloud function to launch the pipeline by running a subprocess to launch the pipeline. There are various ways of invoking the cloud function. This one uses app engine cron service.


