触发Apache Beam(Python)从GAE Cronjob运行

时间:2021-04-21 15:34:57

In replacing my old appengine-mapreduce job, I need a way to trigger this python dataflow job from my cron.

在替换旧的appengine-mapreduce作业时,我需要一种方法来从我的cron触发这个python数据流作业。

I have read https://cloud.google.com/blog/big-data/2016/04/scheduling-dataflow-pipelines-using-app-engine-cron-service-or-cloud-functions , but am unclear on the full translation for Python.

我已阅读https://cloud.google.com/blog/big-data/2016/04/scheduling-dataflow-pipelines-using-app-engine-cron-service-or-cloud-functions,但目前还不清楚Python的完整翻译。

Cloud Functions do not have python installed, and I'm not sure if/how it's possible to install a portable python. So I assume triggering from my Managed VM Python instance will be easier...as far as I can tell, it will be something like this:

云功能没有安装python,我不确定是否/如何安装便携式python。所以我假设从我的Managed VM Python实例触发将更容易......据我所知,它将是这样的:

  • I am using GAE Flexible VMs (no sandbox).
  • 我正在使用GAE Flexible VMs(没有沙盒)。
  • I can include the apache_beam libraries (to run my_dataflow.py) into my docker image.
  • 我可以将apache_beam库(运行my_dataflow.py)包含到我的docker镜像中。
  • I can upload these files with my project push so they are accessible from the VM disk: my_dataflow.py, setup.py (that installs my library dependencies), and apache-beam.tar.gz (since I'm writing against the 0.7.0 API that's not yet released on PyPI)
  • 我可以通过项目推送上传这些文件,以便可以从VM磁盘访问它们:my_dataflow.py,setup.py(安装我的库依赖项)和apache-beam.tar.gz(因为我写的是0.7 .0尚未在PyPI上发布的API)
  • I can call my_dataflow.run() pointing PipelineOptions at the setup.py and apache-beam.tar.gz.
  • 我可以在setup.py和apache-beam.tar.gz上调用my_dataflow.run()指向PipelineOptions。

Is that it, or am I missing any other steps? Hoping to avoid barking up the wrong tree here, and worried about running into known impassable roadblocks after spending a few hours pushing-and-repushing trying to get this working.

是这样,还是我错过了其他任何步骤?希望避免在这里咆哮错误的树,并担心在花费几个小时推进和重新尝试使其工作之后遇到已知无法通行的障碍。

1 个解决方案

#1


-1  

Yes, template are currently Java only.

是的,模板目前仅限Java。

You may be able to use this technique instead to invoke your pipeline periodically instead. This doesn't use a template pipeline, but instead launches a normal pipeline. You can setup a cloud function to launch the pipeline by running a subprocess to launch the pipeline. There are various ways of invoking the cloud function. This one uses app engine cron service.

您可以使用此技术来定期调用管道。这不使用模板管道,而是启动普通管道。您可以通过运行子流程来启动管道来设置云功能以启动管道。有多种方式可以调用云功能。这个使用app引擎cron服务。

https://cloud.google.com/blog/big-data/2016/04/scheduling-dataflow-pipelines-using-app-engine-cron-service-or-cloud-functions

https://cloud.google.com/blog/big-data/2016/04/scheduling-dataflow-pipelines-using-app-engine-cron-service-or-cloud-functions

#1


-1  

Yes, template are currently Java only.

是的,模板目前仅限Java。

You may be able to use this technique instead to invoke your pipeline periodically instead. This doesn't use a template pipeline, but instead launches a normal pipeline. You can setup a cloud function to launch the pipeline by running a subprocess to launch the pipeline. There are various ways of invoking the cloud function. This one uses app engine cron service.

您可以使用此技术来定期调用管道。这不使用模板管道,而是启动普通管道。您可以通过运行子流程来启动管道来设置云功能以启动管道。有多种方式可以调用云功能。这个使用app引擎cron服务。

https://cloud.google.com/blog/big-data/2016/04/scheduling-dataflow-pipelines-using-app-engine-cron-service-or-cloud-functions

https://cloud.google.com/blog/big-data/2016/04/scheduling-dataflow-pipelines-using-app-engine-cron-service-or-cloud-functions