Is it possible to provision Dataflow workers with custom packages? I'd like to shell out to a Debian-packaged binary from inside a computation.
是否可以为Dataflow工作程序提供自定义程序包?我想从计算中掏出一个Debian打包的二进制文件。
Edit: To be clear, the package configuration is sufficiently complex that it's not feasible to just bundle the files in --filesToStage. The solution should involve installing the Debian package at some point.
编辑:要清楚,程序包配置足够复杂,只能将文件捆绑在--filesToStage中。解决方案应该包括在某些时候安装Debian软件包。
1 个解决方案
#1
4
This is not something Dataflow explicitly supports. However, below are some suggestions on how you could accomplish this. Please keep in mind that things could change in the service that could break this in the future.
这不是Dataflow明确支持的。但是,下面是一些关于如何实现这一目标的建议。请记住,在将来可能会破坏这项服务的服务中可能会发生变化。
There are two separate problems:
有两个不同的问题:
- Getting the debian package onto the worker.
- 将debian包装到工人身上。
- Installing the debian package.
- 安装debian软件包。
For the first problem you can use --filesToStage and specify the path to your debian package. This will cause the package to be uploaded to GCS and then downloaded to the worker on startup. If you use this option you must include in the value of --filesToStage all your jars as well since they will not be included by default if you explicitly set --filesToStage.
对于第一个问题,您可以使用--filesToStage并指定debian包的路径。这将导致程序包上载到GCS,然后在启动时下载到工作程序。如果使用此选项,则必须在--filesToStage的值中包含所有jar,因为如果明确设置--filesToStage,则默认情况下不会包含它们。
On the java worker any files passed in --filesToStage will be available in the following directories (or a subdirectory of)
在java worker上传入的任何文件--filesToStage将在以下目录(或子目录)中可用
/var/opt/google/dataflow
or
要么
/dataflow/packages
You would need to check both locations in order to be guaranteed of finding the file.
您需要检查这两个位置,以确保找到该文件。
We provide no guarantee that these directories won't change in the future. These are simply the locations used today.
我们不保证这些目录将来不会改变。这些只是今天使用的位置。
To solve the second problem you can override StartBundle in your DoFn. From here you could shell out to the command line and install your debian package after finding it in /dataflow/packages.
要解决第二个问题,您可以覆盖DoFn中的StartBundle。从这里开始,你可以在/ dataflow / packages中找到命令行并安装你的debian软件包。
There could be multiple instances of your DoFn running side by side so you could get contention issues if two processes try to install your package simultaneously. I'm not sure if the debian package system can handle this or you need to so in your code explicitly.
可能有多个DoFn并行运行的实例,因此如果两个进程同时尝试安装您的程序包,您可能会遇到争用问题。我不确定debian软件包系统是否可以处理这个问题,或者你需要在代码中明确地处理这个问题。
A slight variant of this approach is to not use --filesToStage to distribute the package to your workers but instead add code to your startBundle to fetch it from some location.
这种方法的一个细微变体是不使用--filesToStage将包分发给您的worker,而是将代码添加到您的startBundle以从某个位置获取它。
#1
4
This is not something Dataflow explicitly supports. However, below are some suggestions on how you could accomplish this. Please keep in mind that things could change in the service that could break this in the future.
这不是Dataflow明确支持的。但是,下面是一些关于如何实现这一目标的建议。请记住,在将来可能会破坏这项服务的服务中可能会发生变化。
There are two separate problems:
有两个不同的问题:
- Getting the debian package onto the worker.
- 将debian包装到工人身上。
- Installing the debian package.
- 安装debian软件包。
For the first problem you can use --filesToStage and specify the path to your debian package. This will cause the package to be uploaded to GCS and then downloaded to the worker on startup. If you use this option you must include in the value of --filesToStage all your jars as well since they will not be included by default if you explicitly set --filesToStage.
对于第一个问题,您可以使用--filesToStage并指定debian包的路径。这将导致程序包上载到GCS,然后在启动时下载到工作程序。如果使用此选项,则必须在--filesToStage的值中包含所有jar,因为如果明确设置--filesToStage,则默认情况下不会包含它们。
On the java worker any files passed in --filesToStage will be available in the following directories (or a subdirectory of)
在java worker上传入的任何文件--filesToStage将在以下目录(或子目录)中可用
/var/opt/google/dataflow
or
要么
/dataflow/packages
You would need to check both locations in order to be guaranteed of finding the file.
您需要检查这两个位置,以确保找到该文件。
We provide no guarantee that these directories won't change in the future. These are simply the locations used today.
我们不保证这些目录将来不会改变。这些只是今天使用的位置。
To solve the second problem you can override StartBundle in your DoFn. From here you could shell out to the command line and install your debian package after finding it in /dataflow/packages.
要解决第二个问题,您可以覆盖DoFn中的StartBundle。从这里开始,你可以在/ dataflow / packages中找到命令行并安装你的debian软件包。
There could be multiple instances of your DoFn running side by side so you could get contention issues if two processes try to install your package simultaneously. I'm not sure if the debian package system can handle this or you need to so in your code explicitly.
可能有多个DoFn并行运行的实例,因此如果两个进程同时尝试安装您的程序包,您可能会遇到争用问题。我不确定debian软件包系统是否可以处理这个问题,或者你需要在代码中明确地处理这个问题。
A slight variant of this approach is to not use --filesToStage to distribute the package to your workers but instead add code to your startBundle to fetch it from some location.
这种方法的一个细微变体是不使用--filesToStage将包分发给您的worker,而是将代码添加到您的startBundle以从某个位置获取它。