According to Google's Dataflow documentation, Dataflow job template creation is "currently limited to Java and Maven." However, the documentation for Java across GCP's Dataflow site is... messy, to say the least. The 1.x and 2.x versions of Dataflow are pretty far apart in terms of details, I have some specific code requirements that lock me into the 2.0.0r3 codebase, so I'm pretty much required to use Apache Beam. Apache is -- understandably -- quite dedicated to Maven, but institutionally my company's thrown the bulk of its weight behind Gradle, so much so that they migrated all their Java projects over to it last year and have pushed back against re-introducing it.
根据Google的Dataflow文档,Dataflow作业模板创建“目前仅限于Java和Maven”。但是,至少可以说,GCP的Dataflow站点上的Java文档很麻烦。数据流的1.x和2.x版本在细节方面相当远,我有一些特定的代码要求将我锁定到2.0.0r3代码库,所以我非常需要使用Apache Beam。可以理解的是,Apache非常专注于Maven,但从制度上来说,我的公司大部分时间都把它放在了Gradle之后,以至于去年他们将所有的Java项目迁移到了它,并且已经推迟了重新引入它。
However, now we seem to be at an impasse, because we've got a specific goal to try to centralize a lot of our back-end gathering in GCP's Dataflow, and GCP Dataflow doesn't appear to have formal support for Gradle. If it does, it's not in the official documentation.
然而,现在我们似乎陷入了僵局,因为我们有一个特定的目标,试图将我们的大量后端收集集中在GCP的Dataflow中,而GCP Dataflow似乎没有对Gradle的正式支持。如果是的话,它不在官方文档中。
Is there a sufficient technical basis to actually build Dataflow templates with Gradle and the issue is that Google's docs simply haven't been updated to support this? Is there a technical reason why Gradle can't do what's being done with Maven? Is there a better guide for working with GCP Dataflow than the docs on Google's and Apache's websites? I haven't worked with Maven archetypes before, and all the searches I've done for "gradle archetypes" turn up results from, at best, over a year ago. Most of the information points to forum discussions from 2014 and version 1.7rc3, but we're on 3.5. This feels like it ought to be a solved problem, but for the life of me I can't find any current information on this online.
是否有足够的技术基础来使用Gradle实际构建Dataflow模板,问题是Google的文档是否尚未更新以支持此功能?有没有技术上的理由为什么Gradle不能用Maven做什么?与Google和Apache的网站上的文档相比,使用GCP Dataflow有更好的指南吗?我之前没有使用过Maven原型,而且我为“gradle archetypes”所做的所有搜索结果都是在一年多前的最佳结果。大多数信息指向2014年和1.7rc3版本的论坛讨论,但我们的目标是3.5。这感觉它应该是一个已经解决的问题,但对于我的生活,我无法在网上找到任何关于此的信息。
2 个解决方案
#1
1
There's absolutely nothing stopping you writing your Dataflow application/pipeline in Java, and using Gradle to build it.
绝对没有什么能阻止你用Java编写Dataflow应用程序/管道,并使用Gradle构建它。
Gradle will simply produce an application distribution (e.g. ./gradlew clean distTar
), which you then extract and run with the --runner=TemplatingDataflowPipelineRunner --dataflowJobFile=gs://...
parameters.
Gradle将只生成一个应用程序分发(例如./gradlew clean distTar),然后使用--runner = TemplatingDataflowPipelineRunner --dataflowJobFile = gs:// ...参数进行提取和运行。
It's just a runnable Java application.
它只是一个可运行的Java应用程序。
The template and all the binaries will then be uploaded to GCS, and you can execute the pipeline through the console, CLI or even Cloud Functions.
然后,模板和所有二进制文件将上载到GCS,您可以通过控制台,CLI甚至云功能执行管道。
You don't even need to use Gradle. You could just run it locally and the template/binaries will be uploaded. But, I'd imagine you are are using a build server like Jenkins.
您甚至不需要使用Gradle。您可以在本地运行它,并上传模板/二进制文件。但是,我想你正在使用像Jenkins这样的构建服务器。
Maybe the Dataflow docs should read "Note: Template creation is currently limited to Java", because this feature is not available in the Python SDK yet.
也许Dataflow文档应该是“注意:模板创建目前仅限于Java”,因为此功能尚未在Python SDK中提供。
#2
0
Commandline to Run Cloud Dataflow Job With Gradle
使用Gradle运行云数据流作业的命令行
Generic Execution
通用执行
$ gradle clean execute -DmainClass=com.foo.bar.myfolder.MyPipeline -Dexec.args="--runner=DataflowRunner --gcpTempLocation=gs://my-bucket/tmpdataflow" -Pdataflow-runner
Specific Example
具体例子
$ gradle clean execute -DmainClass=com.foo.bar.myfolder.MySpannerPipeline -Dexec.args="--runner=DataflowRunner --gcpTempLocation=gs://my-bucket/tmpdataflow --spannerInstanceId=fooInstance --spannerDatabaseId=barDatabase" -Pdataflow-runner
Explanation of Commandline
命令行说明
-
gradle clean execute uses the execute task which allows us to easily pass commandline flags to the Dataflow Pipeline. The clean task removes cached builds.
gradle clean execute使用execute任务,它允许我们轻松地将命令行标志传递给Dataflow Pipeline。 clean任务删除缓存的构建。
-
-DmainClass= specifies the Java Main class since we have multiple pipelines in a single folder. Without this, Gradle doesn't know what the Main class is and where to pass the args. Note: Your gradle.build file must include task execute per below.
-DmainClass =指定Java Main类,因为我们在一个文件夹中有多个管道。如果没有这个,Gradle不知道Main类是什么以及传递args的位置。注意:您的gradle.build文件必须包含以下任务执行。
-
-Dexec.args= specifies the execution arguments, which will be passed to the Pipeline. Note: Your gradle.build file must include task execute per below.
-Dexec.args =指定执行参数,它将传递给Pipeline。注意:您的gradle.build文件必须包含以下任务执行。
-
--runner=DataflowRunner and -Pdataflow-runner ensure that the Google Cloud Dataflow runner is used and not the local DirectRunner
--runner = DataflowRunner和-Pdataflow-runner确保使用Google Cloud Dataflow运行器而不是本地DirectRunner
-
--spannerInstanceId= and --spannerDatabaseId= are just pipeline-specific flags. Your pipeline won't have them so.
--spannerInstanceId =和--spannerDatabaseId =只是管道特定的标志。你的管道不会有它们。
build.gradle contents (NOTE: You need to populate your specific dependencies)
build.gradle contents(注意:您需要填充特定的依赖项)
apply plugin: 'java'
apply plugin: 'maven'
apply plugin: 'application'
group = 'com.foo.bar'
version = '0.3'
mainClassName = System.getProperty("mainClass")
sourceCompatibility = 1.8
targetCompatibility = 1.8
repositories {
maven { url "https://repository.apache.org/content/repositories/snapshots/" }
maven { url "http://repo.maven.apache.org/maven2" }
}
dependencies {
compile group: 'org.apache.beam', name: 'beam-sdks-java-core', version:'2.5.0'
// Insert your build deps for your Beam Dataflow project here
runtime group: 'org.apache.beam', name: 'beam-runners-direct-java', version:'2.5.0'
runtime group: 'org.apache.beam', name: 'beam-runners-google-cloud-dataflow-java', version:'2.5.0'
}
task execute (type:JavaExec) {
main = System.getProperty("mainClass")
classpath = sourceSets.main.runtimeClasspath
systemProperties System.getProperties()
args System.getProperty("exec.args").split()
}
Explanation of build.gradle
build.gradle的解释
-
We use the
task execute (type:JavaExec)
in order to easily pass runtime flags into the Java Dataflow pipeline program. For example, we can specify what the main class is (since we have more than one pipeline in the same folder) and we can pass specific Dataflow arguments (i.e., specificPipelineOptions
). more here我们使用任务execute(类型:JavaExec)来轻松地将运行时标志传递到Java Dataflow管道程序中。例如,我们可以指定主类是什么(因为我们在同一个文件夹中有多个管道),我们可以传递特定的Dataflow参数(即特定的PipelineOptions)。更多在这里
-
The line of build.gradle that reads
runtime group: 'org.apache.beam', name: 'beam-runners-google-cloud-dataflow-java', version:'2.5.0'
is very important. It provides the Cloud Dataflow runner that allows you to execute pipelines in Google Cloud Platform.读取运行组的build.gradle行:'org.apache.beam',名称:'beam-runners-google-cloud-dataflow-java',版本:'2.5.0'非常重要。它提供了Cloud Dataflow运行器,允许您在Google Cloud Platform中执行管道。
#1
1
There's absolutely nothing stopping you writing your Dataflow application/pipeline in Java, and using Gradle to build it.
绝对没有什么能阻止你用Java编写Dataflow应用程序/管道,并使用Gradle构建它。
Gradle will simply produce an application distribution (e.g. ./gradlew clean distTar
), which you then extract and run with the --runner=TemplatingDataflowPipelineRunner --dataflowJobFile=gs://...
parameters.
Gradle将只生成一个应用程序分发(例如./gradlew clean distTar),然后使用--runner = TemplatingDataflowPipelineRunner --dataflowJobFile = gs:// ...参数进行提取和运行。
It's just a runnable Java application.
它只是一个可运行的Java应用程序。
The template and all the binaries will then be uploaded to GCS, and you can execute the pipeline through the console, CLI or even Cloud Functions.
然后,模板和所有二进制文件将上载到GCS,您可以通过控制台,CLI甚至云功能执行管道。
You don't even need to use Gradle. You could just run it locally and the template/binaries will be uploaded. But, I'd imagine you are are using a build server like Jenkins.
您甚至不需要使用Gradle。您可以在本地运行它,并上传模板/二进制文件。但是,我想你正在使用像Jenkins这样的构建服务器。
Maybe the Dataflow docs should read "Note: Template creation is currently limited to Java", because this feature is not available in the Python SDK yet.
也许Dataflow文档应该是“注意:模板创建目前仅限于Java”,因为此功能尚未在Python SDK中提供。
#2
0
Commandline to Run Cloud Dataflow Job With Gradle
使用Gradle运行云数据流作业的命令行
Generic Execution
通用执行
$ gradle clean execute -DmainClass=com.foo.bar.myfolder.MyPipeline -Dexec.args="--runner=DataflowRunner --gcpTempLocation=gs://my-bucket/tmpdataflow" -Pdataflow-runner
Specific Example
具体例子
$ gradle clean execute -DmainClass=com.foo.bar.myfolder.MySpannerPipeline -Dexec.args="--runner=DataflowRunner --gcpTempLocation=gs://my-bucket/tmpdataflow --spannerInstanceId=fooInstance --spannerDatabaseId=barDatabase" -Pdataflow-runner
Explanation of Commandline
命令行说明
-
gradle clean execute uses the execute task which allows us to easily pass commandline flags to the Dataflow Pipeline. The clean task removes cached builds.
gradle clean execute使用execute任务,它允许我们轻松地将命令行标志传递给Dataflow Pipeline。 clean任务删除缓存的构建。
-
-DmainClass= specifies the Java Main class since we have multiple pipelines in a single folder. Without this, Gradle doesn't know what the Main class is and where to pass the args. Note: Your gradle.build file must include task execute per below.
-DmainClass =指定Java Main类,因为我们在一个文件夹中有多个管道。如果没有这个,Gradle不知道Main类是什么以及传递args的位置。注意:您的gradle.build文件必须包含以下任务执行。
-
-Dexec.args= specifies the execution arguments, which will be passed to the Pipeline. Note: Your gradle.build file must include task execute per below.
-Dexec.args =指定执行参数,它将传递给Pipeline。注意:您的gradle.build文件必须包含以下任务执行。
-
--runner=DataflowRunner and -Pdataflow-runner ensure that the Google Cloud Dataflow runner is used and not the local DirectRunner
--runner = DataflowRunner和-Pdataflow-runner确保使用Google Cloud Dataflow运行器而不是本地DirectRunner
-
--spannerInstanceId= and --spannerDatabaseId= are just pipeline-specific flags. Your pipeline won't have them so.
--spannerInstanceId =和--spannerDatabaseId =只是管道特定的标志。你的管道不会有它们。
build.gradle contents (NOTE: You need to populate your specific dependencies)
build.gradle contents(注意:您需要填充特定的依赖项)
apply plugin: 'java'
apply plugin: 'maven'
apply plugin: 'application'
group = 'com.foo.bar'
version = '0.3'
mainClassName = System.getProperty("mainClass")
sourceCompatibility = 1.8
targetCompatibility = 1.8
repositories {
maven { url "https://repository.apache.org/content/repositories/snapshots/" }
maven { url "http://repo.maven.apache.org/maven2" }
}
dependencies {
compile group: 'org.apache.beam', name: 'beam-sdks-java-core', version:'2.5.0'
// Insert your build deps for your Beam Dataflow project here
runtime group: 'org.apache.beam', name: 'beam-runners-direct-java', version:'2.5.0'
runtime group: 'org.apache.beam', name: 'beam-runners-google-cloud-dataflow-java', version:'2.5.0'
}
task execute (type:JavaExec) {
main = System.getProperty("mainClass")
classpath = sourceSets.main.runtimeClasspath
systemProperties System.getProperties()
args System.getProperty("exec.args").split()
}
Explanation of build.gradle
build.gradle的解释
-
We use the
task execute (type:JavaExec)
in order to easily pass runtime flags into the Java Dataflow pipeline program. For example, we can specify what the main class is (since we have more than one pipeline in the same folder) and we can pass specific Dataflow arguments (i.e., specificPipelineOptions
). more here我们使用任务execute(类型:JavaExec)来轻松地将运行时标志传递到Java Dataflow管道程序中。例如,我们可以指定主类是什么(因为我们在同一个文件夹中有多个管道),我们可以传递特定的Dataflow参数(即特定的PipelineOptions)。更多在这里
-
The line of build.gradle that reads
runtime group: 'org.apache.beam', name: 'beam-runners-google-cloud-dataflow-java', version:'2.5.0'
is very important. It provides the Cloud Dataflow runner that allows you to execute pipelines in Google Cloud Platform.读取运行组的build.gradle行:'org.apache.beam',名称:'beam-runners-google-cloud-dataflow-java',版本:'2.5.0'非常重要。它提供了Cloud Dataflow运行器,允许您在Google Cloud Platform中执行管道。