如何使用Spark runner运行Cloud Dataflow管道?

时间:2022-11-01 15:35:59

I have read that Google Cloud Dataflow pipelines, which are based on Apache Beam SDK, can be run with Spark or Flink.

我已经读过基于Apache Beam SDK的Google Cloud Dataflow管道可以使用Spark或Flink运行。

I have some dataflow pipelines currently running on GCP using default Cloud Dataflow runner and I want to run it using Spark runner but I don't know how to.

我目前在GCP上使用默认的Cloud Dataflow运行程序运行一些数据流管道,我想使用Spark运行程序运行它,但我不知道如何运行它。

Is there any documentation or guide about how to do this? Any pointers will help.

有没有关于如何做到这一点的文档或指南?任何指针都会有所帮助。

Thanks.

谢谢。

2 个解决方案

#1


1  

I'll assume you're using Java but the equivalent process applies with Python.

我假设您使用的是Java,但等效的过程适用于Python。

You need to migrate your pipeline to use the Apache Beam SDK, replacing your Google Dataflow SDK dependency with:

您需要迁移管道以使用Apache Beam SDK,将您的Google Dataflow SDK依赖项替换为:

<dependency>
  <groupId>org.apache.beam</groupId>
  <artifactId>beam-sdks-java-core</artifactId>
  <version>2.4.0</version>
</dependency>

Then add the dependency for the runner you wish to use:

然后为您要使用的跑步者添加依赖项:

<dependency>
    <groupId>org.apache.beam</groupId>
    <artifactId>beam-runners-spark</artifactId>
    <version>2.4.0</version>
</dependency>

And add the --runner=spark to specify that this runner should be used when submitting the pipeline.

并添加--runner = spark以指定在提交管道时应使用此运行器。

See https://beam.apache.org/documentation/runners/capability-matrix/ for the full list of runners and comparison of their capabilities.

有关跑步者的完整列表以及他们的能力比较,请参阅https://beam.apache.org/documentation/runners/capability-matrix/。

#2


0  

Thanks to multiple tutorials and documentation scattered all over the web, I was finally able to have a coherent idea about how to use spark runner with any Beam SDK based pipeline.

感谢遍布网络的多个教程和文档,我终于能够对任何基于Beam SDK的管道使用spark runner进行一致的了解。

I have documented entire process here for future reference: http://opreview.blogspot.com/2018/07/running-apache-beam-pipeline-using.html.

我在此记录了整个过程以供将来参考:http://opreview.blogspot.com/2018/07/running-apache-beam-pipeline-using.html。

#1


1  

I'll assume you're using Java but the equivalent process applies with Python.

我假设您使用的是Java,但等效的过程适用于Python。

You need to migrate your pipeline to use the Apache Beam SDK, replacing your Google Dataflow SDK dependency with:

您需要迁移管道以使用Apache Beam SDK,将您的Google Dataflow SDK依赖项替换为:

<dependency>
  <groupId>org.apache.beam</groupId>
  <artifactId>beam-sdks-java-core</artifactId>
  <version>2.4.0</version>
</dependency>

Then add the dependency for the runner you wish to use:

然后为您要使用的跑步者添加依赖项:

<dependency>
    <groupId>org.apache.beam</groupId>
    <artifactId>beam-runners-spark</artifactId>
    <version>2.4.0</version>
</dependency>

And add the --runner=spark to specify that this runner should be used when submitting the pipeline.

并添加--runner = spark以指定在提交管道时应使用此运行器。

See https://beam.apache.org/documentation/runners/capability-matrix/ for the full list of runners and comparison of their capabilities.

有关跑步者的完整列表以及他们的能力比较,请参阅https://beam.apache.org/documentation/runners/capability-matrix/。

#2


0  

Thanks to multiple tutorials and documentation scattered all over the web, I was finally able to have a coherent idea about how to use spark runner with any Beam SDK based pipeline.

感谢遍布网络的多个教程和文档,我终于能够对任何基于Beam SDK的管道使用spark runner进行一致的了解。

I have documented entire process here for future reference: http://opreview.blogspot.com/2018/07/running-apache-beam-pipeline-using.html.

我在此记录了整个过程以供将来参考:http://opreview.blogspot.com/2018/07/running-apache-beam-pipeline-using.html。