Google Cloud Dataflow和Google Cloud Dataproc有什么区别？

I am using Google Data Flow to implement an ETL data ware house solution.

我正在使用Google Data Flow来实施ETL数据仓库解决方案。

Looking into google cloud offering, it seems DataProc can also do the same thing.

看看谷歌云产品，似乎DataProc也可以做同样的事情。

It also seems DataProc is little bit cheaper than DataFlow.

DataProc似乎比DataFlow便宜一点。

Does anybody know the pros / cons of DataFlow over DataProc

有没有人知道DataFlow上DataFlow的优缺点

Why does google offer both?

为什么谷歌同时提供？

2 个解决方案

#1

Yes, Cloud Dataflow and Cloud Dataproc can both be used to implement ETL data warehousing solutions.

是的，Cloud Dataflow和Cloud Dataproc都可用于实施ETL数据仓库解决方案。

An overview of why each of these products exist can be found in the Google Cloud Platform Big Data Solutions Articles

有关每种产品存在的概述，请参阅Google云端平台大数据解决方案文章

Quick takeaways:

快速外卖：

Cloud Dataproc provides you with a Hadoop cluster, on GCP, and access to Hadoop-ecosystem tools (e.g. Apache Pig, Hive, and Spark); this has strong appeal if you are already familiar with Hadoop tools and have Hadoop jobs
Cloud Dataproc为您提供GCP上的Hadoop集群，以及对Hadoop生态系统工具（例如Apache Pig，Hive和Spark）的访问;如果您已熟悉Hadoop工具并拥有Hadoop作业，那么这具有很强的吸引力
Cloud Dataflow provides you with a place to run Apache Beam based jobs, on GCP, and you do not need to address common aspects of running jobs on a cluster (e.g. Balancing work, or Scaling the number of workers for a job; by default, this is automatically managed for you, and applies to both batch and streaming) -- this can be very time consuming on other systems
- Apache Beam is an important consideration; Beam jobs are intended to be portable across "runners," which include Cloud Dataflow, and enable you to focus on your logical computation, rather than how a "runner" works -- In comparison, when authoring a Spark job, your code is bound to the runner, Spark, and how that runner works
- Apache Beam是一个重要的考虑因素; Beam作业旨在通过“Runners”（包括Cloud Dataflow）移植，并使您能够专注于逻辑计算，而不是“跑步者”如何工作 - 相比之下，在创作Spark作业时，您的代码受到约束对跑步者，Spark，以及那个跑步者如何运作
- Cloud Dataflow also offers the ability to create jobs based on "templates," which can help simplify common tasks where the differences are parameter values
- Cloud Dataflow还提供基于“模板”创建作业的功能，这有助于简化差异为参数值的常见任务
Cloud Dataflow为您提供了在GCP上运行基于Apache Beam的作业的位置，您无需解决群集上正在运行的作业的常见问题（例如，平衡工作或缩放作业的工作者数量;默认情况下，这是自动管理的，适用于批处理和流媒体） - 这在其他系统上非常耗时Apache Beam是一个重要的考虑因素; Beam作业旨在通过“Runners”（包括Cloud Dataflow）移植，并使您能够专注于逻辑计算，而不是“跑步者”如何工作 - 相比之下，在创作Spark作业时，您的代码受到约束对于跑步者，Spark以及该跑步者如何运作Cloud Dataflow还提供了基于“模板”创建作业的功能，这有助于简化差异为参数值的常见任务

#2

Same reason as why Dataproc offers both Hadoop and Spark: sometimes one programming model is the best fit for the job, sometimes the other. Likewise, in some cases the best fit for the job is the Apache Beam programming model, offered by Dataflow.

与Dataproc同时提供Hadoop和Spark的原因相同：有时一种编程模型最适合工作，有时候另一种。同样，在某些情况下，最适合这项工作的是由Dataflow提供的Apache Beam编程模型。

In many cases, a big consideration is that one already has a codebase written against a particular framework, and one just wants to deploy it on the Google Cloud, so even if, say, the Beam programming model is superior to Hadoop, someone with a lot of Hadoop code might still choose Dataproc for the time being, rather than rewriting their code on Beam to run on Dataflow.

在许多情况下，一个重要的考虑因素是，已经有一个针对特定框架编写的代码库，而且只想将其部署在Google Cloud上，所以即使Beam编程模型优于Hadoop，也可能是许多Hadoop代码可能仍然暂时选择Dataproc，而不是在Beam上重写他们的代码以在Dataflow上运行。

The differences between Spark and Beam programming models are quite large, and there are a lot of use cases where each one has a big advantage over the other. See https://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-comparison .

Spark和Beam编程模型之间的差异非常大，并且有很多用例，每个用例都比另一个具有很大的优势。请参阅https://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-comparison。

#1