I want to read incoming data on a Google PubSub topic, handle the data and transform it into a unified data structure and then insert it into a dataset in Google BigQuery. From what I understand, it is possible to use some kind of pipeline that streams the data. However, I'm having trouble finding any good and concise examples that achieve this.
我想阅读Google PubSub主题上的传入数据,处理数据并将其转换为统一的数据结构,然后将其插入到Google BigQuery的数据集中。根据我的理解,可以使用某种流式传输数据的管道。但是,我很难找到实现这一目标的任何好的和简洁的例子。
My project is written in Scala, so I would prefer examples written in that language. Otherwise something concise in Java works too.
我的项目是用Scala编写的,所以我更喜欢用该语言编写的例子。否则Java中简洁的东西也会起作用。
Thanks!
谢谢!
1 个解决方案
#1
3
I would say Google Cloud Dataflow is the correct product for your use case. It is used precisely for what you described: read input data from different sources (Pub/Sub in your case), transform it, and write it to a sink (BigQuery here).
我想说Google Cloud Dataflow是适合您用例的正确产品。它正好用于您所描述的内容:从不同的源读取输入数据(在您的情况下为Pub / Sub),转换它,并将其写入接收器(此处为BigQuery)。
Dataflow works with Batch and Streaming Pipelines. In the former, all the data is available at the creation time, while the latter is the version that you need, which continuously reads from an unbounded source (a Pub/Sub subscription, for example), and works on data as soon as it arrives into the Pipeline.
Dataflow适用于Batch和Streaming Pipelines。在前者中,所有数据在创建时都可用,而后者是您需要的版本,它可以从无限制的源(例如,Pub / Sub订阅)连续读取,并在数据上尽快处理数据。到达管道。
In addition, you will find it useful that the Dataflow team has recently released a beta version of some templates that you can use in order to start working with Dataflow easier. In this case, there is even a Cloud Pub/Sub to BigQuery template available, which you can use as it is, or modify its source code (available in the official GitHub repository) in order to add any transformation you want to apply between the Pub/Sub-read and the BigQuery-write.
此外,您会发现Dataflow团队最近发布了一些模板的测试版,您可以使用这些模板,以便更轻松地开始使用Dataflow。在这种情况下,甚至可以使用Cloud Pub / Sub到BigQuery模板,您可以将其原样使用,或者修改其源代码(可在官方GitHub存储库中获得),以便添加要在其之间应用的任何转换。发布/读取和BigQuery-write。
Note that the latest Dialogflow Java SDK is based on Apache Beam, which has plenty of documentation and code references that you may find interesting:
请注意,最新的Dialogflow Java SDK基于Apache Beam,它包含大量您可能感兴趣的文档和代码参考:
- Built-in I/O Transforms (for reading/writing in Pub/Sub, BigQuery, or many other options)
- 内置I / O转换(用于在Pub / Sub,BigQuery或许多其他选项中读/写)
- Java SDK Reference (where you will find all the information of the classes available in the SDK)
- Java SDK Reference(您可以在其中找到SDK中可用类的所有信息)
- Apache Beam Programming Guide (here you will find a complete description of the fundamentals of Apache Beam, and everything that you should take into consideration
- Apache Beam Programming Guide(在这里您可以找到Apache Beam基础知识的完整描述,以及您应该考虑的所有内容
- Comparison between the Dataflow (Apache Beam) and the Spark (using Scala, for example) programming models
- Dataflow(Apache Beam)和Spark(使用Scala,例如)编程模型之间的比较
#1
3
I would say Google Cloud Dataflow is the correct product for your use case. It is used precisely for what you described: read input data from different sources (Pub/Sub in your case), transform it, and write it to a sink (BigQuery here).
我想说Google Cloud Dataflow是适合您用例的正确产品。它正好用于您所描述的内容:从不同的源读取输入数据(在您的情况下为Pub / Sub),转换它,并将其写入接收器(此处为BigQuery)。
Dataflow works with Batch and Streaming Pipelines. In the former, all the data is available at the creation time, while the latter is the version that you need, which continuously reads from an unbounded source (a Pub/Sub subscription, for example), and works on data as soon as it arrives into the Pipeline.
Dataflow适用于Batch和Streaming Pipelines。在前者中,所有数据在创建时都可用,而后者是您需要的版本,它可以从无限制的源(例如,Pub / Sub订阅)连续读取,并在数据上尽快处理数据。到达管道。
In addition, you will find it useful that the Dataflow team has recently released a beta version of some templates that you can use in order to start working with Dataflow easier. In this case, there is even a Cloud Pub/Sub to BigQuery template available, which you can use as it is, or modify its source code (available in the official GitHub repository) in order to add any transformation you want to apply between the Pub/Sub-read and the BigQuery-write.
此外,您会发现Dataflow团队最近发布了一些模板的测试版,您可以使用这些模板,以便更轻松地开始使用Dataflow。在这种情况下,甚至可以使用Cloud Pub / Sub到BigQuery模板,您可以将其原样使用,或者修改其源代码(可在官方GitHub存储库中获得),以便添加要在其之间应用的任何转换。发布/读取和BigQuery-write。
Note that the latest Dialogflow Java SDK is based on Apache Beam, which has plenty of documentation and code references that you may find interesting:
请注意,最新的Dialogflow Java SDK基于Apache Beam,它包含大量您可能感兴趣的文档和代码参考:
- Built-in I/O Transforms (for reading/writing in Pub/Sub, BigQuery, or many other options)
- 内置I / O转换(用于在Pub / Sub,BigQuery或许多其他选项中读/写)
- Java SDK Reference (where you will find all the information of the classes available in the SDK)
- Java SDK Reference(您可以在其中找到SDK中可用类的所有信息)
- Apache Beam Programming Guide (here you will find a complete description of the fundamentals of Apache Beam, and everything that you should take into consideration
- Apache Beam Programming Guide(在这里您可以找到Apache Beam基础知识的完整描述,以及您应该考虑的所有内容
- Comparison between the Dataflow (Apache Beam) and the Spark (using Scala, for example) programming models
- Dataflow(Apache Beam)和Spark(使用Scala,例如)编程模型之间的比较