使用Dataflow从PubSub流式传输PubSubmessage(json字符串)时云存储中的数据格式?

时间:2021-11-12 15:21:46

We are looking to stream the PubSubmessage(json string) from Pub-Sub using Dataflow and then write in Cloud storage. I am wondering what would be best dataformat while writing the data to Cloud storage? My further use case might also involve using Dataflow to read from Cloud storage again for further operations to persist to Data lake based on the need. Few of the options i was thinking: a) Use Dataflow to directly write as json string itself to Cloud storage? I assume every line in the file in the Cloud storage is to be treated as a single message if reading from Cloud storage and then if processing for further operations to Datalake, right? b) Transform the json to a text file format using Dataflow and save in Cloud storage c) Any other options?

我们希望使用Dataflow从Pub-Sub流式传输PubSubmessage(json字符串),然后在Cloud存储中写入。我想知道在将数据写入云存储时最佳数据格式是什么?我的进一步使用案例还可能涉及使用Dataflow再次从云存储中读取数据,以便根据需要继续运行到Data Lake。我想的几个选项:a)使用Dataflow直接将json字符串本身写入云存储?我假设如果从云存储中读取数据,那么云存储中文件中的每一行都将被视为单个消息,然后处理进一步操作到Datalake,对吗? b)使用Dataflow将json转换为文本文件格式并保存在云存储中c)任何其他选项?

1 个解决方案

#1


0  

You could store your data with the JSON format for further use in BigQuery if you need to analyze your data later. The Dataflow solution that you're mentioning on the a) option will be a good way to handle your scenario. Additionally, you could use Cloud functions with a Pub/Sub trigger then write the content to cloud storage. You could use the code shown in this tutorial as a base for this scenario as this put the information in a topic, then gather the message from the topic and creates a cloud storage object with the message as its content.

如果您以后需要分析数据,可以使用JSON格式存储数据以便在BigQuery中进一步使用。您在a)选项上提到的Dataflow解决方案将是处理您的方案的好方法。此外,您可以将Cloud功能与Pub / Sub触发器一起使用,然后将内容写入云存储。您可以使用本教程中显示的代码作为此方案的基础,因为这会将信息放在主题中,然后从主题中收集消息并创建一个云存储对象,并将消息作为其内容。

#1


0  

You could store your data with the JSON format for further use in BigQuery if you need to analyze your data later. The Dataflow solution that you're mentioning on the a) option will be a good way to handle your scenario. Additionally, you could use Cloud functions with a Pub/Sub trigger then write the content to cloud storage. You could use the code shown in this tutorial as a base for this scenario as this put the information in a topic, then gather the message from the topic and creates a cloud storage object with the message as its content.

如果您以后需要分析数据,可以使用JSON格式存储数据以便在BigQuery中进一步使用。您在a)选项上提到的Dataflow解决方案将是处理您的方案的好方法。此外,您可以将Cloud功能与Pub / Sub触发器一起使用,然后将内容写入云存储。您可以使用本教程中显示的代码作为此方案的基础,因为这会将信息放在主题中,然后从主题中收集消息并创建一个云存储对象,并将消息作为其内容。