What are the pros&cons between streaming data to Bigquery vs upload data to PubSub and then using data flow to insert data to Bigquery

时间:2021-07-01 15:35:50

As far i know, streaming data to BigQuery would cause duplicate rows as it mentions here https://cloud.google.com/bigquery/streaming-data-into-bigquery#real-time_dashboards_and_queries

据我所知,将数据流式传输到BigQuery会导致重复的行,因为它在这里提到https://cloud.google.com/bigquery/streaming-data-into-bigquery#real-time_dashboards_and_queries

On the other hand, uploading data to PubSub and then using data flow to insert data to Bigquery will prevent the duplicate rows?. there is also a tutorial for real-time data analysis here https://cloud.google.com/solutions/real-time/fluentd-bigquery

另一方面,将数据上传到PubSub然后使用数据流将数据插入Bigquery会阻止重复的行?此处还有一个实时数据分析教程https://cloud.google.com/solutions/real-time/fluentd-bigquery

so what are other pros and cons, and in what case i should use the dataflow to stream data from PubSub

那么还有什么其他优缺点,以及在什么情况下我应该使用数据流从PubSub传输数据

1 个解决方案

#1


5  

With the Google Dataflow and PubSub you will have full control of your streaming data, you can slice and dice your data in real-time and implement your own business logic and finally write it to the BigQuery table. On the other hand using other approaches to directly stream data to BigQuery using BigQuery jobs, you definitely loose the control over your data.

使用Google Dataflow和PubSub,您可以完全控制流数据,您可以实时切片和切块数据并实现自己的业务逻辑,最后将其写入BigQuery表。另一方面,使用其他方法使用BigQuery作业直接将数据流式传输到BigQuery,您绝对无法控制数据。

The pros and cons really depends upon what you need to do with your streaming data. If you are doing the flat insertion no need for Dataflow but if you need some serious computation like group by key, merge, partition, sum over your streaming data then probably the Dataflow will be the best approach for that. Things to keep in mind is the cost, once you start injecting serious volume of data to PubSub and use the Dataflow to manipulate those it starts getting costly.

优缺点实际上取决于您对流数据的需求。如果您正在进行平面插入,则不需要Dataflow,但如果您需要一些严格的计算,例如按键分组,合并,分区,汇总您的流数据,那么可能Dataflow将是最佳方法。需要记住的是成本,一旦你开始向PubSub注入大量数据并使用数据流来操纵那些开始变得昂贵的数据。

To answer your question, yes you can eliminate duplicate rows using Dataflow. Since Dataflow has full control of the data You can use pipeline filters to check for any conditions meeting the duplicate values. The current scenario I am using the Dataflow pipeline is for manipulating my customer log record in real-time with serious pre-aggregation done with Dataflow and stream of logs passed through PubSub. Dataflow is very powerful for both batch and streaming data manipulation. Hope this helps.

要回答您的问题,是的,您可以使用Dataflow消除重复的行。由于Dataflow完全控制数据您可以使用管道过滤器来检查满足重复值的任何条件。我使用Dataflow管道的当前场景是实时操作我的客户日志记录,使用Dataflow和通过PubSub传递的日志流进行严格的预聚合。数据流对于批处理和流数据操作都非常强大。希望这可以帮助。

#1


5  

With the Google Dataflow and PubSub you will have full control of your streaming data, you can slice and dice your data in real-time and implement your own business logic and finally write it to the BigQuery table. On the other hand using other approaches to directly stream data to BigQuery using BigQuery jobs, you definitely loose the control over your data.

使用Google Dataflow和PubSub,您可以完全控制流数据,您可以实时切片和切块数据并实现自己的业务逻辑,最后将其写入BigQuery表。另一方面,使用其他方法使用BigQuery作业直接将数据流式传输到BigQuery,您绝对无法控制数据。

The pros and cons really depends upon what you need to do with your streaming data. If you are doing the flat insertion no need for Dataflow but if you need some serious computation like group by key, merge, partition, sum over your streaming data then probably the Dataflow will be the best approach for that. Things to keep in mind is the cost, once you start injecting serious volume of data to PubSub and use the Dataflow to manipulate those it starts getting costly.

优缺点实际上取决于您对流数据的需求。如果您正在进行平面插入,则不需要Dataflow,但如果您需要一些严格的计算,例如按键分组,合并,分区,汇总您的流数据,那么可能Dataflow将是最佳方法。需要记住的是成本,一旦你开始向PubSub注入大量数据并使用数据流来操纵那些开始变得昂贵的数据。

To answer your question, yes you can eliminate duplicate rows using Dataflow. Since Dataflow has full control of the data You can use pipeline filters to check for any conditions meeting the duplicate values. The current scenario I am using the Dataflow pipeline is for manipulating my customer log record in real-time with serious pre-aggregation done with Dataflow and stream of logs passed through PubSub. Dataflow is very powerful for both batch and streaming data manipulation. Hope this helps.

要回答您的问题,是的,您可以使用Dataflow消除重复的行。由于Dataflow完全控制数据您可以使用管道过滤器来检查满足重复值的任何条件。我使用Dataflow管道的当前场景是实时操作我的客户日志记录,使用Dataflow和通过PubSub传递的日志流进行严格的预聚合。数据流对于批处理和流数据操作都非常强大。希望这可以帮助。