数据流和大查询中的窗口函数

时间:2022-08-13 15:25:58

I am looking at analysing streaming data (web events).

我正在研究分析流数据(网络事件)。

Is there a good rule of thumb to help me determine if I should

是否有一个很好的经验法则可以帮助我确定是否应该这样做

  1. Perform Grouping and Aggregation in Dataflow and write the output
  2. 在Dataflow中执行分组和聚合并写入输出

or

要么

  1. Use Dataflow to stream into Big Query and possibly use a range decorator to limit data / use a windowing function for partitions and aggregate via SQL.
  2. 使用Dataflow流式传输到Big Query,并可能使用范围装饰器限制数据/使用窗口函数进行分区并通过SQL进行聚合。

Looking at the examples in the documentation and this article https://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-comparison

查看文档和本文中的示例https://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-comparison

Classic Batch Programming, Hourly Team Scores, All-time User Scores, User Behaviour Analysis feel like they are straightforward to create via SQL (given "created" and "write" timestamps are recorded)

经典批量编程,每小时团队分数,所有时间用户分数,用户行为分析感觉他们可以直接通过SQL创建(记录“创建”和“写入”时间戳)

The Spam filtering example I can see the limitations to using BQ if this applied on a per-event streaming basis).

垃圾邮件过滤示例我可以看到使用BQ的限制,如果这适用于每个事件的流媒体)。

The semantics of Dataflow seem to overlap in terms of GroupBy, Join, Combine, Windowing as well as BQ supporting streaming inserts with availability in seconds, well short enough for hour level aggregation.

数据流的语义似乎在GroupBy,Join,Combine,Windowing以及支持流插入的BQ方面重叠,可用性在几秒钟内完成,足够短以用于小时级聚合。

Is there something fundamental I have not understood? Or is there a case that streaming into BigQuery and then querying will start to become unreliable?

有什么根本我还没理解?或者是否存在流入BigQuery然后查询将开始变得不可靠的情况?

Thank you

谢谢

Chris

克里斯

(Apologies if this question is a bit vague - happy to be redirected to a better place to ask)

(如果这个问题有点模糊,请道歉 - 很高兴被重定向到更好的地方询问)

2 个解决方案

#1


3  

Whether one chooses to perform grouping and aggregation in Dataflow or using BigQuery operations (after having ingested data using Dataflow) is something that depends on the application logic and on what consumes the output. For example, sessions and sliding windows are both hard to express in SQL; while Dataflow supports arbitrary processing such as triggered estimates. Another thing to consider is that it may be easier to express the computation logic using an imperative programming language instead of using SQL.

是否选择在Dataflow中执行分组和聚合或使用BigQuery操作(在使用Dataflow获取数据之后)取决于应用程序逻辑以及消耗输出的内容。例如,会话和滑动窗口都难以在SQL中表达;而Dataflow支持任意处理,例如触发估计。另一件需要考虑的事情是,使用命令式编程语言而不是使用SQL来表达计算逻辑可能更容易。

#2


1  

Below, not necessarily answers your exact question, but rather adds yet another aspect to consider:
1. If you are building process that supposed to power your infrastructure – dataflow might be a good choice. Of course you bound to your tech team resources.
2. In case if you plan for ad-hocs and self-serve type of activity by non-tech personnel (of course tech personnel is not excluded here also) – you can focus on employing BigQuery’s query features (including windowing functions) and make sure you have good real working examples that rest of your company can use as a template to start leveraging power of BigQuery and GCP in general. This proved to work great! Domain experts now can answer their questions (like you enlisted in your question) by themselves w/o having tech people in between. Quality and Timing much better in this scenario!

下面,不一定回答您的确切问题,而是添加另一个需要考虑的方面:1。如果您正在构建应该为您的基础架构供电的流程 - 数据流可能是一个不错的选择。当然,您必须使用您的技术团队资源。 2.如果您计划由非技术人员进行ad-hocs和自助式活动(当然也不包括技术人员) - 您可以专注于使用BigQuery的查询功能(包括窗口函数)和制作确保你有一个很好的实际工作示例,公司的其余部分可以用作模板来开始利用BigQuery和GCP的强大功能。事实证明这很有效!领域专家现在可以自己回答他们的问题(就像你在你的问题中加入的那样),而不是技术人员。在这种情况下,质量和时间要好得多!

#1


3  

Whether one chooses to perform grouping and aggregation in Dataflow or using BigQuery operations (after having ingested data using Dataflow) is something that depends on the application logic and on what consumes the output. For example, sessions and sliding windows are both hard to express in SQL; while Dataflow supports arbitrary processing such as triggered estimates. Another thing to consider is that it may be easier to express the computation logic using an imperative programming language instead of using SQL.

是否选择在Dataflow中执行分组和聚合或使用BigQuery操作(在使用Dataflow获取数据之后)取决于应用程序逻辑以及消耗输出的内容。例如,会话和滑动窗口都难以在SQL中表达;而Dataflow支持任意处理,例如触发估计。另一件需要考虑的事情是,使用命令式编程语言而不是使用SQL来表达计算逻辑可能更容易。

#2


1  

Below, not necessarily answers your exact question, but rather adds yet another aspect to consider:
1. If you are building process that supposed to power your infrastructure – dataflow might be a good choice. Of course you bound to your tech team resources.
2. In case if you plan for ad-hocs and self-serve type of activity by non-tech personnel (of course tech personnel is not excluded here also) – you can focus on employing BigQuery’s query features (including windowing functions) and make sure you have good real working examples that rest of your company can use as a template to start leveraging power of BigQuery and GCP in general. This proved to work great! Domain experts now can answer their questions (like you enlisted in your question) by themselves w/o having tech people in between. Quality and Timing much better in this scenario!

下面,不一定回答您的确切问题,而是添加另一个需要考虑的方面:1。如果您正在构建应该为您的基础架构供电的流程 - 数据流可能是一个不错的选择。当然,您必须使用您的技术团队资源。 2.如果您计划由非技术人员进行ad-hocs和自助式活动(当然也不包括技术人员) - 您可以专注于使用BigQuery的查询功能(包括窗口函数)和制作确保你有一个很好的实际工作示例,公司的其余部分可以用作模板来开始利用BigQuery和GCP的强大功能。事实证明这很有效!领域专家现在可以自己回答他们的问题(就像你在你的问题中加入的那样),而不是技术人员。在这种情况下,质量和时间要好得多!