为什么Dataflow不支持SortByKey?

时间:2021-08-04 15:34:25

I was wondering why Dataflow does not support 'SortByKey' like Apache Spark.

我想知道为什么Dataflow不像Apache Spark那样支持'SortByKey'。

I have a huge table in BigQuery that I cannot sort it because "Order By" is not scalable. So, I was thinking to move the output of BigQuery to Dataflow and sort it there. But, there is no SortByKey and it seems I have to write a combiner.

我在BigQuery中有一个巨大的表,我无法对其进行排序,因为“Order By”不可扩展。所以,我正在考虑将BigQuery的输出移动到Dataflow并在那里进行排序。但是,没有SortByKey,似乎我必须编写一个组合器。

Any suggestions will be appreciated.

任何建议将不胜感激。

1 个解决方案

#1


1  

Sorting (especially by key) requires globally serial processing, which is not a scalable operation. Apache Beam / Dataflow does not provide such support, as it is frequently unnecessary.

排序(尤其是按键)需要全局串行处理,这不是可伸缩的操作。 Apache Beam / Dataflow不提供此类支持,因为它通常是不必要的。

There are a variety of alternatives that generally address the need more scalably. For instance, you can sort the values within each key, which allows each key to be processed in parallel. Another common use case is TopN either globally or per-key. Again, this can be supported much more efficiently than actually sorting.

有各种各样的替代方案通常可以解决更加可扩展的需求。例如,您可以对每个键中的值进行排序,从而允许并行处理每个键。另一个常见用例是全局或按键的TopN。同样,这可以比实际排序更有效地得到支持。

Could you elaborate on what you need to sort by and why? It would make it possible to identify options for implementing this within the Beam and Dataflow SDKs.

你能详细说明你需要排序的内容和原因吗?这样就可以在Beam和Dataflow SDK中识别实现它的选项。

#1


1  

Sorting (especially by key) requires globally serial processing, which is not a scalable operation. Apache Beam / Dataflow does not provide such support, as it is frequently unnecessary.

排序(尤其是按键)需要全局串行处理,这不是可伸缩的操作。 Apache Beam / Dataflow不提供此类支持,因为它通常是不必要的。

There are a variety of alternatives that generally address the need more scalably. For instance, you can sort the values within each key, which allows each key to be processed in parallel. Another common use case is TopN either globally or per-key. Again, this can be supported much more efficiently than actually sorting.

有各种各样的替代方案通常可以解决更加可扩展的需求。例如,您可以对每个键中的值进行排序,从而允许并行处理每个键。另一个常见用例是全局或按键的TopN。同样,这可以比实际排序更有效地得到支持。

Could you elaborate on what you need to sort by and why? It would make it possible to identify options for implementing this within the Beam and Dataflow SDKs.

你能详细说明你需要排序的内容和原因吗?这样就可以在Beam和Dataflow SDK中识别实现它的选项。