在梁自定义组合功能中,即使对象在“同一”机器上也会发生序列化?

时间:2022-05-28 15:34:53

We have a custom combine function (on beam sdk 2.0) in which the millions of objects get accumulated but they do NOT necessarily get reduced....that is, they sometimes get added to a List such that eventually, the List might get quite large (hundreds of megabytes, even gigabytes).

我们有一个自定义组合函数(在beam sdk 2.0上),其中数百万个对象被累积但是它们不一定会被减少....也就是说,它们有时被添加到List中,这样最终,List可能会变得相当大(几百兆,甚至几千兆字节)。

To minimize the problem of having to "pass around" these objects (during merging of accumulators) between nodes, we've created a SINGLE giant node (of 64 cores, tonnes of RAM).

为了最大限度地减少必须在节点之间“传递”这些对象(在合并累加器期间)的问题,我们创建了一个SINGLE巨型节点(64核,吨RAM)。

So, in "theory", dataflow does not need to serialize the List object (and any of these big objects in the List) even during "merge accumulator" operations, since all the objects are on the same node. But, does dataflow still serialize even if all the objects of interest are on the same node or is it smart enough to know that an object is on the same node vs separate nodes?

因此,在“理论”中,即使在“合并累加器”操作期间,数据流也不需要序列化List对象(以及List中的任何这些大对象),因为所有对象都在同一节点上。但是,即使所有感兴趣的对象都在同一个节点上,或者它是否足够聪明,知道某个对象在同一节点上与单独的节点相比,数据流是否仍会序列化?

Ideally, when objects are on same node, we can just pass around references to the objects (rather than serializing/deserializing the contents of these objects, which can be very very large.) (I understand, of course, than when dealing with multiple nodes, there's no choice but to serialize/deserialize since the data has to be passed around somehow; but within a node, is beam sdk 2.0 smart enough to not serialize/deserialize during these combine functions, group by's etc.?)

理想情况下,当对象在同一节点上时,我们可以只传递对象的引用(而不是序列化/反序列化这些对象的内容,这可能非常大。)(当然,我理解的是处理多个对象时)节点,除了序列化/反序列化之外别无选择,因为数据必须以某种方式传递;但在一个节点内,beam sdk 2.0是否足够聪明,在这些组合函数,分组等等中不能序列化/反序列化?)

1 个解决方案

#1


0  

The Dataflow service aggressively optimizes your pipeline to avoid needless serialization. The optimization you are interested in is fusion, described here in the Dataflow documentation. When data moves through a fused "stage" (a sequence of low-level instructions roughly corresponding to steps in your input pipeline), it is not serialized and deserialized.

Dataflow服务积极优化您的管道以避免不必要的序列化。您感兴趣的优化是融合,在Dataflow文档中进行了描述。当数据通过融合的“阶段”(一系列低级指令大致对应于输入管道中的步骤)时,它不会被序列化和反序列化。

However, if your CombineFn builds a list, and that list grows large, you should try to rephrase your pipeline to use a raw GroupByKey. Another important optimization is "combiner lifting" or "mapper-side combine" where your CombineFn is applied per-key locally prior to shuffling your data between machines, based on the assumption that the accumulator will be smaller than just a list of elements. So the whole list will be serialized, shuffled, and deserialized prior to completing the Combine transform. If, instead, you use a GroupByKey directly, your elements would be much more efficiently streamed, without serializing an entire list.

但是,如果您的CombineFn构建一个列表,并且该列表变大,您应该尝试将管道重新改写为使用原始GroupByKey。另一个重要的优化是“组合器提升”或“映射器组合”,其中在根据假设累加器小于元素列表的假设之前,在机器之间对数据进行混洗之前,本地应用CombineFn。因此,在完成Combine转换之前,将对整个列表进行序列化,混洗和反序列化。相反,如果直接使用GroupByKey,则可以更有效地流式传输元素,而无需序列化整个列表。

I should note that Beam's other runners also perform standard fusion optimization and others. These all generally come from functional programming work in the late 80s / early 90s and was applied to distributed data processing in FlumeJava, circa 2010, so it is a baseline expectation now.

我应该注意到Beam的其他跑步者也执行标准的融合优化等。这些都通常来自80年代末/ 90年代初期的函数式编程工作,并且应用于大约2010年的FlumeJava中的分布式数据处理,因此它现在是基线期望。

#1


0  

The Dataflow service aggressively optimizes your pipeline to avoid needless serialization. The optimization you are interested in is fusion, described here in the Dataflow documentation. When data moves through a fused "stage" (a sequence of low-level instructions roughly corresponding to steps in your input pipeline), it is not serialized and deserialized.

Dataflow服务积极优化您的管道以避免不必要的序列化。您感兴趣的优化是融合,在Dataflow文档中进行了描述。当数据通过融合的“阶段”(一系列低级指令大致对应于输入管道中的步骤)时,它不会被序列化和反序列化。

However, if your CombineFn builds a list, and that list grows large, you should try to rephrase your pipeline to use a raw GroupByKey. Another important optimization is "combiner lifting" or "mapper-side combine" where your CombineFn is applied per-key locally prior to shuffling your data between machines, based on the assumption that the accumulator will be smaller than just a list of elements. So the whole list will be serialized, shuffled, and deserialized prior to completing the Combine transform. If, instead, you use a GroupByKey directly, your elements would be much more efficiently streamed, without serializing an entire list.

但是,如果您的CombineFn构建一个列表,并且该列表变大,您应该尝试将管道重新改写为使用原始GroupByKey。另一个重要的优化是“组合器提升”或“映射器组合”,其中在根据假设累加器小于元素列表的假设之前,在机器之间对数据进行混洗之前,本地应用CombineFn。因此,在完成Combine转换之前,将对整个列表进行序列化,混洗和反序列化。相反,如果直接使用GroupByKey,则可以更有效地流式传输元素,而无需序列化整个列表。

I should note that Beam's other runners also perform standard fusion optimization and others. These all generally come from functional programming work in the late 80s / early 90s and was applied to distributed data processing in FlumeJava, circa 2010, so it is a baseline expectation now.

我应该注意到Beam的其他跑步者也执行标准的融合优化等。这些都通常来自80年代末/ 90年代初期的函数式编程工作,并且应用于大约2010年的FlumeJava中的分布式数据处理,因此它现在是基线期望。