我可以修改apache beam变换中的元素吗?

时间:2021-09-17 15:35:37

The Apache Beam programming guide contains the following rule:

Apache Beam编程指南包含以下规则:

3.2.2. Immutability

3.2.2。不变性

A PCollection is immutable. Once created, you cannot add, remove, or change individual elements. A Beam Transform might process each element of a PCollection and generate new pipeline data (as a new PCollection), but it does not consume or modify the original input collection.

PCollection是不可变的。创建后,您无法添加,删除或更改单个元素。 Beam Transform可以处理PCollection的每个元素并生成新的管道数据(作为新的PCollection),但它不会消耗或修改原始输入集合。

Does this mean I cannot, must not, or should not modify individual elements in a custom transform? Specifically, I am using the python SDK and considering the case of a transform that takes a dict {key: "data"} as input, does some processing and adds further fields {other_key: "some more data"}. My interpretation of rule 3.2.2 above is that I should so something like

这是否意味着我不能,不能或不应该修改自定义转换中的单个元素?具体来说,我正在使用python SDK并考虑转换的情况,该转换采用dict {key:“data”}作为输入,进行一些处理并添加更多字段{other_key:“更多数据”}。我对上述规则3.2.2的解释是,我应该这样

def process(self,element):
    import copy
    output = copy.deepcopy(element)
    output[other_key] = some_data
    yield output

but I am wondering if this may be a bit overkill.

但我想知道这是否有点矫枉过正。

Using a TestPipeline, I found that the elements of the input collection are also modified if I act on them in the process() method (unless the elements are basic types such as int, float, bool...).

使用TestPipeline,我发现如果我在process()方法中对它们进行操作,那么输入集合的元素也会被修改(除非元素是基本类型,如int,float,bool ......)。

Is mutating elements considered an absolute no-go, or just a practice one has to be careful with ?

变异元素被认为是绝对禁止的,还是仅仅是一个必须小心的练习?

1 个解决方案

#1


3  

Mutating elements is an absolute no-go, and it can and will lead to violations of the Beam model semantics, i.e. to incorrect and unpredictable results. Beam Java direct runner intentionally detects mutations and fails pipelines that do that - this is not yet implemented in Python runner, but it should be.

变异元素是绝对禁止的,它可以并且将导致违反Beam模型语义,即不正确和不可预测的结果。 Beam Java direct runner有意识地检测到突变和管道失败 - 这在Python运行器中尚未实现,但它应该是。

The reason for that is, primarily, fusion. Eg. imagine that two DoFn's are applied to the same PCollection "C" (f(C) and g(C) - not f(g(C))), and a runner schedules them to run in the same shard. Imagine the first DoFn modifies the element, then by the time the second DoFn runs, the element has been changed - i.e. the second DoFn is not really being applied to "C". There are a number of other scenarios where mutations will lead to incorrect results.

其原因主要是融合。例如。想象两个DoFn被应用于相同的PCollection“C”(f(C)和g(C) - 而不是f(g(C))),并且跑步者将它们安排在相同的碎片中运行。想象一下,第一个DoFn修改元素,然后到第二个DoFn运行时,元素已被更改 - 即第二个DoFn实际上并未应用于“C”。还有许多其他情况,突变会导致错误的结果。

#1


3  

Mutating elements is an absolute no-go, and it can and will lead to violations of the Beam model semantics, i.e. to incorrect and unpredictable results. Beam Java direct runner intentionally detects mutations and fails pipelines that do that - this is not yet implemented in Python runner, but it should be.

变异元素是绝对禁止的,它可以并且将导致违反Beam模型语义,即不正确和不可预测的结果。 Beam Java direct runner有意识地检测到突变和管道失败 - 这在Python运行器中尚未实现,但它应该是。

The reason for that is, primarily, fusion. Eg. imagine that two DoFn's are applied to the same PCollection "C" (f(C) and g(C) - not f(g(C))), and a runner schedules them to run in the same shard. Imagine the first DoFn modifies the element, then by the time the second DoFn runs, the element has been changed - i.e. the second DoFn is not really being applied to "C". There are a number of other scenarios where mutations will lead to incorrect results.

其原因主要是融合。例如。想象两个DoFn被应用于相同的PCollection“C”(f(C)和g(C) - 而不是f(g(C))),并且跑步者将它们安排在相同的碎片中运行。想象一下,第一个DoFn修改元素,然后到第二个DoFn运行时,元素已被更改 - 即第二个DoFn实际上并未应用于“C”。还有许多其他情况,突变会导致错误的结果。