数据流以PCollection的顺序写入文件

时间:2022-03-24 15:37:24

I have a PCollection which holds KV and has only one key-value, the key has no meaning and the value holds an Iterable of KVs. The key of this inner KV is a number and the value of this KV is an Iterable of Strings. The PCollection is defined like this:

我有一个PCollection,它持有KV并且只有一个键值,键没有任何意义,值保持一个可转换的KV。这个内部KV的关键是一个数字,这个KV的值是一个Iterable of Strings。 PCollection的定义如下:

PCollection<KV<String, Iterable<KV<Long, Iterable<String>>>>>

I want to write to a file on a single machine : sorted by the number, for each number and for each string in that number, a row in the file.

我想写一台机器上的文件:按数字排序,每个数字和该数字中的每个字符串,文件中的一行。

Using this PCollection I can have a ParDo that receives in it's processElement method all the numbers and their strings. Now I can sort by number, iterate the numbers, for each number iterate the strings and output the string and number to the output collection.

使用这个PCollection我可以有一个ParDo,它在其processElement方法中接收所有数字及其字符串。现在我可以按数字排序,迭代数字,每个数字迭代字符串并输出字符串和数字到输出集合。

However, when I write this collection to a file like this:

但是,当我将此集合写入这样的文件时:

outputCollection.apply(TextIO.Write.withoutSharding().to(options.getOutputFilePath()));

The strings are not written sorted by the number, it seems the write is done in parallel even though it is done locally on a single machine. Even though I inserted the "number : string" into the output collection sorted according to the number, in the file I see the numbers mixed up.

字符串不是按编号排序的,似乎写入是并行完成的,即使它是在一台机器上本地完成的。即使我将“number:string”插入到根据数字排序的输出集合中,在文件中我看到数字混合了。

How can I control the order in which the TextIO.Write writes the records? Can I tell it to run in a single thread and use the order the elements were inserted to the PCollection?

如何控制TextIO.Write写入记录的顺序?我可以告诉它在单个线程中运行并使用元素插入PCollection的顺序吗?

thanks

谢谢

1 个解决方案

#1


1  

The elements in a PCollection are unordered. This is closely related to the fact that all elements in a PCollection may be processed on different machines, and ordering between machines would be difficult.

PCollection中的元素是无序的。这与PCollection中的所有元素可以在不同的机器上处理,并且机器之间的排序将是困难的事实密切相关。

If you know that all of the data for a specific key fits on one machine, you could output a single element containing all the values, and then create a custom sink that writes that to a file.

如果您知道特定键的所有数据都适合一台计算机,则可以输出包含所有值的单个元素,然后创建一个将其写入文件的自定义接收器。

#1


1  

The elements in a PCollection are unordered. This is closely related to the fact that all elements in a PCollection may be processed on different machines, and ordering between machines would be difficult.

PCollection中的元素是无序的。这与PCollection中的所有元素可以在不同的机器上处理,并且机器之间的排序将是困难的事实密切相关。

If you know that all of the data for a specific key fits on one machine, you could output a single element containing all the values, and then create a custom sink that writes that to a file.

如果您知道特定键的所有数据都适合一台计算机,则可以输出包含所有值的单个元素,然后创建一个将其写入文件的自定义接收器。