I have a PCollection which holds KV and has only one key-value, the key has no meaning and the value holds an Iterable of KVs. The key of this inner KV is a number and the value of this KV is an Iterable of Strings. The PCollection is defined like this:
我有一个PCollection,它持有KV并且只有一个键值,键没有任何意义,值保持一个可转换的KV。这个内部KV的关键是一个数字,这个KV的值是一个Iterable of Strings。 PCollection的定义如下:
PCollection<KV<String, Iterable<KV<Long, Iterable<String>>>>>
I want to write to a file on a single machine : sorted by the number, for each number and for each string in that number, a row in the file.
我想写一台机器上的文件:按数字排序,每个数字和该数字中的每个字符串,文件中的一行。
Using this PCollection I can have a ParDo that receives in it's processElement method all the numbers and their strings. Now I can sort by number, iterate the numbers, for each number iterate the strings and output the string and number to the output collection.
使用这个PCollection我可以有一个ParDo,它在其processElement方法中接收所有数字及其字符串。现在我可以按数字排序,迭代数字,每个数字迭代字符串并输出字符串和数字到输出集合。
However, when I write this collection to a file like this:
但是,当我将此集合写入这样的文件时:
outputCollection.apply(TextIO.Write.withoutSharding().to(options.getOutputFilePath()));
The strings are not written sorted by the number, it seems the write is done in parallel even though it is done locally on a single machine. Even though I inserted the "number : string" into the output collection sorted according to the number, in the file I see the numbers mixed up.
字符串不是按编号排序的,似乎写入是并行完成的,即使它是在一台机器上本地完成的。即使我将“number:string”插入到根据数字排序的输出集合中,在文件中我看到数字混合了。
How can I control the order in which the TextIO.Write writes the records? Can I tell it to run in a single thread and use the order the elements were inserted to the PCollection?
如何控制TextIO.Write写入记录的顺序?我可以告诉它在单个线程中运行并使用元素插入PCollection的顺序吗?
thanks
谢谢
1 个解决方案
#1
1
The elements in a PCollection are unordered. This is closely related to the fact that all elements in a PCollection may be processed on different machines, and ordering between machines would be difficult.
PCollection中的元素是无序的。这与PCollection中的所有元素可以在不同的机器上处理,并且机器之间的排序将是困难的事实密切相关。
If you know that all of the data for a specific key fits on one machine, you could output a single element containing all the values, and then create a custom sink that writes that to a file.
如果您知道特定键的所有数据都适合一台计算机,则可以输出包含所有值的单个元素,然后创建一个将其写入文件的自定义接收器。
#1
1
The elements in a PCollection are unordered. This is closely related to the fact that all elements in a PCollection may be processed on different machines, and ordering between machines would be difficult.
PCollection中的元素是无序的。这与PCollection中的所有元素可以在不同的机器上处理,并且机器之间的排序将是困难的事实密切相关。
If you know that all of the data for a specific key fits on one machine, you could output a single element containing all the values, and then create a custom sink that writes that to a file.
如果您知道特定键的所有数据都适合一台计算机,则可以输出包含所有值的单个元素,然后创建一个将其写入文件的自定义接收器。