在Google Cloud Dataflow管道中批处理元素

I'm looking into grouping elements during the flow into batch groups that are grouped based on a batch size.

我正在研究在流入批处理组期间对元素进行分组，这些批处理组是根据批处理大小进行分组的。

In Pseudo code:

在伪代码中：

PCollection[String].apply(Grouped.size(10))

Basically converting a PCollection[String] into PCollection[List[String]] where each list now contains 10 elements. As it is batch and in case it doesn't evenly divide the last batch would contain the left over elements.

基本上将PCollection [String]转换为PCollection [List [String]]，其中每个列表现在包含10个元素。由于它是批处理的，如果不均匀分割，最后一批将包含剩余的元素。

I have two ugly ideas with windows and fake timestamps or a GroupBy using keys based on a random index to distribute evenly, but this seems like a to complex solution for the simple problem.

我有两个丑陋的想法与Windows和假时间戳或GroupBy使用基于随机索引的键来均匀分布，但这似乎是一个简单问题的复杂解决方案。

1 个解决方案

#1

This question is similar to a variety of questions on how to batch elements. Take a look at these to get you started:

这个问题类似于关于如何批处理元素的各种问题。看一下这些可以帮助您入门：

Can datastore input in google dataflow pipeline be processed in a batch of N entries at a time?

谷歌数据流管道中的数据存储输入是否可以一次处理一批N个条目？

Partition data coming from CSV so I can process larger patches rather then individual lines

来自CSV的分区数据因此我可以处理更大的补丁而不是单独的行

#1