I'm looking into grouping elements during the flow into batch groups that are grouped based on a batch size.
我正在研究在流入批处理组期间对元素进行分组,这些批处理组是根据批处理大小进行分组的。
In Pseudo code:
在伪代码中:
PCollection[String].apply(Grouped.size(10))
Basically converting a PCollection[String] into PCollection[List[String]] where each list now contains 10 elements. As it is batch and in case it doesn't evenly divide the last batch would contain the left over elements.
基本上将PCollection [String]转换为PCollection [List [String]],其中每个列表现在包含10个元素。由于它是批处理的,如果不均匀分割,最后一批将包含剩余的元素。
I have two ugly ideas with windows and fake timestamps or a GroupBy using keys based on a random index to distribute evenly, but this seems like a to complex solution for the simple problem.
我有两个丑陋的想法与Windows和假时间戳或GroupBy使用基于随机索引的键来均匀分布,但这似乎是一个简单问题的复杂解决方案。
1 个解决方案
#1
1
This question is similar to a variety of questions on how to batch elements. Take a look at these to get you started:
这个问题类似于关于如何批处理元素的各种问题。看一下这些可以帮助您入门:
Can datastore input in google dataflow pipeline be processed in a batch of N entries at a time?
谷歌数据流管道中的数据存储输入是否可以一次处理一批N个条目?
Partition data coming from CSV so I can process larger patches rather then individual lines
来自CSV的分区数据因此我可以处理更大的补丁而不是单独的行
#1
1
This question is similar to a variety of questions on how to batch elements. Take a look at these to get you started:
这个问题类似于关于如何批处理元素的各种问题。看一下这些可以帮助您入门:
Can datastore input in google dataflow pipeline be processed in a batch of N entries at a time?
谷歌数据流管道中的数据存储输入是否可以一次处理一批N个条目?
Partition data coming from CSV so I can process larger patches rather then individual lines
来自CSV的分区数据因此我可以处理更大的补丁而不是单独的行