I'm very new to using Google Cloud Dataflow. I would like to get the Cartesian product of two PCollections. For example, if I have two PCollections (1, 2)
and ("hello", "world")
, their Cartesian product is ((1, "hello"), (1, "world"), (2, "hello"), (2, "world"))
.
我对使用Google Cloud Dataflow非常陌生。我想获得两个PCollections的笛卡尔积。例如,如果我有两个PCollections(1,2)和(“hello”,“world”),他们的笛卡尔积是((1,“hello”),(1,“world”),(2,“hello” “),(2,”世界“))。
Any ideas how I could do that? Also, since the Cartesian product could be large, I'm hoping the solution would lazily create the product and thus avoid huge memory consumption.
我有什么想法可以做到这一点?此外,由于笛卡尔积可能很大,我希望该解决方案可以懒得创建产品,从而避免大量内存消耗。
Thanks!
谢谢!
1 个解决方案
#1
4
In general, computing the cartesian product will be expensive. If either (or both) of the collections fit in memory, you can use side-inputs to broadcast the data to all of the workers. So for your example, you'd turn the PCollection<String>
into a side input, and then you would have a ParDo
that took it as the main input. For each string on the main input, you could access the side-input which had an Iterable<String>
of all the values, and you'd output the pairs (or you could in this DoFn
choose to output only the pairs that line up).
通常,计算笛卡尔积将是昂贵的。如果集合中的任何一个(或两个)适合内存,则可以使用侧输入将数据广播给所有工作者。因此,对于您的示例,您将PCollection
This will to re-iterate over the entire set of words each time -- if it fits in memory this should be fine. If it has to re-fetch the side input data every time it could be problematic.
这将重复遍历整个单词集 - 如果它适合内存,这应该没问题。如果每次出现问题时都必须重新获取侧输入数据。
Another approach would be to rely on shuffling and keys. Say you wanted to find words with a 3-letter overlap. You can process the dictionary and produced a PCollection
of values keyed by the 3-letter prefixes. You can also create the similar PCollection
keyed by 3-letter suffixes. Then you can GroupByKey
(or CoGroupByKey
). After that, you have for each 3-letter key, all the words with that as a prefix and that as a suffix.
另一种方法是依靠改组和密钥。假设您要查找具有3个字母重叠的单词。您可以处理字典并生成由3个字母前缀键入的PCollection值。您还可以创建由3个字母后缀键入的类似PCollection。然后你可以使用GroupByKey(或CoGroupByKey)。在那之后,你有每个3个字母的键,所有的单词作为前缀,并作为后缀。
#1
4
In general, computing the cartesian product will be expensive. If either (or both) of the collections fit in memory, you can use side-inputs to broadcast the data to all of the workers. So for your example, you'd turn the PCollection<String>
into a side input, and then you would have a ParDo
that took it as the main input. For each string on the main input, you could access the side-input which had an Iterable<String>
of all the values, and you'd output the pairs (or you could in this DoFn
choose to output only the pairs that line up).
通常,计算笛卡尔积将是昂贵的。如果集合中的任何一个(或两个)适合内存,则可以使用侧输入将数据广播给所有工作者。因此,对于您的示例,您将PCollection
This will to re-iterate over the entire set of words each time -- if it fits in memory this should be fine. If it has to re-fetch the side input data every time it could be problematic.
这将重复遍历整个单词集 - 如果它适合内存,这应该没问题。如果每次出现问题时都必须重新获取侧输入数据。
Another approach would be to rely on shuffling and keys. Say you wanted to find words with a 3-letter overlap. You can process the dictionary and produced a PCollection
of values keyed by the 3-letter prefixes. You can also create the similar PCollection
keyed by 3-letter suffixes. Then you can GroupByKey
(or CoGroupByKey
). After that, you have for each 3-letter key, all the words with that as a prefix and that as a suffix.
另一种方法是依靠改组和密钥。假设您要查找具有3个字母重叠的单词。您可以处理字典并生成由3个字母前缀键入的PCollection值。您还可以创建由3个字母后缀键入的类似PCollection。然后你可以使用GroupByKey(或CoGroupByKey)。在那之后,你有每个3个字母的键,所有的单词作为前缀,并作为后缀。