如何获得两个PCollections的笛卡尔积

时间:2022-01-05 15:34:45

I'm very new to using Google Cloud Dataflow. I would like to get the Cartesian product of two PCollections. For example, if I have two PCollections (1, 2) and ("hello", "world"), their Cartesian product is ((1, "hello"), (1, "world"), (2, "hello"), (2, "world")).

我对使用Google Cloud Dataflow非常陌生。我想获得两个PCollections的笛卡尔积。例如,如果我有两个PCollections(1,2)和(“hello”,“world”),他们的笛卡尔积是((1,“hello”),(1,“world”),(2,“hello” “),(2,”世界“))。

Any ideas how I could do that? Also, since the Cartesian product could be large, I'm hoping the solution would lazily create the product and thus avoid huge memory consumption.

我有什么想法可以做到这一点?此外,由于笛卡尔积可能很大,我希望该解决方案可以懒得创建产品,从而避免大量内存消耗。

Thanks!

谢谢!

1 个解决方案

#1


4  

In general, computing the cartesian product will be expensive. If either (or both) of the collections fit in memory, you can use side-inputs to broadcast the data to all of the workers. So for your example, you'd turn the PCollection<String> into a side input, and then you would have a ParDo that took it as the main input. For each string on the main input, you could access the side-input which had an Iterable<String> of all the values, and you'd output the pairs (or you could in this DoFn choose to output only the pairs that line up).

通常,计算笛卡尔积将是昂贵的。如果集合中的任何一个(或两个)适合内存,则可以使用侧输入将数据广播给所有工作者。因此,对于您的示例,您将PCollection 转换为侧输入,然后您将使用ParDo将其作为主输入。对于主输入上的每个字符串,您可以访问具有所有值的Iterable 的side-input,并输出对(或者您可以在此DoFn中选择仅输出排列的对)。

This will to re-iterate over the entire set of words each time -- if it fits in memory this should be fine. If it has to re-fetch the side input data every time it could be problematic.

这将重复遍历整个单词集 - 如果它适合内存,这应该没问题。如果每次出现问题时都必须重新获取侧输入数据。

Another approach would be to rely on shuffling and keys. Say you wanted to find words with a 3-letter overlap. You can process the dictionary and produced a PCollection of values keyed by the 3-letter prefixes. You can also create the similar PCollection keyed by 3-letter suffixes. Then you can GroupByKey (or CoGroupByKey). After that, you have for each 3-letter key, all the words with that as a prefix and that as a suffix.

另一种方法是依靠改组和密钥。假设您要查找具有3个字母重叠的单词。您可以处理字典并生成由3个字母前缀键入的PCollection值。您还可以创建由3个字母后缀键入的类似PCollection。然后你可以使用GroupByKey(或CoGroupByKey)。在那之后,你有每个3个字母的键,所有的单词作为前缀,并作为后缀。

#1


4  

In general, computing the cartesian product will be expensive. If either (or both) of the collections fit in memory, you can use side-inputs to broadcast the data to all of the workers. So for your example, you'd turn the PCollection<String> into a side input, and then you would have a ParDo that took it as the main input. For each string on the main input, you could access the side-input which had an Iterable<String> of all the values, and you'd output the pairs (or you could in this DoFn choose to output only the pairs that line up).

通常,计算笛卡尔积将是昂贵的。如果集合中的任何一个(或两个)适合内存,则可以使用侧输入将数据广播给所有工作者。因此,对于您的示例,您将PCollection 转换为侧输入,然后您将使用ParDo将其作为主输入。对于主输入上的每个字符串,您可以访问具有所有值的Iterable 的side-input,并输出对(或者您可以在此DoFn中选择仅输出排列的对)。

This will to re-iterate over the entire set of words each time -- if it fits in memory this should be fine. If it has to re-fetch the side input data every time it could be problematic.

这将重复遍历整个单词集 - 如果它适合内存,这应该没问题。如果每次出现问题时都必须重新获取侧输入数据。

Another approach would be to rely on shuffling and keys. Say you wanted to find words with a 3-letter overlap. You can process the dictionary and produced a PCollection of values keyed by the 3-letter prefixes. You can also create the similar PCollection keyed by 3-letter suffixes. Then you can GroupByKey (or CoGroupByKey). After that, you have for each 3-letter key, all the words with that as a prefix and that as a suffix.

另一种方法是依靠改组和密钥。假设您要查找具有3个字母重叠的单词。您可以处理字典并生成由3个字母前缀键入的PCollection值。您还可以创建由3个字母后缀键入的类似PCollection。然后你可以使用GroupByKey(或CoGroupByKey)。在那之后,你有每个3个字母的键,所有的单词作为前缀,并作为后缀。