如何获得两个PCollections的笛卡尔积

I'm very new to using Google Cloud Dataflow. I would like to get the Cartesian product of two PCollections. For example, if I have two PCollections (1, 2) and ("hello", "world"), their Cartesian product is ((1, "hello"), (1, "world"), (2, "hello"), (2, "world")).

我对使用Google Cloud Dataflow非常陌生。我想获得两个PCollections的笛卡尔积。例如，如果我有两个PCollections（1,2）和（“hello”，“world”），他们的笛卡尔积是（（1，“hello”），（1，“world”），（2，“hello” “），（2，”世界“））。

Any ideas how I could do that? Also, since the Cartesian product could be large, I'm hoping the solution would lazily create the product and thus avoid huge memory consumption.

我有什么想法可以做到这一点？此外，由于笛卡尔积可能很大，我希望该解决方案可以懒得创建产品，从而避免大量内存消耗。

Thanks!

谢谢！

1 个解决方案

#1

In general, computing the cartesian product will be expensive. If either (or both) of the collections fit in memory, you can use side-inputs to broadcast the data to all of the workers. So for your example, you'd turn the PCollection<String> into a side input, and then you would have a ParDo that took it as the main input. For each string on the main input, you could access the side-input which had an Iterable<String> of all the values, and you'd output the pairs (or you could in this DoFn choose to output only the pairs that line up).

通常，计算笛卡尔积将是昂贵的。如果集合中的任何一个（或两个）适合内存，则可以使用侧输入将数据广播给所有工作者。因此，对于您的示例，您将PCollection 转换为侧输入，然后您将使用ParDo将其作为主输入。对于主输入上的每个字符串，您可以访问具有所有值的Iterable 的side-input，并输出对（或者您可以在此DoFn中选择仅输出排列的对）。

This will to re-iterate over the entire set of words each time -- if it fits in memory this should be fine. If it has to re-fetch the side input data every time it could be problematic.

这将重复遍历整个单词集 - 如果它适合内存，这应该没问题。如果每次出现问题时都必须重新获取侧输入数据。

Another approach would be to rely on shuffling and keys. Say you wanted to find words with a 3-letter overlap. You can process the dictionary and produced a PCollection of values keyed by the 3-letter prefixes. You can also create the similar PCollection keyed by 3-letter suffixes. Then you can GroupByKey (or CoGroupByKey). After that, you have for each 3-letter key, all the words with that as a prefix and that as a suffix.

另一种方法是依靠改组和密钥。假设您要查找具有3个字母重叠的单词。您可以处理字典并生成由3个字母前缀键入的PCollection值。您还可以创建由3个字母后缀键入的类似PCollection。然后你可以使用GroupByKey（或CoGroupByKey）。在那之后，你有每个3个字母的键，所有的单词作为前缀，并作为后缀。

#1

This will to re-iterate over the entire set of words each time -- if it fits in memory this should be fine. If it has to re-fetch the side input data every time it could be problematic.

这将重复遍历整个单词集 - 如果它适合内存，这应该没问题。如果每次出现问题时都必须重新获取侧输入数据。

秒客网

如何获得两个PCollections的笛卡尔积

1 个解决方案

#1

#1

相关文章