从PCollection >输出多个文本文件

时间:2021-03-27 15:35:19

How do I output to multiple files from PCollection<KV<String, String>>?

如何从PCollection >输出多个文件?

The key in each entry is the file name. The groupByKey transformation gives me PCollection<KV<String, Iterable<String>>>, but how I can write them to multiple files?

每个条目中的键是文件名。 groupByKey转换给了我PCollection >>,但是如何将它们写入多个文件?

For example, given the following input

例如,给出以下输入

<file1, value1>
<file2, value2>
<file1, value3>

I'd like to output two files

我想输出两个文件

file1:
  value1
  value3

file2:
  value2

1 个解决方案

#1


2  

Dataflow currently does not have a transform that can do this for you. As a work-around, you can do this using a simple DoFn that will extract the filename from the KV, open the file using IOChannelFactory, and write the Iterable<String> to it.

数据流当前没有可以为您执行此操作的转换。作为解决方法,您可以使用简单的DoFn来完成此操作,该DoFn将从KV中提取文件名,使用IOChannelFactory打开文件,并将Iterable 写入其中。

See similar question and another one.

看到类似的问题和另一个问题。

We have plans to address this https://issues.apache.org/jira/browse/BEAM-92, but no concrete timeline yet.

我们计划解决这个https://issues.apache.org/jira/browse/BEAM-92,但还没有具体的时间表。

#1


2  

Dataflow currently does not have a transform that can do this for you. As a work-around, you can do this using a simple DoFn that will extract the filename from the KV, open the file using IOChannelFactory, and write the Iterable<String> to it.

数据流当前没有可以为您执行此操作的转换。作为解决方法,您可以使用简单的DoFn来完成此操作,该DoFn将从KV中提取文件名,使用IOChannelFactory打开文件,并将Iterable 写入其中。

See similar question and another one.

看到类似的问题和另一个问题。

We have plans to address this https://issues.apache.org/jira/browse/BEAM-92, but no concrete timeline yet.

我们计划解决这个https://issues.apache.org/jira/browse/BEAM-92,但还没有具体的时间表。