通过Apache Beam写入XML时生成多个文件

时间:2022-11-01 15:35:41

I'm trying to write an XML file where the source is a text file stored in GCS. The code is running fine but instead of a single XML file, it is generating multiple XML files. (No. of XML files seem to follow total no. of records present in source text file). I have observed this scenario while using 'DataflowRunner'.

我正在尝试编写一个XML文件,其中源是存储在GCS中的文本文件。代码运行正常但不是单个XML文件,而是生成多个XML文件。 (XML文件的数量似乎遵循源文本文件中存在的总记录数)。我在使用'DataflowRunner'时观察到了这种情况。

When I run the same code in local then two files get generated. First one contains all the records with proper elements and the second one contains only opening and closing root element.

当我在本地运行相同的代码时,会生成两个文件。第一个包含具有适当元素的所有记录,第二个包含仅包含开始和结束根元素。

Any idea about the occurrence of this unexpected behaviour? please find below the code snippet I'm using :

有关这种意外行为发生的任何想法?请在下面找到我正在使用的代码段:

PCollection<String>input_records=p.apply(TextIO.read().from("gs://balajee_test/xml_source.txt"));

   PCollection<XMLFormatter> input_object= input_records.apply(ParDo.of(new DoFn<String,XMLFormatter>(){
        @ProcessElement

        public void processElement(ProcessContext c)
        {
            String elements[]=c.element().toString().split(",");

            c.output(new XMLFormatter(elements[0],elements[1],elements[2],elements[3],elements[4]));

            System.out.println("Values to be written have been provided to constructor ");

        }
    })).setCoder(AvroCoder.of(XMLFormatter.class));

   input_object.apply(XmlIO.<XMLFormatter>write()
              .withRecordClass(XMLFormatter.class)
              .withRootElement("library")
              .to("gs://balajee_test/book_output"));

Please let me know the way to generate a single XML file(book_output.xml) at output.

请让我知道在输出中生成单个XML文件(book_output.xml)的方法。

1 个解决方案

#1


0  

XmlIO.write().to() is documented as follows:

XmlIO.write()。to()记录如下:

/**
 * Writes to files with the given path prefix.
 *
 * <p>Output files will have the name {@literal {filenamePrefix}-0000i-of-0000n.xml} where n is
 * the number of output bundles.
 */

I.e. it is expected that it may produce multiple files: e.g. if the runner chooses to process your data parallelizing it into 3 tasks ("bundles"), you'll get 3 files. Some of the parts may turn out empty in some cases, but the total data written will always add up to the expected data.

即预计它可能会产生多个文件:例如如果跑步者选择处理您的数据并将其并行化为3个任务(“捆绑”),您将获得3个文件。在某些情况下,某些部分可能会变空,但写入的总数据将总是与预期数据相加。

Asking the IO to produce exactly one file is a reasonable request if your data is not particularly big. It is supported in TextIO and AvroIO via .withoutSharding(), but not yet supported in XmlIO. Please feel free to file a JIRA with the feature request.

如果您的数据不是特别大,请求IO生成一个文件是合理的请求。它通过.withoutSharding()在TextIO和AvroIO中受支持,但在XmlIO中尚不支持。请随时提交具有功能请求的JIRA。

#1


0  

XmlIO.write().to() is documented as follows:

XmlIO.write()。to()记录如下:

/**
 * Writes to files with the given path prefix.
 *
 * <p>Output files will have the name {@literal {filenamePrefix}-0000i-of-0000n.xml} where n is
 * the number of output bundles.
 */

I.e. it is expected that it may produce multiple files: e.g. if the runner chooses to process your data parallelizing it into 3 tasks ("bundles"), you'll get 3 files. Some of the parts may turn out empty in some cases, but the total data written will always add up to the expected data.

即预计它可能会产生多个文件:例如如果跑步者选择处理您的数据并将其并行化为3个任务(“捆绑”),您将获得3个文件。在某些情况下,某些部分可能会变空,但写入的总数据将总是与预期数据相加。

Asking the IO to produce exactly one file is a reasonable request if your data is not particularly big. It is supported in TextIO and AvroIO via .withoutSharding(), but not yet supported in XmlIO. Please feel free to file a JIRA with the feature request.

如果您的数据不是特别大,请求IO生成一个文件是合理的请求。它通过.withoutSharding()在TextIO和AvroIO中受支持,但在XmlIO中尚不支持。请随时提交具有功能请求的JIRA。