使用Beam Python SDK阅读复杂的XML

时间:2021-10-26 15:35:36

How do I best go about to write a Source for the Python SDK which should read a nested XML file and split the content into multiple rows. The existing sources all work on line level which is not what I need in context of my XML.

我如何最好地为Python SDK编写一个Source,它应该读取嵌套的XML文件并将内容拆分成多行。现有的资源都在线级工作,这不是我在XML上下文中所需要的。

It's a bunch of XML files and every single file makes out one transaction that has to be broken down into multiple records (order lines, payments, etc.).

它是一堆XML文件,每个文件都会生成一个必须分解为多个记录(订单行,付款等)的交易。

1 个解决方案

#1


1  

You can use this pattern for reading TensorFlow records as a model for writing your own source: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/tfrecordio.py

您可以使用此模式读取TensorFlow记录作为编写自己的源的模型:https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/tfrecordio.py

You can use Python to parse the XML into elements.

您可以使用Python将XML解析为元素。

Please keep in mind that a source will write to a PCollection that must contain only one type of element, so your source cannot emit some payment records and some order records. You'll need to either emit a single transaction record or create a wrapper around each record sub-type and filter on the contents later.

请记住,源将写入必须仅包含一种类型元素的PCollection,因此您的源不能发出一些付款记录和一些订单记录。您需要发出单个事务记录或在每个记录子类型周围创建一个包装器,并在以后过滤内容。

#1


1  

You can use this pattern for reading TensorFlow records as a model for writing your own source: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/tfrecordio.py

您可以使用此模式读取TensorFlow记录作为编写自己的源的模型:https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/tfrecordio.py

You can use Python to parse the XML into elements.

您可以使用Python将XML解析为元素。

Please keep in mind that a source will write to a PCollection that must contain only one type of element, so your source cannot emit some payment records and some order records. You'll need to either emit a single transaction record or create a wrapper around each record sub-type and filter on the contents later.

请记住,源将写入必须仅包含一种类型元素的PCollection,因此您的源不能发出一些付款记录和一些订单记录。您需要发出单个事务记录或在每个记录子类型周围创建一个包装器,并在以后过滤内容。