逐个读取压缩文件

时间:2022-09-21 15:33:23

Suppose I have a zip containing N number of files. I want to process each file one by one using Dataflow. Is this possible?

假设我有一个包含N个文件的zip。我想使用Dataflow逐个处理每个文件。这可能吗?

I need to process each file in the zip and dump the data in it in a BigQuery table. So each file will be dumped in a separate BigQuery table.

我需要处理zip中的每个文件并将数据转储到BigQuery表中。因此,每个文件都将被转储到一个单独的BigQuery表中。

I tried reading a zip file using Dataflow but it reads everything in it at once. I must be able to differentiate between the various files in the zip.

我尝试使用Dataflow读取zip文件,但它会立即读取其中的所有内容。我必须能够区分zip中的各种文件。

Thank You

谢谢

1 个解决方案

#1


0  

I think that you can write one DoFn to read the catalog of files and output a tuple (Filename, Zipfile) pairs, or (offset, zipfile) pairs. Then the downstream step will receive the pairs sharded on different workers, allowing you to load separate files from the zip in parallel.

我认为您可以编写一个DoFn来读取文件目录并输出元组(Filename,Zipfile)对或(offset,zipfile)对。然后下游步骤将接收在不同工作者上分片的对,允许您并行地从zip加载单独的文件。

I assume there is an API to (1) list the files in the zip and (2) unzip just the specific file that you want to unzip. Hopefully this approach will work.

我假设有一个API可以(1)列出zip中的文件,(2)只解压缩要解压缩的特定文件。希望这种方法能够奏效。

#1


0  

I think that you can write one DoFn to read the catalog of files and output a tuple (Filename, Zipfile) pairs, or (offset, zipfile) pairs. Then the downstream step will receive the pairs sharded on different workers, allowing you to load separate files from the zip in parallel.

我认为您可以编写一个DoFn来读取文件目录并输出元组(Filename,Zipfile)对或(offset,zipfile)对。然后下游步骤将接收在不同工作者上分片的对,允许您并行地从zip加载单独的文件。

I assume there is an API to (1) list the files in the zip and (2) unzip just the specific file that you want to unzip. Hopefully this approach will work.

我假设有一个API可以(1)列出zip中的文件,(2)只解压缩要解压缩的特定文件。希望这种方法能够奏效。