有没有办法读取除python apache beam中定义的文件列表之外的所有文件?

时间:2021-06-30 15:35:44

My use case is that I am batch processing files in a bucket that is constantly being updated with new files. I don't want to process csv files that have already been processed.

我的用例是我在一个桶中批量处理文件,这些文件经常被新文件更新。我不想处理已经处理过的csv文件。

Is there a way to do that?

有没有办法做到这一点?

One potential solution I thought of, is to have a text file that maintains a list of processed files and then reads all csv files excluding the files in the processed list. Is that possible?

我想到的一个可能的解决方案是使用一个文本文件来维护已处理文件的列表,然后读取除处理列表中的文件之外的所有csv文件。那可能吗?

Or is it possible to read a list of specific files?

或者是否可以读取特定文件的列表?

1 个解决方案

#1


1  

There's not a good built-in way to do this, but you can have one stage of your pipeline that computes the list of files to read as you suggested, the using a DoFn that maps a filename to the contents of the file. See Reading multiple .gz file and identifying which row belongs to which file for information about how to write this DoFn

没有一个好的内置方法可以做到这一点,但你可以有你的管道的一个阶段来计算你建议阅读的文件列表,使用一个将文件名映射到文件内容的DoFn。有关如何编写此DoFn的信息,请参阅读取多个.gz文件并确定哪个行属于哪个文件

#1


1  

There's not a good built-in way to do this, but you can have one stage of your pipeline that computes the list of files to read as you suggested, the using a DoFn that maps a filename to the contents of the file. See Reading multiple .gz file and identifying which row belongs to which file for information about how to write this DoFn

没有一个好的内置方法可以做到这一点,但你可以有你的管道的一个阶段来计算你建议阅读的文件列表,使用一个将文件名映射到文件内容的DoFn。有关如何编写此DoFn的信息,请参阅读取多个.gz文件并确定哪个行属于哪个文件