TextIO.Write - 是否附加或替换输出文件(Google Cloud Dataflow)

时间:2022-03-14 15:25:16

I cannot find any documentation on it, so I wonder what is the behavior if the output files already exist (in a gs:// bucket)?

我找不到任何关于它的文档,所以我想知道如果输出文件已经存在(在gs:// bucket中)是什么行为?

Thanks, G

谢谢,G

1 个解决方案

#1


6  

The files will be overwritten. There are several motivations for this:

文件将被覆盖。这有几个动机:

  • The "report-like" use case (compute a summary of the input data and put the results on GCS) seems to be a lot more frequent than the use case where you are producing data incrementally and putting more of it onto GCS with each execution of the pipeline.
  • “类似报告”的用例(计算输入数据的摘要并将结果放在GCS上)似乎比用于逐步生成数据并在每次执行时将更多数据放入GCS的用例更频繁管道。
  • It is good if rerunning a pipeline is idempotent(-ish?). E.g. if you find a bug in your pipeline, you can just fix it and rerun it, and enjoy the overwritten correct results. A pipeline that appends to files would be very difficult to work with in this matter.
  • 如果重新运行管道是幂等的(-ish?),那就好了。例如。如果你在管道中发现了一个错误,你可以修复它并重新运行它,并享受覆盖正确的结果。在这个问题上,附加到文件的管道很难处理。
  • It is not required to specify the number of output shards for TextIO.Write; it can slightly differ between different executions, even for exactly the same pipeline and the same input data. The semantics of appending in that case would be very confusing.
  • 不需要为TextIO.Write指定输出分片的数量;它可以在不同的执行之间略有不同,即使对于完全相同的管道和相同的输入数据也是如此。在这种情况下附加的语义会非常混乱。
  • Appending is, as far as I know, impossible to implement efficiently using any filesystem I'm aware of, while preserving the atomicity and fault tolerance guarantees (e.g. that you produce all output or none of it, even in the face of bundle re-executions due to failures).
  • 据我所知,追加是不可能使用我所知道的任何文件系统有效地实现,同时保留原子性和容错保证(例如,即使面对捆绑重新生成所有输出也不产生任何输出由于失败而执行。

This behavior will be documented in the next version of SDK that appears on github.

此行为将记录在github上显示的下一版SDK中。

#1


6  

The files will be overwritten. There are several motivations for this:

文件将被覆盖。这有几个动机:

  • The "report-like" use case (compute a summary of the input data and put the results on GCS) seems to be a lot more frequent than the use case where you are producing data incrementally and putting more of it onto GCS with each execution of the pipeline.
  • “类似报告”的用例(计算输入数据的摘要并将结果放在GCS上)似乎比用于逐步生成数据并在每次执行时将更多数据放入GCS的用例更频繁管道。
  • It is good if rerunning a pipeline is idempotent(-ish?). E.g. if you find a bug in your pipeline, you can just fix it and rerun it, and enjoy the overwritten correct results. A pipeline that appends to files would be very difficult to work with in this matter.
  • 如果重新运行管道是幂等的(-ish?),那就好了。例如。如果你在管道中发现了一个错误,你可以修复它并重新运行它,并享受覆盖正确的结果。在这个问题上,附加到文件的管道很难处理。
  • It is not required to specify the number of output shards for TextIO.Write; it can slightly differ between different executions, even for exactly the same pipeline and the same input data. The semantics of appending in that case would be very confusing.
  • 不需要为TextIO.Write指定输出分片的数量;它可以在不同的执行之间略有不同,即使对于完全相同的管道和相同的输入数据也是如此。在这种情况下附加的语义会非常混乱。
  • Appending is, as far as I know, impossible to implement efficiently using any filesystem I'm aware of, while preserving the atomicity and fault tolerance guarantees (e.g. that you produce all output or none of it, even in the face of bundle re-executions due to failures).
  • 据我所知,追加是不可能使用我所知道的任何文件系统有效地实现,同时保留原子性和容错保证(例如,即使面对捆绑重新生成所有输出也不产生任何输出由于失败而执行。

This behavior will be documented in the next version of SDK that appears on github.

此行为将记录在github上显示的下一版SDK中。