Google Cloud Dataflow可修复基于零的索引和一个基于索引的问题

My team and I are starting to use Google Cloud Dataflow to run our jobs remotely and not locally on our computers. We started from the julian example set in Python to make sure that a deployment was working successfully.

我和我的团队开始使用Google Cloud Dataflow远程运行我们的工作，而不是在我们的计算机上本地运行。我们从Python中设置的julian示例开始，以确保部署成功运行。

It did complete on Google Cloud Dataflow even though it took longer than it did on my local machine.

它确实在Google Cloud Dataflow上完成，即使它花费的时间比我本地计算机上的时间长。

The issue we have is they used zero based indexing and one based indexing in the same file name which did not make sense to us.

我们遇到的问题是他们在同一个文件名中使用了基于零的索引和一个基于索引，这对我们来说没有意义。

We think ending at 00008-of-00008 or 00009-of-00009 make more sense than ending at 00008-of-00009. Is there anyway we can fix this so that the left and right side numbers could match?

我们认为结束于00008-00008或00009-00009比结束于00008-00009更有意义。无论如何我们可以解决这个问题，以便左侧和右侧的数字匹配吗？

1 个解决方案

#1

By using The 0000X-of-0000Y format, Beam tries to do an index-of-count. The number in the right is the total number of shards, while the number on the left is a zero-based index.

通过使用0000X-of-0000Y格式，Beam尝试进行计数索引。右边的数字是分片的总数，而左边的数字是从零开始的索引。

Changing this behavior is not currently supported (easily) by the sinks in Apache Beam. To add it yourself, you'd have to modify the code in Apache Beam (specifically, around here).

Apache Beam中的接收器当前不支持（轻松）更改此行为。要自己添加它，你必须修改Apache Beam中的代码（特别是在这里）。

Hope this helps.

希望这可以帮助。

#1