谷歌数据流是BQ / BT每个工作写原子?

时间:2021-07-09 15:22:41

maybe I am a bad seeker but I couldn't find my answers in documentation, so I just want to try my luck here

也许我是一个糟糕的寻求者,但我在文档中找不到我的答案,所以我只想在这里试试运气

So my question is that say I have a dataflow job that write to a BigQuery or BigTable and the job failed. Will dataflow will able to rollback to state before it started or there might simply be partial data in my table?

所以我的问题是,我有一个写入BigQuery或BigTable的数据流作业,但作业失败了。数据流是否能够在它开始之前回滚到状态,或者我的表中可能只有部分数据?

I know that write to GCS seems not atomic that there will be partial output partition produced along the way when the job is running.

我知道写入GCS似乎不是原子的,在作业运行时会产生部分输出分区。

However, I have tried dumping data into BQ by dataflow, and it seems that the output table will not be exposed to users until the job claimed success.

但是,我尝试通过数据流将数据转储到BQ中,并且在作业声明成功之前,似乎不会向用户公开输出表。

3 个解决方案

#1


1  

I can speak for Bigtable. Bigtable is atomic at the row level, not at the job level. A Dataflow job that fails part way will write partial data into Bigtable.

我可以代表Bigtable。 Bigtable在行级别是原子级,而不是在作业级别。部分失败的Dataflow作业会将部分数据写入Bigtable。

#2


3  

In Batch, Cloud Dataflow uses the following procedure for BigQueryIO.Write.to("some table"):

在批处理中,Cloud Dataflow对BigQueryIO.Write.to(“some table”)使用以下过程:

  1. Write all data to a temporary directory on GCS.
  2. 将所有数据写入GCS上的临时目录。
  3. Issue a BigQuery load job with an explicit list of all the temporary files containing the rows to be written.
  4. 发出一个BigQuery加载作业,其中包含所有包含要写入的行的临时文件的显式列表。

If there are failures when the GCS writes are only partially complete, we will recreate the temp files on retry. Exactly one complete copy of the data will be produced by step 1 and used for loading in step 2, or the job will fail before step 2.

如果GCS写入仅在部分完成时出现故障,我们将在重试时重新创建临时文件。步骤1将生成一个完整的数据副本,并在步骤2中用于加载,或者在步骤2之前作业将失败。

Each BigQuery load job, as in William V's answer, is atomic. The load job will succeed or fail, and if it fails there will be no data written to BigQuery.

每个BigQuery加载工作,如William V的答案,都是原子的。加载作业将成功或失败,如果失败,则不会向BigQuery写入数据。

For slightly more depth, Dataflow also uses a deterministic BigQuery job id (like dataflow_job_12423423) so that if the Dataflow code monitoring the load job fails and is retried we will still have exactly-once write semantics to BigQuery.

对于更深入的深度,Dataflow还使用确定性BigQuery作业ID(如dataflow_job_12423423),这样如果监视加载作业的Dataflow代码失败并重试,我们仍然会有一次写入BigQuery的语义。

Together, this design means that each BigQueryIO.Write transform in your pipeline is atomic. In a common case, you have only one such write in your job, and so if the job succeeds the data will be in BigQuery and if the job fails there will be no data written.

总之,这种设计意味着管道中的每个BigQueryIO.Write变换都是原子的。在一个常见的情况下,您的工作中只有一个这样的写入,因此如果作业成功,数据将在BigQuery中,如果作业失败,则不会写入任何数据。

However: Note that if you have multiple BigQueryIO.Write transforms in a pipeline, some of the writes may have successfully completed before the Dataflow job fails. The completed writes will not be reverted when the Dataflow job fails. This means that you may need to be careful when rerunning a Dataflow pipeline with multiple sinks in order to ensure correctness in the presence of commited writes from the earlier failed job.

但是:请注意,如果管道中有多个BigQueryIO.Write转换,则在Dataflow作业失败之前,某些写入可能已成功完成。当Dataflow作业失败时,不会还原已完成的写入。这意味着在重新运行具有多个接收器的Dataflow管道时可能需要小心,以确保存在来自早期失败作业的提交写入的正确性。

#3


1  

BigQuery jobs fail or succeed as a unit. From https://cloud.google.com/bigquery/docs/reference/v2/jobs

BigQuery作业失败或成功。来自https://cloud.google.com/bigquery/docs/reference/v2/jobs

Each action is atomic and only occurs if BigQuery is able to complete the job successfully. Creation, truncation and append actions occur as one atomic update upon job completion.

每个操作都是原子操作,只有在BigQuery能够成功完成作业时才会发生。作业完成时,创建,截断和追加操作将作为一个原子更新发生。

Though, just to be clear, BigQuery is atomic at the level of the BigQuery job, not at the level of a Dataflow job that might have created the BigQuery job. E.g. if your Dataflow job fails but it has written to BigQuery before failing (and that BigQuery job is complete) then the data will remain in BigQuery.

但是,为了清楚起见,BigQuery在BigQuery作业的级别上是原子的,而不是可能创建BigQuery作业的Dataflow作业的级别。例如。如果您的Dataflow作业失败但在失败之前已写入BigQuery(并且该BigQuery作业已完成),那么数据将保留在BigQuery中。

#1


1  

I can speak for Bigtable. Bigtable is atomic at the row level, not at the job level. A Dataflow job that fails part way will write partial data into Bigtable.

我可以代表Bigtable。 Bigtable在行级别是原子级,而不是在作业级别。部分失败的Dataflow作业会将部分数据写入Bigtable。

#2


3  

In Batch, Cloud Dataflow uses the following procedure for BigQueryIO.Write.to("some table"):

在批处理中,Cloud Dataflow对BigQueryIO.Write.to(“some table”)使用以下过程:

  1. Write all data to a temporary directory on GCS.
  2. 将所有数据写入GCS上的临时目录。
  3. Issue a BigQuery load job with an explicit list of all the temporary files containing the rows to be written.
  4. 发出一个BigQuery加载作业,其中包含所有包含要写入的行的临时文件的显式列表。

If there are failures when the GCS writes are only partially complete, we will recreate the temp files on retry. Exactly one complete copy of the data will be produced by step 1 and used for loading in step 2, or the job will fail before step 2.

如果GCS写入仅在部分完成时出现故障,我们将在重试时重新创建临时文件。步骤1将生成一个完整的数据副本,并在步骤2中用于加载,或者在步骤2之前作业将失败。

Each BigQuery load job, as in William V's answer, is atomic. The load job will succeed or fail, and if it fails there will be no data written to BigQuery.

每个BigQuery加载工作,如William V的答案,都是原子的。加载作业将成功或失败,如果失败,则不会向BigQuery写入数据。

For slightly more depth, Dataflow also uses a deterministic BigQuery job id (like dataflow_job_12423423) so that if the Dataflow code monitoring the load job fails and is retried we will still have exactly-once write semantics to BigQuery.

对于更深入的深度,Dataflow还使用确定性BigQuery作业ID(如dataflow_job_12423423),这样如果监视加载作业的Dataflow代码失败并重试,我们仍然会有一次写入BigQuery的语义。

Together, this design means that each BigQueryIO.Write transform in your pipeline is atomic. In a common case, you have only one such write in your job, and so if the job succeeds the data will be in BigQuery and if the job fails there will be no data written.

总之,这种设计意味着管道中的每个BigQueryIO.Write变换都是原子的。在一个常见的情况下,您的工作中只有一个这样的写入,因此如果作业成功,数据将在BigQuery中,如果作业失败,则不会写入任何数据。

However: Note that if you have multiple BigQueryIO.Write transforms in a pipeline, some of the writes may have successfully completed before the Dataflow job fails. The completed writes will not be reverted when the Dataflow job fails. This means that you may need to be careful when rerunning a Dataflow pipeline with multiple sinks in order to ensure correctness in the presence of commited writes from the earlier failed job.

但是:请注意,如果管道中有多个BigQueryIO.Write转换,则在Dataflow作业失败之前,某些写入可能已成功完成。当Dataflow作业失败时,不会还原已完成的写入。这意味着在重新运行具有多个接收器的Dataflow管道时可能需要小心,以确保存在来自早期失败作业的提交写入的正确性。

#3


1  

BigQuery jobs fail or succeed as a unit. From https://cloud.google.com/bigquery/docs/reference/v2/jobs

BigQuery作业失败或成功。来自https://cloud.google.com/bigquery/docs/reference/v2/jobs

Each action is atomic and only occurs if BigQuery is able to complete the job successfully. Creation, truncation and append actions occur as one atomic update upon job completion.

每个操作都是原子操作,只有在BigQuery能够成功完成作业时才会发生。作业完成时,创建,截断和追加操作将作为一个原子更新发生。

Though, just to be clear, BigQuery is atomic at the level of the BigQuery job, not at the level of a Dataflow job that might have created the BigQuery job. E.g. if your Dataflow job fails but it has written to BigQuery before failing (and that BigQuery job is complete) then the data will remain in BigQuery.

但是,为了清楚起见,BigQuery在BigQuery作业的级别上是原子的,而不是可能创建BigQuery作业的Dataflow作业的级别。例如。如果您的Dataflow作业失败但在失败之前已写入BigQuery(并且该BigQuery作业已完成),那么数据将保留在BigQuery中。