I have a question around the WRITE_TRUNCATE behaviour in Big Query.
我对Big Query中的WRITE_TRUNCATE行为有疑问。
I have a big query table (T1) which I'm periodically appending to with log data (one row per log line). I want to have a dataflow job (D1) that reads from this table, removes any duplicate rows and performs other data cleansing operations and then outputs this to another big query table (T2), replacing any data that may have already been present in this table. I believe I can do this by using the WRITE_TRUNCATE write disposition in the BigQuery.IO sink within the dataflow job.
我有一个大的查询表(T1),我定期附加日志数据(每个日志行一行)。我希望有一个数据流作业(D1)从该表读取,删除任何重复的行并执行其他数据清理操作,然后将其输出到另一个大查询表(T2),替换可能已存在于此的任何数据表。我相信我可以通过在数据流作业中的BigQuery.IO接收器中使用WRITE_TRUNCATE写入处置来完成此操作。
Question is, if I have another dataflow job (D2) reading from table T2 while job D1 is in the middle of a write truncate to this table, what data does D2 see, i.e. does it see the table in either the state it was in before the truncate or after the truncate has finished. Or can it see the table during any step during the truncate (e.g. part way through appending the new data)?
问题是,如果我有另一个数据流作业(D2)从表T2读取而作业D1在该表的写截断中间,那么D2会看到什么数据,即它是否在它所处的状态中看到该表在截断之前或截断之后。或者它可以在截断期间的任何步骤中看到该表(例如,通过附加新数据的一部分)?
The javadoc linked above suggests that the truncate may not be atomic while the REST documentation for Big Query suggests that it is.
上面链接的javadoc表明截断可能不是原子的,而Big Query的REST文档表明它是。
1 个解决方案
#1
3
The REST API is actually the source of truth here, i.e. the change is atomic upon the BigQuery job's successful completion.
REST API实际上是这里的真实来源,即在BigQuery作业成功完成时,更改是原子的。
#1
3
The REST API is actually the source of truth here, i.e. the change is atomic upon the BigQuery job's successful completion.
REST API实际上是这里的真实来源,即在BigQuery作业成功完成时,更改是原子的。