谷歌数据流管道陷入了拖延

时间:2021-07-01 15:35:44

Yesterday I started a Job on Google-Dataflow which usually runs about 10-30min. It was still running this morning. When looking into stackdriver, I saw one repeating loop of logs for the job:

昨天我在Google-Dataflow上创建了一个工作,通常运行大约10-30分钟。它今天早上还在运行。在查看stackdriver时,我看到了一个重复的日志循环:

I  Refused to split GroupingShuffleReader <at position ShufflePosition(base64:AAAABOA3nVgAAQ) of shuffle range [ShufflePosition(base64:AAAAAAD_AP8A_wD_AAE), ShufflePosition(base64:AAAABOA3nVkAAQ))> at ShufflePosition(base64:AAAABOA3nVkAAQ) 
E  Refusing to split <at position ShufflePosition(base64:AAAABOA3nVgAAQ) of shuffle range [ShufflePosition(base64:AAAAAAD_AP8A_wD_AAE), ShufflePosition(base64:AAAABOA3nVkAAQ))> at ShufflePosition(base64:AAAABOA3nVkAAQ): proposed split position out of range 
I  Proposing dynamic split of work unit our-project-id;2017-09-26_09_29_26-14666853265610614017;1268593085087986642 at {"fractionConsumed":1.0,"position":{"shufflePosition":"AAAABOA3nVkAAQ"}} 
I  Setting node annotation to enable volume controller attach/detach 

I now cancelled the job. Before this job started, I reduced the disksize for workers to 40GB, because our quota of 10,240GB(!!!) was exceeded with about 15 Jobs. I will increase the disksize to around 100GB, but more shouldn't be necessary.

我现在取消了这份工作。在此工作开始之前,我将工作人员的磁盘大小减少到40GB,因为我们的配额为10,240GB(!!!)超过了大约15个工作。我将磁盘大小增加到大约100GB,但不需要更多。

Any suggestions on how to fix this otherwise or how this can happen? It would also be interesting what this error really means...

关于如何解决这个问题或者如何解决这个问题的任何建议?这个错误的真正含义也很有趣......

The JobID: 2017-09-26_09_29_26-14666853265610614017

工作ID:2017-09-26_09_29_26-14666853265610614017

谷歌数据流管道陷入了拖延

The Step "ToElasticsearch" Was showing 16hours before I cancelled the job. In this step, there are only http-Posts for each article to Elasticsearch

步骤“ToElasticsearch”在我取消工作前16小时显示。在此步骤中,Elasticsearch的每篇文章只有http-Posts

1 个解决方案

#1


0  

The most likely cause of this is that you have a hot-key. Specifically, one of the keys produces the majority of the output. In such a case, the work doesn't distribute across the available workers well. You could try inserting a Reshuffle transform after the steps that may have many outputs for a single input. It sounds like doing so after the ReadArticlesFromDatastore step may be the right place.

最可能的原因是你有一个热键。具体来说,其中一个键产生大部分输出。在这种情况下,工作不会很好地分配给可用的工作人员。您可以尝试在可能具有单个输入的许多输出的步骤之后插入重新洗牌变换。在ReadArticlesFromDatastore步骤可能是正确的位置之后,这听起来像是这样做的。

#1


0  

The most likely cause of this is that you have a hot-key. Specifically, one of the keys produces the majority of the output. In such a case, the work doesn't distribute across the available workers well. You could try inserting a Reshuffle transform after the steps that may have many outputs for a single input. It sounds like doing so after the ReadArticlesFromDatastore step may be the right place.

最可能的原因是你有一个热键。具体来说,其中一个键产生大部分输出。在这种情况下,工作不会很好地分配给可用的工作人员。您可以尝试在可能具有单个输入的许多输出的步骤之后插入重新洗牌变换。在ReadArticlesFromDatastore步骤可能是正确的位置之后,这听起来像是这样做的。