在数据流作业中没有达到目标工作者的后果是什么？

My apache beam scio dataflow job is asking for more workers than my current quota. The job completes successfully, but is limited to 575 workers. What are the consequences of not giving it the RAM it is asking for. More disk IO of intermediate steps? Slower sink IO? Does it depend on what's going on with the job? In particular, my job is pretty simple really has 2 steps:

我的apache beam scio dataflow工作要求的工作人数超过我目前的配额。工作顺利完成，但仅限于575名工人。没有给它所要求的RAM的后果是什么。更多磁盘IO的中间步骤？接收器IO较慢？这取决于工作的进展情况吗？特别是，我的工作非常简单，实际上有两个步骤：

-aggregateByKey 
-DO IO per key

I can run my own experiments, but I'm also interested in the cost of the job, since it isn't extremely time sensitive operation (aka I'm okay letting it run longer if it is cheaper)...

我可以运行自己的实验，但我也对这项工作的成本感兴趣，因为它不是非常时间敏感的操作（也就是说，如果它更便宜，我可以让它运行更长时间）......

1 个解决方案

#1

In this case, your job will have a higher runtime than if your quota was higher, but the aggregate amount of time spent performing work by all workers should be about the same.

在这种情况下，您的工作将比您的配额更高的工作时间更高，但所有工作人员执行工作所花费的总时间应该大致相同。

Dataflow bills you on the amount of time each CPU, memory and storage unit is allocated. If the total CPU-hours, RAM GB-hours and storage GB-hours are about the same, your job should cost about the same.

Dataflow会向您收取每个CPU，内存和存储单元分配的时间。如果总的CPU小时数，RAM GB小时数和存储GB小时数大致相同，那么您的工作成本应该大致相同。

Note: Dataflow also charges by the amount of bytes shuffled if you use the shuffle service. This should also not be affected by the number of workers.

注意：如果您使用shuffle服务，数据流也会按洗牌的字节数收费。这也应该不受工人数量的影响。

#1