When reading data in from Datastore in my Dataflow pipeline, it seems like the job is not being distributed over the amount of available workers I have set for my job. Does Dataflow parallelize the read of Datastore data or is it doing it with a single worker?
在我的Dataflow管道中从数据存储区读取数据时,似乎作业没有分配给我为我的工作设置的可用工作量。 Dataflow是否并行化数据存储区数据的读取,还是单个工作程序执行此操作?
1 个解决方案
#1
1
Typically, reads made by DatastoreIO use multiple workers to read in parallel. However, not all queries can be parallelized according to the documentation. For instance, queries that specify a limit or use an inequality filter. These queries would need to use a single worker to ensure correctness.
通常,DatastoreIO进行的读取使用多个worker并行读取。但是,并非所有查询都可以根据文档进行并行化。例如,指定限制的查询或使用不等式过滤器。这些查询需要使用单个工作程序来确保正确性。
https://cloud.google.com/dataflow/model/datastore-io#reading-from-datastore
https://cloud.google.com/dataflow/model/datastore-io#reading-from-datastore
#1
1
Typically, reads made by DatastoreIO use multiple workers to read in parallel. However, not all queries can be parallelized according to the documentation. For instance, queries that specify a limit or use an inequality filter. These queries would need to use a single worker to ensure correctness.
通常,DatastoreIO进行的读取使用多个worker并行读取。但是,并非所有查询都可以根据文档进行并行化。例如,指定限制的查询或使用不等式过滤器。这些查询需要使用单个工作程序来确保正确性。
https://cloud.google.com/dataflow/model/datastore-io#reading-from-datastore
https://cloud.google.com/dataflow/model/datastore-io#reading-from-datastore