BigQueryIO.read()性能非常慢--Apache Beam

时间:2020-12-24 15:26:47

I am trying to read records from a BigQuery table which has 2410957408 records. And it is taking forever to just to read them using BigQueryIO.readTableRows() in Apache Beam.

我试图从具有2410957408记录的BigQuery表中读取记录。在Apache Beam中使用BigQueryIO.readTableRows()来阅读它们是永远的。

I am using the default machine type "n1-standard-1" and Autoscaling.

我使用的是默认机器类型“n1-standard-1”和Autoscaling。

What can be done improve the performance significantly without having a lot of impact on cost? Will a high-mem or high-cpu machine type help?

可以做些什么可以显着提高性能而不会对成本产生很大影响?高内存或高CPU机型会有帮助吗?

2 个解决方案

#1


3  

BigQueryIO.readTableRows() will first export table data into a gcs bucket and beam workers will consume the export from there. The export stage is a BigQuery API, which is not very performant and not part of beam implementation.

BigQueryIO.readTableRows()将首先将表数据导出到gcs存储桶中,并且梁工作者将从那里使用导出。导出阶段是一个BigQuery API,它不是非常高效,也不是梁实现的一部分。

#2


0  

I looked at the job you quoted, and it seems that the majority of the time is spent on Beam ingesting the data exported by BigQuery, specifically, it seems, on converting the BigQuery export results to TableRow. TableRow is a very bulky and inefficient object - for better performance I recommend using BigQueryIO.read(SerializableFunction) for reading into your custom type directly.

我查看了你所引用的工作,似乎大部分时间花在了Beam上摄取BigQuery导出的数据,特别是在将BigQuery导出结果转换为TableRow时。 TableRow是一个非常庞大且效率低下的对象 - 为了获得更好的性能,我建议使用BigQueryIO.read(SerializableFunction)直接读取您的自定义类型。

#1


3  

BigQueryIO.readTableRows() will first export table data into a gcs bucket and beam workers will consume the export from there. The export stage is a BigQuery API, which is not very performant and not part of beam implementation.

BigQueryIO.readTableRows()将首先将表数据导出到gcs存储桶中,并且梁工作者将从那里使用导出。导出阶段是一个BigQuery API,它不是非常高效,也不是梁实现的一部分。

#2


0  

I looked at the job you quoted, and it seems that the majority of the time is spent on Beam ingesting the data exported by BigQuery, specifically, it seems, on converting the BigQuery export results to TableRow. TableRow is a very bulky and inefficient object - for better performance I recommend using BigQueryIO.read(SerializableFunction) for reading into your custom type directly.

我查看了你所引用的工作,似乎大部分时间花在了Beam上摄取BigQuery导出的数据,特别是在将BigQuery导出结果转换为TableRow时。 TableRow是一个非常庞大且效率低下的对象 - 为了获得更好的性能,我建议使用BigQueryIO.read(SerializableFunction)直接读取您的自定义类型。