I was doing a performance benchmarking of dataflow batch loads and found that the loads were just too slow when compared against the same loads on Bigquery command line tool.
我正在对数据流批处理负载进行性能基准测试,发现与Bigquery命令行工具上的相同负载相比,负载太慢了。
The file size was around 20 MB with millions of records. I tried different machine types and got the best load performance on n1-highmem-4
with the approx load time of 8 minutes in loading the target BQ table.
文件大小约为20 MB,有数百万条记录。我尝试了不同的机器类型,并在n1-highmem-4上获得了最佳的负载性能,加载目标BQ表时的加载时间约为8分钟。
When the same table load was applied by running BQ command on the command-line utility, it hardly took 2 minutes to process and load the same volume of data. Any insights about this poor load performance using Dataflow jobs? How to improve the performance to make it comparable to BQ command line utility?
通过在命令行实用程序上运行BQ命令来应用相同的表加载时,几乎不需要2分钟来处理和加载相同数量的数据。有关使用Dataflow作业的负载性能不佳的任何见解?如何提高性能使其与BQ命令行实用程序相媲美?
1 个解决方案
#1
5
Most likely, a few minutes are being spent on starting and shutting down VMs. If you're doing something that can directly be done using BQ CLI, then using Dataflow for that purpose is likely overkill. However, you can update your question with more details (e.g. your code and the Dataflow job id) - maybe there's something else inefficient going on.
最有可能的是,花费几分钟来启动和关闭虚拟机。如果你正在做一些可以使用BQ CLI直接完成的事情,那么为此目的使用Dataflow可能有点过头了。但是,您可以使用更多详细信息(例如您的代码和数据流作业ID)更新您的问题 - 可能还有其他低效的问题。
#1
5
Most likely, a few minutes are being spent on starting and shutting down VMs. If you're doing something that can directly be done using BQ CLI, then using Dataflow for that purpose is likely overkill. However, you can update your question with more details (e.g. your code and the Dataflow job id) - maybe there's something else inefficient going on.
最有可能的是,花费几分钟来启动和关闭虚拟机。如果你正在做一些可以使用BQ CLI直接完成的事情,那么为此目的使用Dataflow可能有点过头了。但是,您可以使用更多详细信息(例如您的代码和数据流作业ID)更新您的问题 - 可能还有其他低效的问题。