描述在Google Dataflow上运行的Java应用程序

时间:2022-07-08 15:39:09

Do you have any idea how to profile a java application running on a dataflow worker? Do you know any tools that can allow me to discover memory leaks of my application?

您是否知道如何分析在数据流工作器上运行的Java应用程序?你知道任何可以让我发现应用程序内存泄漏的工具吗?

1 个解决方案

#1


2  

For time profiling, you can try the instructions described in this issue 72, but there may be difficulties with workers being torn-down or auto-scaled away before you can get the profiles off the worker. Unfortunately it doesn't provide memory profiling so it won't help with memory leaks.

对于时间分析,您可以尝试本期72中描述的说明,但在将工作人员从工作人员中删除之前,工作人员可能会被拆除或自动缩放。不幸的是,它不提供内存分析,因此它无助于内存泄漏。

You can also run with the DirectPipelineRunner, which will execute the pipeline locally on your machine. This will allow you to profile the code in your pipeline without needing to deal with Dataflow workers. Depending on the scale of the pipeline you'll likely need to adjust the input size to be something that can be handled on one machine.

您还可以使用DirectPipelineRunner运行,它将在您的计算机上本地执行管道。这将允许您分析管道中的代码,而无需处理Dataflow工作程序。根据管道的规模,您可能需要将输入大小调整为可在一台机器上处理的内容。

It may also be helpful to try to distinguish code that runs on the worker -- eg., the code within a single DoFn and the structure of the pipeline and the data. For instance, out-of-memory problems can be caused by having a GroupByKey with too many values associated with a single key and reading that into a list.

尝试区分在worker上运行的代码(例如,单个DoFn中的代码以及管道和数据的结构)也可能有所帮助。例如,如果GroupByKey具有与单个键关联的过多值并将其读入列表,则可能导致内存不足问题。

#1


2  

For time profiling, you can try the instructions described in this issue 72, but there may be difficulties with workers being torn-down or auto-scaled away before you can get the profiles off the worker. Unfortunately it doesn't provide memory profiling so it won't help with memory leaks.

对于时间分析,您可以尝试本期72中描述的说明,但在将工作人员从工作人员中删除之前,工作人员可能会被拆除或自动缩放。不幸的是,它不提供内存分析,因此它无助于内存泄漏。

You can also run with the DirectPipelineRunner, which will execute the pipeline locally on your machine. This will allow you to profile the code in your pipeline without needing to deal with Dataflow workers. Depending on the scale of the pipeline you'll likely need to adjust the input size to be something that can be handled on one machine.

您还可以使用DirectPipelineRunner运行,它将在您的计算机上本地执行管道。这将允许您分析管道中的代码,而无需处理Dataflow工作程序。根据管道的规模,您可能需要将输入大小调整为可在一台机器上处理的内容。

It may also be helpful to try to distinguish code that runs on the worker -- eg., the code within a single DoFn and the structure of the pipeline and the data. For instance, out-of-memory problems can be caused by having a GroupByKey with too many values associated with a single key and reading that into a list.

尝试区分在worker上运行的代码(例如,单个DoFn中的代码以及管道和数据的结构)也可能有所帮助。例如,如果GroupByKey具有与单个键关联的过多值并将其读入列表,则可能导致内存不足问题。