Flink TaskManager OutOfMemoryError: Metaspace 处理记录

时间:2022-11-24 11:01:19

一个很有意思的Flink任务异常处理记录

一、环境信息

Flink1.12 Standalone 模式,单台机器,由于客户环境基本很长时间会看不到运行状态

二、问题现象

现场同事反馈设备在客户现场运行了一段时间后Flink Job全挂,在Flink DashBoard上所有的Job都看不到了,TaskManager已经挂掉了,但TaskManager进程还在, 手动重启taskmanager服务后恢复正常

三、问题排查

系统监控上,磁盘IO,CPU,内存使用都正常,/var/log/message中也没看到异常信息

发现Flink taskmanager的日志有OutOfMemoryError: Metaspace情况

2022-11-03 16:20:59,122 ERROR org.apache.flink.runtime.taskexecutor.TaskManagerRunner [] - Fatal error occurred while executing the TaskManager. Shutting it down... java.lang.OutOfMemoryError: Metaspace. The metaspace out-of-memory error has occurred. This can mean two things: either the job requires a larger size of JVM metaspace to load classes or there is a class loading leak. In the first case 'taskmanager.memory.jvm-metaspace.size' configuration option should be increased. If the error persists (usually in cluster after several job (re-)submissions) then there is probably a class loading leak in user code or some of its dependencies which has to be investigated and fixed. The task executor has to be shutdown...

四、原因分析

因为flink taskmanager中oom日志上有提示 If the error persists (usually in cluster after several job (re-)submissions) then there is probably a class loading leak in user code or some of its dependencies which has to be investigated and fixed. The task executor has to be shutdown... 所以怀疑可能由于任务多次重启导致

五、第一次内部复现

写一个Flink Job,让他提交到TaskManager上之后就抛错(简单点open方法里头抛个错,让Job挂掉) 并且大量快速重复的提交错误任务,此时会导致发现flink-dashboard上TaskManager的JVM Metaspace在每次任务重启后会新增20mb左右

Flink TaskManager OutOfMemoryError: Metaspace 处理记录

一直增至1G,导致TaskManager假死,进程还存在,但是任务已经完全崩溃,task manager界面看不到task manager的实例,dolphinscheduler会重新提交任务,但是不会成功,此时反复重启

任务重启情况

Flink TaskManager OutOfMemoryError: Metaspace 处理记录

TaskManager已经挂掉,但是进程还在(处于假死状态)

Flink TaskManager OutOfMemoryError: Metaspace 处理记录

Flink TaskManager OutOfMemoryError: Metaspace 处理记录

此时Flink调度系统在一直重试拉起任务但是一直失败,Taskmanager的日志没有出现OutOfMemoryError的情况

六、第二次复现-成功复现 MetaSpace OutOfMemoryError

单次重复慢速(每60s提交一次)提交瞬间错误的任务,长时间失败提交后,在Taskmanager中出现MetaSpace OutOfMemoryError的信息

复现的TaskManager out-of-memory错误日志

Flink TaskManager OutOfMemoryError: Metaspace 处理记录

并且能在systemd中查看到flink-taskmanager重启的信息

Flink TaskManager OutOfMemoryError: Metaspace 处理记录

但是此时查看flink dashboard界面,能看到所有任务提示Running正常,无问题(但其实不能工作)

Flink TaskManager OutOfMemoryError: Metaspace 处理记录

切换到单个任务内部,在overview上也能看到是正常的标识

Flink TaskManager OutOfMemoryError: Metaspace 处理记录

但是在exception中,则能看到提示没有slot的信息,无法提交任务

Flink TaskManager OutOfMemoryError: Metaspace 处理记录

七、第三次内部复现- 说明有一定概率自动恢复

操作步骤和第二步一样,但是Taskmanager会过段时间挂掉自动恢复(通过systemd),调度系统重试提交任务几次后,成功提交,任务恢复正常

八、问题解决

当然还是要优先处理掉程序中的bug,解决任务重启问题, 但为防止后续有其他问题,看到网上有以下几个方案: 1,JAR包依赖分离, 把用到的第三方包放到flink/lib目录下 2,更改运行方式为Local或者Yarn模式(这种内存泄漏在Standalone-cluster模式才会出现)

附上官网关于这个问题的说明

https://nightlies.apache.org/flink/flink-docs-master/docs/ops/debugging/debugging_classloading/#unloading-of-dynamically-loaded-classes-in-user-code

官网原版可能看的更清楚些

Unloading of Dynamically Loaded Classes in User Code # All scenarios that involve dynamic user code classloading (sessions) rely on classes being unloaded again. Class unloading means that the Garbage Collector finds that no objects from a class exist and more, and thus removes the class (the code, static variable, metadata, etc).

Whenever a TaskManager starts (or restarts) a task, it will load that specific task’s code. Unless classes can be unloaded, this will become a memory leak, as new versions of classes are loaded and the total number of loaded classes accumulates over time. This typically manifests itself though a OutOfMemoryError: Metaspace.

Common causes for class leaks and suggested fixes:

Lingering Threads: Make sure the application functions/sources/sinks shuts down all threads. Lingering threads cost resources themselves and additionally typically hold references to (user code) objects, preventing garbage collection and unloading of the classes.

Interners: Avoid caching objects in special structures that live beyond the lifetime of the functions/sources/sinks. Examples are Guava’s interners, or Avro’s class/object caches in the serializers.

JDBC: JDBC drivers leak references outside the user code classloader. To ensure that these classes are only loaded once you should add the driver jars to Flink’s lib/ folder instead of bundling them in the user-jar. If you can’t guarantee that none of your user-jars bundle the driver, you have to additionally add the driver classes to the list of parent-first loaded classes via classloader.parent-first-patterns-additional.

A helpful tool for unloading dynamically loaded classes are the user code class loader release hooks. These are hooks which are executed prior to the unloading of a classloader. It is generally recommended to shutdown and unload resources as part of the regular function lifecycle (typically the close() methods). But in some cases (for example for static fields), it is better to unload once a classloader is certainly not needed anymore.

Class loader release hooks can be registered via the RuntimeContext.registerUserCodeClassLoaderReleaseHookIfAbsent() method.

官网翻译

Flink的组件(JobManager, TaskManager, Client, ApplicationMaster等)在启动时会在日志开头的环境信息部分记录classpath的设定。 当JobManager和TaskManager的运行模式为指定一个job时,可以通过将用户代码的JAR文件放置在/lib目录下,从而包含在classpath路径中,以保证它们不会被动态加载。 通常情况下将job的JAR文件放置在/lib目录下可以正常运行。JAR文件会同时作为classpath(AppClassLoader)和动态类加载器(FlinkUserCodeClassLoader)的一部分。 由于AppClassLoader是FlinkUserCodeClassLoader的父类(Java默认情况下以parent-first方式加载),这样类只会加载一次。 当job相关的JAR文件不能全部放在/lib目录下(如多个job共用的一个session)时,可以通过将相对公共的类库放在/lib目录下,从而避免这些类的动态加载。