官方配置:Configuration | Apache Flink
1、TM进程过一段时间就停止
报错信息: [] - Task did not exit gracefully within 180 + seconds.
: Task did not exit gracefully within 180 + seconds.
at $(:1791) [flink-dist_2.11-1.14.:1.14.4]
at (:748) [?:1.8.0_291]
原因:任务取消超时
解决:TM配置文件${FLINK_HOME}/conf/
#取消任务取消watchdog
: 0
参数说明:Timeout in milliseconds after which a task cancellation times out and leads to a fatal TaskManager error. A value of 0 deactivates the watch dog. Notice that a task cancellation is different from both a task failure and a clean shutdown. Task cancellation timeout only applies to task cancellation and does not apply to task closing/clean-up caused by a task failure or a clean shutdown.
2、web端上传的jar包,在独立集群重启后全部丢失
原因:文件默认保存在/tmp目录,会被清除
解决:JM配置文件${FLINK_HOME}/conf/
: /usr/local/flink/upload
: /usr/local/flink/tmpdir
3、JM stop不能停止独立集群
原因:pid文件默认保存在/tmp目录,会被清除导致脚本找不到pid结束进程
解决:JM配置文件${FLINK_HOME}/conf/
: /usr/local/flink/piddir
4、zookeeper存储value太长,zookeeper集群down掉导致TM全部down掉,zookeeper报错信息:
Unexpected exception causing shutdown while sock still open
: Unreasonable length = 1970218037
at (:95)
at (:85)
at (:103)
at (:249)
Zookeeper server went down in HA cluster. Please replay if there is any work around.
You can attempt to increase your Java System Property on the ZK servers to a value higher than 2-3 GB (in bytes) to overcome this. It appears a very large record was somehow placed into your ZK by an application, which appears to have then caused this issue.
解决方法:配置zookeeper的参数到合适的长度
5、: Metaspace. 详细报错信息:
: Metaspace. The metaspace out-of-memory error has occurred. This can mean two things: either the job requires a larger size of JVM metaspace to load classes or there is a class loading leak. In the first case '' configuration option should be increased. If the error persists (usually in cluster after several job (re-)submissions) then there is probably a class loading leak in user code or some of its dependencies which has to be investigated and fixed. The task executor has to be shutdown...
at .defineClass1(Native Method) ~[?:1.8.0_291]
at (:756) ~[?:1.8.0_291]
at (:142) ~[?:1.8.0_291]
at (:468) ~[?:1.8.0_291]
at $100(:74) ~[?:1.8.0_291]
at $(:369) ~[?:1.8.0_291]
at $(:363) ~[?:1.8.0_291]
at (Native Method) ~[?:1.8.0_291]
at (:362) ~[?:1.8.0_291]
原因:没有找到具体原因,持续观察,网上搜索有两种说法:代码阻塞、背压
短期解决方案:TM配置文件${FLINK_HOME}/conf/
修改配置(默认256m): 512m
6、flink ui查询checkpoint报错
ERROR [] - Unhandled exception.
.: input array
at .(:1650) ~[flink-dist_2.11-1.14.:1.14.4]
at .(:158) ~[flink-dist_2.11-1.14.:1.14.4]
at .(:272) ~[flink-dist_2.11-1.14.:1.14.4]
at .(:241) ~[flink-dist_2.11-1.14.:1.14.4]
at $(:158) ~[flink-dist_2.11-1.14.:1.14.4]
at (:52) ~[flink-dist_2.11-1.14.:1.14.4]
at (:108) ~[flink-dist_2.11-1.14.:1.14.4]
at (:81) ~[flink-dist_2.11-1.14.:1.14.4]
at (:129) ~[flink-dist_2.11-1.14.:1.14.4]
at (:84) ~[flink-dist_2.11-1.14.:1.14.4]
at (:58) ~[flink-dist_2.11-1.14.:1.14.4]
at (:68) ~[flink-dist_2.11-1.14.:1.14.4]
at $handleRequest$0(:87) ~[flink-dist_2.11-1.14.:1.14.4]
at (:616) [?:1.8.0_291]
at $(:591) [?:1.8.0_291]
at $(:456) [?:1.8.0_291]
at $(:511) [?:1.8.0_291]
at (:266) [?:1.8.0_291]
at $$201(:180) [?:1.8.0_291]
at $(:293) [?:1.8.0_291]
at (:1149) [?:1.8.0_291]
at $(:624) [?:1.8.0_291]
at (:748) [?:1.8.0_291]
原因:flink 版本≤1.14.4 序列化bug
解决:升级版本到1.14.5 1.15.0,不过release还没发布-20220524[FLINK-25904] NullArgumentException when accessing checkpoint stats on standby JobManager - ASF JIRA
flink集群问题:flink错误本-错误记录_大数据男的博客-****博客_flink 错误日志查看