Flink standalone集群问题记录

官方配置：Configuration | Apache Flink

1、TM进程过一段时间就停止

报错信息： [] - Task did not exit gracefully within 180 + seconds.
: Task did not exit gracefully within 180 + seconds.
at $(:1791) [flink-dist_2.11-1.14.:1.14.4]
at (:748) [?:1.8.0_291]

原因：任务取消超时

解决：TM配置文件${FLINK_HOME}/conf/

#取消任务取消watchdog

: 0

参数说明：Timeout in milliseconds after which a task cancellation times out and leads to a fatal TaskManager error. A value of 0 deactivates the watch dog. Notice that a task cancellation is different from both a task failure and a clean shutdown. Task cancellation timeout only applies to task cancellation and does not apply to task closing/clean-up caused by a task failure or a clean shutdown.

2、web端上传的jar包，在独立集群重启后全部丢失

原因：文件默认保存在/tmp目录，会被清除

解决：JM配置文件${FLINK_HOME}/conf/

: /usr/local/flink/upload
: /usr/local/flink/tmpdir

3、JM stop不能停止独立集群

原因：pid文件默认保存在/tmp目录，会被清除导致脚本找不到pid结束进程

解决：JM配置文件${FLINK_HOME}/conf/

: /usr/local/flink/piddir

4、zookeeper存储value太长，zookeeper集群down掉导致TM全部down掉，zookeeper报错信息：

Unexpected exception causing shutdown while sock still open
: Unreasonable length = 1970218037

at (:95)
at (:85)
at (:103)
at (:249)

Zookeeper server went down in HA cluster. Please replay if there is any work around.

You can attempt to increase your Java System Property on the ZK servers to a value higher than 2-3 GB (in bytes) to overcome this. It appears a very large record was somehow placed into your ZK by an application, which appears to have then caused this issue.

解决方法：配置zookeeper的参数到合适的长度

5、: Metaspace. 详细报错信息：

: Metaspace. The metaspace out-of-memory error has occurred. This can mean two things: either the job requires a larger size of JVM metaspace to load classes or there is a class loading leak. In the first case '' configuration option should be increased. If the error persists (usually in cluster after several job (re-)submissions) then there is probably a class loading leak in user code or some of its dependencies which has to be investigated and fixed. The task executor has to be shutdown...
at .defineClass1(Native Method) ~[?:1.8.0_291]
at (:756) ~[?:1.8.0_291]
at (:142) ~[?:1.8.0_291]
at (:468) ~[?:1.8.0_291]
at $100(:74) ~[?:1.8.0_291]
at $(:369) ~[?:1.8.0_291]
at $(:363) ~[?:1.8.0_291]
at (Native Method) ~[?:1.8.0_291]
at (:362) ~[?:1.8.0_291]

原因：没有找到具体原因，持续观察，网上搜索有两种说法：代码阻塞、背压

短期解决方案：TM配置文件${FLINK_HOME}/conf/

修改配置(默认256m): 512m

6、flink ui查询checkpoint报错

ERROR [] - Unhandled exception.
.: input array
at .(:1650) ~[flink-dist_2.11-1.14.:1.14.4]
at .(:158) ~[flink-dist_2.11-1.14.:1.14.4]
at .(:272) ~[flink-dist_2.11-1.14.:1.14.4]
at .(:241) ~[flink-dist_2.11-1.14.:1.14.4]
at $(:158) ~[flink-dist_2.11-1.14.:1.14.4]
at (:52) ~[flink-dist_2.11-1.14.:1.14.4]
at (:108) ~[flink-dist_2.11-1.14.:1.14.4]
at (:81) ~[flink-dist_2.11-1.14.:1.14.4]
at (:129) ~[flink-dist_2.11-1.14.:1.14.4]
at (:84) ~[flink-dist_2.11-1.14.:1.14.4]
at (:58) ~[flink-dist_2.11-1.14.:1.14.4]
at (:68) ~[flink-dist_2.11-1.14.:1.14.4]
at $handleRequest$0(:87) ~[flink-dist_2.11-1.14.:1.14.4]
at (:616) [?:1.8.0_291]
at $(:591) [?:1.8.0_291]
at $(:456) [?:1.8.0_291]
at $(:511) [?:1.8.0_291]
at (:266) [?:1.8.0_291]
at $$201(:180) [?:1.8.0_291]
at $(:293) [?:1.8.0_291]
at (:1149) [?:1.8.0_291]
at $(:624) [?:1.8.0_291]
at (:748) [?:1.8.0_291]

原因：flink 版本≤1.14.4 序列化bug

解决：升级版本到1.14.5 1.15.0,不过release还没发布-20220524[FLINK-25904] NullArgumentException when accessing checkpoint stats on standby JobManager - ASF JIRA

flink集群问题：flink错误本-错误记录_大数据男的博客-****博客_flink 错误日志查看

秒客网