CDH集群异常关闭导致zookeeper启动失败

时间:2021-11-20 23:46:43

集群异常关闭后,有个zookeeper节点始终无法启动,CM上的日志没有明显的报错

解决思路:

 1.尝试通过命令行启动zkServer.sh  start,查看zookeeper.out,发现报如下错误

Unexpected exception, exiting abnormally
java.io.EOFException
        at java.io.DataInputStream.readInt(DataInputStream.java:392)
        at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
        at org.apache.zookeeper.server.persistence.FileHeader.deserialize(FileHeader.java:64)
        at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.inStreamCreated(FileTxnLog.java:558)
        at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.createInputArchive(FileTxnLog.java:577)
        at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.goToNextLog(FileTxnLog.java:543)
        at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.next(FileTxnLog.java:625)
        at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.init(FileTxnLog.java:529)
        at org.apache.zookeeper.server.persistence.FileTxnLog$FileTxnIterator.<init>(FileTxnLog.java:504)
        at org.apache.zookeeper.server.persistence.FileTxnLog.read(FileTxnLog.java:341)
        at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:175)
        at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:223)
        at org.apache.zookeeper.server.ZooKeeperServer.loadData(ZooKeeperServer.java:250)
        at org.apache.zookeeper.server.ZooKeeperServer.startdata(ZooKeeperServer.java:377)
        at org.apache.zookeeper.server.NIOServerCnxnFactory.startup(NIOServerCnxnFactory.java:122)
        at org.apache.zookeeper.server.ZooKeeperServerMain.runFromConfig(ZooKeeperServerMain.java:118)
        at org.apache.zookeeper.server.ZooKeeperServerMain.initializeAndRun(ZooKeeperServerMain.java:91)
        at org.apache.zookeeper.server.ZooKeeperServerMain.main(ZooKeeperServerMain.java:53)
        at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:121)
        at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:79)
由此可见,确实是意外退出导致,

2.网上查询解决方法:

首先查看conf/zoo.cfg
配置文件,找到dataDir和dataLogDir对应目录,CDH5.12均为/var/lib/zookeeper,到此文件夹下删除version-2文件夹,命令行启动zookeeper,查看zookeeper.out发现启动成功。

 欣喜的到集群上启动,发现仍然失败,且报如下错误

Unable to access datadir, exiting abnormally
org.apache.zookeeper.server.persistence.FileTxnSnapLog$DatadirException: Missing data directory /var/lib/zookeeper/version-2, automatic data directory creation is disabled (zookeeper.datadir.autocreate is false). Please create this directory manually.
	at org.apache.zookeeper.server.persistence.FileTxnSnapLog.<init>(FileTxnSnapLog.java:102)
	at org.apache.zookeeper.server.quorum.QuorumPeer.<init>(QuorumPeer.java:490)
	at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:138)
	at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:116)
	at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:79) 

查看发现通过命令行version-2目录用户组为root,而通CM启动的用户组为zookeeper

CDH集群异常关闭导致zookeeper启动失败

于是决定删除version-2,通过CM再次启动,让CM来生成这个目录,测试后仍然启动失败,报同样错误

CM有点坑,不会自动生成这个文件夹,于是尝试用命令行启动,生成version后,修改用户组为zookeeper

通过CM成功启动。理论上直接手动创建也可以


CDH集群异常关闭导致zookeeper启动失败

3.解决方法总结

  • 删除/var/lib/zookeeper/version-2文件夹
  • 创建version-2文件夹,修改用户和用户组为zookeeper