zookeeper服务器主动断开与客户端的连接问题

时间:2025-03-19 07:25:03

最近在项目中用到了zookeeper的主从任务分发的功能。其中遇到了zookeeper server主动断开与client连接的问题:

2017-09-0109:45:57,390 INFO : Client session timed out,have not heard from server in 1668ms for sessionid 0x0, closing socketconnection and attempting reconnect

2017-09-01 09:45:58,224 : Connection timed out: couldn'tconnect to ZooKeeper in 5000 milliseconds

2017-09-0109:45:58,450 INFO : Session: 0x0 closed

2017-09-0109:45:58,450 INFO : EventThread shut down

2017-09-0109:45:58,451 WARN :$ConnectionLossException: KeeperErrorCode =ConnectionLoss

2017-09-0109:46:03,453 INFO : Initiating client connection,connectString=test-ssps-s-02:2181,test-ssps-s-03:2181,test-ssps-s-04:2181sessionTimeout=5000watcher=$WatcherWithClientRef@7f0d36c

2017-09-0109:46:03,455 INFO : Opening socket connection toserver test-ssps-s-04/10.117.210.216:2181. Will not attempt to authenticateusing SASL (unknown error)

2017-09-0109:46:03,463 INFO : Socket connectionestablished to test-ssps-s-04/10.117.210.216:2181, initiating session

2017-09-0109:46:05,131 INFO : Client session timed out,have not heard from server in 1668ms for sessionid 0x0, closing socketconnection and attempting reconnect

2017-09-0109:46:05,885 INFO : Opening socket connection toserver test-ssps-s-03/10.51.20.155:2181. Will not attempt to authenticate usingSASL (unknown error)

2017-09-0109:46:06,626 INFO : Socket connectionestablished to test-ssps-s-03/10.51.20.155:2181, initiating session

2017-09-0109:46:08,293 INFO : Client session timed out,have not heard from server in 1667ms for sessionid 0x0, closing socketconnection and attempting reconnect

2017-09-01 09:46:08,454 ERROR :Connection timed out: couldn't connect to ZooKeeper in 5000 milliseconds

2017-09-0109:46:09,157 INFO : Session: 0x0 closed

2017-09-01 09:46:09,157 INFO :EventThread shut down

2017-09-0109:46:09,158 WARN :$ConnectionLossException: KeeperErrorCode =ConnectionLoss

日志显示无法与zk建立连接,导致failover终止。

查看leader日志:

2017-09-0109:45:49,754 INFO : Acceptedsocket connection from /10.51.20.155:40713

2017-09-0109:45:49,754 INFO : Clientattempting to establish new session at /10.51.20.155:40713

2017-09-0109:45:51,090 INFO : Gotuser-level KeeperException when processing sessionid:0x15e2a2922a213b5type:setData cxid:0x21 zxid:0x6d00050f33 txntype:-1 reqpath:n/a ErrorPath:/yarn-leader-election/yarnRM/ActiveBreadCrumb Error:KeeperErrorCode =BadVersion for /yarn-leader-election/yarnRM/ActiveBreadCrumb

2017-09-0109:45:52,000 INFO : Expiring session0x35e2a2922671668, timeout of 5000ms exceeded

2017-09-0109:45:52,001 INFO : Processedsession termination for sessionid: 0x35e2a2922671668

2017-09-0109:45:52,482 INFO : Acceptedsocket connection from /10.117.68.10:33589

2017-09-0109:45:52,484 INFO : Clientattempting to renew session 0x15e2a2922a21870 at /10.117.68.10:33589

2017-09-0109:45:52,484 INFO : Establishedsession 0x15e2a2922a21870 with negotiated timeout 5000 for client/10.117.68.10:33589

2017-09-0109:45:52,884 WARN : caught end ofstream exception

2017-09-0109:45:52,884 INFO : Closed socketconnection for client /10.51.20.155:40712 which had sessionid 0x15de73379b60000

2017-09-0109:45:53,244 INFO : Acceptedsocket connection from /10.51.20.155:40730

2017-09-0109:46:36,520 INFO : Acceptedsocket connection from /10.51.20.155:40749

2017-09-01 09:46:29,773 WARN :fsync-ing the write ahead log in SyncThread:3 took53812ms which will adversely effect operation latency. See theZooKeeper troubleshooting guide

2017-09-01 09:46:17,239 : Shutting down

2017-09-0109:45:58,000 INFO : Expiring session0x15e2a2922a21870, timeout of 5000ms exceeded

2017-09-01 09:45:53,439 : Unexpected Exception:

 (:73) (:77) (:418) (:1509)(:171)(:73)

2017-09-0109:46:36,648 INFO : Expiring session0x15e2a2922a21871, timeout of 10000ms exceeded

2017-09-0109:46:36,648 INFO : Expiring session0x15e2a2922a213b5, timeout of 10000ms exceeded

2017-09-0109:46:36,648 INFO : Expiring session0x15de73379b60000, timeout of 6000ms exceeded

2017-09-0109:46:36,648 INFO : Expiring session0x25de7337bb00000, timeout of 6000ms exceeded

2017-09-0109:46:36,648 INFO : Expiring session0x35e2a2922671890, timeout of 10000ms exceeded

2017-09-0109:46:36,648 INFO : Expiring session0x15e2a2922a213b6, timeout of 10000ms exceeded

2017-09-01 09:46:36,647 : Shutdown called

从日志信息可以看出,zookeerper 服务在同步日志过程中耗时太长,花了53812ms(正常应该在3秒内,server和client心跳检测时间这边设置为5s),同步日志会导致ZK无法响应外部请求,进而引发session过期,进而引发zk 服务端shut down。

从监控及分析结果来看,均属ZK服务端在fsync-ing the write ahead log日志时超长引起。

关于ZK日志存放,官网给出如下建议:

Having a dedicated log devicehas a large impact on throughput and stable latencies. It is highly recommenedto dedicate a log device and set dataLogDir to point to a directory on thatdevice, and then make sure to point dataDir to a directory not residing on thatdevice.

故为避免此类问题,dataLogDir存放目录应该与dataDir分开,可单独采用一套存储设备来存放ZK日志。

在中增加:forceSync=no

默认是开启的,为避免同步延迟问题,ZK接收到数据后会立刻去讲当前状态信息同步到磁盘日志文件中,同步完成后才会应答。将此项关闭后,客户端连接可以得到快速响应。

关闭forceSync选项后,会存在潜在风险,虽然依旧会刷磁盘(()首先被执行),但因为操作系统为提高写磁盘效率,会先写缓存,当机器异常后,可能导致一些zk状态信息没有同步到磁盘,从而带来ZK前后信息不一样问题。