最近在调试hbase,10台节点,服务正常后,写入数据,总是出现regionserver自动down的情况,查看日志如下:
2016-05-04 13:29:09,690 WARN [regionserver//192.168.1.46:16020] : Failed to write trailer, non-fatal, continuing...
(): No lease on /apps/hbase/data/oldWALs/%2C16020%.1462336775368 (inode 294646): File is not open for writing. Holder DFSClient_NONMAPREDUCE_-309271655_1 does not have any open files.
at (:3454)
at (:3354)
at (:823)
at
(:515)
at $ClientNamenodeProtocol$
()
at $Server$(:616)
at $(:969)
at $Handler$(:2151)
at $Handler$(:2147)
at (Native Method)
at (:422)
at (:1657)
at $(:2145)
at (:1411)
at (:1364)
at $(:206)
at .$(Unknown Source)
at
(:393)
at .invoke0(Native Method)
at (:62)
at (:43)
at (:497)
at (:187)
at (:102)
at .$(Unknown Source)
at .invoke0(Native Method)
at (:62)
at (:43)
at (:497)
at $(:279)
at .$(Unknown Source)
at $DataStreamer.addDatanode2ExistingPipeline(:1028)
at $(:1184)
at $(:933)
at $(:487)
2016-05-04 13:29:09,692 ERROR [regionserver//192.168.1.46:16020] : Shutdown / close of WAL failed: : No lease on /apps/hbase/data/oldWALs/%2C16020%.1462336775368 (inode 294646): File is not open for writing.
Holder DFSClient_NONMAPREDUCE_-309271655_1 does not have any open files.
at (:3454)
at (:3354)
at (:823)
at
(:515)
at $ClientNamenodeProtocol$
()
at $Server$(:616)
at $(:969)
at $Handler$(:2151)
at $Handler$(:2147)
at (Native Method)
at (:422)
at (:1657)
at $(:2145)
2016-05-04 13:29:09,702 INFO [regionserver//192.168.1.46:16020] : regionserver//192.168.1.46:16020 closing leases
2016-05-04 13:29:09,702 INFO [regionserver//192.168.1.46:16020] : regionserver//192.168.1.46:16020 closed leases
2016-05-04 13:29:09,702 INFO [regionserver//192.168.1.46:16020] : Chore service for: ,16020,1461926336242 had [[ScheduledChore: Name: ,16020,1461926336242-MemstoreFlusherChore Period: 10000 Unit: MILLISECONDS], [ScheduledChore: Name: MovedRegionsCleaner for region ,16020,1461926336242 Period: 120000 Unit: MILLISECONDS]] on shutdown
2016-05-04 13:29:09,702 INFO [regionserver//192.168.1.46:16020] : Waiting for Split Thread to finish...
2016-05-04 13:29:09,703 INFO [regionserver//192.168.1.46:16020] : Waiting for Merge Thread to finish...
2016-05-04 13:29:09,703 INFO [regionserver//192.168.1.46:16020] : Waiting for Large Compaction Thread to finish...
2016-05-04 13:29:09,703 INFO [regionserver//192.168.1.46:16020] : Waiting for Small Compaction Thread to finish...
2016-05-04 13:29:09,703 WARN [regionserver//192.168.1.46:16020] : Possibly transient ZooKeeper, quorum=mapping2.:2181,mapping1.:2181,mapping3.:2181,
exception=$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/replication/rs/,16020,1461926336242
2016-05-04 13:29:10,703 WARN [regionserver//192.168.1.46:16020] : Possibly transient ZooKeeper, quorum=mapping2.:2181,mapping1.:2181,mapping3.:2181,
exception=$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/replication/rs/,16020,1461926336242
2016-05-04 13:29:12,704 WARN [regionserver//192.168.1.46:16020] : Possibly transient ZooKeeper, quorum=mapping2.:2181,mapping1.:2181,mapping3.:2181,
exception=$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/replication/rs/,16020,1461926336242
2016-05-04 13:29:16,704 WARN [regionserver//192.168.1.46:16020] : Possibly transient ZooKeeper, quorum=mapping2.:2181,mapping1.:2181,mapping3.:2181,
exception=$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/replication/rs/,16020,1461926336242
2016-05-04 13:29:18,007 INFO [regionserver//192.168.1.46:] : regionserver//192.168.1.46: closing leases
2016-05-04 13:29:18,008 INFO [regionserver//192.168.1.46:] : regionserver//192.168.1.46: closed leases
2016-05-04 13:29:24,704 WARN [regionserver//192.168.1.46:16020] : Possibly transient ZooKeeper, quorum=mapping2.:2181,mapping1.:2181,mapping3.:2181,
exception=$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/replication/rs/,16020,1461926336242
2016-05-04 13:29:24,704 ERROR [regionserver//192.168.1.46:16020] : ZooKeeper getChildren failed after 4 attempts
2016-05-04 13:29:24,704 WARN [regionserver//192.168.1.46:16020] : regionserver:16020-0x15460f0ceb70046, quorum=mapping2.:2181,mapping1.:2181,mapping3.:2181, baseZNode=/hbase-unsecure Unable to list children of znode /hbase-unsecure/replication/rs/,16020,1461926336242
$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/replication/rs/,16020,1461926336242
at (:127)
at (:51)
at (:1472)
at (:295)
at (:454)
at (:482)
at (:1461)
at (:1383)
at (:1265)
at (:187)
at (:292)
at (:180)
at (:172)
at (:2137)
at (:1071)
at (:745)
2016-05-04 13:29:24,705 ERROR [regionserver//192.168.1.46:16020] : regionserver:16020-0x15460f0ceb70046, quorum=mapping2.:2181,mapping1.:2181,mapping3.:2181, baseZNode=/hbase-unsecure Received unexpected KeeperException, re-throwing exception
$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/replication/rs/,16020,1461926336242
at (:127)
at (:51)
at (:1472)
at (:295)
at (:454)
at (:482)
at (:1461)
at (:1383)
at (:1265)
at (:187)
at (:292)
at (:180)
at (:172)
at (:2137)
at (:1071)
at (:745)
2016-05-04 13:29:24,705 INFO [regionserver//192.168.1.46:16020] : Stopping server on 16020
2016-05-04 13:29:24,705 INFO [,port=16020] : ,port=16020: stopping
2016-05-04 13:29:24,706 INFO [] : : stopped
2016-05-04 13:29:24,706 INFO [] : : stopping
2016-05-04 13:29:24,706 WARN [regionserver//192.168.1.46:16020] : Possibly transient ZooKeeper, quorum=mapping2.:2181,mapping1.:2181,mapping3.:2181, exception=$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/rs/,16020,1461926336242
2016-05-04 13:29:25,706 WARN [regionserver//192.168.1.46:16020] : Possibly transient ZooKeeper, quorum=mapping2.:2181,mapping1.:2181,mapping3.:2181, exception=$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/rs/,16020,1461926336242
2016-05-04 13:29:27,707 WARN [regionserver//192.168.1.46:16020] : Possibly transient ZooKeeper, quorum=mapping2.:2181,mapping1.:2181,mapping3.:2181, exception=$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/rs/,16020,1461926336242
2016-05-04 13:29:31,707 WARN [regionserver//192.168.1.46:16020] : Possibly transient ZooKeeper, quorum=mapping2.:2181,mapping1.:2181,mapping3.:2181, exception=$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/rs/,16020,1461926336242
2016-05-04 13:29:39,707 WARN [regionserver//192.168.1.46:16020] : Possibly transient ZooKeeper, quorum=mapping2.:2181,mapping1.:2181,mapping3.:2181, exception=$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/rs/,16020,1461926336242
2016-05-04 13:29:39,707 ERROR [regionserver//192.168.1.46:16020] : ZooKeeper delete failed after 4 attempts
2016-05-04 13:29:39,707 WARN [regionserver//192.168.1.46:16020] : Failed deleting my ephemeral node
$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/rs/,16020,1461926336242
at (:127)
at (:51)
at (:873)
at (:178)
at (:1221)
at (:1210)
at (:1403)
at (:1079)
at (:745)
2016-05-04 13:29:39,708 INFO [regionserver//192.168.1.46:16020] : stopping server ,16020,1461926336242; zookeeper connection closed.
2016-05-04 13:29:39,708 INFO [regionserver//192.168.1.46:16020] : regionserver//192.168.1.46:16020 exiting
2016-05-04 13:29:39,708 ERROR [main] : Region server exiting
: HRegionServer Aborted
at (:68)
at (:87)
at (:70)
at (:126)
at (:2651)
2016-05-04 13:29:39,710 INFO [Thread-7] : Shutdown hook starting; =true; fsShutdownHook=$Cache$ClientFinalizer@7a7471ce
2016-05-04 13:29:39,710 INFO [Thread-7] : Starting fs shutdown hook thread.
2016-05-04 13:29:39,710 INFO [Thread-7] : Shutdown hook finished.
分析:
看着像是跟Zookeeper有关系,又了看监控,发现内存有时候降为0,网络的流量比较大,应该是在写入数据,这个问题网上需要调整jvm参数
从ambari上修改:
修改前:
export HBASE_REGIONSERVER_OPTS="$HBASE_REGIONSERVER_OPTS -Xmn{{regionserver_xmn_size}} -XX:CMSInitiatingOccupancyFraction=70 -Xms{{regionserver_heapsize}} -Xmx{{regionserver_heapsize}} $JDK_DEPENDED_OPTS"
修改后:
export HBASE_REGIONSERVER_OPTS="$HBASE_REGIONSERVER_OPTS -XX:MaxTenuringThreshold=3 -XX:SurvivorRatio=8 -XX:+UseG1GC -XX:MaxGCPauseMillis=50 -XX:InitiatingHeapOccupancyPercent=75 -XX:NewRatio=39 -Xms{{regionserver_heapsize}} -Xmx{{regionserver_heapsize}} $JDK_DEPENDED_OPTS"
修改前:
export HBASE_OPTS="$HBASE_OPTS -XX:+UseConcMarkSweepGC -XX:ErrorFile={{log_dir}}/hs_err_pid% -={{java_io_tmpdir}}"
修改后:
export HBASE_OPTS="$HBASE_OPTS -XX:ErrorFile={{log_dir}}/hs_err_pid% -={{java_io_tmpdir}}"
解释:堆大小调整为40G,新生代1G,回收算法使用G1。
-XX:NewRatio=39
是新生代和其他的老年代、持久代的比例
1/(39+1) * 40 G
默认的CMS算法 总出现异常 导致regionserver自杀
-Xmn{{regionserver_xmn_size}} 是配置新生代的
有可能在G1中不适用了,删除掉
参考:
/chengxin1982/p/
/zhenjing/archive/2012/11/13/hbase_is_OK.html