hbase regionserver总出现自动down的情况排查

时间:2025-04-12 08:04:52

最近在调试hbase,10台节点,服务正常后,写入数据,总是出现regionserver自动down的情况,查看日志如下:

2016-05-04 13:29:09,690 WARN  [regionserver//192.168.1.46:16020] : Failed to write trailer, non-fatal, continuing...
(): No lease on /apps/hbase/data/oldWALs/%2C16020%.1462336775368 (inode 294646): File is not open for writing. Holder DFSClient_NONMAPREDUCE_-309271655_1 does not have any open files.
    at (:3454)
    at (:3354)
    at (:823)
    at

(:515)
    at $ClientNamenodeProtocol$

()
    at $Server$(:616)
    at $(:969)
    at $Handler$(:2151)
    at $Handler$(:2147)
    at (Native Method)
    at (:422)
    at (:1657)
    at $(:2145)

    at (:1411)
    at (:1364)
    at $(:206)
    at .$(Unknown Source)
    at

(:393)
    at .invoke0(Native Method)
    at (:62)
    at (:43)
    at (:497)
    at (:187)
    at (:102)
    at .$(Unknown Source)
    at .invoke0(Native Method)
    at (:62)
    at (:43)
    at (:497)
    at $(:279)
    at .$(Unknown Source)
    at $DataStreamer.addDatanode2ExistingPipeline(:1028)
    at $(:1184)
    at $(:933)
    at $(:487)
2016-05-04 13:29:09,692 ERROR [regionserver//192.168.1.46:16020] : Shutdown / close of WAL failed: : No lease on /apps/hbase/data/oldWALs/%2C16020%.1462336775368 (inode 294646): File is not open for writing.

Holder DFSClient_NONMAPREDUCE_-309271655_1 does not have any open files.
    at (:3454)
    at (:3354)
    at (:823)
    at

(:515)
    at $ClientNamenodeProtocol$

()
    at $Server$(:616)
    at $(:969)
    at $Handler$(:2151)
    at $Handler$(:2147)
    at (Native Method)
    at (:422)
    at (:1657)
    at $(:2145)

2016-05-04 13:29:09,702 INFO  [regionserver//192.168.1.46:16020] : regionserver//192.168.1.46:16020 closing leases
2016-05-04 13:29:09,702 INFO  [regionserver//192.168.1.46:16020] : regionserver//192.168.1.46:16020 closed leases
2016-05-04 13:29:09,702 INFO  [regionserver//192.168.1.46:16020] : Chore service for: ,16020,1461926336242 had [[ScheduledChore: Name: ,16020,1461926336242-MemstoreFlusherChore Period: 10000 Unit: MILLISECONDS], [ScheduledChore: Name: MovedRegionsCleaner for region ,16020,1461926336242 Period: 120000 Unit: MILLISECONDS]] on shutdown
2016-05-04 13:29:09,702 INFO  [regionserver//192.168.1.46:16020] : Waiting for Split Thread to finish...
2016-05-04 13:29:09,703 INFO  [regionserver//192.168.1.46:16020] : Waiting for Merge Thread to finish...
2016-05-04 13:29:09,703 INFO  [regionserver//192.168.1.46:16020] : Waiting for Large Compaction Thread to finish...
2016-05-04 13:29:09,703 INFO  [regionserver//192.168.1.46:16020] : Waiting for Small Compaction Thread to finish...
2016-05-04 13:29:09,703 WARN  [regionserver//192.168.1.46:16020] : Possibly transient ZooKeeper, quorum=mapping2.:2181,mapping1.:2181,mapping3.:2181,

exception=$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/replication/rs/,16020,1461926336242
2016-05-04 13:29:10,703 WARN  [regionserver//192.168.1.46:16020] : Possibly transient ZooKeeper, quorum=mapping2.:2181,mapping1.:2181,mapping3.:2181,

exception=$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/replication/rs/,16020,1461926336242
2016-05-04 13:29:12,704 WARN  [regionserver//192.168.1.46:16020] : Possibly transient ZooKeeper, quorum=mapping2.:2181,mapping1.:2181,mapping3.:2181,

exception=$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/replication/rs/,16020,1461926336242
2016-05-04 13:29:16,704 WARN  [regionserver//192.168.1.46:16020] : Possibly transient ZooKeeper, quorum=mapping2.:2181,mapping1.:2181,mapping3.:2181,

exception=$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/replication/rs/,16020,1461926336242
2016-05-04 13:29:18,007 INFO  [regionserver//192.168.1.46:] : regionserver//192.168.1.46: closing leases
2016-05-04 13:29:18,008 INFO  [regionserver//192.168.1.46:] : regionserver//192.168.1.46: closed leases
2016-05-04 13:29:24,704 WARN  [regionserver//192.168.1.46:16020] : Possibly transient ZooKeeper, quorum=mapping2.:2181,mapping1.:2181,mapping3.:2181,

exception=$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/replication/rs/,16020,1461926336242
2016-05-04 13:29:24,704 ERROR [regionserver//192.168.1.46:16020] : ZooKeeper getChildren failed after 4 attempts
2016-05-04 13:29:24,704 WARN  [regionserver//192.168.1.46:16020] : regionserver:16020-0x15460f0ceb70046, quorum=mapping2.:2181,mapping1.:2181,mapping3.:2181, baseZNode=/hbase-unsecure Unable to list children of znode /hbase-unsecure/replication/rs/,16020,1461926336242
$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/replication/rs/,16020,1461926336242
    at (:127)
    at (:51)
    at (:1472)
    at (:295)
    at (:454)
    at (:482)
    at (:1461)
    at (:1383)
    at (:1265)
    at (:187)
    at (:292)
    at (:180)
    at (:172)
    at (:2137)
    at (:1071)
    at (:745)
2016-05-04 13:29:24,705 ERROR [regionserver//192.168.1.46:16020] : regionserver:16020-0x15460f0ceb70046, quorum=mapping2.:2181,mapping1.:2181,mapping3.:2181, baseZNode=/hbase-unsecure Received unexpected KeeperException, re-throwing exception
$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/replication/rs/,16020,1461926336242
    at (:127)
    at (:51)
    at (:1472)
    at (:295)
    at (:454)
    at (:482)
    at (:1461)
    at (:1383)
    at (:1265)
    at (:187)
    at (:292)
    at (:180)
    at (:172)
    at (:2137)
    at (:1071)
    at (:745)
2016-05-04 13:29:24,705 INFO  [regionserver//192.168.1.46:16020] : Stopping server on 16020
2016-05-04 13:29:24,705 INFO  [,port=16020] : ,port=16020: stopping
2016-05-04 13:29:24,706 INFO  [] : : stopped
2016-05-04 13:29:24,706 INFO  [] : : stopping
2016-05-04 13:29:24,706 WARN  [regionserver//192.168.1.46:16020] : Possibly transient ZooKeeper, quorum=mapping2.:2181,mapping1.:2181,mapping3.:2181, exception=$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/rs/,16020,1461926336242
2016-05-04 13:29:25,706 WARN  [regionserver//192.168.1.46:16020] : Possibly transient ZooKeeper, quorum=mapping2.:2181,mapping1.:2181,mapping3.:2181, exception=$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/rs/,16020,1461926336242
2016-05-04 13:29:27,707 WARN  [regionserver//192.168.1.46:16020] : Possibly transient ZooKeeper, quorum=mapping2.:2181,mapping1.:2181,mapping3.:2181, exception=$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/rs/,16020,1461926336242
2016-05-04 13:29:31,707 WARN  [regionserver//192.168.1.46:16020] : Possibly transient ZooKeeper, quorum=mapping2.:2181,mapping1.:2181,mapping3.:2181, exception=$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/rs/,16020,1461926336242
2016-05-04 13:29:39,707 WARN  [regionserver//192.168.1.46:16020] : Possibly transient ZooKeeper, quorum=mapping2.:2181,mapping1.:2181,mapping3.:2181, exception=$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/rs/,16020,1461926336242
2016-05-04 13:29:39,707 ERROR [regionserver//192.168.1.46:16020] : ZooKeeper delete failed after 4 attempts
2016-05-04 13:29:39,707 WARN  [regionserver//192.168.1.46:16020] : Failed deleting my ephemeral node
$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/rs/,16020,1461926336242
    at (:127)
    at (:51)
    at (:873)
    at (:178)
    at (:1221)
    at (:1210)
    at (:1403)
    at (:1079)
    at (:745)
2016-05-04 13:29:39,708 INFO  [regionserver//192.168.1.46:16020] : stopping server ,16020,1461926336242; zookeeper connection closed.
2016-05-04 13:29:39,708 INFO  [regionserver//192.168.1.46:16020] : regionserver//192.168.1.46:16020 exiting
2016-05-04 13:29:39,708 ERROR [main] : Region server exiting
: HRegionServer Aborted
    at (:68)
    at (:87)
    at (:70)
    at (:126)
    at (:2651)
2016-05-04 13:29:39,710 INFO  [Thread-7] : Shutdown hook starting; =true; fsShutdownHook=$Cache$ClientFinalizer@7a7471ce
2016-05-04 13:29:39,710 INFO  [Thread-7] : Starting fs shutdown hook thread.
2016-05-04 13:29:39,710 INFO  [Thread-7] : Shutdown hook finished.


分析:

看着像是跟Zookeeper有关系,又了看监控,发现内存有时候降为0,网络的流量比较大,应该是在写入数据,这个问题网上需要调整jvm参数


从ambari上修改:

修改前:
export HBASE_REGIONSERVER_OPTS="$HBASE_REGIONSERVER_OPTS -Xmn{{regionserver_xmn_size}} -XX:CMSInitiatingOccupancyFraction=70  -Xms{{regionserver_heapsize}} -Xmx{{regionserver_heapsize}} $JDK_DEPENDED_OPTS"
修改后:
export HBASE_REGIONSERVER_OPTS="$HBASE_REGIONSERVER_OPTS -XX:MaxTenuringThreshold=3 -XX:SurvivorRatio=8 -XX:+UseG1GC -XX:MaxGCPauseMillis=50 -XX:InitiatingHeapOccupancyPercent=75 -XX:NewRatio=39 -Xms{{regionserver_heapsize}} -Xmx{{regionserver_heapsize}} $JDK_DEPENDED_OPTS"

修改前:
export HBASE_OPTS="$HBASE_OPTS -XX:+UseConcMarkSweepGC -XX:ErrorFile={{log_dir}}/hs_err_pid% -={{java_io_tmpdir}}"
修改后:
export HBASE_OPTS="$HBASE_OPTS -XX:ErrorFile={{log_dir}}/hs_err_pid% -={{java_io_tmpdir}}"


解释:堆大小调整为40G,新生代1G,回收算法使用G1。
-XX:NewRatio=39
是新生代和其他的老年代、持久代的比例

1/(39+1) * 40 G
默认的CMS算法 总出现异常 导致regionserver自杀

-Xmn{{regionserver_xmn_size}} 是配置新生代的
有可能在G1中不适用了,删除掉


参考:

/chengxin1982/p/

/zhenjing/archive/2012/11/13/hbase_is_OK.html