记一次RegionServer节点的宕机事故[hbase1.1.2]

我们的大数据部署在金山云，hbase集群中的进程RegionServer(RS)死掉以后不会自动启动，只能手动拉起。因此我写了个定时监控脚本，一旦监测到RS死掉就立即启动，以保证故障节点能持续提供hbase数据读/写服务。立即启动的好处是不对其他RS造成过大压力，因为一个节点的RS死掉后，hmaster就会把此Region Server管理的region分配给其他健康的Region Server(HMaster也会监视zookeeper上的这些RS临时节点，一旦hmaster发现有RS下线了，就会把此Region Server的表region分配给其他健康的Region Server)。

我的脚本设定了每间隔30分钟做一次监测，即每半小时逐个检查kmr-core-machine-001-kingsoft节点到kmr-core-machine-008-kingsoft节点上的RS进程是否在跑，若没在跑则启动它。

以下是监控脚本的输出日志信息摘要。

IN FUNCTION keep_inspecting()

Active Hbase Master host is kmr-5b9c18fc-gn-7b3518df-master-1-001-kingsoft

IN FUNCTION check_then_start_RS()

2017-12-15 03:17:01 Checked regionServer in kmr-core-machine-001-kingsoft by Ambari, result is____ 1

2017-12-15 03:17:02 Checked regionServer in kmr-core-machine-002-kingsoft by Ambari, result is____ 1

2017-12-15 03:17:02 Checked regionServer in kmr-core-machine-003-kingsoft by Ambari, result is____ 1

2017-12-15 03:17:02 Checked regionServer in kmr-core-machine-004-kingsoft by Ambari, result is____ 1

2017-12-15 03:17:02 Checked regionServer in kmr-core-machine-005-kingsoft by Ambari, result is____ 1

2017-12-15 03:17:04 Checked regionServer in kmr-core-machine-006-kingsoft by Ambari, result is____ 1

2017-12-15 03:17:04 Checked regionServer in kmr-core-machine-007-kingsoft by Ambari, result is____ 1

2017-12-15 03:17:04 Checked regionServer in kmr-core-machine-008-kingsoft by Ambari, result is____ 1

IN FUNCTION keep_inspecting()

Active Hbase Master host is kmr-5b9c18fc-gn-7b3518df-master-1-001-kingsoft

IN FUNCTION check_then_start_RS()

2017-12-15 03:47:04 Checked regionServer in kmr-core-machine-001-kingsoft by Ambari, result is____ 1

2017-12-15 03:47:04 Checked regionServer in kmr-core-machine-002-kingsoft by Ambari, result is____ 1

2017-12-15 03:47:04 Checked regionServer in kmr-core-machine-003-kingsoft by Ambari, result is____ 1

2017-12-15 03:47:04 Checked regionServer in kmr-core-machine-004-kingsoft by Ambari, result is____ 1

2017-12-15 03:47:04 Checked regionServer inkmr-core-machine-005-kingsoft by Ambari, result is____0, now Starting......

IN FUNCTION restart_regionserver()

{

"href" : "http://localhost:8080/api/v1/clusters/ks-ksai_kmr/requests/669",

"Requests" : {

"id" : 669,

"status" : "Accepted"

}

}Now checking IF kmr-core-machine-005-kingsoft IS RUNNING RegionServer process......

IN FUNCTION after_start_RS()

RegionServer process on kmr-core-machine-005-kingsoft is =======> hbase 17021 156 3.3 14738868 1022152 ? Sl 03:47 0:09 /usr/java/oracle-jdk/bin/java -Dproc_regionserver -XX:OnOutOfMemoryError=kill -9 %p -Dhdp.version=2.5.0.0-1245 -XX:+UseConcMarkSweepGC -XX:ErrorFile=/mnt/log/hbase/hs_err_pid%p.log -Djava.io.tmpdir=/tmp -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/mnt/log/hbase/gc.log-201712150347 -Xmn2456m -XX:CMSInitiatingOccupancyFraction=70 -Xms12288m -Xmx12288m -Dhbase.log.dir=/mnt/log/hbase -Dhbase.log.file=hbase-hbase-regionserver-kmr-core-machine-005-kingsoft.log -Dhbase.home.dir=/usr/hdp/current/hbase-regionserver/bin/.. -Dhbase.id.str=hbase -Dhbase.root.logger=INFO,RFA -Djava.library.path=:/usr/hdp/2.5.0.0-1245/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.5.0.0-1245/hadoop/lib/native -Dhbase.security.logger=INFO,RFAS org.apache.hadoop.hbase.regionserver.HRegionServer start

2017-12-15 03:47:15 Checked regionServer in kmr-core-machine-006-kingsoft by Ambari, result is____ 1

2017-12-15 03:47:15 Checked regionServer inkmr-core-machine-007-kingsoft by Ambari, result is____0, now Starting......

IN FUNCTION restart_regionserver()

{

"href" : "http://localhost:8080/api/v1/clusters/ks-ksai_kmr/requests/670",

"Requests" : {

"id" : 670,

"status" : "Accepted"

}

}Now checking IF kmr-core-machine-007-kingsoft IS RUNNING RegionServer process......

IN FUNCTION after_start_RS()

RegionServer process on kmr-core-machine-007-kingsoft is =======> hbase 17851 96.0 0.9 14569660 293848 ? Sl 03:47 0:03 /usr/java/oracle-jdk/bin/java -Dproc_regionserver -XX:OnOutOfMemoryError=kill -9 %p -Dhdp.version=2.5.0.0-1245 -XX:+UseConcMarkSweepGC -XX:ErrorFile=/mnt/log/hbase/hs_err_pid%p.log -Djava.io.tmpdir=/tmp -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/mnt/log/hbase/gc.log-201712150347 -Xmn2456m -XX:CMSInitiatingOccupancyFraction=70 -Xms12288m -Xmx12288m -Dhbase.log.dir=/mnt/log/hbase -Dhbase.log.file=hbase-hbase-regionserver-kmr-core-machine-007-kingsoft.log -Dhbase.home.dir=/usr/hdp/current/hbase-regionserver/bin/.. -Dhbase.id.str=hbase -Dhbase.root.logger=INFO,RFA -Djava.library.path=:/usr/hdp/2.5.0.0-1245/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.5.0.0-1245/hadoop/lib/native -Dhbase.security.logger=INFO,RFAS org.apache.hadoop.hbase.regionserver.HRegionServer start

2017-12-15 03:47:26 Checked regionServer in kmr-core-machine-008-kingsoft by Ambari, result is____ 1

红色字体是有关故障节点的信息，从日志大概看出：005节点和007节点的RS死了，当遇到定时监测脚本则被自动拉起,并检查此节点RS是否已成功拉起。

下面以005节点为例分析该节点RS死掉的原因——

从上面监控脚本的输出日志看出，脚本监测到005节点在2017-12-15 03:47:04RS死掉后，立即启动它，那么这个03:47时间点之前发生了什么呢，这就需要进到此节点查看相关的日志。

由于我的定时监控脚本是在2017-12-15 03:47:15启动005这个节点上的RS，所以只需查看03:47之前的日志记录，摘要如下

[root@kmr-core-machine-005-kingsoft hbase]# vim hbase-hbase-regionserver-kmr-core-machine-005-kingsoft.log

2017-12-15 03:13:39,340 INFO [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 7650ms

GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=8002ms

2017-12-15 03:13:51,243 WARN [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 10402ms

GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=10826ms

2017-12-15 03:13:51,244 INFO [kmr-core-machine-005-kingsoft,16020,1513233984700_ChoreService_2] regionserver.StorefileRefresherChore: Chore: StorefileRefresherChore missed its start time

2017-12-15 03:14:04,386 WARN [regionserver/kmr-core-machine-005-kingsoft/172.31.0.7:16020]util.Sleeper: We slept 13141ms instead of 3000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired

2017-12-15 03:14:04,388 INFO [kmr-core-machine-005-kingsoft,16020,1513233984700_ChoreService_2] regionserver.StorefileRefresherChore: Chore: StorefileRefresherChore missed its start time

2017-12-15 03:14:04,388 WARN [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 12143ms

GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=12330ms

2017-12-15 03:14:17,545 INFO [kmr-core-machine-005-kingsoft,16020,1513233984700_ChoreService_4] regionserver.StorefileRefresherChore: Chore: StorefileRefresherChore missed its start time

2017-12-15 03:14:17,546 WARN [regionserver/kmr-core-machine-005-kingsoft/172.31.0.7:16020]util.Sleeper: We slept 13158ms instead of 3000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired

2017-12-15 03:14:17,552 WARN [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 12162ms

GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=12534ms

2017-12-15 03:14:17,682 INFO [kmr-core-machine-005-kingsoft,16020,1513233984700_ChoreService_4] regionserver.StorefileRefresherChore: Chore: StorefileRefresherChore missed its start time

2017-12-15 03:14:28,499 WARN [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 10445ms

GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=10493ms

2017-12-15 03:14:28,501 INFO [kmr-core-machine-005-kingsoft,16020,1513233984700_ChoreService_1] regionserver.StorefileRefresherChore: Chore: StorefileRefresherChore missed its start time

2017-12-15 03:14:40,068 INFO [kmr-core-machine-005-kingsoft,16020,1513233984700_ChoreService_4] regionserver.StorefileRefresherChore: Chore: StorefileRefresherChore missed its start time

2017-12-15 03:14:40,068 WARN [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 11068ms

GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=11123ms

2017-12-15 03:14:40,069 WARN [kmr-core-machine-005-kingsoft,16020,1513233984700_ChoreService_1] regionserver.HeapMemoryManager: heapOccupancyPercent 0.9677683 is above heap occupancy alarm watermark (0.95)

2017-12-15 03:14:48,046 INFO [kmr-core-machine-005-kingsoft,16020,1513233984700_ChoreService_2] regionserver.StorefileRefresherChore: Chore: StorefileRefresherChore missed its start time

2017-12-15 03:14:48,048 INFO [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 7479ms

GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=7739ms

2017-12-15 03:14:59,566 INFO [kmr-core-machine-005-kingsoft,16020,1513233984700_ChoreService_1] regionserver.StorefileRefresherChore: Chore: StorefileRefresherChore missed its start time

2017-12-15 03:14:59,566 WARN [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 10515ms

GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=10863ms

2017-12-15 03:15:08,837 INFO [kmr-core-machine-005-kingsoft,16020,1513233984700_ChoreService_1] regionserver.StorefileRefresherChore: Chore: StorefileRefresherChore missed its start time

2017-12-15 03:15:08,838 INFO [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 8771ms

GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=9043ms

2017-12-15 03:15:20,008 WARN [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 10669ms

GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=10914ms

2017-12-15 03:15:20,008 INFO [kmr-core-machine-005-kingsoft,16020,1513233984700_ChoreService_3] regionserver.StorefileRefresherChore: Chore: StorefileRefresherChore missed its start time

2017-12-15 03:15:29,353 INFO [kmr-core-machine-005-kingsoft,16020,1513233984700_ChoreService_2] regionserver.StorefileRefresherChore: Chore: StorefileRefresherChore missed its start time

2017-12-15 03:15:29,354 INFO [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 8845ms

GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=9168ms

2017-12-15 03:15:35,351 INFO [kmr-core-machine-005-kingsoft,16020,1513233984700_ChoreService_2] regionserver.StorefileRefresherChore: Chore: StorefileRefresherChore missed its start time

2017-12-15 03:15:35,351 INFO [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 5497ms

GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=5831ms

2017-12-15 03:15:47,079 INFO [kmr-core-machine-005-kingsoft,16020,1513233984700_ChoreService_3] regionserver.StorefileRefresherChore: Chore: StorefileRefresherChore missed its start time

2017-12-15 03:15:47,081 WARN [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 11228ms

GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=11603ms

2017-12-15 03:16:00,418 WARN [regionserver/kmr-core-machine-005-kingsoft/172.31.0.7:16020]util.Sleeper: We slept 13331ms instead of 3000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired

2017-12-15 03:16:00,419 INFO [kmr-core-machine-005-kingsoft,16020,1513233984700_ChoreService_2] regionserver.StorefileRefresherChore: Chore: StorefileRefresherChore missed its start time

2017-12-15 03:16:00,429 WARN [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 12839ms

注意看红色字体，在03:47之前的日志记录可以看到，很多long garbage collecting pause和JvmPauseMonitor，把每一行JvmPauseMonitor的行尾XXXXXms的毫秒数字累加起来，得到的超时时间就超过了我们设置的Zookeeper Session Timeout时长。也就是说此RS节点在zookeeper注册的临时节点超时不更新，导致zookeeper认为此RS死了，就会删除此节点，从而造成了对应的RS下线，那么等这个RS节点的GC结束后再去更新它在zookeeper上的临时节点，发现不存在此节点了(因为刚才被zookeeper删除了嘛)。所以说，一旦遇到这种级别的gc(称为Stop-the-world)，会导致此节点的所有线程都阻塞住，包括向zookeeper的临时节点发送心跳信息的线程也被阻塞。

因为目前的集群资源有限，分配给hbase region server的内存不够，而客户端写hbase的请求很多，就会产生大量的对象，对像一多就过多地占用JVM中的Heap space，造成内存不足。内存不足就会触发JVM的garbage collection，只有当这个gc结束，其他线程才能被唤醒执行。所以一旦这种情况发生就像是stop-the-world，整个世界都停止了，region server就不能持续提供服务。

解决办法有2个：

一个是延长zookeeper的Zookeeper Session Timeout时长，比如我这里设为5分钟，这样能缓解RS上的gc超时导致的RS下线问题；

另外一个就是增大hdfs的xceiver的线程数量，这一点能在高并发写hbase的时候加快HLOG和memstore 刷到HDFS的性能，提升RS节点的gc效率，例如把这个参数调到4096，但是不能再高了，因为——
1、线程需要自己的堆栈，需要内存，缺省每个线程是1MB，也就是说设置4096个线程则需要4GB的内存来容纳他们，这会影响memstores，block cache和JVM，导致OutOfMemoryException，所以不能设置过大。
2、太多线程导致CPU负载过大，引起过多的上下文交换来处理并行任务，这会损耗实际运行所需的资源。

【延伸】JVM的gc概要

1. Full GC == Major GC指的是对老年代/永久代进行stop the world级别的GC。此时会阻塞用户的一切线程，只留下gc线程回收垃圾对象，其中Minor GC的STW时间主要耗费在复制阶段，CMS GC的STW时间主要耗费在标示垃圾对象阶段。

2. Full GC的次数 =老年代GC时stop the world的次数

3. Full GC的时间 =老年代GC时stop the world的总时间

4. CMS 不等于Full GC，CMS分为多个阶段，只有stop the world的阶段被计算到了Full GC的次数和时间，而和业务线程并发的GC的次数和时间则不被认为是Full GC。

5. Full GC本身不会先进行Minor GC。不过我们可以配置，在Full GC之前进行一次Minor GC，因为老年代很多对象都会引用到新生代的对象，先进行一次Minor GC可以提高老年代GC的速度。例如老年代使用CMS时设置CMSScavengeBeforeRemark优化，让CMS remark之前先进行一次Minor GC。

秒客网

记一次RegionServer节点的宕机事故[hbase1.1.2]

相关文章