记一次RegionServer节点的宕机事故[hbase1.1.2]

时间:2021-08-26 13:47:49

我们的大数据部署在金山云,hbase集群中的进程RegionServer(RS)死掉以后不会自动启动,只能手动拉起。因此我写了个定时监控脚本,一旦监测到RS死掉就立即启动,以保证故障节点持续提供hbase数据读/写服务。立即启动的好处是不对其他RS造成过大压力,因为一个节点的RS死掉后,hmaster就会把此Region Server管理region分配给其他健康的Region Server(HMaster也会监视zookeeper上的这些RS临时节点,一旦hmaster发现有RS下线了,就会把此Region Server的表region分配给其他健康的Region Server)

我的脚本设定了每间隔30分钟做一次监测,即每半小时逐个检查kmr-core-machine-001-kingsoft节点到kmr-core-machine-008-kingsoft节点上的RS进程是否在跑,若没在跑则启动它。

以下是监控脚本的输出日志信息摘要

IN FUNCTION keep_inspecting()

Active Hbase Master host is kmr-5b9c18fc-gn-7b3518df-master-1-001-kingsoft

IN FUNCTION check_then_start_RS()

2017-12-15 03:17:01 Checked regionServer in kmr-core-machine-001-kingsoft by Ambari, result is____ 1

2017-12-15 03:17:02 Checked regionServer in kmr-core-machine-002-kingsoft by Ambari, result is____ 1

2017-12-15 03:17:02 Checked regionServer in kmr-core-machine-003-kingsoft by Ambari, result is____ 1

2017-12-15 03:17:02 Checked regionServer in kmr-core-machine-004-kingsoft by Ambari, result is____ 1

2017-12-15 03:17:02 Checked regionServer in kmr-core-machine-005-kingsoft by Ambari, result is____ 1

2017-12-15 03:17:04 Checked regionServer in kmr-core-machine-006-kingsoft by Ambari, result is____ 1

2017-12-15 03:17:04 Checked regionServer in kmr-core-machine-007-kingsoft by Ambari, result is____ 1

2017-12-15 03:17:04 Checked regionServer in kmr-core-machine-008-kingsoft by Ambari, result is____ 1

IN FUNCTION keep_inspecting()

Active Hbase Master host is kmr-5b9c18fc-gn-7b3518df-master-1-001-kingsoft

IN FUNCTION check_then_start_RS()

2017-12-15 03:47:04 Checked regionServer in kmr-core-machine-001-kingsoft by Ambari, result is____ 1

2017-12-15 03:47:04 Checked regionServer in kmr-core-machine-002-kingsoft by Ambari, result is____ 1

2017-12-15 03:47:04 Checked regionServer in kmr-core-machine-003-kingsoft by Ambari, result is____ 1

2017-12-15 03:47:04 Checked regionServer in kmr-core-machine-004-kingsoft by Ambari, result is____ 1

2017-12-15 03:47:04 Checked regionServer inkmr-core-machine-005-kingsoft by Ambari, result is____0, now Starting......

IN FUNCTION restart_regionserver()

{

  "href" : "http://localhost:8080/api/v1/clusters/ks-ksai_kmr/requests/669",

  "Requests" : {

    "id" : 669,

    "status" : "Accepted"

  }

}Now checking IF kmr-core-machine-005-kingsoft IS RUNNING RegionServer process......

IN FUNCTION after_start_RS()

RegionServer process on kmr-core-machine-005-kingsoft is =======> hbase    17021  156  3.3 14738868 1022152 ?    Sl   03:47   0:09 /usr/java/oracle-jdk/bin/java -Dproc_regionserver -XX:OnOutOfMemoryError=kill -9 %p -Dhdp.version=2.5.0.0-1245 -XX:+UseConcMarkSweepGC -XX:ErrorFile=/mnt/log/hbase/hs_err_pid%p.log -Djava.io.tmpdir=/tmp -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/mnt/log/hbase/gc.log-201712150347 -Xmn2456m -XX:CMSInitiatingOccupancyFraction=70 -Xms12288m -Xmx12288m -Dhbase.log.dir=/mnt/log/hbase -Dhbase.log.file=hbase-hbase-regionserver-kmr-core-machine-005-kingsoft.log -Dhbase.home.dir=/usr/hdp/current/hbase-regionserver/bin/.. -Dhbase.id.str=hbase -Dhbase.root.logger=INFO,RFA -Djava.library.path=:/usr/hdp/2.5.0.0-1245/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.5.0.0-1245/hadoop/lib/native -Dhbase.security.logger=INFO,RFAS org.apache.hadoop.hbase.regionserver.HRegionServer start

2017-12-15 03:47:15 Checked regionServer in kmr-core-machine-006-kingsoft by Ambari, result is____ 1

2017-12-15 03:47:15 Checked regionServer inkmr-core-machine-007-kingsoft by Ambari, result is____0, now Starting......

IN FUNCTION restart_regionserver()

{

  "href" : "http://localhost:8080/api/v1/clusters/ks-ksai_kmr/requests/670",

  "Requests" : {

    "id" : 670,

    "status" : "Accepted"

  }

}Now checking IF kmr-core-machine-007-kingsoft IS RUNNING RegionServer process......

IN FUNCTION after_start_RS()

RegionServer process on kmr-core-machine-007-kingsoft is =======> hbase    17851 96.0  0.9 14569660 293848 ?     Sl   03:47   0:03 /usr/java/oracle-jdk/bin/java -Dproc_regionserver -XX:OnOutOfMemoryError=kill -9 %p -Dhdp.version=2.5.0.0-1245 -XX:+UseConcMarkSweepGC -XX:ErrorFile=/mnt/log/hbase/hs_err_pid%p.log -Djava.io.tmpdir=/tmp -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/mnt/log/hbase/gc.log-201712150347 -Xmn2456m -XX:CMSInitiatingOccupancyFraction=70 -Xms12288m -Xmx12288m -Dhbase.log.dir=/mnt/log/hbase -Dhbase.log.file=hbase-hbase-regionserver-kmr-core-machine-007-kingsoft.log -Dhbase.home.dir=/usr/hdp/current/hbase-regionserver/bin/.. -Dhbase.id.str=hbase -Dhbase.root.logger=INFO,RFA -Djava.library.path=:/usr/hdp/2.5.0.0-1245/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.5.0.0-1245/hadoop/lib/native -Dhbase.security.logger=INFO,RFAS org.apache.hadoop.hbase.regionserver.HRegionServer start

2017-12-15 03:47:26 Checked regionServer in kmr-core-machine-008-kingsoft by Ambari, result is____ 1

红色字体是有关故障节点的信息,从日志大概看出:005节点和007节点的RS死了,当遇到定时监测脚本则被自动拉起,并检查此节点RS是否已成功拉起。

下面以005节点为例分析该节点RS死掉的原因——

从上面监控脚本的输出日志看出,脚本监测到005节点在2017-12-15 03:47:04RS死掉后,立即启动它,那么这个03:47时间点之前发生了什么呢,这就需要进到此节点查看相关的日志。

由于我的定时监控脚本是在2017-12-15 03:47:15启动005这个节点上的RS,所以只需查看03:47之前的日志记录,摘要如下

[root@kmr-core-machine-005-kingsoft hbase]# vim hbase-hbase-regionserver-kmr-core-machine-005-kingsoft.log

2017-12-15 03:13:39,340 INFO  [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 7650ms

GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=8002ms

2017-12-15 03:13:51,243 WARN  [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 10402ms

GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=10826ms

2017-12-15 03:13:51,244 INFO  [kmr-core-machine-005-kingsoft,16020,1513233984700_ChoreService_2] regionserver.StorefileRefresherChore: Chore: StorefileRefresherChore missed its start time

2017-12-15 03:14:04,386 WARN  [regionserver/kmr-core-machine-005-kingsoft/172.31.0.7:16020]util.Sleeper: We slept 13141ms instead of 3000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired

2017-12-15 03:14:04,388 INFO  [kmr-core-machine-005-kingsoft,16020,1513233984700_ChoreService_2] regionserver.StorefileRefresherChore: Chore: StorefileRefresherChore missed its start time

2017-12-15 03:14:04,388 WARN  [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 12143ms

GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=12330ms

2017-12-15 03:14:17,545 INFO  [kmr-core-machine-005-kingsoft,16020,1513233984700_ChoreService_4] regionserver.StorefileRefresherChore: Chore: StorefileRefresherChore missed its start time

2017-12-15 03:14:17,546 WARN  [regionserver/kmr-core-machine-005-kingsoft/172.31.0.7:16020]util.Sleeper: We slept 13158ms instead of 3000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired

2017-12-15 03:14:17,552 WARN  [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 12162ms

GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=12534ms

2017-12-15 03:14:17,682 INFO  [kmr-core-machine-005-kingsoft,16020,1513233984700_ChoreService_4] regionserver.StorefileRefresherChore: Chore: StorefileRefresherChore missed its start time

2017-12-15 03:14:28,499 WARN  [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 10445ms

GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=10493ms

2017-12-15 03:14:28,501 INFO  [kmr-core-machine-005-kingsoft,16020,1513233984700_ChoreService_1] regionserver.StorefileRefresherChore: Chore: StorefileRefresherChore missed its start time

2017-12-15 03:14:40,068 INFO  [kmr-core-machine-005-kingsoft,16020,1513233984700_ChoreService_4] regionserver.StorefileRefresherChore: Chore: StorefileRefresherChore missed its start time

2017-12-15 03:14:40,068 WARN  [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 11068ms

GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=11123ms

2017-12-15 03:14:40,069 WARN  [kmr-core-machine-005-kingsoft,16020,1513233984700_ChoreService_1] regionserver.HeapMemoryManager: heapOccupancyPercent 0.9677683 is above heap occupancy alarm watermark (0.95)

2017-12-15 03:14:48,046 INFO  [kmr-core-machine-005-kingsoft,16020,1513233984700_ChoreService_2] regionserver.StorefileRefresherChore: Chore: StorefileRefresherChore missed its start time

2017-12-15 03:14:48,048 INFO  [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 7479ms

GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=7739ms

2017-12-15 03:14:59,566 INFO  [kmr-core-machine-005-kingsoft,16020,1513233984700_ChoreService_1] regionserver.StorefileRefresherChore: Chore: StorefileRefresherChore missed its start time

2017-12-15 03:14:59,566 WARN  [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 10515ms

GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=10863ms

2017-12-15 03:15:08,837 INFO  [kmr-core-machine-005-kingsoft,16020,1513233984700_ChoreService_1] regionserver.StorefileRefresherChore: Chore: StorefileRefresherChore missed its start time

2017-12-15 03:15:08,838 INFO  [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 8771ms

GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=9043ms

2017-12-15 03:15:20,008 WARN  [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 10669ms

GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=10914ms

2017-12-15 03:15:20,008 INFO  [kmr-core-machine-005-kingsoft,16020,1513233984700_ChoreService_3] regionserver.StorefileRefresherChore: Chore: StorefileRefresherChore missed its start time

2017-12-15 03:15:29,353 INFO  [kmr-core-machine-005-kingsoft,16020,1513233984700_ChoreService_2] regionserver.StorefileRefresherChore: Chore: StorefileRefresherChore missed its start time

2017-12-15 03:15:29,354 INFO  [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 8845ms

GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=9168ms

2017-12-15 03:15:35,351 INFO  [kmr-core-machine-005-kingsoft,16020,1513233984700_ChoreService_2] regionserver.StorefileRefresherChore: Chore: StorefileRefresherChore missed its start time

2017-12-15 03:15:35,351 INFO  [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 5497ms

GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=5831ms

2017-12-15 03:15:47,079 INFO  [kmr-core-machine-005-kingsoft,16020,1513233984700_ChoreService_3] regionserver.StorefileRefresherChore: Chore: StorefileRefresherChore missed its start time

2017-12-15 03:15:47,081 WARN  [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 11228ms

GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=11603ms

2017-12-15 03:16:00,418 WARN  [regionserver/kmr-core-machine-005-kingsoft/172.31.0.7:16020]util.Sleeper: We slept 13331ms instead of 3000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired

2017-12-15 03:16:00,419 INFO  [kmr-core-machine-005-kingsoft,16020,1513233984700_ChoreService_2] regionserver.StorefileRefresherChore: Chore: StorefileRefresherChore missed its start time

2017-12-15 03:16:00,429 WARN  [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 12839ms

注意看红色字体,在03:47之前的日志记录可以看到,很多long garbage collecting pauseJvmPauseMonitor,把每一行JvmPauseMonitor行尾XXXXXms的毫秒数字累加起来,得到的超时时间就超过了我们设置的Zookeeper Session Timeout时长。也就是说此RS节点在zookeeper注册的临时节点超时不更新,导致zookeeper认为此RS死了,就会删除此节点,从而造成了对应的RS下线,那么等这个RS节点的GC结束后再去更新它在zookeeper上的临时节点,发现不存在此节点了(因为刚才被zookeeper删除了嘛)。所以说,一旦遇到这种级别的gc(称为Stop-the-world),会导致此节点的所有线程都阻塞住,包括向zookeeper的临时节点发送心跳信息的线程也被阻塞。

因为目前的集群资源有限,分配给hbase region server的内存不够,而客户端写hbase的请求很多,就会产生大量的对象,对像一多就过多地占用JVM中的Heap space,造成内存不足。内存不足就会触发JVMgarbage collection,只有当这个gc结束,其他线程才能被唤醒执行。所以一旦这种情况发生就像是stop-the-world,整个世界都停止了,region server就不能持续提供服务。

解决办法有2个:

一个是延长zookeeperZookeeper Session Timeout时长,比如我这里设为5分钟,这样能缓解RS上的gc超时导致的RS下线问题;

另外一个就是增大hdfsxceiver的线程数量,这一点能在高并发写hbase的时候加快HLOGmemstore 刷到HDFS的性能,提升RS节点的gc效率,例如把这个参数调到4096,但是不能再高了,因为——
    1、线程需要自己的堆栈,需要内存,缺省每个线程是1MB也就是说设置4096个线程则需要4GB的内存来容纳他们,这影响memstoresblock cacheJVM,导致OutOfMemoryException所以不能设置大。
    2、太多线程导致CPU负载过大,引起上下文交换来处理并行任务,这损耗实际运行所需的资源。

 

【延伸】JVMgc概要

1. Full GC == Major GC指的是对老年代/永久代进行stop the world级别的GC。此时会阻塞用户的一切线程,只留下gc线程回收垃圾对象,其中Minor GCSTW时间主要耗费在复制阶段,CMS GCSTW时间主要耗费在标示垃圾对象阶段。

2. Full GC的次数 =老年代GCstop the world的次数

3. Full GC的时间 =老年代GCstop the world的总时间

4. CMS 不等于Full GCCMS分为多个阶段,只有stop the world的阶段被计算到了Full GC的次数和时间,而和业务线程并发的GC的次数和时间则不被认为是Full GC

5. Full GC本身不会先进行Minor GC。不过我们可以配置,在Full GC之前进行一次Minor GC,因为老年代很多对象都会引用到新生代的对象,先进行一次Minor GC可以提高老年代GC的速度。例如老年代使用CMS时设置CMSScavengeBeforeRemark优化,让CMS remark之前先进行一次Minor GC