我们的大数据部署在金山云,hbase集群中的进程RegionServer(RS)死掉以后不会自动启动,只能手动拉起。因此我写了个定时监控脚本,一旦监测到RS死掉就立即启动,以保证故障节点能持续提供hbase数据读/写服务。立即启动的好处是不对其他RS造成过大压力,因为一个节点的RS死掉后,hmaster就会把此Region Server管理的region分配给其他健康的Region Server(HMaster也会监视zookeeper上的这些RS临时节点,一旦hmaster发现有RS下线了,就会把此Region Server的表region分配给其他健康的Region Server)。
我的脚本设定了每间隔30分钟做一次监测,即每半小时逐个检查kmr-core-machine-001-kingsoft节点到kmr-core-machine-008-kingsoft节点上的RS进程是否在跑,若没在跑则启动它。
以下是监控脚本的输出日志信息摘要。
IN FUNCTION keep_inspecting()
Active Hbase Master host is kmr-5b9c18fc-gn-7b3518df-master-1-001-kingsoft
IN FUNCTION check_then_start_RS()
2017-12-15 03:17:01 Checked regionServer in kmr-core-machine-001-kingsoft by Ambari, result is____ 1
2017-12-15 03:17:02 Checked regionServer in kmr-core-machine-002-kingsoft by Ambari, result is____ 1
2017-12-15 03:17:02 Checked regionServer in kmr-core-machine-003-kingsoft by Ambari, result is____ 1
2017-12-15 03:17:02 Checked regionServer in kmr-core-machine-004-kingsoft by Ambari, result is____ 1
2017-12-15 03:17:02 Checked regionServer in kmr-core-machine-005-kingsoft by Ambari, result is____ 1
2017-12-15 03:17:04 Checked regionServer in kmr-core-machine-006-kingsoft by Ambari, result is____ 1
2017-12-15 03:17:04 Checked regionServer in kmr-core-machine-007-kingsoft by Ambari, result is____ 1
2017-12-15 03:17:04 Checked regionServer in kmr-core-machine-008-kingsoft by Ambari, result is____ 1
IN FUNCTION keep_inspecting()
Active Hbase Master host is kmr-5b9c18fc-gn-7b3518df-master-1-001-kingsoft
IN FUNCTION check_then_start_RS()
2017-12-15 03:47:04 Checked regionServer in kmr-core-machine-001-kingsoft by Ambari, result is____ 1
2017-12-15 03:47:04 Checked regionServer in kmr-core-machine-002-kingsoft by Ambari, result is____ 1
2017-12-15 03:47:04 Checked regionServer in kmr-core-machine-003-kingsoft by Ambari, result is____ 1
2017-12-15 03:47:04 Checked regionServer in kmr-core-machine-004-kingsoft by Ambari, result is____ 1
2017-12-15 03:47:04 Checked regionServer inkmr-core-machine-005-kingsoft by Ambari, result is____0, now Starting......
IN FUNCTION restart_regionserver()
{
"href" : "http://localhost:8080/api/v1/clusters/ks-ksai_kmr/requests/669",
"Requests" : {
"id" : 669,
"status" : "Accepted"
}
}Now checking IF kmr-core-machine-005-kingsoft IS RUNNING RegionServer process......
IN FUNCTION after_start_RS()
RegionServer process on kmr-core-machine-005-kingsoft is =======> hbase 17021 156 3.3 14738868 1022152 ? Sl 03:47 0:09 /usr/java/oracle-jdk/bin/java -Dproc_regionserver -XX:OnOutOfMemoryError=kill -9 %p -Dhdp.version=2.5.0.0-1245 -XX:+UseConcMarkSweepGC -XX:ErrorFile=/mnt/log/hbase/hs_err_pid%p.log -Djava.io.tmpdir=/tmp -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/mnt/log/hbase/gc.log-201712150347 -Xmn2456m -XX:CMSInitiatingOccupancyFraction=70 -Xms12288m -Xmx12288m -Dhbase.log.dir=/mnt/log/hbase -Dhbase.log.file=hbase-hbase-regionserver-kmr-core-machine-005-kingsoft.log -Dhbase.home.dir=/usr/hdp/current/hbase-regionserver/bin/.. -Dhbase.id.str=hbase -Dhbase.root.logger=INFO,RFA -Djava.library.path=:/usr/hdp/2.5.0.0-1245/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.5.0.0-1245/hadoop/lib/native -Dhbase.security.logger=INFO,RFAS org.apache.hadoop.hbase.regionserver.HRegionServer start
2017-12-15 03:47:15 Checked regionServer in kmr-core-machine-006-kingsoft by Ambari, result is____ 1
2017-12-15 03:47:15 Checked regionServer inkmr-core-machine-007-kingsoft by Ambari, result is____0, now Starting......
IN FUNCTION restart_regionserver()
{
"href" : "http://localhost:8080/api/v1/clusters/ks-ksai_kmr/requests/670",
"Requests" : {
"id" : 670,
"status" : "Accepted"
}
}Now checking IF kmr-core-machine-007-kingsoft IS RUNNING RegionServer process......
IN FUNCTION after_start_RS()
RegionServer process on kmr-core-machine-007-kingsoft is =======> hbase 17851 96.0 0.9 14569660 293848 ? Sl 03:47 0:03 /usr/java/oracle-jdk/bin/java -Dproc_regionserver -XX:OnOutOfMemoryError=kill -9 %p -Dhdp.version=2.5.0.0-1245 -XX:+UseConcMarkSweepGC -XX:ErrorFile=/mnt/log/hbase/hs_err_pid%p.log -Djava.io.tmpdir=/tmp -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/mnt/log/hbase/gc.log-201712150347 -Xmn2456m -XX:CMSInitiatingOccupancyFraction=70 -Xms12288m -Xmx12288m -Dhbase.log.dir=/mnt/log/hbase -Dhbase.log.file=hbase-hbase-regionserver-kmr-core-machine-007-kingsoft.log -Dhbase.home.dir=/usr/hdp/current/hbase-regionserver/bin/.. -Dhbase.id.str=hbase -Dhbase.root.logger=INFO,RFA -Djava.library.path=:/usr/hdp/2.5.0.0-1245/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.5.0.0-1245/hadoop/lib/native -Dhbase.security.logger=INFO,RFAS org.apache.hadoop.hbase.regionserver.HRegionServer start
2017-12-15 03:47:26 Checked regionServer in kmr-core-machine-008-kingsoft by Ambari, result is____ 1
红色字体是有关故障节点的信息,从日志大概看出:005节点和007节点的RS死了,当遇到定时监测脚本则被自动拉起,并检查此节点RS是否已成功拉起。
下面以005节点为例分析该节点RS死掉的原因——
从上面监控脚本的输出日志看出,脚本监测到005节点在2017-12-15 03:47:04RS死掉后,立即启动它,那么这个03:47时间点之前发生了什么呢,这就需要进到此节点查看相关的日志。
由于我的定时监控脚本是在2017-12-15 03:47:15启动005这个节点上的RS,所以只需查看03:47之前的日志记录,摘要如下
[root@kmr-core-machine-005-kingsoft hbase]# vim hbase-hbase-regionserver-kmr-core-machine-005-kingsoft.log
2017-12-15 03:13:39,340 INFO [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 7650ms
GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=8002ms
2017-12-15 03:13:51,243 WARN [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 10402ms
GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=10826ms
2017-12-15 03:13:51,244 INFO [kmr-core-machine-005-kingsoft,16020,1513233984700_ChoreService_2] regionserver.StorefileRefresherChore: Chore: StorefileRefresherChore missed its start time
2017-12-15 03:14:04,386 WARN [regionserver/kmr-core-machine-005-kingsoft/172.31.0.7:16020]util.Sleeper: We slept 13141ms instead of 3000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
2017-12-15 03:14:04,388 INFO [kmr-core-machine-005-kingsoft,16020,1513233984700_ChoreService_2] regionserver.StorefileRefresherChore: Chore: StorefileRefresherChore missed its start time
2017-12-15 03:14:04,388 WARN [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 12143ms
GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=12330ms
2017-12-15 03:14:17,545 INFO [kmr-core-machine-005-kingsoft,16020,1513233984700_ChoreService_4] regionserver.StorefileRefresherChore: Chore: StorefileRefresherChore missed its start time
2017-12-15 03:14:17,546 WARN [regionserver/kmr-core-machine-005-kingsoft/172.31.0.7:16020]util.Sleeper: We slept 13158ms instead of 3000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
2017-12-15 03:14:17,552 WARN [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 12162ms
GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=12534ms
2017-12-15 03:14:17,682 INFO [kmr-core-machine-005-kingsoft,16020,1513233984700_ChoreService_4] regionserver.StorefileRefresherChore: Chore: StorefileRefresherChore missed its start time
2017-12-15 03:14:28,499 WARN [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 10445ms
GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=10493ms
2017-12-15 03:14:28,501 INFO [kmr-core-machine-005-kingsoft,16020,1513233984700_ChoreService_1] regionserver.StorefileRefresherChore: Chore: StorefileRefresherChore missed its start time
2017-12-15 03:14:40,068 INFO [kmr-core-machine-005-kingsoft,16020,1513233984700_ChoreService_4] regionserver.StorefileRefresherChore: Chore: StorefileRefresherChore missed its start time
2017-12-15 03:14:40,068 WARN [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 11068ms
GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=11123ms
2017-12-15 03:14:40,069 WARN [kmr-core-machine-005-kingsoft,16020,1513233984700_ChoreService_1] regionserver.HeapMemoryManager: heapOccupancyPercent 0.9677683 is above heap occupancy alarm watermark (0.95)
2017-12-15 03:14:48,046 INFO [kmr-core-machine-005-kingsoft,16020,1513233984700_ChoreService_2] regionserver.StorefileRefresherChore: Chore: StorefileRefresherChore missed its start time
2017-12-15 03:14:48,048 INFO [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 7479ms
GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=7739ms
2017-12-15 03:14:59,566 INFO [kmr-core-machine-005-kingsoft,16020,1513233984700_ChoreService_1] regionserver.StorefileRefresherChore: Chore: StorefileRefresherChore missed its start time
2017-12-15 03:14:59,566 WARN [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 10515ms
GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=10863ms
2017-12-15 03:15:08,837 INFO [kmr-core-machine-005-kingsoft,16020,1513233984700_ChoreService_1] regionserver.StorefileRefresherChore: Chore: StorefileRefresherChore missed its start time
2017-12-15 03:15:08,838 INFO [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 8771ms
GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=9043ms
2017-12-15 03:15:20,008 WARN [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 10669ms
GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=10914ms
2017-12-15 03:15:20,008 INFO [kmr-core-machine-005-kingsoft,16020,1513233984700_ChoreService_3] regionserver.StorefileRefresherChore: Chore: StorefileRefresherChore missed its start time
2017-12-15 03:15:29,353 INFO [kmr-core-machine-005-kingsoft,16020,1513233984700_ChoreService_2] regionserver.StorefileRefresherChore: Chore: StorefileRefresherChore missed its start time
2017-12-15 03:15:29,354 INFO [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 8845ms
GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=9168ms
2017-12-15 03:15:35,351 INFO [kmr-core-machine-005-kingsoft,16020,1513233984700_ChoreService_2] regionserver.StorefileRefresherChore: Chore: StorefileRefresherChore missed its start time
2017-12-15 03:15:35,351 INFO [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 5497ms
GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=5831ms
2017-12-15 03:15:47,079 INFO [kmr-core-machine-005-kingsoft,16020,1513233984700_ChoreService_3] regionserver.StorefileRefresherChore: Chore: StorefileRefresherChore missed its start time
2017-12-15 03:15:47,081 WARN [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 11228ms
GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=11603ms
2017-12-15 03:16:00,418 WARN [regionserver/kmr-core-machine-005-kingsoft/172.31.0.7:16020]util.Sleeper: We slept 13331ms instead of 3000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
2017-12-15 03:16:00,419 INFO [kmr-core-machine-005-kingsoft,16020,1513233984700_ChoreService_2] regionserver.StorefileRefresherChore: Chore: StorefileRefresherChore missed its start time
2017-12-15 03:16:00,429 WARN [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 12839ms
注意看红色字体,在03:47之前的日志记录可以看到,很多long garbage collecting pause和JvmPauseMonitor,把每一行JvmPauseMonitor的行尾XXXXXms的毫秒数字累加起来,得到的超时时间就超过了我们设置的Zookeeper Session Timeout时长。也就是说此RS节点在zookeeper注册的临时节点超时不更新,导致zookeeper认为此RS死了,就会删除此节点,从而造成了对应的RS下线,那么等这个RS节点的GC结束后再去更新它在zookeeper上的临时节点,发现不存在此节点了(因为刚才被zookeeper删除了嘛)。所以说,一旦遇到这种级别的gc(称为Stop-the-world),会导致此节点的所有线程都阻塞住,包括向zookeeper的临时节点发送心跳信息的线程也被阻塞。
因为目前的集群资源有限,分配给hbase region server的内存不够,而客户端写hbase的请求很多,就会产生大量的对象,对像一多就过多地占用JVM中的Heap space,造成内存不足。内存不足就会触发JVM的garbage collection,只有当这个gc结束,其他线程才能被唤醒执行。所以一旦这种情况发生就像是stop-the-world,整个世界都停止了,region server就不能持续提供服务。
解决办法有2个:
一个是延长zookeeper的Zookeeper Session Timeout时长,比如我这里设为5分钟,这样能缓解RS上的gc超时导致的RS下线问题;
另外一个就是增大hdfs的xceiver的线程数量,这一点能在高并发写hbase的时候加快HLOG和memstore 刷到HDFS的性能,提升RS节点的gc效率,例如把这个参数调到4096,但是不能再高了,因为——
1、线程需要自己的堆栈,需要内存,缺省每个线程是1MB,也就是说设置4096个线程则需要4GB的内存来容纳他们,这会影响memstores,block cache和JVM,导致OutOfMemoryException,所以不能设置过大。
2、太多线程导致CPU负载过大,引起过多的上下文交换来处理并行任务,这会损耗实际运行所需的资源。
【延伸】JVM的gc概要
1. Full GC == Major GC指的是对老年代/永久代进行stop the world级别的GC。此时会阻塞用户的一切线程,只留下gc线程回收垃圾对象,其中Minor GC的STW时间主要耗费在复制阶段,CMS GC的STW时间主要耗费在标示垃圾对象阶段。
2. Full GC的次数 =老年代GC时stop the world的次数
3. Full GC的时间 =老年代GC时stop the world的总时间
4. CMS 不等于Full GC,CMS分为多个阶段,只有stop the world的阶段被计算到了Full GC的次数和时间,而和业务线程并发的GC的次数和时间则不被认为是Full GC。
5. Full GC本身不会先进行Minor GC。不过我们可以配置,在Full GC之前进行一次Minor GC,因为老年代很多对象都会引用到新生代的对象,先进行一次Minor GC可以提高老年代GC的速度。例如老年代使用CMS时设置CMSScavengeBeforeRemark优化,让CMS remark之前先进行一次Minor GC。