hbase-1.1.2:某个表区域的状态处于regions in transition(RIT)的troubleshooting

时间:2022-08-03 08:23:51

【发现RIT问题】

hbase某个表区域的状态处于Regions In Transition,即RIT

处于RIT的region是 6f6f4f8204ccaa0b71df08591de155ba 在master角色的web UI查看到具体RIT信息如下
Region State RIT time (ms)
6f6f4f8204ccaa0b71df08591de155ba ks_namespace:active_users,9c1be2,1512918712031.6f6f4f8204ccaa0b71df08591de155ba. state=OFFLINE, ts=Thu Jan 25 09:15:31 CST 2018 (1657s ago), server=kmr-5b9c18fc-gn-7b3518df-core-1-006.ksc.com,16020,15137131561341657561
# RIT的持续时间1657561ms,超过了27分钟!

【排查过程】
(1)到HDFS上查看此region是否有数据,结果是有数据的(若无数据则进一步处理)
[hdfs@hbase-master1 yarn]$ hdfs dfs -ls -R /apps/hbase/data/data/ks_namespace/ | grep --color 6f6f4f8204ccaa0b71df08591de155ba
drwxr-xr-x   - hbase hdfs          0 2017-12-20 03:23 /apps/hbase/data/data/ks_namespace/active_users/6f6f4f8204ccaa0b71df08591de155ba
-rw-r--r--   3 hbase hdfs        105 2017-12-10 23:11 /apps/hbase/data/data/ks_namespace/active_users/6f6f4f8204ccaa0b71df08591de155ba/.regioninfo
drwxr-xr-x   - hbase hdfs          0 2017-12-19 05:26 /apps/hbase/data/data/ks_namespace/active_users/6f6f4f8204ccaa0b71df08591de155ba/cf
-rw-r--r--   3 hbase hdfs  386450467 2017-12-19 05:26 /apps/hbase/data/data/ks_namespace/active_users/6f6f4f8204ccaa0b71df08591de155ba/cf/cfaa112cc5f449e0890411a35966825d
drwxr-xr-x   - hbase hdfs          0 2017-12-22 21:11 /apps/hbase/data/data/ks_namespace/active_users/6f6f4f8204ccaa0b71df08591de155ba/recovered.edits
-rw-r--r--   3 hbase hdfs          0 2017-12-22 21:11 /apps/hbase/data/data/ks_namespace/active_users/6f6f4f8204ccaa0b71df08591de155ba/recovered.edits/206932.seqid


(2)检查hadoop fsck检查是否有损坏的文件块,结果是健康的(若不健康则在进一步处理)
[hdfs@hbase-master1 yarn]$ hadoop fsck /apps/hbase/data/data/ks_namespace/active_users/6f6f4f8204ccaa0b71df08591de155ba
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

Connecting to namenode via http://hbase-master1.ksc.com:50070/fsck?ugi=hdfs&path=%2Fapps%2Fhbase%2Fdata%2Fdata%2Fks_namespace%2Factive_users%2F6f6f4f8204ccaa0b71df08591de155ba
FSCK started by hdfs (auth:SIMPLE) from /172.31.0.10 for path /apps/hbase/data/data/ks_namespace/active_users/6f6f4f8204ccaa0b71df08591de155ba at Thu Jan 25 09:33:56 CST 2018
...Status: HEALTHY
 Total size: 386450572 B
 Total dirs: 3
 Total files: 3
 Total symlinks: 0
 Total blocks (validated): 4 (avg. block size 96612643 B)
 Minimally replicated blocks: 4 (100.0 %)
 Over-replicated blocks: 0 (0.0 %)
 Under-replicated blocks: 0 (0.0 %)
 Mis-replicated blocks: 0 (0.0 %)
 Default replication factor: 3
 Average block replication: 3.0
 Corrupt blocks: 0
 Missing replicas: 0 (0.0 %)
 Number of data-nodes: 8
 Number of racks: 1
FSCK ended at Thu Jan 25 09:33:56 CST 2018 in 2 milliseconds

The filesystem under path '/apps/hbase/data/data/ks_namespace/active_users/6f6f4f8204ccaa0b71df08591de155ba' is HEALTHY


【解决】尝试删除zookeeper下的此region节点
# 进到zkCli,删掉RIT节点region-in-transition下的区域名
[zk: localhost:2181(CONNECTED) 3] ls /hbase-unsecure/region-in-transition
[6f6f4f8204ccaa0b71df08591de155ba]
[zk: localhost:2181(CONNECTED) 4] rmr /hbase-unsecure/region-in-transition/6f6f4f8204ccaa0b71df08591de155ba
[zk: localhost:2181(CONNECTED) 5] ls /hbase-unsecure/region-in-transition                                  
[]
[zk: localhost:2181(CONNECTED) 6] ls /hbase-unsecure/region-in-transition
[]

# 重启master之后(master会重建RIT和region信息),再到hbase shell验证一下,结果正常显示
hbase(main):007:0> scan 'ks_namespace:active_users', LIMIT => 2
ROW                              COLUMN+CELL                                                                                   
 (null)                          column=cf:confidence, timestamp=10000, value=1.0                                              
 (null)                          column=cf:label, timestamp=10000, value=\xE4\xB8\xAD\xE5\x9B\xBD,\xE4\xB8\x8A\xE6\xB5\xB7,\xE4
                                 \xB8\x8A\xE6\xB5\xB7,,\xE7\xA7\xBB\xE5\x8A\xA8                                                
 ----                            column=cf:confidence, timestamp=10000, value=1.0                                              
 ----                            column=cf:label, timestamp=10000, value=\xE4\xB8\xAD\xE5\x9B\xBD,\xE8\xBE\xBD\xE5\xAE\x81,\xE8
                                 \x90\xA5\xE5\x8F\xA3,,\xE8\x81\x94\xE9\x80\x9A                                                

2 row(s) in 0.2300 seconds


【延伸】

具体的RIT产生机制可参考这哥们的博文:http://blog.csdn.net/LW_GHY/article/details/60780065