大数据DataNode 硬盘故障导致uuid变化问题处理

时间:2022-01-27 14:55:49

昨天大数据集群主一台主机硬盘io报错,经过停机维护后检查硬盘io读写正常后,加入集群。发现cloudera页面报错

大数据DataNode 硬盘故障导致uuid变化问题处理

在查看主机log,发现有快硬盘报错

2017-04-22 10:50:11,976 INFO org.apache.hadoop.hdfs.server.common.Storage: Lock on /dn9/dfs/dn/in_use.lock acquired by nodename 1178
9@datanode26.wumart.com
2017-04-22 10:50:11,984 WARN org.apache.hadoop.hdfs.server.common.Storage: org.apache.hadoop.hdfs.server.common.InconsistentFSStateE
xception: Directory /dn9/dfs/dn is in an inconsistent state: Root /dn9/dfs/dn: DatanodeUuid=031a3a79-8d18-4ba0-9dcf-6f2850e2b65e, do
es not match 280a0cac-e5d4-497c-baf2-86c3802f3db1 from other StorageDirectory.
2017-04-22 11:12:32,716 INFO org.apache.hadoop.hdfs.server.common.Storage: Lock on /dn9/dfs/dn/in_use.lock acquired by nodename 1589
1@datanode26.wumart.com
2017-04-22 11:12:32,716 WARN org.apache.hadoop.hdfs.server.common.Storage: org.apache.hadoop.hdfs.server.common.InconsistentFSStateE
xception: Directory /dn9/dfs/dn is in an inconsistent state: Root /dn9/dfs/dn: DatanodeUuid=031a3a79-8d18-4ba0-9dcf-6f2850e2b65e, do
es not match 280a0cac-e5d4-497c-baf2-86c3802f3db1 from other StorageDirectory.
2017-04-22 12:32:32,086 INFO org.apache.hadoop.hdfs.server.common.Storage: Lock on /dn9/dfs/dn/in_use.lock acquired by nodename 2661
1@datanode26.wumart.com
2017-04-22 12:32:32,087 WARN org.apache.hadoop.hdfs.server.common.Storage: org.apache.hadoop.hdfs.server.common.InconsistentFSStateE
xception: Directory /dn9/dfs/dn is in an inconsistent state: Root /dn9/dfs/dn: DatanodeUuid=031a3a79-8d18-4ba0-9dcf-6f2850e2b65e, do
es not match 280a0cac-e5d4-497c-baf2-86c3802f3db1 from other StorageDirectory.


然后根据报错百度搜索,按照方法该uuid,后正常。


解决DataNode Volume Failures故障



一、概述

Hadoop集群有一台DataNode节点发生硬件故障,由于后需需要长时间的处理,所以从Cloudera集群中剔除了该节点,在重新将该节点添加到集群时候发现DataNode节点爆DataNode 卷故障阈值警告

二、解决过程 2.1、排查故障

排查DataNode日志发现如下错误:

2016-06-02 10:19:55,214 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: Failed to add volume: [DISK]file:/disk0/dfs/dn/
org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /disk0/dfs/dn is in an inconsistent state: Root /disk0/dfs/dn: DatanodeUuid=0b9e33dc-984e-4679-a03b-4271362d3e53, does not match a5cd8b10-e7d4-40a9-bc6d-f5c0526d16e9 from other StorageDirectory.
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.setFieldsFromProperties(DataStorage.java:609)
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.setFieldsFromProperties(DataStorage.java:564)
        at org.apache.hadoop.hdfs.server.common.StorageInfo.readProperties(StorageInfo.java:232)
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:667)
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.loadStorageDirectory(DataStorage.java:288)
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.prepareVolume(DataStorage.java:323)
        at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.addVolume(FsDatasetImpl.java:383)
        at org.apache.hadoop.hdfs.server.datanode.DataNode$2.call(DataNode.java:577)
        at org.apache.hadoop.hdfs.server.datanode.DataNode$2.call(DataNode.java:573)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

发现DataNode在检查文件时候原本的uuid于实际DataNode中的uuid不匹配导致磁盘告警。

2.2、解决故障
  • 找到/disk0/下面的VERSION文件重新编辑。
注意:我这里是一个DataNode物理机上面挂载着12块盘,通过与其他路径的VERSION文件做对比即可知道本来的uuid是什么样的。

[root@slave current]# pwd
/disk0/dfs/dn/current
[root@slave191 current]# vim VERSION
#将datanodeUuid的值改为a5cd8b10-e7d4-40a9-bc6d-f5c0526d16e9

重启该节点datanode


原因分析

该节点在我下机器之前应用的是cloudera 的模板1,但是我在重新挂载时候应用了模板2导致数据目录不一样了(/disk0被替换成/opt了),然后我又手动更改最终导致datanodeUuid不一致报警。