搭建hadoop伪集群过程中出现一个奇怪的问题:其他进程启动OK,但是DataNode木有启动,终于找到了原因,特此记录留作纪念
各个配置文件内容如下:
/etc/profile:
########## jdk ################
export JAVA_HOME=/opt/package/jdk1.7.0_76
########### hadoop ############
export HADOOP_HOME=/opt/package/hadoop-2.7.2
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sin:$JAVA_HOME/bin
export CLASSPATH=.:$JAVA_HOME/lib:$HADOOP_HOME/lib:$CLASSPATH
hadoop-env.sh:
export JAVA_HOME=/opt/package/jdk1.7.0_76
hdfs-site.xml:
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.permissions.enabled</name>
<value>false</value>
</property>
core-site.xml:
<property>
<name>fs.defaultFS</name>
<value>hdfs://node1:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/data/hadoop_tmp_dir</value>
</property>
mapred-site.xml:
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
yarn-site.xml:
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
格式化HDFS都木有发生任何异常
日志目录是这样:
[root@node1 logs]# ll
total 132
-rw-r--r-- 1 root root 49693 Jun 29 12:25 hadoop-root-namenode-node1.log
-rw-r--r-- 1 root root 717 Jun 29 12:04 hadoop-root-namenode-node1.out
-rw-r--r-- 1 root root 43240 Jun 29 12:25 hadoop-root-secondarynamenode-node1.log
-rw-r--r-- 1 root root 19360 Jun 29 12:24 hadoop-root-secondarynamenode-node1.out
-rw-r--r-- 1 root root 0 Jun 29 12:04 SecurityAuth-root.audit
启动HDFS显示如下:
[root@node1 sbin]# ./start-dfs.sh
17/06/29 12:04:36 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Starting namenodes on [node1]
node1: starting namenode, logging to /opt/package/hadoop-2.7.2/logs/hadoop-root-namenode-node1.out
localhost: ssh: Could not resolve hostname localhost: Name or service not known
Starting secondary namenodes [0.0.0.0]
0.0.0.0: reverse mapping checking getaddrinfo for localhost [127.0.0.1] failed - POSSIBLE BREAK-IN ATTEMPT!
0.0.0.0: starting secondarynamenode, logging to /opt/package/hadoop-2.7.2/logs/hadoop-root-secondarynamenode-node1.out
17/06/29 12:05:00 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[root@node1 sbin]#
这些提示信息看起来似乎很正常,用jps显示Java进程信息如下
[root@node1 sbin]# jps发现木有DataNode
8193 SecondaryNameNode
8012 NameNode
8318 Jps
各种折腾之后发现,在启动hdfs时,系统有个重要的提示给忽略了,就是下面这个
localhost: ssh: Could not resolve hostname localhost: Name or service not known恍然大悟之后,立刻查看/etc/hosts,发现里面是这样的情况
里面居然木有本地主机ip和localhost的映射关系
增加一行
127.0.0.1 node1 localhost.localdomain localhost变成了这样
增加了本机地址的映射关系之后,执行./start-dfs.sh
在启动时没有提示
localhost: ssh: Could not resolve hostname localhost: Name or service not known这样的错误了,但是使用jps查看java进程时,依然发现木有datanode.
但是查看hadoop-root-datanode-node1.log文件时发现有这样的错误提示
2017-06-29 12:32:24,881 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for Block pool <registering> (Datanode Uuid unassigned) service to node1/127.0.0.1:9000. Exiting.
java.io.IOException: All specified directories are failed to load.
at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:478)
at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1358)
at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1323)
at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:317)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:223)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:802)
at java.lang.Thread.run(Thread.java:745)
2017-06-29 12:32:24,883 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Ending block pool service for: Block pool <registering> (Datanode Uuid unassigned) service to node1/127.0.0.1:9000
2017-06-29 12:32:24,886 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Removed Block pool <registering> (Datanode Uuid unassigned)
然后又根据Initialization failed for Block pool <registering>这个错误信息在网上查找后发现了另一个文章 Initialization failed for Block pool 感谢这位作者,根据里面的提示找到了
我的core-site.xml里面配置的hadoop.tmp.dir的参数,/opt/data/hadoop_tmp_dir/dfs/data/current,在执行./hdfs namenode -format 之后在这个路径下创建了dfs/data、dfs/name子目录,并且也创建了VERSION文件,
打开/opt/data/hadoop_tmp_dir/dfs/data/current/VERSION
#Thu Jun 29 11:47:52 CST 2017
storageID=DS-c237765a-a14c-4433-9b39-e3da97c04ee5
clusterID=CID-21d3ce28-824f-4863-a5fc-bc6b331c2c74
cTime=0
datanodeUuid=172d0371-62e5-469d-a9b7-b9966eced736
storageType=DATA_NODE
layoutVersion=-56
/opt/data/hadoop_tmp_dir/dfs/name/current/VERSION
#Thu Jun 29 12:31:48 CST 2017对比clusterID发现,果然不一致,然后把namendoe下的clusterID同步到datanode的VERSION文件里面。然后执行./start-dfs.sh
namespaceID=123143181
clusterID=CID-90058437-02e5-4619-9e95-715ab2ec880b
cTime=0
storageType=NAME_NODE
blockpoolID=BP-336256458-127.0.0.1-1498710708132
layoutVersion=-63
发现启动后已经有DataNode进程了。
其实到这个地步有点晕了,究竟是什么原因导致的呢?
分析由这几个原因:
1. /etc/hosts中本机地址没有配置映射关系
2. 多次对HDFS进行了格式化
多种原因导致了这些问题。
有些观点是:在第一次格式化dfs后,启动并使用了hadoop,后来又重新执行了格式化命令(hdfs namenode -format),这时namenode的clusterID会重新生成,而datanode的clusterID 保持不变。
但是我觉得可能还有其他原因,需要继续做测试才能真正明白。