目睹这头大象是怎么跳的舞。以下是我在Ubuntu 12.10下面安装JDK以及Hadoop的整个过程。
说明:在最开始时,我在网上各处搜比较妥当的安装hadoop的方法,过程比较纠结;后来才发现直接在官方文档中就可以找到可靠的安装过程,传送门:Hadoop Single Node Setup
一、安装Java开发环境(Ubuntu自带openjdk:可java -version查看版本;或执行sudo apt-get install java提示已安装openjdk)
1、火狐下载jdk-6u37-linux-i586.bin,下载后原目录为:/home/baron/Downloads/
2、在/usr/下新建java目录:sudo mkdir /usr/java
3、拷贝文件至该新建目录:sudo cp /home/baron/Downloads/jdk-6u37-linux-i586.bin /usr/java
4、更改文件权限,使之可以运行:sudo chmod u+x jdk-6u37-linux-i586.bin
5、运行该文件:sudo jdk-6u37-linux-i586.bin 。至此,usr/java/目录下面有一个bin文件包jdk1.6.0_37,以及解压后的同名文件夹。
6、在profile中配置jdk环境变量:sudo vi /etc/profile,并在后面加上一下几行(千万不能输错,否则进不了桌面系统,如出现该情况:ctrl+alt+F1进入root环境,验证用户名密码,执行:vi /etc/profile正确修改文件):
export JAVA_HOME=/usr/java/jdk1.6.0_37
export PATH=$JAVA_HOME/bin:$PATH
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
二、安装ssh(hadoop使用ssh来实现cluster中各node的登录认证,免密码ssh设置在后文中有介绍)
sudo apt-get install ssh
三、安装rsync(该版本Ubuntu已自带rsync)
sudo apt-get install rsync
四、安装hadoop
1、创建hadoop用户组以及用户:
sudo addgroup hadoop
sudo adduser --ingroup hadoop hadoop
在/home/下会有一个新的hadoop文件夹,此时最好切换至新建的hadoop用户登陆Ubuntu。
2、将下载的hadoop拷贝至该新建文件夹下:sudo cp /home/baron/Downloads/hadoop-1.0.4-bin.tar.gz /home/hadoop/
3、进入该目录(cd /home/hadoop/)之后,解压该文件:sudo tar xzf hadoop-1.0.4-bin.tar.gz
4、进入hadoop-env.sh所在目录(/hadoop-1.0.4/conf/),对该文件进行如下内容的修改:export JAVA_HOME=/usr/java/jdk1.6.0_37
5、hadoop默认是Standalone Operation。可以按照官方文档进行测试:
By default, Hadoop is configured to run in a non-distributed mode, as a single Java process. This is useful for debugging.
The following example copies the unpacked conf directory to use as input and then finds and displays every match of the given regular expression. Output is written to the given output directory.
$ mkdir input
$ cp conf/*.xml input
$ bin/hadoop jar hadoop-examples-*.jar grep input output 'dfs[a-z.]+'
$ cat output/*
6、或者使用Pseudo-Distributed Operation模式,参照官方文档:
Pseudo-Distributed Operation
Hadoop can also be run on a single-node in a pseudo-distributed mode where each Hadoop daemon runs in a separate Java process.
Configuration,Use the following:
conf/core-site.xml:
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
conf/hdfs-site.xml:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
conf/mapred-site.xml:
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</configuration>
7、测试可否使用ssh登陆localhost(执行后屏幕的提示忘了copy,如有提示,输入yes):
Now check that you can ssh to the localhost without a passphrase:
$ ssh localhost
如果无法登录,则主动生成key:
If you cannot ssh to localhost without a passphrase, execute the following commands:
$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
屏幕提示(部分数据已用*代替):
Generating public/private dsa key pair.
Your identification has been saved in /home/hadoop/.ssh/id_dsa.
Your public key has been saved in /home/hadoop/.ssh/id_dsa.pub.
The key fingerprint is:
b3:5d:c4:*** hadoop@Baron-SR25E
The key's randomart image is:
+--[ DSA 1024]----+
| ...o E... |
| . ...= .. |
| o .. + |
| . * |
| S + o |
| = = . |
| . o o o |
| . o . |
| ... . |
+-----------------+
免输入密码登陆ssh:
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
将ssh密钥追加到authorized_keys后面,即可实现免密钥登陆。
8、执行格式化namenode:
Format a new distributed-filesystem:
$ bin/hadoop namenode -format
12/11/10 16:25:48 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = Baron-SR25E/127.0.1.1
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 1.0.4
STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0 -r 1393290; compiled by 'hortonfo' on Wed Oct 3 05:13:58 UTC 2012
************************************************************/
12/11/10 16:25:49 INFO util.GSet: VM type = 32-bit
12/11/10 16:25:49 INFO util.GSet: 2% max memory = 17.77875 MB
12/11/10 16:25:49 INFO util.GSet: capacity = 2^22 = 4194304 entries
12/11/10 16:25:49 INFO util.GSet: recommended=4194304, actual=4194304
12/11/10 16:25:49 INFO namenode.FSNamesystem: fsOwner=root
12/11/10 16:25:49 INFO namenode.FSNamesystem: supergroup=supergroup
12/11/10 16:25:49 INFO namenode.FSNamesystem: isPermissionEnabled=true
12/11/10 16:25:49 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100
12/11/10 16:25:49 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)
12/11/10 16:25:49 INFO namenode.NameNode: Caching file names occuring more than 10 times
12/11/10 16:25:50 INFO common.Storage: Image file of size 110 saved in 0 seconds.
12/11/10 16:25:50 INFO common.Storage: Storage directory /tmp/hadoop-root/dfs/name has been successfully formatted.
12/11/10 16:25:50 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at Baron-SR25E/127.0.1.1
************************************************************/
9、按照官方文档给出的样例运行hadoop(切记:首先使用ssh登陆localhost):
Start the hadoop daemons:
$ bin/start-all.sh
如果错误提示无法创建文件夹等信息,可在命令前加上sudo,但此时也会提示该用户名没有权限使用sudo,所以可按下步骤进行修改:
1)进入超级用户模式,也就是输入"su -"
su -
系统会让你输入超级用户密码,输入密码后就进入了超级用户模式,也就是root用户模式。注意这里有"-" ,这和su是不同的,在用命令”su”的时候只是切换到root,但没有把root的环境变量传过去,还是当前用户的环境变量,用”su -”命令将环境变量也一起带过去,就象和root登录一样。
2)添加文件的写权限,也就是输入命令:
chmod u+w /etc/sudoers
3)编辑/etc/sudoers文件,也就是输入命令:
vi /etc/sudoers
进入编辑模式,找到这一 行:
root ALL=(ALL:ALL) ALL
在它的下面添加:
hadoop ALL=(ALL:ALL) ALL
这里的hadoop是你的用户名,然后保存退出。 。
4)撤销文件的写权限,也就是输入命令:
chmod u-w /etc/sudoers
然后再执行以上命令启动hadoop,应该没问题了
10、继续按照官方文档给出的示例执行命令:
Browse the web interface for the NameNode and the JobTracker; by default they are available at:
NameNode - http://localhost:50070/
JobTracker - http://localhost:50030/
Copy the input files into the distributed filesystem:
$ bin/hadoop fs -put conf input
Run some of the examples provided:
$ bin/hadoop jar hadoop-examples-*.jar grep input output 'dfs[a-z.]+'
Examine the output files:
Copy the output files from the distributed filesystem to the local filesytem and examine them:
$ bin/hadoop fs -get output output
$ cat output/*
or
View the output files on the distributed filesystem:
$ bin/hadoop fs -cat output/*
When you're done, stop the daemons with:
$ bin/stop-all.sh
看到如图的结果,我也大致满意了,虽然还不是很清楚其中各项数据的含义,有待来日深究。
备注:对于执行hadoop命令过程中提示的各种错误信息,经分析主要是当前登录用户对文件读写有权限限制导致,获取到对/home/hadoop/hadoop-1.0.4文件的读写权限之后就不会出现类似问题了,即步骤四-9的方法,参考自网络,亲测可用。
补充说明:
如果需要在terminal中直接运行hadoop命令,还需要在/etc/profile中更改PAHT环境变量,例如:
export HADOOP=/home/hadoop/hadoop-1.0.4
export PATH=$HADOOP/bin:$PATH