基于Eclipse的hadoop开发环境配置及worldCount实例运行

时间:2021-08-31 08:30:44

我的开发环境:

操作系统ubuntu12.04 一个namenode 三个datanode

Hadoop版本:hadoop-1.0.1

Eclipse版本:EclipseSDK3.8.2

第一步:先启动hadoop守护进程

具体参看:http://www.cnblogs.com/flyoung2008/archive/2011/11/29/2268302.html

第二步:在eclipse上安装hadoop插件

1.复制 hadoop-eclipse插件:hadoop-eclipse-plugin-1.2.1到 eclipse安装目录/plugins/ 下。 
下载地址:http://download.csdn.net/detail/wtxwd/7803427
2.重启eclipse,配置hadoop installation directory。 
如果安装插件成功,打开Window-->Preferens,你会发现Hadoop Map/Reduce选项,在这个选项里你需要配置Hadoop installation directory。配置完成后退出。

基于Eclipse的hadoop开发环境配置及worldCount实例运行

3.配置Map/Reduce Locations。 
在Window-->Show View中打开Map/Reduce Locations。 
在Map/Reduce Locations中新建一个Hadoop Location。在这个View中,右键-->New Hadoop Location。在弹出的对话框中你需要配置Location name,如Hadoop,还有Map/Reduce Master和DFS Master。这里面的Host、Port分别为你在mapred-site.xml、core-site.xml中配置的地址及端口。如:

Map/Reduce Master

master(192.168.1.6)
9001

DFS Master

master(192.168.1.6)
9000
基于Eclipse的hadoop开发环境配置及worldCount实例运行
配置完后退出。点击window-perspective,选择map/reduce视图,点击DFS Locations-->Hadoop如果能显示文件夹(2)说明配置正确,如果显示"拒绝连接",请检查你的配置。
基于Eclipse的hadoop开发环境配置及worldCount实例运行

第三步:新建项目。 
File-->New-->Other-->Map/Reduce Project 
项目名可以随便取,如WordCount。 
复制WordCount.java到刚才新建的项目下面。 
源文件下载地址:http://download.csdn.net/detail/wtxwd/7803385
第四步:暂时离开eclipse,上传文件

然后打开终端创建2个文件file01、file02,内容如下


  1. [hadoop@localhost~]$ ls  

classes  Desktop file01  file02  hadoop-0.20.203.0  wordcount.jar WordCount.java


  1. [hadoop@localhost~]$ cat file01  

Hello World Bye World


  1. [hadoop@localhost~]$ cat file02  

Hello Hadoop Goodbye Hadoop

启动hadoop,在hadoop中创建input文件夹


  1. [hadoop@localhost~]$ hadoop-0.20.203.0/bin/hadoop dfs -ls  
  2.   
  3. [hadoop@localhost~]$ hadoop-0.20.203.0/bin/hadoop dfs -mkdir input  
  4.   
  5. [hadoop@localhost~]$ hadoop-0.20.203.0/bin/hadoop dfs -ls  

Found 1 items

drwxr-xr-x   - hadoop supergroup          0 2011-11-23 05:20 /user/hadoop/input

把file01、file02上传input中


  1. [hadoop@localhost~]$ hadoop-0.20.203.0/bin/hadoop fs -put file01 input  
  2.   
  3. [hadoop@localhost~]$ hadoop-0.20.203.0/bin/hadoop fs -put file02 input  
  4.   
  5. [hadoop@localhost~]$ hadoop-0.20.203.0/bin/hadoop fs -ls input  

Found 2 items

-rw-r--r--   1 hadoop supergroup         22 2011-11-23 05:22/user/hadoop/input/file01

-rw-r--r--   1 hadoop supergroup         28 2011-11-23 05:22/user/hadoop/input/file02


第五步:运行项目

1.在新建的项目Hadoop,点击WordCount.java,右键-->Run As-->Run Configurations 
2.在弹出的Run Configurations对话框中,点Java Application,右键-->New,这时会新建一个application名为WordCount 
3.配置运行参数,点Arguments,在Program arguments中输入“你要传给程序的输入文件夹和你要求程序将计算结果保存的文件夹”,如:

hdfs://master:9000/user/hadoop/input  hdfs://master:9000/user/hadoop/output

4、如果运行时报java.lang.OutOfMemoryError: Java heap space 配置VM arguments(在Program arguments下)

-Xms512m -Xmx1024m -XX:MaxPermSize=256m

基于Eclipse的hadoop开发环境配置及worldCount实例运行

完成输入输出路径以后,我们点击Apply,但是这个时候不能点击Run,因为这里的run是指在单机上run,而我们是要在hadoop集群上run,因此我们点击close然后执行以下步骤:WordCount.java->右键->Run as->Run on hadoop


点击finish运行。

使用Eclipse进行hadoop的程序编写,然后Run on hadoop 后,可能出现如下错误org.apache.hadoop.security.AccessControlException:org.apache.hadoop.security.AccessControlException: Permissiondenied: user=mango, access=WRITE,inode="hadoop"/inode="tmp":hadoop:supergroup:rwxr-xr-x

因为Eclipse使用hadoop插件提交作业时,会默认以 mango 身份去将作业写入hdfs文件系统中,对应的也就是 HDFS上的/user/xxx , 我的为/user/zcf ,   由于 mango用户对hadoop目录并没有写入权限,所以导致异常的发生。

提供的解决方法为:放开 hadoop 目录的权限 , 命令如下:

$cd /usr/local/hadoop

$ bin/hadoop fs -chmod 777 /user/hadoop

运行成功后控制台显示如下:

13/06/30 14:17:01 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
13/06/30 14:17:01 INFO input.FileInputFormat: Total input paths to process : 2
13/06/30 14:17:01 WARN snappy.LoadSnappy: Snappy native library not loaded
13/06/30 14:17:01 INFO mapred.JobClient: Running job: job_local_0001
13/06/30 14:17:02 INFO util.ProcessTree: setsid exited with exit code 0
13/06/30 14:17:02 INFO mapred.Task:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@1e903d5
13/06/30 14:17:02 INFO mapred.MapTask: io.sort.mb = 100
13/06/30 14:17:02 INFO mapred.MapTask: data buffer = 79691776/99614720
13/06/30 14:17:02 INFO mapred.MapTask: record buffer = 262144/327680
13/06/30 14:17:02 INFO mapred.MapTask: Starting flush of map output
13/06/30 14:17:02 INFO mapred.MapTask: Finished spill 0
13/06/30 14:17:02 INFO mapred.Task: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
13/06/30 14:17:02 INFO mapred.LocalJobRunner: 
13/06/30 14:17:02 INFO mapred.Task: Task 'attempt_local_0001_m_000000_0' done.
13/06/30 14:17:02 INFO mapred.Task:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@c01e99
13/06/30 14:17:02 INFO mapred.MapTask: io.sort.mb = 100
13/06/30 14:17:02 INFO mapred.MapTask: data buffer = 79691776/99614720
13/06/30 14:17:02 INFO mapred.MapTask: record buffer = 262144/327680
13/06/30 14:17:02 INFO mapred.MapTask: Starting flush of map output
13/06/30 14:17:02 INFO mapred.MapTask: Finished spill 0
13/06/30 14:17:02 INFO mapred.Task: Task:attempt_local_0001_m_000001_0 is done. And is in the process of commiting
13/06/30 14:17:02 INFO mapred.LocalJobRunner: 
13/06/30 14:17:02 INFO mapred.Task: Task 'attempt_local_0001_m_000001_0' done.
13/06/30 14:17:02 INFO mapred.Task:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@31f2a7
13/06/30 14:17:02 INFO mapred.LocalJobRunner: 
13/06/30 14:17:02 INFO mapred.Merger: Merging 2 sorted segments
13/06/30 14:17:02 INFO mapred.Merger: Down to the last merge-pass, with 2 segments left of total size: 73 bytes
13/06/30 14:17:02 INFO mapred.LocalJobRunner: 
13/06/30 14:17:02 INFO mapred.Task: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting
13/06/30 14:17:02 INFO mapred.LocalJobRunner: 
13/06/30 14:17:02 INFO mapred.Task: Task attempt_local_0001_r_000000_0 is allowed to commit now
13/06/30 14:17:02 INFO output.FileOutputCommitter: Saved output of task 'attempt_local_0001_r_000000_0' to hdfs://localhost:9000/user/hadoop/output
13/06/30 14:17:02 INFO mapred.LocalJobRunner: reduce > reduce
13/06/30 14:17:02 INFO mapred.Task: Task 'attempt_local_0001_r_000000_0' done.
13/06/30 14:17:02 INFO mapred.JobClient:  map 100% reduce 100%
13/06/30 14:17:02 INFO mapred.JobClient: Job complete: job_local_0001
13/06/30 14:17:02 INFO mapred.JobClient: Counters: 22
13/06/30 14:17:02 INFO mapred.JobClient:   File Output Format Counters 
13/06/30 14:17:02 INFO mapred.JobClient:     Bytes Written=31
13/06/30 14:17:02 INFO mapred.JobClient:   FileSystemCounters
13/06/30 14:17:02 INFO mapred.JobClient:     FILE_BYTES_READ=18047
13/06/30 14:17:02 INFO mapred.JobClient:     HDFS_BYTES_READ=116
13/06/30 14:17:02 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=214050
13/06/30 14:17:02 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=31
13/06/30 14:17:02 INFO mapred.JobClient:   File Input Format Counters 
13/06/30 14:17:02 INFO mapred.JobClient:     Bytes Read=46
13/06/30 14:17:02 INFO mapred.JobClient:   Map-Reduce Framework
13/06/30 14:17:02 INFO mapred.JobClient:     Map output materialized bytes=81
13/06/30 14:17:02 INFO mapred.JobClient:     Map input records=2
13/06/30 14:17:02 INFO mapred.JobClient:     Reduce shuffle bytes=0
13/06/30 14:17:02 INFO mapred.JobClient:     Spilled Records=12
13/06/30 14:17:02 INFO mapred.JobClient:     Map output bytes=78
13/06/30 14:17:02 INFO mapred.JobClient:     Total committed heap usage (bytes)=681639936
13/06/30 14:17:02 INFO mapred.JobClient:     CPU time spent (ms)=0
13/06/30 14:17:02 INFO mapred.JobClient:     SPLIT_RAW_BYTES=222
13/06/30 14:17:02 INFO mapred.JobClient:     Combine input records=8
13/06/30 14:17:02 INFO mapred.JobClient:     Reduce input records=6
13/06/30 14:17:02 INFO mapred.JobClient:     Reduce input groups=4
13/06/30 14:17:02 INFO mapred.JobClient:     Combine output records=6
13/06/30 14:17:02 INFO mapred.JobClient:     Physical memory (bytes) snapshot=0
13/06/30 14:17:02 INFO mapred.JobClient:     Reduce output records=4
13/06/30 14:17:02 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=0
13/06/30 14:17:02 INFO mapred.JobClient:     Map output records=8



运行完毕后在output文件夹中查看结果。

hadoop fs -ls /user/hadoop/output
输出:

Found 2 items
-rw-r--r-- 3 hadoop supergroup 0 2014-08-22 17:13 /user/hadoop/output/_SUCCESS
-rw-r--r-- 3 hadoop supergroup 41 2014-08-22 17:13 /user/hadoop/output/part-r-00000
查看文件内容:

hadoop fs -cat /user/hadoop/output/part-r-00000

输出
Bye1
Goodbye1
Hadoop2
Hello2
World2

至此,eclipse下的WordCount实例运行结束,如果还想重新运行一遍,这需把output文件夹删除或者修改Run Configuration中arguments中的output路径,因为hadoop为了保证结果的正确性,存在输出的文件夹的话,就会报异常,异常如下

Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://localhost:9000/user/hadoop/output already exists