Hadoop + Spark 在CentOS下的伪分布式部署

时间:2021-08-19 06:11:34

一. 软件

  • centos6.5
  • jdk1.7
  • hadoop-2.6.1.tar.gz(在64位平台重新编译好的版本)
  • scala2.11.7.tgz
  • spark-1.5.0-bin-hadoop2.6.tgz 

 

二. 安装前准备

 

1. 在系统全局安装jdk

a. 解压

b. 配置环境变量(可以在/etc/profile.d/下面配置)

export JAVA_HOME=/usr/java/jdk1.7.0_21
export CLASSPATH=.:$JAVA_HOME/lib:$CLASSPATH
export PATH=$JAVA_HOME/bin:$PATH

source /etc/profile

c. 检验Java安装

java -version

 

2. 创建hadoop用户和组,并在/etc/sudoers中赋予root权限

# groupadd hadoop

# useradd -g hadoop hadoop

# passwd hadoop

# visodu

添加如下:

## Allow root to run any commands anywhere

root    ALL=(ALL)       ALL
hadoop  ALL=(ALL)       ALL

3. 修改主机名

vim /etc/hosts

vim /etc/sysconfig/network

hostname

4. 安装ssh服务并建立ssh互信无密码访问

a. 安装openssh服务

rpm –qa | grep ssh

yum install openssh

b. 生成公钥密钥对

以hadoop用户登录

ssh-keygen -t rsa 

看到图形输出,表示密钥生成成功,目录下多出两个文件

私钥文件:id_rsa
公钥文件:id_rsa.pub

c. 将公钥文件id_rsa.pub内容放到authorized_keys文件中:

cat id_rsa.pub >> authorized_keys 

d. 将公钥文件authorized_keys分发到各dataNode节点:

e. 验证ssh无密码登陆

 

5. 关闭防火墙

# service iptables stop

 

三. hadoop配置部署

1. 下载hadoop

http://mirrors.hust.edu.cn/apache/hadoop/common/

2. 配置文件

解压,tar zxvf hadoop-2.6.0.tar.gz

进入配置文件目录:cd hadoop-2.5.1/etc/hadoop

a. core-site.xml

<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hadoop/tmp</value>
<description>Abase for other temporary directories.</description>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://nameNode:9000</value>
</property>
<property>
<name>io.file.buffer.size</name>
<value>4096</value>
</property>
</configuration>

b. hdfs-site.xml

<configuration>
<property>
<name>dfs.nameservices</name>
<value>hadoop-cluster1</value>
</property>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>nameNode:50090</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///home/hadoop/dfs/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///home/hadoop/dfs/data</value>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
</configuration>

c. mapred-site.xml

<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobtracker.http.address</name>
<value>nameNode:50030</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>nameNode:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>nameNode:19888</value>
</property>
</configuration>

d. yarn-site.xml

<configuration>

<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>nameNode:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>nameNode:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>nameNode:8031</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>nameNode:8033</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>nameNode:8088</value>
</property>
</configuration>

e. slaves

把作为dataNode的机器名写入该文件中

f. 修改JAVA_HOME

分别在文件hadoop-env.sh和yarn-env.sh中添加JAVA_HOME配置

vim hadoop-env.sh

vim yarn-env.sh

g. 配置系统环境变量

export HADOOP_HOME

3. 格式化文件系统

bin/hdfs namenode –format

4. 启动停止服务

启动

./start-dfs.sh

./start-yarn.sh

停止

./stop-yarn.sh

./stop-dfs.sh

5. 验证

 

四. 安装scala

1. 下载 scala2.11.7 http://www.scala-lang.org/

2. 将下载的 scala2.11.7.tgz 放到/usr/local/ 并解压 tar zxvf scala2.11.7.tgz
3. 配置环境变量:

vim /etc/profile
export SCALA_HOME=/usr/local/scala-2.11.7
export PATH=$PATH:$SCALA_HOME/bin

source /etc/profile

4. 检测scala

scala -version

 

五. spark部署安装

1. 下载 spark1.5  http://mirrors.cnnic.cn/apache/
2. 解压spark-1.5.0-bin-hadoop2.6.tgz
3. 配置环境变量:

vim /etc/profile

export SPARK_HOME=/app/spark-1.5.0-bin-hadoop2.6

export PATH=$PATH:$SCALA_HOME/bin:$SPARK_HOME/bin

source /etc/profile

4.进入到spark 的conf 目录下:

cp spark-env.sh.template spark-env.sh

并在 spark-env.sh 文件后加:

###jdk安装目录

export JAVA_HOME=/usr/local/jdk1.7.0_79

###scala安装目录

export SCALA_HOME=/usr/local/scala-2.11.7

###spark集群的master节点的ip

export SPARK_MASTER_IP=192.168.1.104

#export SPARK_WORKER_CORES=2

#export SPARK_WORKER_MEMORY=4g

#export SPARK_MASTER_IP=spark1             

#export SPARK_MASTER_PORT=30111 

#export SPARK_MASTER_WEBUI_PORT=30118 

#export SPARK_WORKER_CORES=2 

#export SPARK_WORKER_MEMORY=4g 

#export SPARK_WORKER_PORT=30333 

#export SPARK_WORKER_WEBUI_PORT=30119 

#export SPARK_WORKER_INSTANCES=1

###指定的worker节点能够最大分配给Excutors的内存大小

export SPARK_WORKER_MEMORY=1g

###hadoop集群的配置文件目录

export HADOOP_CONF_DIR=/usr/local/hadoop26/etc/hadoop

###spark集群的配置文件目录

export SPARK_CONF_DIR=/usr/local/spark-1.4.0-bin-hadoop2.6/conf

#spark 性能调优

export SPARK_DAEMON_JAVA_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"

6.修改conf目录下面的slaves文件将worker节点都加进去
7.启动spark:

        bin/spark-shell
8.查看spark设置:http://ip:4040


更多的资料,参考:https://spark.apache.org/docs