hadoop2.7.3-spark2.0.2集群部署(备忘)
安装包准备
Oracle JDK
安装了elasticSearch的系统应该已经配置好了JDK环境; 推荐JDK7scala开发包br>
spark依赖于scala运行, scala是开发spark统计程序的官方语言; 推荐安装scala-2.11版本hadoop开发包
hadoop-yarn为spark运算提供资源管理及hdfs存储; 推荐apache hadoop-2.7.3版本spark开发包
用于分布式运算; 推荐apache spark 2.0.2版本ES-Hadoop插件(假设需要与elasticSearch交互)
es-hadoop作为hadoop/spark集成elasticSearch的插件使用; 推荐es-hadoop_5.1.1版本
安装步骤
1. 安装JDK1.7
mkdir /usr/local/java && cd /usr/local/java
wget "http://download.oracle.com/otn/java/jdk/7u76-b13/jdk-7u76-linux-x64.tar.gz"
tar -zxf jdk-7u76-linux-x64.tar.gz && rm -f jdk-7u76-linux-x64.tar.gz
在/etc/profile中加入如下变量:
export JAVA_HOME=/usr/local/java/jdk1.7.0_76
export JRE_HOME=/usr/local/java/jdk1.7.0_76/jre
export PATH=$PATH:/usr/local/java/jdk1.7.0_76/bin
export CLASSPATH=./:/usr/local/java/jdk1.7.0_76/lib:/usr/local/java/jdk1.7.0_76/jre/lib
source /etc/profile
2. 安装scala-2.11
mkdir /usr/local/scala && cd /usr/local/scala/
wget "http://downloads.lightbend.com/scala/2.11.8/scala-2.11.8.tgz"
tar -zxf scala-2.11.8.tgz && rm -f scala-2.11.8.tgz
echo "export SCALA_HOME=/usr/local/scala/scala-2.11.8" >> /etc/profile
echo "export PATH=$SCALA_HOME/bin:$PATH" >> /etc/profile
source /etc/profile
3. 部署Hadoop
- 为hadoop创建专有用户
groupadd hadoop # 添加hadoop用户组
useradd hadoop -g hadoop # 添加hadoop用户并加入hadoop组
vim /etc/sudoers # 编辑sudoers文件,给hadoop用户sudo权限
hadoop ALL=(ALL) ALL # 在sudoers末尾加上这一行
-
修改各机器主机名, 用ä¥方便区分节点
假设有三台机器, 一个用作master节点, 两个用于slave节点,如下:192.168.1.100 master
192.168.1.101 slave01
192.168.1.102 slave02那么在将各个hostname分别改为master, slave01, slave02后, 各自配置/etc/hosts:
echo "192.168.1.100 master" >> /etc/hosts
echo "192.168.1.101 slave01" >> /etc/hosts
echo "192.168.1.102 slave02" >> /etc/hosts
配置免密码登录
hadoop集群中需要配置namenode(master节点)通过用户hadoop免密码登录到本地以及其他datanode(slave节点);
具体做法是将master节点上的rsa这类证书分发到各个slave节点对应ssh配置目录, 这里略过具体过程。下载hadoop2.7.3
mkdir /usr/local/hadoop && cd /usr/local/hadoop
wget "https://archive.apache.org/dist/hadoop/core/hadoop-2.7.3/hadoop-2.7.3.tar.gz"
tar -zxf hadoop-2.7.3.tar.gz && rm -f hadoop-2.7.3.tar.gz
mkdir -p /usr/local/hadoop/hdfs/data
mkdir -p /usr/local/hadoop/hdfs/name
mkdir -p /usr/local/hadoop/tmp
chown -R hadoop:hadoop /usr/local/hadoop
cd hadoop-2.7.3 && su hadoop
- 配置环境变量(所有节点同样配置)
echo "export HADOOP_HOME=/usr/local/hadoop/hadoop-2.7.3" >> /etc/profile
echo "export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin" >> /etc/profile
source /etc/profile
- 修改配置文件, 在${HADOOP_HOME}/etc/hadoop/下(可先在主节点中配置好, 然后拷贝到其他工作节点)
vim etc/hadoop/hadoop-env.sh
export JAVA_HOME=/usr/local/java/jdk1.7.0_76
vim etc/hadoop/yarn-env.sh
export JAVA_HOME=/usr/local/java/jdk1.7.0_76
vim etc/hadoop/slaves // 把datanode的hostname写入slaves文件, 根据实际情况修改
slave01
slave02
vim etc/hadoop/core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://master:9000</value>
<description>HDFS的URI,文件系统://namenode标识:端口号</description>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/local/hadoop/tmp</value>
<description>namenode上本地的hadoop临时文件夹</description>
</property>
</configuration>
vim etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.name.dir</name>
<value>/usr/local/hadoop/hdfs/name</value>
<description>namenode上存储hdfs名字空间元数据 </description>
</property>
<property>
<name>dfs.data.dir</name>
<value>/usr/local/hadoop/hdfs/data</value>
<description>datanode上数据块的物理存储位置</description>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
<description>副本个数,配置默认是3,应小于datanode机器数量</description>
</property>
</configuration>
vim etc/hadoop/yarn-site.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>master</value>
</property>
</configuration>
vim etc/hadoop/mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
所有配置文件修改后, 将/usr/local/hadoop/文件夹拷贝到datanode中相应的位置
- hadoop集群初始化及启动, 在主节点中执行
cd /usr/local/hadoop/hadoop-2.7.3 && su hadoop
bin/hdfs namenode -format
sbin/start-dfs.sh
sbin/start-yarn.sh
hadoop启动后, 通过http://master:50070/和http://master:8088/可以分别查看hdfs和task等状态信息
4. 部署spark
- 下载spark-2.0.2-bin-hadoop2.7.tgz
mkdir /usr/local/spark/ && cd /usr/local/spark
chown -R hadoop:hadoop /usr/local/spark
wget "http://d3kbcqa49mib13.cloudfront.net/spark-2.0.2-bin-hadoop2.7.tgz"
tar -zxf spark-2.0.2-bin-hadoop2.7.tgz && rm -f spark-2.0.2-bin-hadoop2.7.tgz
cd spark-2.0.2-bin-hadoop2.7
- 配置环境变量
echo "export SPARK_HOME=/usr/local/spark/spark-2.0.2-bin-hadoop2.7" >> /etc/profile
echo "export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin" >> /etc/profile
source /etc/profile
- 修改配置文件
vim conf/spark-env.sh
export JAVA_HOME=/usr/local/java/jdk1.7.0_76
export SCALA_HOME=/usr/local/scala/scala-2.11.8
export SPARK_HOME=/usr/local/spark/spark-2.0.2-bin-hadoop2.7
export HADOOP_HOME=/usr/local/hadoop/hadoop-2.7.3
export SPARK_MASTER_HOST=master
export HADOOP_CONF_DIR=/usr/local/hadoop/hadoop-2.7.3/etc/hadoop/
export SPARK_HISTORY_OPTS="-Dspark.history.retainedApplications=3 -Dspark.history.fs.logDirectory=hdfs://master:9000/sparklogs"
export LD_LIBRARY_PATH=${HADOOP_HOME}/lib/native/:$LD_LIBRARY_PATH
vim spark-default.conf
spark.eventLog.enabled true
spark.yarn.jars hdfs:///sparkjars/* # 指定spark on yarn模式下所以来的spark jar包
spark.eventLog.dir hdfs://master:9000/sparklogs
vim slaves
slave01
slave02
- 在hdfs为Spark建立必要的目录
# SPARK_HOME
# sprk集群的日志目录:配置文件中对history-server中定义的log目录
$ hdfs dfs -mkdir /sparklogs
# 将spark的jar包拷贝到hadoop服务器上,这样避免每次计算的时候都要做去一次拷贝操作
$ hdfs dfs -mkdir /sparkjars
$ cd /usr/local/spark/spark-2.0.2-bin-hadoop2.7/ && hdfs dfs -put jars/* /sparkjars/
配置文件修改完成后, 将/usr/local/spark文件夹拷贝到其他节点对应的位置, 并配置好环境变量
spark集群启动, 在主节点中执行:
cd /usr/local/spark/spark-2.0.2-bin-hadoop2.7/ && ./sbin/start-all.sh
用自带example验证测试
hadoop@master:/usr/local/spark/spark-2.0.2-bin-hadoop2.7# bin/spark-submit --class org.apache.spark.\
examples.JavaSparkPi --master spark://master:7077 examples/jars/spark-examples_2.11-2.0.2.jar
16/12/26 15:41:13 WARN SparkContext: Use an existing SparkContext, some configuration may not take effect.
[Stage 0:> (0 + 0) / 2]16/12/26 15:41:20 WARN \
TaskSetManager: Stage 0 contains a task of very large size (981 KB). The maximum recommended task size is 100 KB.
Pi is roughly 3.13608
spark启动后, 通过http://master:8080/可以查看spark当前的运行状态
5. 结合elasticSearch
下载ES-Hadoop
mkdir /usr/local/es-hadoop && cd /usr/local/es-hadoop
wget "http://download.elastic.co/hadoop/elasticsearch-hadoop-5.1.1.zip"
unzip elasticsearch-hadoop-5.1.1.zip && rm -f elasticsearch-hadoop-5.1.1.zip
cp elasticsearch-hadoop-5.1.1/dist/elasticsearch-hadoop-5.1.1.jar /usr/local/spark/spark-2.0.2-bin-hadoop2.7/jars/
通过spark访问/操作elasticSearch
hadoop@master:/usr/local/spark/spark-2.0.2-bin-hadoop2.7# ./bin/spark-submit your_spark_es_script.py