Spark学习之standalone模式部署实战

Standalone模式部署实战

spark有好几种运行模式，本次我们来挑一种standalone模式来操作一下，就是spark独自包揽除了存储文件操作之外的所有操作，包括集群管理，任务调度，程序计算等等，这种模式适合不大的程序，不需要yarn等将部署整的很复杂。专业点的描述就是，利用Spark自带的资源管理与调度器运行Spark集群，采用Master/Slave结构，为解决单点故障，可以采用Xookeeper实现高可靠(High Availability, HA)。开始吧
首先准备以下东西

要部署的Application，已经打包成一个jar包，或者没有用自带的实例
四台linux机器，物理机或者虚拟机，要求互相能ping通，并且已经安装好jdk 我的是jdk-1.8.0_101
spark安装包，我的版本spark-2.3-hadoop-2.7版本
zookeeper安装包
每台机器都要安装scala环境我的版本scala-2.11.12

一、操作系统准备

四台机器ip如下:

10.1.161.91
10.1.161.92
10.1.161.94
10.1.161.95

1、主机名修改

为了便于后续操作，修改下主机名，修改成统一格式，我的机器对应如下

10.1.161.91  --->   
10.1.161.92  --->   
10.1.161.94  --->   
10.1.161.95  --->

2、配置主机和ip的映射修改hosts文件

每台机器执行

vi /etc/hosts

在文件后面添加下列配置

10.1.161.95 
10.1.161.94 
10.1.161.91 
10.1.161.92

保存好了，可以测试下ping情况，比如在上ping ，结果如下
在这里插入图片描述
每台机器都要试一下，保证映射有效，以免后续出错
同时也要测一下外网的连通性，如下

如果不通的话，检查下是否dns解析有问题

vi /etc/

尝试修改nameServer。或者是其他什么原因，请自行排查

3、无密码访问

如果机器没有安装openssh，执行下列命令安装

yum install openssh-server

每台机器都要安装，安装好了以后，生成密钥，命令如下

ssh-keygen -t rsa

执行的时候直接几次enter就行了，由于我这个已经生成过了，因此会问是否覆盖，如下
在这里插入图片描述
生成好了以后，将密钥拷贝到其他机器，通过以下指令：

ssh-copy-id -i 
ssh-copy-id -i 
ssh-copy-id -i

我在test01上执行

ssh-copy-id -i

结果
在这里插入图片描述
执行过程中需要输入访问密码。所有的机器都要执行，目的是确保任意两台机器之间都可以互相无密码登录，才能作为一个集群，共同协作。
接下来测试下是否可以无密码互通，执行

ssh

结果
在这里插入图片描述
可以看到现在已经可以无密码访问了，同时也可以访问，如下

其他机器同理，

ssh 
ssh

保证任意两台机器之间可以互相无密码操作即可，还可以用scp命令测试下是否可以无密码互相复制文件啥的。
这一步算完成了。

二、环境安装

1、安装包分享

包括jdk、scala、spark、hadoop、zookeeper安装包分享如下

/s/1wq77i-EB5kh5j3ZncBEa6g   密码 1v2p

建议将安装包都下载到一个目录下，比如/usr/local/sparksoft 下，至于是用ftp传上去还是虚拟机用共享目录都可以，方便即可。
至于安装，应该首先在一台机器上安装，比如，安装好了以后全部复制到其他机器上，保证几台机器的安装和配置都是相同的。
在这里插入图片描述

2、基础环境安装

JDK安装

将安装包复制到当前目录，接下来解压即可

    tar -zxvf

解压完毕后，配置环境变量

    vi /etc/profile

文件后面添加配置

    export JAVA_HOME=/usr/local/java/jdk1.8.0_131
    export JRE_HOME=${JAVA_HOME}/jre
    export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib

保存，立即生效

    source /etc/profile

测试是否安装成功
在这里插入图片描述
有版本显示即可。

Scala安装

同java安装一样，不多啰嗦，安装结果
在这里插入图片描述

3、spark安装

mkdir /usr/local/spark-2.3-hadoop-2.7
cd /usr/local/spark-2.3-hadoop-2.7
cp /usr/local/sparksoft/spark-2.3-bin-hadoop2. .
tar -zxvf spark-2.3-bin-hadoop2.

解压完毕，同样的需要配置spark_home如下

export SPARK_HOME=/usr/local/spark-2.3-hadoop-2.7/spark-2.3.2-bin-hadoop2.7
export PATH=$PATH:${JAVA_HOME}/bin:${SPARK_HOME}/bin:${SPARK_HOME}/sbin

这样在任意目录就可以直接运行spark的sh脚本，不需要到安装目录中去找脚本。
由于在该模式下，spark自己管理资源，因此不需要安装其他的yarn之类的就可以了，spark用master进程充当resourcemanager,worker进程就是工人，干活的，另外假设我们的机器足够刚，不发生故障，所以也不考虑单点故障问题，先就这样启动，搞起来再说。
接下来进入启动阶段

三、配置与启动

现在上的环境配置的差不多了，所以需要将这台机器上的配置拷贝到其他机器上，包括环境配置文件，java，scala，spark等，如果安装了hadoop那么也要拷贝一下，总之保证几台机器的环境相同:

scp /etc/profile :/etc/profile
scp -r /usr/local/java/jdk1.8.0_131 :/usr/local/java/jdk1.8.0_131/

其他指令差不多，不重复了。
Standalone 模式是Spark实现的资源调度框架，其主要的节点有Client节点、Master节点和Worker节点。其中Driver既可以运行在Master 节点上中，也可以运行在本地Client端。当用spark-shell交互式工具提交Spark的Job时，Driver在Master节点上运行（集群模式）；当使用spark-submit工具提交Job或者在Eclips、IDEA等开发平台上使用”new (“spark://master:7077”)”方式运行Spark任务时，Driver是运行在本地 Client端上的（客户端模式）。找到一张运行过程的原理图如下：
在这里插入图片描述
关于Worker进程生成几个Executor，每个Executor使用几个core，这些都可以在里面配置，也可以不配置，在/usr/local/spark-2.3-hadoop-2.7/spark-2.3.2-bin-hadoop2.7/conf 目录下,如果没有，就复制一个

在这里插入图片描述
可添加配置如下

[root@test01 conf]# vi 

#!/usr/bin/env bash

#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    /licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# This file is sourced when running various Spark programs.
# Copy it as  and edit that to configure Spark for your site.

# Options read when launching programs locally with
# ./bin/run-example or ./bin/spark-submit
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node
# - SPARK_PUBLIC_DNS, to set the public dns name of the driver program

# Options read by executors and drivers running inside the cluster
# - SPARK_LOCAL_IP, to set the IP address Spark binds to on this node
# - SPARK_PUBLIC_DNS, to set the public DNS name of the driver program
# - SPARK_LOCAL_DIRS, storage directories to use on this node for shuffle and RDD data
# - MESOS_NATIVE_JAVA_LIBRARY, to point to your  if you use Mesos

# Options read in YARN client/cluster mode
# - SPARK_CONF_DIR, Alternate conf dir. (Default: ${SPARK_HOME}/conf)
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
# - YARN_CONF_DIR, to point Spark towards YARN configuration files when you use YARN
# - SPARK_EXECUTOR_CORES, Number of cores for the executors (Default: 1).
# - SPARK_EXECUTOR_MEMORY, Memory per Executor (. 1000M, 2G) (Default: 1G)
# - SPARK_DRIVER_MEMORY, Memory for Driver (. 1000M, 2G) (Default: 1G)

# Options for the daemons used in the standalone deploy mode
# - SPARK_MASTER_HOST, to bind the master to a different IP address or hostname
# - SPARK_MASTER_PORT / SPARK_MASTER_WEBUI_PORT, to use non-default ports for the master
# - SPARK_MASTER_OPTS, to set config properties only for the master (. "-Dx=y")
# - SPARK_WORKER_CORES, to set the number of cores to use on this machine
# - SPARK_WORKER_MEMORY, to set how much total memory workers have to give executors (. 1000m, 2g)
# - SPARK_WORKER_PORT / SPARK_WORKER_WEBUI_PORT, to use non-default ports for the worker
# - SPARK_WORKER_DIR, to set the working directory of worker processes
# - SPARK_WORKER_OPTS, to set config properties only for the worker (. "-Dx=y")
# - SPARK_DAEMON_MEMORY, to allocate to the master, worker and history server themselves (default: 1g).
# - SPARK_HISTORY_OPTS, to set config properties only for the history server (. "-Dx=y")
# - SPARK_SHUFFLE_OPTS, to set config properties only for the external shuffle service (. "-Dx=y")
# - SPARK_DAEMON_JAVA_OPTS, to set config properties for all daemons (. "-Dx=y")
# - SPARK_DAEMON_CLASSPATH, to set the classpath for all daemons
# - SPARK_PUBLIC_DNS, to set the public dns name of the master or workers

# Generic options for the daemons used in the standalone deploy mode
# - SPARK_CONF_DIR      Alternate conf dir. (Default: ${SPARK_HOME}/conf)
# - SPARK_LOG_DIR       Where log files are stored.  (Default: ${SPARK_HOME}/logs)
# - SPARK_PID_DIR       Where the pid file is stored. (Default: /tmp)
# - SPARK_IDENT_STRING  A string representing this instance of spark. (Default: $USER)
# - SPARK_NICENESS      The scheduling priority for daemons. (Default: 0)
# - SPARK_NO_DAEMONIZE  Run the proposed command in the foreground. It will not output a PID file.
# Options for native BLAS, like Intel MKL, OpenBLAS, and so on.
# You might get better performance to enable these options if using native BLAS (see SPARK-21305).
# - MKL_NUM_THREADS=1        Disable multi-threading of Intel MKL
# - OPENBLAS_NUM_THREADS=1   Disable multi-threading of OpenBLAS
export JAVA_HOME=/usr/local/java/jdk1.8.0_131
export SPARK_HOME=/usr/local/spark-2.3-hadoop-2.7/spark-2.3.2-bin-hadoop2.7
export SPARK_EXECUTOR_MEMORY=5G
export SPARK_EXECUTOR_cores=2
export SPARK_WORKER_CORES=2

注意java_home 和spark_home一定要配其他的试自己的情况，不配也可以
将配置复制到其他机器

scp  :/usr/local/spark-2.3-hadoop-2.7/spark-2.3.2-bin-hadoop2.7/conf/
scp  :/usr/local/spark-2.3-hadoop-2.7/spark-2.3.2-bin-hadoop2.7/conf/
scp  :/usr/local/spark-2.3-hadoop-2.7/spark-2.3.2-bin-hadoop2.7/conf/

接着给机器分配下角色(看自己喜好)

机器	角色
	master
	worker
	worker
	worker

机器有承担master任务，其他机器都承担worker任务，将这个决定配置到spark中，配置在conf目录下的slaves文件中

cp  ./slaves
vi slaves

修改成如下结果

#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    /licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# A Spark Worker will be started on each of the machines listed below.

保存，再将配置复制到其他机器

scp slaves :/usr/local/spark-2.3-hadoop-2.7/spark-2.3.2-bin-hadoop2.7/conf/
scp slaves :/usr/local/spark-2.3-hadoop-2.7/spark-2.3.2-bin-hadoop2.7/conf/
scp slaves :/usr/local/spark-2.3-hadoop-2.7/spark-2.3.2-bin-hadoop2.7/conf/

这样准备差不多了，着手启动。

standalone cluster集群模式，相比客户端模式的区别

客户端的SparkSubmit进程会在应用程序提交给集群之后就退出
Master会在集群中选择一个Worker进程生成一个子进程DriverWrapper来启动driver程序

我们可以查看sbin下的

#!/usr/bin/env bash

#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    /licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# Starts the master on the machine this script is executed on.

if [ -z "${SPARK_HOME}" ]; then
  export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
fi

# NOTE: This exact class name is matched downstream by SparkSubmit.
# Any changes need to be reflected there.
CLASS=""

if [[ "$@" = *--help ]] || [[ "$@" = *-h ]]; then
  echo "Usage: ./sbin/ [options]"
  pattern="Usage:"
  pattern+="\|Using Spark's default log4j profile:"
  pattern+="\|Registered signal handlers for"

  "${SPARK_HOME}"/bin/spark-class $CLASS --help 2>&1 | grep -v "$pattern" 1>&2
  exit 1
fi

ORIGINAL_ARGS="$@"

. "${SPARK_HOME}/sbin/"

. "${SPARK_HOME}/bin/"

if [ "$SPARK_MASTER_PORT" = "" ]; then
  SPARK_MASTER_PORT=7077
fi

if [ "$SPARK_MASTER_HOST" = "" ]; then
  case `uname` in
      (SunOS)
          SPARK_MASTER_HOST="`/usr/sbin/check-hostname | awk '{print $NF}'`"
          ;;
      (*)
          SPARK_MASTER_HOST="`hostname -f`"
          ;;
  esac
fi

if [ "$SPARK_MASTER_WEBUI_PORT" = "" ]; then
  SPARK_MASTER_WEBUI_PORT=8080
fi

"${SPARK_HOME}/sbin"/ start $CLASS 1 \
  --host $SPARK_MASTER_HOST --port $SPARK_MASTER_PORT --webui-port $SPARK_MASTER_WEBUI_PORT \
  $ORIGINAL_ARGS

从这里可以看到三个信息，

Starts the master on the machine this script is executed on.
SPARK_MASTER_PORT=7077
SPARK_MASTER_WEBUI_PORT=8080

启动这个脚本的机器就是master节点，所以我从机器启动脚本，有两个端口，待会尝试访问下。
首先启动这个脚本
在这里插入图片描述
完成后看下进程

果然有Master进程，那个jar进程是别的程序，不用管，那么master节点就起来了，访问下那两个端口看看

说明8080端口是web管理界面，7077端口是master节点url，待会交作业用

下面启动worker节点
启动脚本
在这里插入图片描述
再看下各台机器的进程

可以看到大家都领到了自己的角色，就等待任务提交了。
再访问下管理界面

跟刚才的区别是worker节点的管理也进来了，但是没有application 所以下面就是交作业环节。
直接用官方的示例来做，提交命令如下：

spark-submit --class  --master spark://:7077 --num-executors  2 /usr/local/spark-2.3-hadoop-2.7/spark-2.3.2-bin-hadoop2.7/examples/jars/spark-examples_2.11-2.3.

提交完看打印结果
在这里插入图片描述
这个地址是可以看spark运行情况的，不过应用结束了，这个界面也就结束了。

看到输出计算结果了。再看看管理界面

可以看到，有一个应用运行，现在已经finished，只用了两秒钟时间。如果把官方示例应用换成自己的应用也可以。
最后，从整体看初步目标算是完成了，会有若干细节问题以及其他问题没提到，自己摸索摸索就可以了，多造几遍肯定也就熟了，本文到此结束。

秒客网

Spark学习之standalone模式部署实战

Standalone模式部署实战

一、操作系统准备

1、主机名修改

2、配置主机和ip的映射修改hosts文件

3、无密码访问

二、环境安装

1、安装包分享

2、基础环境安装

3、spark安装

三、配置与启动

相关文章

Spark学习之standalone模式部署实战

Standalone模式部署实战

一、操作系统准备

1、主机名修改

2、配置主机和ip的映射 修改hosts文件

3、无密码访问

二、环境安装

1、安装包分享

2、基础环境安装

3、spark安装

三、配置与启动

相关文章

2、配置主机和ip的映射修改hosts文件