1、Spark2.2快速入门(本地模式)
1.1 Spark本地模式
学习Spark,先易后难,先从最简单的本地模式学起。
本地模式(local),常用于本地开发测试,解压缩Spark软件包就可以用,也就是所谓的“开封即用”
1.2 安装JDK8
(1)下载
登录Oracle官网http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html,同意协议条款,选择64位Linux版tar包链接,可以直接单击链接下载,推荐通过多线程下载工具(比如迅雷)加速下载。
(2)上传到服务器
通过XShell将在Windows系统下载的JDK8软件包上传到服务器192.168.1.180
(3)解压缩
此处解压缩到/opt目录。为了方便管理,我将第三方软件都安装到/opt目录。
[root@master ~]# tar -zxvf jdk-8u144-linux-x64.tar.gz -C /opt
(4)配置JDK环境变量
可以在/etc/profile文件中设置环境变量,为了方便管理此处在/etc/profile.d/目录下创建custom.sh文件,用于设置用户环境变量。
[root@master ~]# vi /etc/profile.d/custom.sh
[root@master ~]# cat /etc/profile.d/custom.sh
#java path
export JAVA_HOME=/opt/jdk1.8.0_144
export PATH=$PATH:$JAVA_HOME/bin
export CLASSPATH=.:$CLASSPATH:$JAVA_HOME/lib
[root@master ~]#
(5)环境变量生效
[root@master ~]# source /etc/profile.d/custom.sh
(6)运行java -version
验证JDK
[root@master ~]# java -version
java version "1.8.0_144"
Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)
[root@master ~]#
1.3 下载Spark2.x软件包
(1)登录Spark官网
http://spark.apache.org/downloads.html
(2)第1个选择spark发行版(选择2.2.0版),第2个选择软件包类型(选择Hadoop 2.7),第3个选择下载类型(直接下载较慢,选择Select Apache Mirror)
(3)单击spark-2.2.0-bin-hadoop2.7.tgz链接,选择国内镜像
(4)通过多线程下载工具加速下载
选择一个最近的镜像,比如此处选择清华大学镜像,通过wget命令wget http://mirrors.tuna.tsinghua.edu.cn/apache/spark/spark-2.2.0/spark-2.2.0-bin-hadoop2.7.tgz
直接下载。
[root@master ~]# wget http://mirrors.tuna.tsinghua.edu.cn/apache/spark/spark-2.2.0/spark-2.2.0-bin-hadoop2.7.tgz
--2017-08-29 22:43:51-- http://mirrors.tuna.tsinghua.edu.cn/apache/spark/spark-2.2.0/spark-2.2.0-bin-hadoop2.7.tgz
Resolving mirrors.tuna.tsinghua.edu.cn (mirrors.tuna.tsinghua.edu.cn)... 101.6.6.177, 2402:f000:1:416:101:6:6:177
Connecting to mirrors.tuna.tsinghua.edu.cn (mirrors.tuna.tsinghua.edu.cn)|101.6.6.177|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 203728858 (194M) [application/octet-stream]
Saving to: ‘spark-2.2.0-bin-hadoop2.7.tgz’
100%[============================================================================================================>] 203,728,858 9.79MB/s in 23s
2017-08-29 22:44:15 (8.32 MB/s) - ‘spark-2.2.0-bin-hadoop2.7.tgz’ saved [203728858/203728858]
[root@master ~]#
(5)然后解压缩/opt目录。我们约定Linux平台下第三方软件包都放到/opt目录下。
[root@master ~]# tar -zxvf spark-2.2.0-bin-hadoop2.7.tgz -C /opt
(6)由于Spark根目录太长,重命名一下。当然也可以不进行重命名。
[root@master ~]# mv /opt/spark-2.2.0-bin-hadoop2.7/ /opt/spark-2.2.0
1.4 Spark目录结构
[root@master ~]# cd /opt/spark-2.2.0/
[root@master spark-2.2.0]# ll
total 84
drwxr-xr-x. 2 500 500 4096 Jun 30 19:09 bin
drwxr-xr-x. 2 500 500 230 Jun 30 19:09 conf
drwxr-xr-x. 5 500 500 50 Jun 30 19:09 data
drwxr-xr-x. 4 500 500 29 Jun 30 19:09 examples
drwxr-xr-x. 2 500 500 12288 Jun 30 19:09 jars
-rw-r--r--. 1 500 500 17881 Jun 30 19:09 LICENSE
drwxr-xr-x. 2 500 500 4096 Jun 30 19:09 licenses
-rw-r--r--. 1 500 500 24645 Jun 30 19:09 NOTICE
drwxr-xr-x. 8 500 500 240 Jun 30 19:09 python
drwxr-xr-x. 3 500 500 17 Jun 30 19:09 R
-rw-r--r--. 1 500 500 3809 Jun 30 19:09 README.md
-rw-r--r--. 1 500 500 128 Jun 30 19:09 RELEASE
drwxr-xr-x. 2 500 500 4096 Jun 30 19:09 sbin
drwxr-xr-x. 2 500 500 42 Jun 30 19:09 yarn
[root@master spark-2.2.0]#
目录 | 说明 |
---|---|
bin | 可执行脚本,Spark相关命令 |
conf | spark配置文件 |
data | spark自带例子用到的数据 |
examples | spark自带样例程序 |
lib | spark相关jar包 |
sbin | 集群启停,因为spark有自带的集群环境 |
Spark软件包bin目录说明:
- spark-shell :spark shell模式启动命令(脚本)
- spark-submit:spark应用程序提交脚本(脚本)
- run-example:运行spark提供的样例程序
- spark-sql:spark SQL命令启动命令(脚本)
1.5 运行样例程序
[root@master1 spark-2.2.0]# bin/run-example SparkPi 4 4
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/08/29 01:27:26 INFO SparkContext: Running Spark version 2.2.0
17/08/29 01:27:26 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/08/29 01:27:26 INFO SparkContext: Submitted application: Spark Pi
17/08/29 01:27:27 INFO SecurityManager: Changing view acls to: root
17/08/29 01:27:27 INFO SecurityManager: Changing modify acls to: root
17/08/29 01:27:27 INFO SecurityManager: Changing view acls groups to:
17/08/29 01:27:27 INFO SecurityManager: Changing modify acls groups to:
17/08/29 01:27:27 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
17/08/29 01:27:27 INFO Utils: Successfully started service 'sparkDriver' on port 40549.
17/08/29 01:27:27 INFO SparkEnv: Registering MapOutputTracker
17/08/29 01:27:27 INFO SparkEnv: Registering BlockManagerMaster
17/08/29 01:27:27 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
17/08/29 01:27:27 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
17/08/29 01:27:27 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-719136e3-dc4e-4061-a07a-e5f04d679ad1
17/08/29 01:27:27 INFO MemoryStore: MemoryStore started with capacity 366.3 MB
17/08/29 01:27:27 INFO SparkEnv: Registering OutputCommitCoordinator
17/08/29 01:27:27 INFO Utils: Successfully started service 'SparkUI' on port 4040.
17/08/29 01:27:27 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://192.168.1.180:4040
17/08/29 01:27:27 INFO SparkContext: Added JAR file:/opt/spark-2.2.0/examples/jars/scopt_2.11-3.3.0.jar at spark://192.168.1.180:40549/jars/scopt_2.11-3.3.0.jar with timestamp 1503984447798
17/08/29 01:27:27 INFO SparkContext: Added JAR file:/opt/spark-2.2.0/examples/jars/spark-examples_2.11-2.2.0.jar at spark://192.168.1.180:40549/jars/spark-examples_2.11-2.2.0.jar with timestamp 1503984447798
17/08/29 01:27:27 INFO Executor: Starting executor ID driver on host localhost
17/08/29 01:27:27 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 43952.
17/08/29 01:27:27 INFO NettyBlockTransferService: Server created on 192.168.1.180:43952
17/08/29 01:27:27 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
17/08/29 01:27:27 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 192.168.1.180, 43952, None)
17/08/29 01:27:27 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.1.180:43952 with 366.3 MB RAM, BlockManagerId(driver, 192.168.1.180, 43952, None)
17/08/29 01:27:27 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 192.168.1.180, 43952, None)
17/08/29 01:27:27 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 192.168.1.180, 43952, None)
17/08/29 01:27:28 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/opt/spark-2.2.0/spark-warehouse').
17/08/29 01:27:28 INFO SharedState: Warehouse path is 'file:/opt/spark-2.2.0/spark-warehouse'.
17/08/29 01:27:29 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
17/08/29 01:27:29 INFO SparkContext: Starting job: reduce at SparkPi.scala:38
17/08/29 01:27:29 INFO DAGScheduler: Got job 0 (reduce at SparkPi.scala:38) with 4 output partitions
17/08/29 01:27:29 INFO DAGScheduler: Final stage: ResultStage 0 (reduce at SparkPi.scala:38)
17/08/29 01:27:29 INFO DAGScheduler: Parents of final stage: List()
17/08/29 01:27:29 INFO DAGScheduler: Missing parents: List()
17/08/29 01:27:29 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no missing parents
17/08/29 01:27:29 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 1832.0 B, free 366.3 MB)
17/08/29 01:27:29 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1172.0 B, free 366.3 MB)
17/08/29 01:27:29 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.1.180:43952 (size: 1172.0 B, free: 366.3 MB)
17/08/29 01:27:29 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1006
17/08/29 01:27:29 INFO DAGScheduler: Submitting 4 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34) (first 15 tasks are for partitions Vector(0, 1, 2, 3))
17/08/29 01:27:29 INFO TaskSchedulerImpl: Adding task set 0.0 with 4 tasks
17/08/29 01:27:29 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, PROCESS_LOCAL, 4825 bytes)
17/08/29 01:27:29 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, executor driver, partition 1, PROCESS_LOCAL, 4825 bytes)
17/08/29 01:27:29 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, localhost, executor driver, partition 2, PROCESS_LOCAL, 4825 bytes)
17/08/29 01:27:29 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID 3, localhost, executor driver, partition 3, PROCESS_LOCAL, 4825 bytes)
17/08/29 01:27:29 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
17/08/29 01:27:29 INFO Executor: Running task 3.0 in stage 0.0 (TID 3)
17/08/29 01:27:29 INFO Executor: Running task 2.0 in stage 0.0 (TID 2)
17/08/29 01:27:29 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
17/08/29 01:27:29 INFO Executor: Fetching spark://192.168.1.180:40549/jars/scopt_2.11-3.3.0.jar with timestamp 1503984447798
17/08/29 01:27:29 INFO TransportClientFactory: Successfully created connection to /192.168.1.180:40549 after 34 ms (0 ms spent in bootstraps)
17/08/29 01:27:29 INFO Utils: Fetching spark://192.168.1.180:40549/jars/scopt_2.11-3.3.0.jar to /tmp/spark-058642cb-042f-4960-b7e9-172fc02caff8/userFiles-28264a42-00c6-42cb-8d3f-e4fe670fb272/fetchFileTemp1808807623002630899.tmp
17/08/29 01:27:29 INFO Executor: Adding file:/tmp/spark-058642cb-042f-4960-b7e9-172fc02caff8/userFiles-28264a42-00c6-42cb-8d3f-e4fe670fb272/scopt_2.11-3.3.0.jar to class loader
17/08/29 01:27:29 INFO Executor: Fetching spark://192.168.1.180:40549/jars/spark-examples_2.11-2.2.0.jar with timestamp 1503984447798
17/08/29 01:27:29 INFO Utils: Fetching spark://192.168.1.180:40549/jars/spark-examples_2.11-2.2.0.jar to /tmp/spark-058642cb-042f-4960-b7e9-172fc02caff8/userFiles-28264a42-00c6-42cb-8d3f-e4fe670fb272/fetchFileTemp3327801226116360399.tmp
17/08/29 01:27:29 INFO Executor: Adding file:/tmp/spark-058642cb-042f-4960-b7e9-172fc02caff8/userFiles-28264a42-00c6-42cb-8d3f-e4fe670fb272/spark-examples_2.11-2.2.0.jar to class loader
17/08/29 01:27:30 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 867 bytes result sent to driver
17/08/29 01:27:30 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 436 ms on localhost (executor driver) (1/4)
17/08/29 01:27:30 INFO Executor: Finished task 2.0 in stage 0.0 (TID 2). 867 bytes result sent to driver
17/08/29 01:27:30 INFO Executor: Finished task 3.0 in stage 0.0 (TID 3). 867 bytes result sent to driver
17/08/29 01:27:30 INFO TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 423 ms on localhost (executor driver) (2/4)
17/08/29 01:27:30 INFO TaskSetManager: Finished task 3.0 in stage 0.0 (TID 3) in 424 ms on localhost (executor driver) (3/4)
17/08/29 01:27:30 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 867 bytes result sent to driver
17/08/29 01:27:30 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 428 ms on localhost (executor driver) (4/4)
17/08/29 01:27:30 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
17/08/29 01:27:30 INFO DAGScheduler: ResultStage 0 (reduce at SparkPi.scala:38) finished in 0.482 s
17/08/29 01:27:30 INFO DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 0.766385 s
Pi is roughly 3.1493878734696836
17/08/29 01:27:30 INFO SparkUI: Stopped Spark web UI at http://192.168.1.180:4040
17/08/29 01:27:30 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
17/08/29 01:27:30 INFO MemoryStore: MemoryStore cleared
17/08/29 01:27:30 INFO BlockManager: BlockManager stopped
17/08/29 01:27:30 INFO BlockManagerMaster: BlockManagerMaster stopped
17/08/29 01:27:30 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
17/08/29 01:27:30 INFO SparkContext: Successfully stopped SparkContext
17/08/29 01:27:30 INFO ShutdownHookManager: Shutdown hook called
17/08/29 01:27:30 INFO ShutdownHookManager: Deleting directory /tmp/spark-058642cb-042f-4960-b7e9-172fc02caff8
[root@master1 spark-2.2.0]#
可以看到运行结果:Pi is roughly 3.1493878734696836
1.6 初识spark-shell
进入spark-shell
[root@master spark-2.2.0]# bin/spark-shell
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/08/28 23:32:44 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/08/28 23:32:50 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Spark context Web UI available at http://192.168.1.180:4040
Spark context available as 'sc' (master = local[*], app id = local-1503977564935).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.2.0
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_144)
Type in expressions to have them evaluated.
Type :help for more information.
scala>
从上面的spark-shell日志中可以看到Spark context Web UI available at http://192.168.1.180:4040
,表明spark-shell启动了一个WebUI,在浏览器地址栏输入http://192.168.1.180:4040即可打开。