spark之scala程序开发(本地运行模式):单词出现次数统计

时间:2021-11-12 16:13:20

准备工作:

将运行Scala-Eclipse的机器节点(CloudDeskTop)内存调整至4G,因为需要在该节点上跑本地(local)Spark程序,本地Spark程序会启动Worker进程耗用大量内存资源

本地运行模式(主要用于调试)


1、首先将Spark的所有jar包拷贝到hadoop用户家目录下

[hadoop@CloudDeskTop spark-2.1.1]$ pwd
/software/spark-2.1.1
[hadoop@CloudDeskTop spark-2.1.1]$ cp -a jars /home/hadoop/bigdata/lib/Spark2.1.1-All
[hadoop@CloudDeskTop spark-2.1.1]$ ls ~/bigdata/lib
Hadoop2.7.3-All HBase1.2.6-All RPC Spark2.1.1-All

2、在Scala4.4的Eclipse版本中,新建一个Scala的工程

然后在Eclipse中创建一个Spark2.1.1-All的用户库,将~/Spark2.1.1-All目录中的所有jar包导入Spark2.1.1-All用户库中,并将此用户库添加到当前的Scala工程中:Window==>Preferences==>Java==>Build Path==>User Libraries==>New

spark之scala程序开发(本地运行模式):单词出现次数统计

3、在本地创建测试数据

[hadoop@CloudDeskTop scala]$ pwd
/home/hadoop/test/scala
[hadoop@CloudDeskTop scala]$ ls
information information02
[hadoop@CloudDeskTop scala]$ cat information
zhnag san shi yi ge hao ren
jin tian shi yi ge hao tian qi
wo zai zhe li zuo le yi ge ce shi
yi ge guan yu scala de ce shi
welcome to mmzs
欢迎 欢迎
[hadoop@CloudDeskTop scala]$ cat information02
zhnag san shi yi ge hao ren
jin tian shi yi ge hao tian qi
wo zai zhe li zuo le yi ge ce shi
yi ge guan yu scala de ce shi
welcome to mmzs
欢迎 欢迎

4、创建一个带main方法的入口类

File—>New—>Scala Object—>在弹出对话框的Name栏输入:com.mmzs.bigdata.spark.rdd.local.SparkRDD,其中SparkRDD是类名、com.mmzs.bigdata.spark.rdd.local是包名

object WordCount {
  //建立一个main方法
  def main(args: Array[String]): Unit = {
    ......
  }
}

完整的代码如下:

package com.mmzs.bigdata.spark.rdd.local

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import org.apache.spark.rdd.RDD.rddToOrderedRDDFunctions
import org.apache.spark.rdd.RDD.rddToPairRDDFunctions object SparkRDD {
def main(args: Array[String]): Unit = {
//读取spark配置文件
val conf:SparkConf = new SparkConf();
//本地测试模式
conf.setMaster("local");//Local Scala Spark RDD
//设定应用名字
conf.setAppName("sparkTestLocal"); //根据配置获取spark上下文对象
val sc:SparkContext = new SparkContext(conf); //使用sparkContext读取文件到内存并生成RDD对象,
//指定一个输入目录即可,目录中的所有文件都将作为输入文件
val lineRdd:RDD[String] = sc.textFile("/home/hadoop/test/scala");
//使用空格切分每一行的数据为单词数组,并将单词数组中的单词子串释放到外层的RDD集合中
val flatRdd:RDD[String] = lineRdd.flatMap { line => line.split(" "); }
//将RDD中的每一个单词字串转化为元组,以完成单词计数
val mapRDD:RDD[Tuple2[String, Integer]] = flatRdd.map(word=>(word,1));
//按RDD集合中每一个元组的第一个元素(即单词字串)进行分组并完成单词计数
val reduceRDD:RDD[(String, Integer)] = mapRDD.reduceByKey((pre:Integer, next:Integer)=>pre+next);
//交换元素中的key和value的位置便于后续排序
val reduceRDD02:RDD[(Integer, String)] = reduceRDD.map(tuple=>(tuple._2,tuple._1));
//根据key进行排序,第二个参数表示启动的Task数量,设置大了可能会抛出内存溢出的异常
val sortRDD:RDD[(Integer, String)] = reduceRDD02.sortByKey(false, 1);
//排好序之后将顺序换回来
val sortRDD02:RDD[(Integer, String)] = reduceRDD.map(tuple=>(tuple._2,tuple._1)); //指定一个输出目录,如果输出目录已经事先存在则应该将它删除掉
//val f:File=new File("/home/hadoop/test/scala/output");
//if(f.exists()&&f.isDirectory()) f.delete();
//将结果保存到指定路径下
reduceRDD.saveAsTextFile("/home/hadoop/test/scala/output/"); //停止使用spark上下文对象
sc.stop();
}
}

5、运行结果如下:

[hadoop@CloudDeskTop scala]$ ll
总用量 8
-rw-rw-r-- 1 hadoop hadoop 156 2月 7 15:27 information
-rw-rw-r-- 1 hadoop hadoop 156 2月 7 15:29 information02
[hadoop@CloudDeskTop scala]$ ll
总用量 12
-rw-rw-r-- 1 hadoop hadoop 156 2月 7 15:27 information
-rw-rw-r-- 1 hadoop hadoop 156 2月 7 15:29 information02
drwxrwxr-x 2 hadoop hadoop 4096 2月 7 15:54 output
[hadoop@CloudDeskTop scala]$ cd output/
[hadoop@CloudDeskTop output]$ ls
part-00000 part-00001 _SUCCESS
[hadoop@CloudDeskTop output]$ cat part-00000
(scala,2)
(zuo,2)
(tian,4)
(shi,8)
(ce,4)
(zai,2)
(欢迎,4)
(wo,2)
(zhnag,2)
(san,2)
(welcome,2)
(yi,8)
(ge,8)
(hao,4)
(qi,2)
(yu,2)
[hadoop@CloudDeskTop output]$ cat part-00001
(guan,2)
(jin,2)
(ren,2)
(de,2)
(le,2)
(to,2)
(zhe,2)
(li,2)
(mmzs,2)

在Scala4.4的Eclipse的控制台显示信息如下:

 Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
18/02/07 15:54:12 INFO SparkContext: Running Spark version 2.1.1
18/02/07 15:54:12 WARN SparkContext: Support for Java 7 is deprecated as of Spark 2.0.0
18/02/07 15:54:13 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/02/07 15:54:14 INFO SecurityManager: Changing view acls to: hadoop
18/02/07 15:54:14 INFO SecurityManager: Changing modify acls to: hadoop
18/02/07 15:54:14 INFO SecurityManager: Changing view acls groups to:
18/02/07 15:54:14 INFO SecurityManager: Changing modify acls groups to:
18/02/07 15:54:14 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); groups with view permissions: Set(); users with modify permissions: Set(hadoop); groups with modify permissions: Set()
18/02/07 15:54:14 INFO Utils: Successfully started service 'sparkDriver' on port 37639.
18/02/07 15:54:14 INFO SparkEnv: Registering MapOutputTracker
18/02/07 15:54:14 INFO SparkEnv: Registering BlockManagerMaster
18/02/07 15:54:14 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
18/02/07 15:54:14 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
18/02/07 15:54:14 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-b7a711cf-321e-43af-95ca-c504982e56e4
18/02/07 15:54:14 INFO MemoryStore: MemoryStore started with capacity 348.0 MB
18/02/07 15:54:15 INFO SparkEnv: Registering OutputCommitCoordinator
18/02/07 15:54:15 INFO Utils: Successfully started service 'SparkUI' on port 4040.
18/02/07 15:54:15 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://192.168.154.134:4040
18/02/07 15:54:15 INFO Executor: Starting executor ID driver on host localhost
18/02/07 15:54:15 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 59713.
18/02/07 15:54:15 INFO NettyBlockTransferService: Server created on 192.168.154.134:59713
18/02/07 15:54:15 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
18/02/07 15:54:15 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 192.168.154.134, 59713, None)
18/02/07 15:54:15 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.154.134:59713 with 348.0 MB RAM, BlockManagerId(driver, 192.168.154.134, 59713, None)
18/02/07 15:54:15 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 192.168.154.134, 59713, None)
18/02/07 15:54:15 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 192.168.154.134, 59713, None)
18/02/07 15:54:17 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 193.9 KB, free 347.8 MB)
18/02/07 15:54:17 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 22.9 KB, free 347.8 MB)
18/02/07 15:54:17 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.154.134:59713 (size: 22.9 KB, free: 348.0 MB)
18/02/07 15:54:17 INFO SparkContext: Created broadcast 0 from textFile at SparkRDD.scala:24
18/02/07 15:54:18 INFO FileInputFormat: Total input paths to process : 2
18/02/07 15:54:18 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
18/02/07 15:54:18 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
18/02/07 15:54:18 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
18/02/07 15:54:18 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
18/02/07 15:54:18 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
18/02/07 15:54:18 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
18/02/07 15:54:18 INFO SparkContext: Starting job: saveAsTextFile at SparkRDD.scala:42
18/02/07 15:54:18 INFO DAGScheduler: Registering RDD 3 (map at SparkRDD.scala:28)
18/02/07 15:54:18 INFO DAGScheduler: Got job 0 (saveAsTextFile at SparkRDD.scala:42) with 2 output partitions
18/02/07 15:54:18 INFO DAGScheduler: Final stage: ResultStage 1 (saveAsTextFile at SparkRDD.scala:42)
18/02/07 15:54:18 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 0)
18/02/07 15:54:18 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 0)
18/02/07 15:54:18 INFO DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[3] at map at SparkRDD.scala:28), which has no missing parents
18/02/07 15:54:18 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 4.4 KB, free 347.8 MB)
18/02/07 15:54:18 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.5 KB, free 347.8 MB)
18/02/07 15:54:18 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.154.134:59713 (size: 2.5 KB, free: 348.0 MB)
18/02/07 15:54:18 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:996
18/02/07 15:54:18 INFO DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[3] at map at SparkRDD.scala:28)
18/02/07 15:54:18 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
18/02/07 15:54:18 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, PROCESS_LOCAL, 5984 bytes)
18/02/07 15:54:18 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
18/02/07 15:54:18 INFO HadoopRDD: Input split: file:/home/hadoop/test/scala/information:0+156
18/02/07 15:54:19 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1748 bytes result sent to driver
18/02/07 15:54:19 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, executor driver, partition 1, PROCESS_LOCAL, 5986 bytes)
18/02/07 15:54:19 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
18/02/07 15:54:19 INFO HadoopRDD: Input split: file:/home/hadoop/test/scala/information02:0+156
18/02/07 15:54:19 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 527 ms on localhost (executor driver) (1/2)
18/02/07 15:54:19 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 1661 bytes result sent to driver
18/02/07 15:54:19 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 137 ms on localhost (executor driver) (2/2)
18/02/07 15:54:19 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
18/02/07 15:54:19 INFO DAGScheduler: ShuffleMapStage 0 (map at SparkRDD.scala:28) finished in 0.649 s
18/02/07 15:54:19 INFO DAGScheduler: looking for newly runnable stages
18/02/07 15:54:19 INFO DAGScheduler: running: Set()
18/02/07 15:54:19 INFO DAGScheduler: waiting: Set(ResultStage 1)
18/02/07 15:54:19 INFO DAGScheduler: failed: Set()
18/02/07 15:54:19 INFO DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[8] at saveAsTextFile at SparkRDD.scala:42), which has no missing parents
18/02/07 15:54:19 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 72.4 KB, free 347.7 MB)
18/02/07 15:54:19 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 26.1 KB, free 347.7 MB)
18/02/07 15:54:19 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.154.134:59713 (size: 26.1 KB, free: 347.9 MB)
18/02/07 15:54:19 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:996
18/02/07 15:54:19 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 1 (MapPartitionsRDD[8] at saveAsTextFile at SparkRDD.scala:42)
18/02/07 15:54:19 INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks
18/02/07 15:54:19 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, localhost, executor driver, partition 0, ANY, 5757 bytes)
18/02/07 15:54:19 INFO Executor: Running task 0.0 in stage 1.0 (TID 2)
18/02/07 15:54:19 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
18/02/07 15:54:19 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 11 ms
18/02/07 15:54:19 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
18/02/07 15:54:19 INFO FileOutputCommitter: Saved output of task 'attempt_20180207155418_0001_m_000000_2' to file:/home/hadoop/test/scala/output/_temporary/0/task_20180207155418_0001_m_000000
18/02/07 15:54:19 INFO SparkHadoopMapRedUtil: attempt_20180207155418_0001_m_000000_2: Committed
18/02/07 15:54:19 INFO Executor: Finished task 0.0 in stage 1.0 (TID 2). 1977 bytes result sent to driver
18/02/07 15:54:19 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 3, localhost, executor driver, partition 1, ANY, 5757 bytes)
18/02/07 15:54:19 INFO Executor: Running task 1.0 in stage 1.0 (TID 3)
18/02/07 15:54:19 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 298 ms on localhost (executor driver) (1/2)
18/02/07 15:54:19 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
18/02/07 15:54:19 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
18/02/07 15:54:19 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
18/02/07 15:54:19 INFO FileOutputCommitter: Saved output of task 'attempt_20180207155418_0001_m_000001_3' to file:/home/hadoop/test/scala/output/_temporary/0/task_20180207155418_0001_m_000001
18/02/07 15:54:19 INFO SparkHadoopMapRedUtil: attempt_20180207155418_0001_m_000001_3: Committed
18/02/07 15:54:19 INFO Executor: Finished task 1.0 in stage 1.0 (TID 3). 1890 bytes result sent to driver
18/02/07 15:54:19 INFO DAGScheduler: ResultStage 1 (saveAsTextFile at SparkRDD.scala:42) finished in 0.437 s
18/02/07 15:54:19 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 3) in 143 ms on localhost (executor driver) (2/2)
18/02/07 15:54:19 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
18/02/07 15:54:19 INFO DAGScheduler: Job 0 finished: saveAsTextFile at SparkRDD.scala:42, took 1.566889 s
18/02/07 15:54:19 INFO SparkUI: Stopped Spark web UI at http://192.168.154.134:4040
18/02/07 15:54:19 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
18/02/07 15:54:19 INFO MemoryStore: MemoryStore cleared
18/02/07 15:54:19 INFO BlockManager: BlockManager stopped
18/02/07 15:54:19 INFO BlockManagerMaster: BlockManagerMaster stopped
18/02/07 15:54:19 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
18/02/07 15:54:19 INFO SparkContext: Successfully stopped SparkContext
18/02/07 15:54:19 INFO ShutdownHookManager: Shutdown hook called
18/02/07 15:54:19 INFO ShutdownHookManager: Deleting directory /tmp/spark-1ce79d12-6eb7-403c-b9e8-3d836d821c90

Console界面显示的信息:

6、说明

  如果master主机设置为local则表示本地运行,本地运行时Spark会启动一个集群模拟器来运行Job作业,本地运行只能在Eclipse、或直接使用java命令等运行它,不能使用spark-submit来提交运行,因为该命令会装载Spark环境配置并连接到Spark集群,对于本地模式下是不需要连接集群的,本地模式仅仅适合于在本地开发环境测试代码。

使用scala命令运行本地模式:

Scala使用命令行方式运行jar包还存在一些bug(不能使用java命令来运行scala开发的jar包),主要体现在无法协调处理类路径classpath的导入问题