1.下载最新版的scala for eclipse版本,选择windows 64位,下载网址: http://scala-ide.org/download/sdk.html
下载好后解压到D盘,打开并选择工作空间。
然后创建一个测试项目ScalaDev,右击项目选择Properties,在对话框中选择Scala Compiler,在右面页签中勾选Use Project Settings和Scala Installation点击ok,保存配置。
2.添加spark1.6.0的jar文件依赖spark-assembly-1.6.0-hadoop2.6.0.jar,并添加到项目中。
spark-assembly-1.6.0-hadoop2.6.0.jar在spark-1.6.0-bin-hadoop2.6.tgz包中的lib下面。
右击ScalaDev项目选择Build Path->Configure Build Path
注:如果你选择了Scala Installation为Latest2.11 bundle(dynamic)项目会报如下的错误:ScalaDev工程上出现一个红叉,查看Problems下面的原因是scala编译版本和spark的不一致导致。
More than one scala library found in the build path (D:/eclipse/plugins/org.scala-lang.scala-library_2.11.7.v20150622-112736-1fbce4612c.jar, F:/IMF/Big_Data_Software/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar).At least one has an incompatible version. Please update the project build path so it contains only one compatible scala library.
解决方法:右击Scala Library Container->Properties,在弹出框中选择Latest 2.10 bundle(dynamic),保存即可。
3.在src下创建spark工程包,并创建入口类。
选择项目New -> Package创建com.imf.spark包;
选择com.imf.spark包名,创建Scala Object;
测试程序前,要将spark-1.6.0-bin-hadoop2.6目录中的README.md文件拷贝到D://testspark//目录中,代码如下:
- package com.imf.spark
- import org.apache.spark.SparkConf
- import org.apache.spark.SparkContext
- /**
- * 用户scala开发本地测试的spark wordcount程序
- */
- object WordCount {
- def main(args: Array[String]): Unit = {
- /**
- * 1.创建Spark的配置对象SparkConf,设置Spark程序的运行时的配置信息,
- * 例如:通过setMaster来设置程序要链接的Spark集群的Master的URL,如果设置为local,
- * 则代表Spark程序在本地运行,特别适合于机器配置条件非常差的情况。
- */
- //创建SparkConf对象
- val conf = new SparkConf()
- //设置应用程序名称,在程序运行的监控界面可以看到名称
- conf.setAppName("My First Spark App!")
- //设置local使程序在本地运行,不需要安装Spark集群
- conf.setMaster("local")
- /**
- * 2.创建SparkContext对象
- * SparkContext是spark程序所有功能的唯一入口,无论是采用Scala,java,python,R等都必须有一个SprakContext
- * SparkContext核心作用:初始化spark应用程序运行所需要的核心组件,包括DAGScheduler,TaskScheduler,SchedulerBackend
- * 同时还会负责Spark程序往Master注册程序等;
- * SparkContext是整个应用程序中最为至关重要的一个对象;
- */
- //通过创建SparkContext对象,通过传入SparkConf实例定制Spark运行的具体参数和配置信息
- val sc = new SparkContext(conf)
- /**
- * 3.根据具体数据的来源(HDFS,HBase,Local,FS,DB,S3等)通过SparkContext来创建RDD;
- * RDD的创建基本有三种方式:根据外部的数据来源(例如HDFS)、根据Scala集合、由其他的RDD操作;
- * 数据会被RDD划分成为一系列的Partitions,分配到每个Partition的数据属于一个Task的处理范畴;
- */
- //读取本地文件,并设置一个partition
- val lines = sc.textFile("D://testspark//README.md",1)
- /**
- * 4.对初始的RDD进行Transformation级别的处理,例如map,filter等高阶函数的变成,来进行具体的数据计算
- * 4.1.将每一行的字符串拆分成单个单词
- */
- //对每一行的字符串进行拆分并把所有行的拆分结果通过flat合并成一个大的集合
- val words = lines.flatMap { line => line.split(" ") }
- /**
- * 4.2.在单词拆分的基础上对每个单词实例计数为1,也就是word => (word,1)
- */
- val pairs = words.map{word =>(word,1)}
- /**
- * 4.3.在每个单词实例计数为1基础上统计每个单词在文件中出现的总次数
- */
- //对相同的key进行value的累积(包括Local和Reducer级别同时Reduce)
- val wordCounts = pairs.reduceByKey(_+_)
- //打印输出
- wordCounts.foreach(pair => println(pair._1+":"+pair._2))
- sc.stop()
- }
- }
运行结果:
- Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
- 16/01/26 08:23:37 INFO SparkContext: Running Spark version 1.6.0
- 16/01/26 08:23:42 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
- 16/01/26 08:23:42 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path
- java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
- at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:355)
- at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:370)
- at org.apache.hadoop.util.Shell.<clinit>(Shell.java:363)
- at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:79)
- at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:104)
- at org.apache.hadoop.security.Groups.<init>(Groups.java:86)
- at org.apache.hadoop.security.Groups.<init>(Groups.java:66)
- at org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:280)
- at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:271)
- at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:248)
- at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:763)
- at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:748)
- at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:621)
- at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2136)
- at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2136)
- at scala.Option.getOrElse(Option.scala:120)
- at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2136)
- at org.apache.spark.SparkContext.<init>(SparkContext.scala:322)
- at com.dt.spark.WordCount$.main(WordCount.scala:29)
- at com.dt.spark.WordCount.main(WordCount.scala)
- 16/01/26 08:23:42 INFO SecurityManager: Changing view acls to: vivi
- 16/01/26 08:23:42 INFO SecurityManager: Changing modify acls to: vivi
- 16/01/26 08:23:42 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(vivi); users with modify permissions: Set(vivi)
- 16/01/26 08:23:43 INFO Utils: Successfully started service 'sparkDriver' on port 54663.
- 16/01/26 08:23:43 INFO Slf4jLogger: Slf4jLogger started
- 16/01/26 08:23:43 INFO Remoting: Starting remoting
- 16/01/26 08:23:43 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@192.168.100.102:54676]
- 16/01/26 08:23:43 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 54676.
- 16/01/26 08:23:43 INFO SparkEnv: Registering MapOutputTracker
- 16/01/26 08:23:43 INFO SparkEnv: Registering BlockManagerMaster
- 16/01/26 08:23:43 INFO DiskBlockManager: Created local directory at C:\Users\vivi\AppData\Local\Temp\blockmgr-5f59f3c2-3b87-49c5-a1ae-e21847aac44b
- 16/01/26 08:23:43 INFO MemoryStore: MemoryStore started with capacity 1813.7 MB
- 16/01/26 08:23:43 INFO SparkEnv: Registering OutputCommitCoordinator
- 16/01/26 08:23:43 INFO Utils: Successfully started service 'SparkUI' on port 4040.
- 16/01/26 08:23:43 INFO SparkUI: Started SparkUI at http://192.168.100.102:4040
- 16/01/26 08:23:43 INFO Executor: Starting executor ID driver on host localhost
- 16/01/26 08:23:43 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 54683.
- 16/01/26 08:23:43 INFO NettyBlockTransferService: Server created on 54683
- 16/01/26 08:23:43 INFO BlockManagerMaster: Trying to register BlockManager
- 16/01/26 08:23:43 INFO BlockManagerMasterEndpoint: Registering block manager localhost:54683 with 1813.7 MB RAM, BlockManagerId(driver, localhost, 54683)
- 16/01/26 08:23:43 INFO BlockManagerMaster: Registered BlockManager
- 16/01/26 08:23:46 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 153.6 KB, free 153.6 KB)
- 16/01/26 08:23:46 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 13.9 KB, free 167.6 KB)
- 16/01/26 08:23:46 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:54683 (size: 13.9 KB, free: 1813.7 MB)
- 16/01/26 08:23:46 INFO SparkContext: Created broadcast 0 from textFile at WordCount.scala:37
- 16/01/26 08:23:47 WARN : Your hostname, vivi-PC resolves to a loopback/non-reachable address: fe80:0:0:0:5937:95c4:86da:2f43%30, but we couldn't find any external IP address!
- 16/01/26 08:23:48 INFO FileInputFormat: Total input paths to process : 1
- 16/01/26 08:23:48 INFO SparkContext: Starting job: foreach at WordCount.scala:56
- 16/01/26 08:23:48 INFO DAGScheduler: Registering RDD 3 (map at WordCount.scala:48)
- 16/01/26 08:23:48 INFO DAGScheduler: Got job 0 (foreach at WordCount.scala:56) with 1 output partitions
- 16/01/26 08:23:48 INFO DAGScheduler: Final stage: ResultStage 1 (foreach at WordCount.scala:56)
- 16/01/26 08:23:48 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 0)
- 16/01/26 08:23:48 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 0)
- 16/01/26 08:23:48 INFO DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[3] at map at WordCount.scala:48), which has no missing parents
- 16/01/26 08:23:48 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 4.0 KB, free 171.6 KB)
- 16/01/26 08:23:48 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.3 KB, free 173.9 KB)
- 16/01/26 08:23:48 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:54683 (size: 2.3 KB, free: 1813.7 MB)
- 16/01/26 08:23:48 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006
- 16/01/26 08:23:48 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[3] at map at WordCount.scala:48)
- 16/01/26 08:23:48 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
- 16/01/26 08:23:48 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 2119 bytes)
- 16/01/26 08:23:48 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
- 16/01/26 08:23:48 INFO HadoopRDD: Input split: file:/D:/testspark/README.md:0+3359
- 16/01/26 08:23:48 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
- 16/01/26 08:23:48 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
- 16/01/26 08:23:48 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
- 16/01/26 08:23:48 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
- 16/01/26 08:23:48 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
- 16/01/26 08:23:48 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2253 bytes result sent to driver
- 16/01/26 08:23:48 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 177 ms on localhost (1/1)
- 16/01/26 08:23:48 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
- 16/01/26 08:23:48 INFO DAGScheduler: ShuffleMapStage 0 (map at WordCount.scala:48) finished in 0.186 s
- 16/01/26 08:23:48 INFO DAGScheduler: looking for newly runnable stages
- 16/01/26 08:23:48 INFO DAGScheduler: running: Set()
- 16/01/26 08:23:48 INFO DAGScheduler: waiting: Set(ResultStage 1)
- 16/01/26 08:23:48 INFO DAGScheduler: failed: Set()
- 16/01/26 08:23:48 INFO DAGScheduler: Submitting ResultStage 1 (ShuffledRDD[4] at reduceByKey at WordCount.scala:54), which has no missing parents
- 16/01/26 08:23:48 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 2.5 KB, free 176.4 KB)
- 16/01/26 08:23:48 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 1581.0 B, free 177.9 KB)
- 16/01/26 08:23:48 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:54683 (size: 1581.0 B, free: 1813.7 MB)
- 16/01/26 08:23:48 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1006
- 16/01/26 08:23:48 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 1 (ShuffledRDD[4] at reduceByKey at WordCount.scala:54)
- 16/01/26 08:23:48 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
- 16/01/26 08:23:48 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, partition 0,NODE_LOCAL, 1894 bytes)
- 16/01/26 08:23:48 INFO Executor: Running task 0.0 in stage 1.0 (TID 1)
- 16/01/26 08:23:48 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
- 16/01/26 08:23:48 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 2 ms
- package:1
- For:2
- Programs:1
- processing.:1
- Because:1
- The:1
- cluster.:1
- its:1
- [run:1
- APIs:1
- have:1
- Try:1
- computation:1
- through:1
- several:1
- This:2
- graph:1
- Hive:2
- storage:1
- ["Specifying:1
- To:2
- page](http://spark.apache.org/documentation.html):1
- Once:1
- "yarn":1
- prefer:1
- SparkPi:2
- engine:1
- version:1
- file:1
- documentation,:1
- processing,:1
- the:21
- are:1
- systems.:1
- params:1
- not:1
- different:1
- refer:2
- Interactive:2
- R,:1
- given.:1
- if:4
- build:3
- when:1
- be:2
- Tests:1
- Apache:1
- ./bin/run-example:2
- programs,:1
- including:3
- Spark.:1
- package.:1
- 1000).count():1
- Versions:1
- HDFS:1
- Data.:1
- >>>:1
- programming:1
- Testing:1
- module,:1
- Streaming:1
- environment:1
- run::1
- clean:1
- 1000::2
- rich:1
- GraphX:1
- Please:3
- is:6
- run:7
- URL,:1
- threads.:1
- same:1
- MASTER=spark://host:7077:1
- on:5
- built:1
- against:1
- [Apache:1
- tests:2
- examples:2
- at:2
- optimized:1
- usage:1
- using:2
- graphs:1
- talk:1
- Shell:2
- class:2
- abbreviated:1
- directory.:1
- README:1
- computing:1
- overview:1
- `examples`:2
- example::1
- ##:8
- N:1
- set:2
- use:3
- Hadoop-supported:1
- tests](https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools).:1
- running:1
- find:1
- contains:1
- project:1
- Pi:1
- need:1
- or:3
- Big:1
- Java,:1
- high-level:1
- uses:1
- <class>:1
- Hadoop,:2
- available:1
- requires:1
- (You:1
- see:1
- Documentation:1
- of:5
- tools:1
- using::1
- cluster:2
- must:1
- supports:2
- built,:1
- system:1
- build/mvn:1
- Hadoop:3
- this:1
- Version"](http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version):1
- particular:2
- Python:2
- Spark:13
- general:2
- YARN,:1
- pre-built:1
- [Configuration:1
- locally:2
- library:1
- A:1
- locally.:1
- sc.parallelize(1:1
- only:1
- Configuration:1
- following:2
- basic:1
- #:1
- changed:1
- More:1
- which:2
- learning,:1
- first:1
- ./bin/pyspark:1
- also:4
- should:2
- for:11
- [params]`.:1
- documentation:3
- [project:2
- mesos://:1
- Maven](http://maven.apache.org/).:1
- setup:1
- <http://spark.apache.org/>:1
- latest:1
- your:1
- MASTER:1
- example:3
- scala>:1
- DataFrames,:1
- provides:1
- configure:1
- distributions.:1
- can:6
- About:1
- instructions.:1
- do:2
- easiest:1
- no:1
- how:2
- `./bin/run-example:1
- Note:1
- individual:1
- spark://:1
- It:2
- Scala:2
- Alternatively,:1
- an:3
- variable:1
- submit:1
- machine:1
- thread,:1
- them,:1
- detailed:2
- stream:1
- And:1
- distribution:1
- return:2
- Thriftserver:1
- ./bin/spark-shell:1
- "local":1
- start:1
- You:3
- Spark](#building-spark).:1
- one:2
- help:1
- with:3
- print:1
- Spark"](http://spark.apache.org/docs/latest/building-spark.html).:1
- data:1
- wiki](https://cwiki.apache.org/confluence/display/SPARK).:1
- in:5
- -DskipTests:1
- downloaded:1
- versions:1
- online:1
- Guide](http://spark.apache.org/docs/latest/configuration.html):1
- comes:1
- [building:1
- Python,:2
- Many:1
- building:2
- Running:1
- from:1
- way:1
- Online:1
- site,:1
- other:1
- Example:1
- analysis.:1
- sc.parallelize(range(1000)).count():1
- you:4
- runs.:1
- Building:1
- higher-level:1
- protocols:1
- guidance:2
- a:8
- guide,:1
- name:1
- fast:1
- SQL:2
- will:1
- instance::1
- to:14
- core:1
- :67
- web:1
- "local[N]":1
- programs:2
- package.):1
- that:2
- MLlib:1
- ["Building:1
- shell::2
- Scala,:1
- and:10
- command,:2
- ./dev/run-tests:1
- sample:1
- 16/01/26 08:23:48 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 1165 bytes result sent to driver
- 16/01/26 08:23:48 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 61 ms on localhost (1/1)
- 16/01/26 08:23:48 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
- 16/01/26 08:23:48 INFO DAGScheduler: ResultStage 1 (foreach at WordCount.scala:56) finished in 0.061 s
- 16/01/26 08:23:48 INFO DAGScheduler: Job 0 finished: foreach at WordCount.scala:56, took 0.328012 s
- 16/01/26 08:23:48 INFO SparkUI: Stopped Spark web UI at http://192.168.100.102:4040
- 16/01/26 08:23:48 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
- 16/01/26 08:23:48 INFO MemoryStore: MemoryStore cleared
- 16/01/26 08:23:48 INFO BlockManager: BlockManager stopped
- 16/01/26 08:23:48 INFO BlockManagerMaster: BlockManagerMaster stopped
- 16/01/26 08:23:48 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
- 16/01/26 08:23:48 INFO SparkContext: Successfully stopped SparkContext
- 16/01/26 08:23:48 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
- 16/01/26 08:23:48 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
- 16/01/26 08:23:48 INFO ShutdownHookManager: Shutdown hook called
- 16/01/26 08:23:48 INFO ShutdownHookManager: Deleting directory C:\Users\vivi\AppData\Local\Temp\spark-56f9ed0a-5671-449a-955a-041c63569ff2
说明:上面程序运行错误,是加载hadoop的配置,因为运行在本地,是找不到的,但不影响测试。