spark:spark启动、spark-shell启动及测试--7

时间:2023-01-06 21:27:46
1.先启动hadoop,启动hdfs和yarn。在开发Spark时,仅需要启动hdfs:
cd /usr/local/hadoop-2.4.0/etc/hadoop
1 sbin/start-dfs.sh  
2 sbin/start-yarn.sh
2.启动saprk
1 cd /usr/local/spark-1.1.0-bin-hadoop2.4/sbin
2  ./start-all.sh
3. 启动spark-shell
4.cd **bin
./spark-shell
退出shell
exit

RDD弹性分布式数据集

scala> val rdd = sc.parallelize(List(1,2,3,4,5,6))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:12

scala> val mapRdd = rdd.map(2 * _)
mapRdd: org.apache.spark.rdd.RDD[Int] = MappedRDD[1] at map at <console>:14

scala> mapRdd.collect
15/02/05 11:52:01 INFO spark.SparkContext: Starting job: collect at <console>:17
15/02/05 11:52:01 INFO scheduler.DAGScheduler: Got job 0 (collect at <console>:17) with 1 output partitions (allowLocal=false)
15/02/05 11:52:01 INFO scheduler.DAGScheduler: Final stage: Stage 0(collect at <console>:17)
15/02/05 11:52:01 INFO scheduler.DAGScheduler: Parents of final stage: List()
15/02/05 11:52:01 INFO scheduler.DAGScheduler: Missing parents: List()
15/02/05 11:52:01 INFO scheduler.DAGScheduler: Submitting Stage 0 (MappedRDD[1] at map at <console>:14), which has no missing parents
15/02/05 11:52:02 INFO storage.MemoryStore: ensureFreeSpace(1656) called with curMem=0, maxMem=280248975
15/02/05 11:52:02 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 1656.0 B, free 267.3 MB)
15/02/05 11:52:03 INFO storage.MemoryStore: ensureFreeSpace(1211) called with curMem=1656, maxMem=280248975
15/02/05 11:52:03 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1211.0 B, free 267.3 MB)
15/02/05 11:52:03 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:34086 (size: 1211.0 B, free: 267.3 MB)
15/02/05 11:52:03 INFO storage.BlockManagerMaster: Updated info of block broadcast_0_piece0
15/02/05 11:52:03 INFO spark.SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:838
15/02/05 11:52:03 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from Stage 0 (MappedRDD[1] at map at <console>:14)
15/02/05 11:52:03 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
15/02/05 11:52:03 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1224 bytes)
15/02/05 11:52:03 INFO executor.Executor: Running task 0.0 in stage 0.0 (TID 0)
15/02/05 11:52:03 INFO executor.Executor: Finished task 0.0 in stage 0.0 (TID 0). 618 bytes result sent to driver
15/02/05 11:52:03 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 405 ms on localhost (1/1)
15/02/05 11:52:03 INFO scheduler.DAGScheduler: Stage 0 (collect at <console>:17) finished in 0.506 s
15/02/05 11:52:04 INFO scheduler.DAGScheduler: Job 0 finished: collect at <console>:17, took 2.923469 s
15/02/05 11:52:04 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
res0: Array[Int] = Array(2, 4, 6, 8, 10, 12)

scala>
scala> mapRdd
res1: org.apache.spark.rdd.RDD[Int] = MappedRDD[1] at map at <console>:14

//*********************************
scala> val filterRdd = mapRdd.filter(_ > 5)
filterRdd: org.apache.spark.rdd.RDD[Int] = FilteredRDD[2] at filter at <console>:16

scala> filterRdd.collect
15/02/05 11:57:17 INFO spark.SparkContext: Starting job: collect at <console>:19
15/02/05 11:57:17 INFO scheduler.DAGScheduler: Got job 1 (collect at <console>:19) with 1 output partitions (allowLocal=false)
15/02/05 11:57:17 INFO scheduler.DAGScheduler: Final stage: Stage 1(collect at <console>:19)
15/02/05 11:57:17 INFO scheduler.DAGScheduler: Parents of final stage: List()
15/02/05 11:57:17 INFO scheduler.DAGScheduler: Missing parents: List()
15/02/05 11:57:17 INFO scheduler.DAGScheduler: Submitting Stage 1 (FilteredRDD[2] at filter at <console>:16), which has no missing parents
15/02/05 11:57:17 INFO storage.MemoryStore: ensureFreeSpace(1856) called with curMem=2867, maxMem=280248975
15/02/05 11:57:17 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 1856.0 B, free 267.3 MB)
15/02/05 11:57:17 INFO storage.MemoryStore: ensureFreeSpace(1342) called with curMem=4723, maxMem=280248975
15/02/05 11:57:17 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 1342.0 B, free 267.3 MB)
15/02/05 11:57:17 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:34086 (size: 1342.0 B, free: 267.3 MB)
15/02/05 11:57:17 INFO storage.BlockManagerMaster: Updated info of block broadcast_1_piece0
15/02/05 11:57:17 INFO spark.SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:838
15/02/05 11:57:17 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from Stage 1 (FilteredRDD[2] at filter at <console>:16)
15/02/05 11:57:17 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
15/02/05 11:57:17 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, PROCESS_LOCAL, 1224 bytes)
15/02/05 11:57:17 INFO executor.Executor: Running task 0.0 in stage 1.0 (TID 1)
15/02/05 11:57:17 INFO executor.Executor: Finished task 0.0 in stage 1.0 (TID 1). 610 bytes result sent to driver
15/02/05 11:57:17 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 83 ms on localhost (1/1)
15/02/05 11:57:17 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
15/02/05 11:57:17 INFO scheduler.DAGScheduler: Stage 1 (collect at <console>:19) finished in 0.086 s
15/02/05 11:57:17 INFO scheduler.DAGScheduler: Job 1 finished: collect at <console>:19, took 0.161116 s
res2: Array[Int] = Array(6, 8, 10, 12)

scala>
//************************************
等价于:scala> val filterRdd = sc.parallelize(List(1,2,3,4,5,6)).map(2 * _).filter(_ > 5).collect
////////////////////////////////////
scala> val rdd = sc.textFile("**wenjianlujing")
scala> rdd.count
scala> rdd.cache
***********************************************
scala> val wordcount = rdd.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_)
**************************************************
scala> val rdd1 = sc.parallelize(List(("a",1),("a",2)))
rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[0] at parallelize at <console>:12

scala> val rdd2 = sc.parallelize(List(("b",1),("b",2)))
rdd2: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[1] at parallelize at <console>:12

scala> rdd1 union rdd2
res0: org.apache.spark.rdd.RDD[(String, Int)] = UnionRDD[2] at union at <console>:17

scala> val result = rdd1 union rdd2
result: org.apache.spark.rdd.RDD[(String, Int)] = UnionRDD[3] at union at <console>:16

scala> result.collect
15/02/05 12:36:14 INFO spark.SparkContext: Starting job: collect at <console>:19
15/02/05 12:36:14 INFO scheduler.DAGScheduler: Got job 0 (collect at <console>:19) with 2 output partitions (allowLocal=false)
15/02/05 12:36:14 INFO scheduler.DAGScheduler: Final stage: Stage 0(collect at <console>:19)
15/02/05 12:36:14 INFO scheduler.DAGScheduler: Parents of final stage: List()
15/02/05 12:36:14 INFO scheduler.DAGScheduler: Missing parents: List()
15/02/05 12:36:14 INFO scheduler.DAGScheduler: Submitting Stage 0 (UnionRDD[3] at union at <console>:16), which has no missing parents
15/02/05 12:36:15 INFO storage.MemoryStore: ensureFreeSpace(1792) called with curMem=0, maxMem=280248975
15/02/05 12:36:15 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 1792.0 B, free 267.3 MB)
15/02/05 12:36:16 INFO storage.MemoryStore: ensureFreeSpace(1324) called with curMem=1792, maxMem=280248975
15/02/05 12:36:16 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1324.0 B, free 267.3 MB)
15/02/05 12:36:16 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:51633 (size: 1324.0 B, free: 267.3 MB)
15/02/05 12:36:16 INFO storage.BlockManagerMaster: Updated info of block broadcast_0_piece0
15/02/05 12:36:16 INFO spark.SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:838
15/02/05 12:36:16 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from Stage 0 (UnionRDD[3] at union at <console>:16)
15/02/05 12:36:16 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
15/02/05 12:36:16 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1534 bytes)
15/02/05 12:36:16 INFO executor.Executor: Running task 0.0 in stage 0.0 (TID 0)
15/02/05 12:36:17 INFO executor.Executor: Finished task 0.0 in stage 0.0 (TID 0). 774 bytes result sent to driver
15/02/05 12:36:17 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, PROCESS_LOCAL, 1534 bytes)
15/02/05 12:36:17 INFO executor.Executor: Running task 1.0 in stage 0.0 (TID 1)
15/02/05 12:36:17 INFO executor.Executor: Finished task 1.0 in stage 0.0 (TID 1). 774 bytes result sent to driver
15/02/05 12:36:17 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 579 ms on localhost (1/2)
15/02/05 12:36:17 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 158 ms on localhost (2/2)
15/02/05 12:36:17 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
15/02/05 12:36:17 INFO scheduler.DAGScheduler: Stage 0 (collect at <console>:19) finished in 0.830 s
15/02/05 12:36:17 INFO scheduler.DAGScheduler: Job 0 finished: collect at <console>:19, took 3.478383 s
res1: Array[(String, Int)] = Array((a,1), (a,2), (b,1), (b,2))

scala>
//*&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&
scala> val rdd1 = sc.parallelize(List(("a",1),("a",2),("b",3),("b",4)))
rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[4] at parallelize at <console>:12

scala> val rdd2 = sc.parallelize(List(("a",5),("a",6),("b",7),("b",8)))
rdd2: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[5] at parallelize at <console>:12

scala> rdd1 join rdd2
res2: org.apache.spark.rdd.RDD[(String, (Int, Int))] = FlatMappedValuesRDD[8] at join at <console>:17

scala> res2.collect
15/02/05 12:43:08 INFO spark.SparkContext: Starting job: collect at <console>:19
15/02/05 12:43:08 INFO scheduler.DAGScheduler: Registering RDD 4 (parallelize at <console>:12)
15/02/05 12:43:08 INFO scheduler.DAGScheduler: Registering RDD 5 (parallelize at <console>:12)
15/02/05 12:43:08 INFO scheduler.DAGScheduler: Got job 1 (collect at <console>:19) with 1 output partitions (allowLocal=false)
15/02/05 12:43:08 INFO scheduler.DAGScheduler: Final stage: Stage 3(collect at <console>:19)
15/02/05 12:43:08 INFO scheduler.DAGScheduler: Parents of final stage: List(Stage 1, Stage 2)
15/02/05 12:43:08 INFO scheduler.DAGScheduler: Missing parents: List(Stage 1, Stage 2)
15/02/05 12:43:08 INFO scheduler.DAGScheduler: Submitting Stage 1 (ParallelCollectionRDD[4] at parallelize at <console>:12), which has no missing parents
15/02/05 12:43:08 INFO storage.MemoryStore: ensureFreeSpace(1488) called with curMem=3116, maxMem=280248975
15/02/05 12:43:08 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 1488.0 B, free 267.3 MB)
15/02/05 12:43:08 INFO storage.MemoryStore: ensureFreeSpace(1077) called with curMem=4604, maxMem=280248975
15/02/05 12:43:08 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 1077.0 B, free 267.3 MB)
15/02/05 12:43:08 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:51633 (size: 1077.0 B, free: 267.3 MB)
15/02/05 12:43:08 INFO storage.BlockManagerMaster: Updated info of block broadcast_1_piece0
15/02/05 12:43:08 INFO spark.SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:838
15/02/05 12:43:08 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from Stage 1 (ParallelCollectionRDD[4] at parallelize at <console>:12)
15/02/05 12:43:08 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
15/02/05 12:43:08 INFO scheduler.DAGScheduler: Submitting Stage 2 (ParallelCollectionRDD[5] at parallelize at <console>:12), which has no missing parents
15/02/05 12:43:08 INFO storage.MemoryStore: ensureFreeSpace(1488) called with curMem=5681, maxMem=280248975
15/02/05 12:43:08 INFO storage.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 1488.0 B, free 267.3 MB)
15/02/05 12:43:08 INFO storage.MemoryStore: ensureFreeSpace(1079) called with curMem=7169, maxMem=280248975
15/02/05 12:43:08 INFO storage.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 1079.0 B, free 267.3 MB)
15/02/05 12:43:08 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:51633 (size: 1079.0 B, free: 267.3 MB)
15/02/05 12:43:09 INFO storage.BlockManagerMaster: Updated info of block broadcast_2_piece0
15/02/05 12:43:09 INFO spark.SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:838
15/02/05 12:43:09 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from Stage 2 (ParallelCollectionRDD[5] at parallelize at <console>:12)
15/02/05 12:43:09 INFO scheduler.TaskSchedulerImpl: Adding task set 2.0 with 1 tasks
15/02/05 12:43:09 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, localhost, PROCESS_LOCAL, 1455 bytes)
15/02/05 12:43:09 INFO executor.Executor: Running task 0.0 in stage 1.0 (TID 2)
15/02/05 12:43:09 INFO executor.Executor: Finished task 0.0 in stage 1.0 (TID 2). 837 bytes result sent to driver
15/02/05 12:43:09 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 2.0 (TID 3, localhost, PROCESS_LOCAL, 1455 bytes)
15/02/05 12:43:09 INFO executor.Executor: Running task 0.0 in stage 2.0 (TID 3)
15/02/05 12:43:09 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 484 ms on localhost (1/1)
15/02/05 12:43:09 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
15/02/05 12:43:09 INFO scheduler.DAGScheduler: Stage 1 (parallelize at <console>:12) finished in 0.602 s
15/02/05 12:43:09 INFO scheduler.DAGScheduler: looking for newly runnable stages
15/02/05 12:43:09 INFO scheduler.DAGScheduler: running: Set(Stage 2)
15/02/05 12:43:09 INFO scheduler.DAGScheduler: waiting: Set(Stage 3)
15/02/05 12:43:09 INFO scheduler.DAGScheduler: failed: Set()
15/02/05 12:43:09 INFO scheduler.DAGScheduler: Missing parents for Stage 3: List(Stage 2)
15/02/05 12:43:09 INFO executor.Executor: Finished task 0.0 in stage 2.0 (TID 3). 837 bytes result sent to driver
15/02/05 12:43:09 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 2.0 (TID 3) in 166 ms on localhost (1/1)
15/02/05 12:43:09 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool
15/02/05 12:43:09 INFO scheduler.DAGScheduler: Stage 2 (parallelize at <console>:12) finished in 0.620 s
15/02/05 12:43:09 INFO scheduler.DAGScheduler: looking for newly runnable stages
15/02/05 12:43:09 INFO scheduler.DAGScheduler: running: Set()
15/02/05 12:43:09 INFO scheduler.DAGScheduler: waiting: Set(Stage 3)
15/02/05 12:43:09 INFO scheduler.DAGScheduler: failed: Set()
15/02/05 12:43:09 INFO scheduler.DAGScheduler: Missing parents for Stage 3: List()
15/02/05 12:43:09 INFO scheduler.DAGScheduler: Submitting Stage 3 (FlatMappedValuesRDD[8] at join at <console>:17), which is now runnable
15/02/05 12:43:09 INFO storage.MemoryStore: ensureFreeSpace(2488) called with curMem=8248, maxMem=280248975
15/02/05 12:43:09 INFO storage.MemoryStore: Block broadcast_3 stored as values in memory (estimated size 2.4 KB, free 267.3 MB)
15/02/05 12:43:10 INFO storage.MemoryStore: ensureFreeSpace(1676) called with curMem=10736, maxMem=280248975
15/02/05 12:43:10 INFO storage.MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 1676.0 B, free 267.3 MB)
15/02/05 12:43:10 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on localhost:51633 (size: 1676.0 B, free: 267.3 MB)
15/02/05 12:43:10 INFO storage.BlockManagerMaster: Updated info of block broadcast_3_piece0
15/02/05 12:43:10 INFO spark.SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:838
15/02/05 12:43:10 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from Stage 3 (FlatMappedValuesRDD[8] at join at <console>:17)
15/02/05 12:43:10 INFO scheduler.TaskSchedulerImpl: Adding task set 3.0 with 1 tasks
15/02/05 12:43:10 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 3.0 (TID 4, localhost, PROCESS_LOCAL, 1902 bytes)
15/02/05 12:43:10 INFO executor.Executor: Running task 0.0 in stage 3.0 (TID 4)
15/02/05 12:43:11 INFO storage.ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
15/02/05 12:43:12 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 321 ms
15/02/05 12:43:12 INFO storage.ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
15/02/05 12:43:12 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
15/02/05 12:43:12 INFO storage.BlockManager: Removing broadcast 2
15/02/05 12:43:12 INFO storage.BlockManager: Removing block broadcast_2
15/02/05 12:43:12 INFO storage.MemoryStore: Block broadcast_2 of size 1488 dropped from memory (free 280238051)
15/02/05 12:43:12 INFO storage.BlockManager: Removing block broadcast_2_piece0
15/02/05 12:43:12 INFO storage.MemoryStore: Block broadcast_2_piece0 of size 1079 dropped from memory (free 280239130)
15/02/05 12:43:13 INFO storage.BlockManagerInfo: Removed broadcast_2_piece0 on localhost:51633 in memory (size: 1079.0 B, free: 267.3 MB)
15/02/05 12:43:13 INFO storage.BlockManagerMaster: Updated info of block broadcast_2_piece0
15/02/05 12:43:13 INFO spark.ContextCleaner: Cleaned broadcast 2
15/02/05 12:43:13 INFO storage.BlockManager: Removing broadcast 1
15/02/05 12:43:13 INFO storage.BlockManager: Removing block broadcast_1_piece0
15/02/05 12:43:13 INFO storage.MemoryStore: Block broadcast_1_piece0 of size 1077 dropped from memory (free 280240207)
15/02/05 12:43:13 INFO storage.BlockManagerInfo: Removed broadcast_1_piece0 on localhost:51633 in memory (size: 1077.0 B, free: 267.3 MB)
15/02/05 12:43:13 INFO storage.BlockManagerMaster: Updated info of block broadcast_1_piece0
15/02/05 12:43:13 INFO storage.BlockManager: Removing block broadcast_1
15/02/05 12:43:13 INFO storage.MemoryStore: Block broadcast_1 of size 1488 dropped from memory (free 280241695)
15/02/05 12:43:13 INFO spark.ContextCleaner: Cleaned broadcast 1
15/02/05 12:43:13 INFO storage.BlockManager: Removing broadcast 0
15/02/05 12:43:13 INFO storage.BlockManager: Removing block broadcast_0
15/02/05 12:43:13 INFO storage.MemoryStore: Block broadcast_0 of size 1792 dropped from memory (free 280243487)
15/02/05 12:43:13 INFO storage.BlockManager: Removing block broadcast_0_piece0
15/02/05 12:43:13 INFO storage.MemoryStore: Block broadcast_0_piece0 of size 1324 dropped from memory (free 280244811)
15/02/05 12:43:13 INFO storage.BlockManagerInfo: Removed broadcast_0_piece0 on localhost:51633 in memory (size: 1324.0 B, free: 267.3 MB)
15/02/05 12:43:13 INFO storage.BlockManagerMaster: Updated info of block broadcast_0_piece0
15/02/05 12:43:13 INFO spark.ContextCleaner: Cleaned broadcast 0
15/02/05 12:43:13 INFO executor.Executor: Finished task 0.0 in stage 3.0 (TID 4). 1203 bytes result sent to driver
15/02/05 12:43:13 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 3.0 (TID 4) in 3187 ms on localhost (1/1)
15/02/05 12:43:13 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool
15/02/05 12:43:13 INFO scheduler.DAGScheduler: Stage 3 (collect at <console>:19) finished in 3.191 s
15/02/05 12:43:13 INFO scheduler.DAGScheduler: Job 1 finished: collect at <console>:19, took 4.484745 s
res3: Array[(String, (Int, Int))] = Array((a,(1,5)), (a,(1,6)), (a,(2,5)), (a,(2,6)), (b,(3,7)), (b,(3,8)), (b,(4,7)), (b,(4,8)))

scala>
//******************************
scala> val rdd = sc.parallelize(List(1,2,3,4,5))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[9] at parallelize at <console>:12

scala> rdd.reduce(_+_)
15/02/05 12:46:23 INFO spark.SparkContext: Starting job: reduce at <console>:15
15/02/05 12:46:23 INFO scheduler.DAGScheduler: Got job 2 (reduce at <console>:15) with 1 output partitions (allowLocal=false)
15/02/05 12:46:23 INFO scheduler.DAGScheduler: Final stage: Stage 4(reduce at <console>:15)
15/02/05 12:46:23 INFO scheduler.DAGScheduler: Parents of final stage: List()
15/02/05 12:46:23 INFO scheduler.DAGScheduler: Missing parents: List()
15/02/05 12:46:23 INFO scheduler.DAGScheduler: Submitting Stage 4 (ParallelCollectionRDD[9] at parallelize at <console>:12), which has no missing parents
15/02/05 12:46:23 INFO storage.MemoryStore: ensureFreeSpace(1184) called with curMem=4164, maxMem=280248975
15/02/05 12:46:23 INFO storage.MemoryStore: Block broadcast_4 stored as values in memory (estimated size 1184.0 B, free 267.3 MB)
15/02/05 12:46:23 INFO storage.MemoryStore: ensureFreeSpace(912) called with curMem=5348, maxMem=280248975
15/02/05 12:46:23 INFO storage.MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 912.0 B, free 267.3 MB)
15/02/05 12:46:23 INFO storage.BlockManagerInfo: Added broadcast_4_piece0 in memory on localhost:51633 (size: 912.0 B, free: 267.3 MB)
15/02/05 12:46:23 INFO storage.BlockManagerMaster: Updated info of block broadcast_4_piece0
15/02/05 12:46:23 INFO spark.SparkContext: Created broadcast 4 from broadcast at DAGScheduler.scala:838
15/02/05 12:46:23 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from Stage 4 (ParallelCollectionRDD[9] at parallelize at <console>:12)
15/02/05 12:46:23 INFO scheduler.TaskSchedulerImpl: Adding task set 4.0 with 1 tasks
15/02/05 12:46:23 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 4.0 (TID 5, localhost, PROCESS_LOCAL, 1220 bytes)
15/02/05 12:46:23 INFO executor.Executor: Running task 0.0 in stage 4.0 (TID 5)
15/02/05 12:46:23 INFO executor.Executor: Finished task 0.0 in stage 4.0 (TID 5). 727 bytes result sent to driver
15/02/05 12:46:23 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 4.0 (TID 5) in 36 ms on localhost (1/1)
15/02/05 12:46:23 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 4.0, whose tasks have all completed, from pool
15/02/05 12:46:23 INFO scheduler.DAGScheduler: Stage 4 (reduce at <console>:15) finished in 0.038 s
15/02/05 12:46:23 INFO scheduler.DAGScheduler: Job 2 finished: reduce at <console>:15, took 0.110935 s
res4: Int = 15

scala>
****************************************
scala> val rdd1 = sc.parallelize(List(("a",1),("a",2),("b",3),("b",4)))
rdd1: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[10] at parallelize at <console>:12

scala> rdd1.lookup("a")
15/02/05 12:49:40 INFO spark.SparkContext: Starting job: lookup at <console>:15
15/02/05 12:49:40 INFO scheduler.DAGScheduler: Got job 3 (lookup at <console>:15) with 1 output partitions (allowLocal=false)
15/02/05 12:49:40 INFO scheduler.DAGScheduler: Final stage: Stage 5(lookup at <console>:15)
15/02/05 12:49:40 INFO scheduler.DAGScheduler: Parents of final stage: List()
15/02/05 12:49:40 INFO scheduler.DAGScheduler: Missing parents: List()
15/02/05 12:49:40 INFO scheduler.DAGScheduler: Submitting Stage 5 (MappedRDD[12] at lookup at <console>:15), which has no missing parents
15/02/05 12:49:40 INFO storage.MemoryStore: ensureFreeSpace(1984) called with curMem=6260, maxMem=280248975
15/02/05 12:49:40 INFO storage.MemoryStore: Block broadcast_5 stored as values in memory (estimated size 1984.0 B, free 267.3 MB)
15/02/05 12:49:40 INFO storage.MemoryStore: ensureFreeSpace(1443) called with curMem=8244, maxMem=280248975
15/02/05 12:49:40 INFO storage.MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 1443.0 B, free 267.3 MB)
15/02/05 12:49:40 INFO storage.BlockManagerInfo: Added broadcast_5_piece0 in memory on localhost:51633 (size: 1443.0 B, free: 267.3 MB)
15/02/05 12:49:40 INFO storage.BlockManagerMaster: Updated info of block broadcast_5_piece0
15/02/05 12:49:40 INFO spark.SparkContext: Created broadcast 5 from broadcast at DAGScheduler.scala:838
15/02/05 12:49:40 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from Stage 5 (MappedRDD[12] at lookup at <console>:15)
15/02/05 12:49:40 INFO scheduler.TaskSchedulerImpl: Adding task set 5.0 with 1 tasks
15/02/05 12:49:40 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 5.0 (TID 6, localhost, PROCESS_LOCAL, 1466 bytes)
15/02/05 12:49:40 INFO executor.Executor: Running task 0.0 in stage 5.0 (TID 6)
15/02/05 12:49:40 INFO executor.Executor: Finished task 0.0 in stage 5.0 (TID 6). 602 bytes result sent to driver
15/02/05 12:49:40 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 5.0 (TID 6) in 51 ms on localhost (1/1)
15/02/05 12:49:40 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 5.0, whose tasks have all completed, from pool
15/02/05 12:49:40 INFO scheduler.DAGScheduler: Stage 5 (lookup at <console>:15) finished in 0.052 s
15/02/05 12:49:40 INFO scheduler.DAGScheduler: Job 3 finished: lookup at <console>:15, took 0.246055 s
res5: Seq[Int] = WrappedArray(1, 2)

scala>