spark学习7之IDEA下搭建Spark本地编译环境并上传到集群运行

时间:2022-04-21 12:45:20

更多代码请见:https://github.com/xubo245/SparkLearning


IDEA下搭建Spark本地编译环境并上传到集群运行

环境:

本地:window7 64 +idea15.0.4+scala 2.10.5

集群:ubuntu+spark1.5.2


1.安装scala2.10.5,需要配置环境变量,还需要安装jdk1.7,同样要环境变量,很多教材,不细讲


2.本地安装idea15.0.4:

https://www.jetbrains.com/idea/download/#section=windows


3.安装插件:

http://plugins.jetbrains.com/plugin/?idea&id=1347

直接在idea 15.0.4的file-》setting-》plugins中搜索scala会搜索不到,应该是网络原因,可以去上面的网址下,然后放到idea安装位置的plugins下,重启idea,会发现有scala,但是new project的时候没有

于是删了,然后在setting的plugins中加上http://www.jetbrains.net/confluence/display/SCA/Scala+Plugin+for+IntelliJ+IDEA

然后在install jetbrains plugin中搜索就可以安装上scala 2.2.0

由于spark1.5.2使用的是scala2.10,以及spark-assembly-1.5.2-hadoop2.6.0.jar也是scala2.10

所以找到刚才安装的目录:C:\Users\xubo\.IdeaIC15\config\plugins,我得idea默认安装插件位置,然后保存scala为scala2,将从http://plugins.jetbrains.com/plugin/?idea&id=1347中下载的scala2.10解压到该目录


4.重启idea,就可以新建scala project 然后导入spark-assembly-1.5.2-hadoop2.6.0.jar就可以本地编译spark程序:

需要安装hadoop-2.6.0的运行文件,,并配置环境变量,还得有正确的



示例:SparkPi.scala,从源码中cp,然后加了setMaster

/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

// scalastyle:off println
package scalaTest

import scala.math.random

import org.apache.spark._

/** Computes an approximation to pi */
object SparkPi {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Spark Pi ").setMaster("local")
val spark = new SparkContext(conf)
val slices = if (args.length > 0) args(0).toInt else 2
println("slices:\n"+slices)
println("args.length:\n"+args.length)
val n = math.min(100000L * slices, Int.MaxValue).toInt // avoid overflow
val count = spark.parallelize(1 until n, slices).map { i =>
val x = random * 2 - 1
val y = random * 2 - 1
if (x*x + y*y < 1) 1 else 0
}.reduce(_ + _)
println("Pi is roughly " + 4.0 * count / n)
spark.stop()
}
}
// scalastyle:on println

本地运行结果:

D:\1win7\java\jdk\bin\java -Didea.launcher.port=7534 "-Didea.launcher.bin.path=D:\1win7\idea\IntelliJ IDEA Community Edition 15.0.4\bin" -Dfile.encoding=UTF-8 -classpath "D:\1win7\java\jdk\jre\lib\charsets.jar;D:\1win7\java\jdk\jre\lib\deploy.jar;D:\1win7\java\jdk\jre\lib\ext\access-bridge-64.jar;D:\1win7\java\jdk\jre\lib\ext\dnsns.jar;D:\1win7\java\jdk\jre\lib\ext\jaccess.jar;D:\1win7\java\jdk\jre\lib\ext\localedata.jar;D:\1win7\java\jdk\jre\lib\ext\sunec.jar;D:\1win7\java\jdk\jre\lib\ext\sunjce_provider.jar;D:\1win7\java\jdk\jre\lib\ext\sunmscapi.jar;D:\1win7\java\jdk\jre\lib\ext\zipfs.jar;D:\1win7\java\jdk\jre\lib\javaws.jar;D:\1win7\java\jdk\jre\lib\jce.jar;D:\1win7\java\jdk\jre\lib\jfr.jar;D:\1win7\java\jdk\jre\lib\jfxrt.jar;D:\1win7\java\jdk\jre\lib\jsse.jar;D:\1win7\java\jdk\jre\lib\management-agent.jar;D:\1win7\java\jdk\jre\lib\plugin.jar;D:\1win7\java\jdk\jre\lib\resources.jar;D:\1win7\java\jdk\jre\lib\rt.jar;D:\1win7\scala;D:\1win7\scala\lib;D:\all\idea\scala2\out\production\scala2;G:\149\spark-assembly-1.5.2-hadoop2.6.0.jar;D:\1win7\scala\lib\scala-actors-migration.jar;D:\1win7\scala\lib\scala-actors.jar;D:\1win7\scala\lib\scala-library.jar;D:\1win7\scala\lib\scala-reflect.jar;D:\1win7\scala\lib\scala-swing.jar;D:\1win7\idea\IntelliJ IDEA Community Edition 15.0.4\lib\idea_rt.jar" com.intellij.rt.execution.application.AppMain scalaTest.SparkPi
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/03/03 17:19:19 INFO SparkContext: Running Spark version 1.5.2
16/03/03 17:19:20 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/03/03 17:19:21 INFO SecurityManager: Changing view acls to: xubo
16/03/03 17:19:21 INFO SecurityManager: Changing modify acls to: xubo
16/03/03 17:19:21 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(xubo); users with modify permissions: Set(xubo)
16/03/03 17:19:22 INFO Slf4jLogger: Slf4jLogger started
16/03/03 17:19:22 INFO Remoting: Starting remoting
16/03/03 17:19:22 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@202.38.84.241:52826]
16/03/03 17:19:22 INFO Utils: Successfully started service 'sparkDriver' on port 52826.
16/03/03 17:19:22 INFO SparkEnv: Registering MapOutputTracker
16/03/03 17:19:22 INFO SparkEnv: Registering BlockManagerMaster
16/03/03 17:19:22 INFO DiskBlockManager: Created local directory at C:\Users\xubo\AppData\Local\Temp\blockmgr-193ae298-f771-488a-92ee-60c4e94ca9d1
16/03/03 17:19:22 INFO MemoryStore: MemoryStore started with capacity 730.6 MB
16/03/03 17:19:22 INFO HttpFileServer: HTTP File server directory is C:\Users\xubo\AppData\Local\Temp\spark-4b618306-ea29-4c02-a891-754af4d84648\httpd-0a2aa0cd-b7f2-453b-983c-482852013882
16/03/03 17:19:22 INFO HttpServer: Starting HTTP Server
16/03/03 17:19:22 INFO Utils: Successfully started service 'HTTP file server' on port 52827.
16/03/03 17:19:22 INFO SparkEnv: Registering OutputCommitCoordinator
16/03/03 17:19:22 INFO Utils: Successfully started service 'SparkUI' on port 4040.
16/03/03 17:19:22 INFO SparkUI: Started SparkUI at http://202.38.84.241:4040
16/03/03 17:19:23 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
16/03/03 17:19:23 INFO Executor: Starting executor ID driver on host localhost
16/03/03 17:19:23 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 52834.
16/03/03 17:19:23 INFO NettyBlockTransferService: Server created on 52834
16/03/03 17:19:23 INFO BlockManagerMaster: Trying to register BlockManager
16/03/03 17:19:23 INFO BlockManagerMasterEndpoint: Registering block manager localhost:52834 with 730.6 MB RAM, BlockManagerId(driver, localhost, 52834)
16/03/03 17:19:23 INFO BlockManagerMaster: Registered BlockManager
slices:
2
args.length:
0
16/03/03 17:19:24 INFO SparkContext: Starting job: main at NativeMethodAccessorImpl.java:-2
16/03/03 17:19:24 INFO DAGScheduler: Got job 0 (main at NativeMethodAccessorImpl.java:-2) with 2 output partitions
16/03/03 17:19:24 INFO DAGScheduler: Final stage: ResultStage 0(main at NativeMethodAccessorImpl.java:-2)
16/03/03 17:19:24 INFO DAGScheduler: Parents of final stage: List()
16/03/03 17:19:24 INFO DAGScheduler: Missing parents: List()
16/03/03 17:19:24 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at main at NativeMethodAccessorImpl.java:-2), which has no missing parents
16/03/03 17:19:24 INFO MemoryStore: ensureFreeSpace(1856) called with curMem=0, maxMem=766075207
16/03/03 17:19:24 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 1856.0 B, free 730.6 MB)
16/03/03 17:19:24 INFO MemoryStore: ensureFreeSpace(1198) called with curMem=1856, maxMem=766075207
16/03/03 17:19:24 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1198.0 B, free 730.6 MB)
16/03/03 17:19:24 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:52834 (size: 1198.0 B, free: 730.6 MB)
16/03/03 17:19:24 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:861
16/03/03 17:19:24 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at main at NativeMethodAccessorImpl.java:-2)
16/03/03 17:19:24 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
16/03/03 17:19:24 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 2085 bytes)
16/03/03 17:19:24 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
16/03/03 17:19:24 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1031 bytes result sent to driver
16/03/03 17:19:24 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, PROCESS_LOCAL, 2085 bytes)
16/03/03 17:19:24 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
16/03/03 17:19:24 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 190 ms on localhost (1/2)
16/03/03 17:19:24 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 1031 bytes result sent to driver
16/03/03 17:19:24 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 33 ms on localhost (2/2)
16/03/03 17:19:24 INFO DAGScheduler: ResultStage 0 (main at NativeMethodAccessorImpl.java:-2) finished in 0.230 s
16/03/03 17:19:24 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
16/03/03 17:19:24 INFO DAGScheduler: Job 0 finished: main at NativeMethodAccessorImpl.java:-2, took 0.545201 s
Pi is roughly 3.14548
16/03/03 17:19:24 INFO SparkUI: Stopped Spark web UI at http://202.38.84.241:4040
16/03/03 17:19:24 INFO DAGScheduler: Stopping DAGScheduler
16/03/03 17:19:24 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/03/03 17:19:24 INFO MemoryStore: MemoryStore cleared
16/03/03 17:19:24 INFO BlockManager: BlockManager stopped
16/03/03 17:19:24 INFO BlockManagerMaster: BlockManagerMaster stopped
16/03/03 17:19:24 INFO SparkContext: Successfully stopped SparkContext
16/03/03 17:19:24 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/03/03 17:19:24 INFO ShutdownHookManager: Shutdown hook called
16/03/03 17:19:24 INFO ShutdownHookManager: Deleting directory C:\Users\xubo\AppData\Local\Temp\spark-4b618306-ea29-4c02-a891-754af4d84648

Process finished with exit code 0


5.将代码打成jar包,上传到集群,请参考:书“”Spark大数据应用“P123

大概:File-》Project Structure-》Artifact,然后选择jar-》from modules dependences。。。

选择class,可以将scala和spark的包删除,不然会很大,最后在idea界面选择build-》build artifact 生成jar导入集群,然后在运行,

运行脚本:

    #!/usr/bin/env bash  
spark-submit --name SparkPi \
--class scalaTest.SparkPi \
--master spark://219.219.220.149:7077 \
--executor-memory 512M \
--total-executor-cores 22 scala2.jar


位置:/home/hadoop/cloud/testByXubo/spark/backupSuccess/ideaSparkPi/1


执行结果:

hadoop@Master:~/cloud/testByXubo/spark/backupSuccess/ideaSparkPi/1$ ./submitJob.sh 
slices:
2
args.length:
0
Pi is roughly 3.14344