FOUR spark-shell 交互式编程

时间:2024-03-03 12:38:25
 编写独立应用程序实现数据去重
 
 
目录为/usr/local/spark/mycode/remdup,在当前目录下新建一个目录
mkdir -p src/main/scala,然后在目录/usr/local/spark/mycode/remdup/src/main/scala 下新建一个
remdup.scala,
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.HashPartitioner
object RemDup {
 def main(args: Array[String]) {
 val conf = new SparkConf().setAppName("RemDup")
 val sc = new SparkContext(conf)
 val dataFile = "file:///home/charles/data"
 val data = sc.textFile(dataFile,2)
 val res = data.filter(_.trim().length>0).map(line=>(line.trim,"")).partitionBy(new 
HashPartitioner(1)).groupByKey().sortByKey().keys
 res.saveAsTextFile("result")
 } }

  

在目录/usr/local/spark/mycode/remdup 目录下新建 simple.sbt,
name := "Simple Project"
version := "1.0"
scalaVersion := "2.11.8"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.1.0"

  

在目录/usr/local/spark/mycode/remdup 下执行下面命令打包程序
$ sudo /usr/local/sbt/sbt package

  

最后在目录/usr/local/spark/mycode/remdup 下执行下面命令提交程序
$ /usr/local/spark2.0.0/bin/spark-submit --class "RemDup" 
/usr/local/spark2.0.0/mycode/remdup/target/scala-2.11/simple-project_2.11-1.0.jar

  

在目录/usr/local/spark/mycode/remdup/result 下即可得到结果文件。