编写独立应用程序实现数据去重
目录为/usr/local/spark/mycode/remdup,在当前目录下新建一个目录
mkdir -p src/main/scala,然后在目录/usr/local/spark/mycode/remdup/src/main/scala 下新建一个
remdup.scala,
import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf import org.apache.spark.HashPartitioner object RemDup { def main(args: Array[String]) { val conf = new SparkConf().setAppName("RemDup") val sc = new SparkContext(conf) val dataFile = "file:///home/charles/data" val data = sc.textFile(dataFile,2) val res = data.filter(_.trim().length>0).map(line=>(line.trim,"")).partitionBy(new HashPartitioner(1)).groupByKey().sortByKey().keys res.saveAsTextFile("result") } }
在目录/usr/local/spark/mycode/remdup 目录下新建 simple.sbt,
name := "Simple Project" version := "1.0" scalaVersion := "2.11.8" libraryDependencies += "org.apache.spark" %% "spark-core" % "2.1.0"
在目录/usr/local/spark/mycode/remdup 下执行下面命令打包程序
$ sudo /usr/local/sbt/sbt package
最后在目录/usr/local/spark/mycode/remdup 下执行下面命令提交程序
$ /usr/local/spark2.0.0/bin/spark-submit --class "RemDup" /usr/local/spark2.0.0/mycode/remdup/target/scala-2.11/simple-project_2.11-1.0.jar
在目录/usr/local/spark/mycode/remdup/result 下即可得到结果文件。