Spark的安装配置:
我们用scala语言编写和操作spark,所以先要完成scala的环境配置
1、先完成Scala的环境搭建
下载Scala插件,创建一个Maven项目,导入Scala依赖和插件
scala依赖
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.11.12</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-compiler</artifactId>
<version>2.11.12</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-reflect</artifactId>
<version>2.11.12</version>
</dependency>
scala插件
<build>
<plugins>
<plugin>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<version>2.15.2</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
2、导入spark-core依赖
<!--导入spark-core依赖-->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.4.5</version>
</dependency>
3、使用spark-->(代码操作)
以下是用spark处理单词统计任务
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
object Demo1WordCount {
def main(args: Array[String]): Unit = {
//1、创建spark的执行环境
val conf = new SparkConf()
//设置运行模式
conf.setMaster("local")
conf.setAppName("wc")
val sc = new SparkContext(conf)
//2、读取数据
//RDD:弹性的分布式数据集(相当于List)
val linesRDD: RDD[String] = sc.textFile("data/lines.txt")
//一行转换多行
val wordsRDD: RDD[String] = linesRDD.flatMap(_.split(","))
val kvRD: RDD[(String, Int)] = wordsRDD.map(word => (word, 1))
//统计单词的数量
val countRDD: RDD[(String, Int)] = kvRD.reduceByKey((x, y) => x + y)
//保存结果
countRDD.saveAsTextFile("data/word_count")
}
}