作者:过往记忆 | 新浪微博:左手牵右手TEL | 可以转载, 但必须以超链接形式标明文章原始出处和作者信息及版权声明 博客地址:http://www.iteblog.com/ 文章标题:《Spark Standalone模式应用程序开发》 本文链接:http://www.iteblog.com/archives/1041 Hadoop、Hive、Hbase、Flume等QQ交流群:138615359(已满),请加入新群:149892483 本博客的微信公共帐号为:iteblog_hadoop,欢迎大家关注。 如果你觉得本文对你有帮助,不妨分享一次,你的每次支持,都是对我最大的鼓励 |
欢迎关注微信公共帐号 |
在本博客的《Spark快速入门指南(Quick Start Spark)》文章中简单地介绍了如何通过Spark shell来快速地运用API。本文将介绍如何快速地利用Spark提供的API开发Standalone模式的应用程序。Spark支持三种程序语言的开发:Scala (利用SBT进行编译), Java (利用Maven进行编译)以及Python。下面我将分别用Scala、Java和Python开发同样功能的程序:
一、Scala版本:
程序如下:
01 |
package scala
|
02 |
/** |
03 |
* User: 过往记忆
|
04 |
* Date: 14-6-10
|
05 |
* Time: 下午11:37
|
06 |
* bolg: http://www.iteblog.com
|
07 |
* 本文地址:http://www.iteblog.com/archives/1041
|
08 |
* 过往记忆博客,专注于hadoop、hive、spark、shark、flume的技术博客,大量的干货
|
09 |
* 过往记忆博客微信公共帐号:iteblog_hadoop
|
10 |
*/
|
11 |
import org.apache.spark.SparkContext
|
12 |
import org.apache.spark.SparkConf
|
13 |
object Test { |
14 |
def main(args: Array[String]) {
|
15 |
val logFile = "file:///spark-bin-0.9.1/README.md"
|
16 |
val conf = new SparkConf().setAppName( "Spark Application in Scala" )
|
17 |
val sc = new SparkContext(conf)
|
18 |
val logData = sc.textFile(logFile, 2 ).cache()
|
19 |
val numAs = logData.filter(line => line.contains( "a" )).count()
|
20 |
val numBs = logData.filter(line => line.contains( "b" )).count()
|
21 |
println( "Lines with a: %s, Lines with b: %s" .format(numAs, numBs))
|
22 |
}
|
23 |
}
|
24 |
} |
为了编译这个文件,需要创建一个xxx.sbt文件,这个文件类似于pom.xml文件,这里我们创建一个scala.sbt文件,内容如下:
1 |
name := "Spark application in Scala"
|
2 |
version := "1.0"
|
3 |
scalaVersion := "2.10.4"
|
4 |
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.0.0"
|
5 |
resolvers += "Akka Repository" at "http://repo.akka.io/releases/"
|
编译:
1 |
# sbt/sbt package
|
2 |
[info] Done packaging. |
3 |
[success] Total time: 270 s, completed Jun 11 , 2014 1 : 05 : 54 AM
|
二、Java版本
01 |
/** |
02 |
* User: 过往记忆
|
03 |
* Date: 14-6-10
|
04 |
* Time: 下午11:37
|
05 |
* bolg: http://www.iteblog.com
|
06 |
* 本文地址:http://www.iteblog.com/archives/1041
|
07 |
* 过往记忆博客,专注于hadoop、hive、spark、shark、flume的技术博客,大量的干货
|
08 |
* 过往记忆博客微信公共帐号:iteblog_hadoop
|
09 |
*/
|
10 |
/* SimpleApp.java */ |
11 |
import org.apache.spark.api.java.*;
|
12 |
import org.apache.spark.SparkConf;
|
13 |
import org.apache.spark.api.java.function.Function;
|
14 |
15 |
public class SimpleApp {
|
16 |
public static void main(String[] args) {
|
17 |
String logFile = "file:///spark-bin-0.9.1/README.md" ;
|
18 |
SparkConf conf = new SparkConf().setAppName( "Spark Application in Java" );
|
19 |
JavaSparkContext sc = new JavaSparkContext(conf);
|
20 |
JavaRDD<String> logData = sc.textFile(logFile).cache();
|
21 |
22 |
long numAs = logData.filter( new Function<String, Boolean>() {
|
23 |
public Boolean call(String s) { return s.contains( "a" ); }
|
24 |
}).count();
|
25 |
26 |
long numBs = logData.filter( new Function<String, Boolean>() {
|
27 |
public Boolean call(String s) { return s.contains( "b" ); }
|
28 |
}).count();
|
29 |
30 |
System.out.println( "Lines with a: " + numAs + ",lines with b: " + numBs);
|
31 |
}
|
32 |
} |
本程序分别统计README.md文件中包含a和b的行数。本项目的pom.xml文件内容如下:
01 |
<?xml version= "1.0" encoding= "UTF-8" ?>
|
02 |
<project xmlns= "http://maven.apache.org/POM/4.0.0"
|
03 |
xmlns:xsi= "http://www.w3.org/2001/XMLSchema-instance"
|
04 |
xsi:schemaLocation="http: //maven.apache.org/POM/4.0.0
|
05 |
06 |
http: //maven.apache.org/xsd/maven-4.0.0.xsd">
|
07 |
08 |
<modelVersion> 4.0 . 0 </modelVersion>
|
09 |
10 |
<groupId>spark</groupId>
|
11 |
<artifactId>spark</artifactId>
|
12 |
<version> 1.0 </version>
|
13 |
14 |
<dependencies>
|
15 |
<dependency>
|
16 |
<groupId>org.apache.spark</groupId>
|
17 |
<artifactId>spark-core_2. 10 </artifactId>
|
18 |
<version> 1.0 . 0 </version>
|
19 |
</dependency>
|
20 |
</dependencies>
|
21 |
</project> |
利用Maven来编译这个工程:
1 |
# mvn install |
2 |
[INFO] ------------------------------------------------------------------------ |
3 |
[INFO] BUILD SUCCESS |
4 |
[INFO] ------------------------------------------------------------------------ |
5 |
[INFO] Total time: 5 .815s
|
6 |
[INFO] Finished at: Wed Jun 11 00 : 01 : 57 CST 2014
|
7 |
[INFO] Final Memory: 13M/32M |
8 |
[INFO] ------------------------------------------------------------------------ |
三、Python版本
01 |
# |
02 |
# User: 过往记忆 |
03 |
# Date: 14 - 6 - 10
|
04 |
# Time: 下午 11 : 37
|
05 |
# bolg: http: //www.iteblog.com
|
06 |
# 本文地址:http: //www.iteblog.com/archives/1041
|
07 |
# 过往记忆博客,专注于hadoop、hive、spark、shark、flume的技术博客,大量的干货 |
08 |
# 过往记忆博客微信公共帐号:iteblog_hadoop |
09 |
# |
10 |
from pyspark import SparkContext
|
11 |
12 |
logFile = "file:///spark-bin-0.9.1/README.md"
|
13 |
sc = SparkContext( "local" , "Spark Application in Python" )
|
14 |
logData = sc.textFile(logFile).cache() |
15 |
16 |
numAs = logData.filter(lambda s: 'a' in s).count()
|
17 |
numBs = logData.filter(lambda s: 'b' in s).count()
|
18 |
19 |
print "Lines with a: %i, lines with b: %i" % (numAs, numBs)
|
四、测试运行
本程序的程序环境是Spark 1.0.0,单机模式,测试如下:
1、测试Scala版本的程序
1 |
# bin/spark-submit -- class "scala.Test" \
|
2 |
--master local[ 4 ] \
|
3 |
target/scala- 2.10 /simple-project_2. 10 - 1.0 .jar
|
4 |
5 |
14 / 06 / 11 01 : 07 : 53 INFO spark.SparkContext: Job finished:
|
6 |
count at Test.scala: 18 , took 0.019705 s
|
7 |
Lines with a: 62 , Lines with b: 35
|
2、测试Java版本的程序
1 |
# bin/spark-submit -- class "SimpleApp" \
|
2 |
--master local[ 4 ] \
|
3 |
target/spark- 1.0 -SNAPSHOT.jar
|
4 |
5 |
14 / 06 / 11 00 : 49 : 14 INFO spark.SparkContext: Job finished:
|
6 |
count at SimpleApp.java: 22 , took 0.019374 s
|
7 |
Lines with a: 62 , lines with b: 35
|
3、测试Python版本的程序
1 |
# bin/spark-submit --master local[ 4 ] \
|
2 |
simple.py
|
3 |
4 |
Lines with a: 62 , lines with b: 35
|
本文地址:《Spark Standalone模式应用程序开发》:http://www.iteblog.com/archives/1041,过往记忆,大量关于Hadoop、Spark等个人原创技术博客本博客文章除特别声明,全部都是原创!
尊重原创,转载请注明: 转载自过往记忆(http://www.iteblog.com/)
本文链接地址: 《Spark Standalone模式应用程序开发》(http://www.iteblog.com/archives/1041)
E-mail:wyphao.2007@163.com