【转】Spark快速入门指南

尊重版权，原文：http://blog.csdn.net/macyang/article/details/7100523

- Spark是什么？

Spark is a MapReduce-like cluster computing framework designed to support
low-latency iterative jobs and interactive use from an interpreter. It is
written in Scala, a high-level language for the JVM, and exposes a clean
language-integrated syntax that makes it easy to write parallel jobs.
Spark runs on top of the Mesos cluster manager.

- Spark下载地址？

git clone git://github.com/mesos/spark.git

- Spark编译与运行？

1）scala 2.9 +（将bin添加到PATH中或者设定了SCALA_HOME环境变量）

2) spark支持local模式和cluster模式, local不需要安装mesos

3) 如果需要将spark运行在cluster上，需要安装mesos

4）使用spark自带的sbt编译/打包： sbt/sbt compile, sbt/sbt assembly

5）使用spark自带的run脚本运行spark程序

- 验证spark环境是否OK?

在spark目录下运行：

1) local单线程： ./run spark.examples.SparkPi local

2) local多核: ./run spark.examples.SparkPi local[2]

3) mesos本地master: ./run spark.examples.SparkPi master@localhost:5050

- Spark Programming Guide介绍了哪些东西？

1) 将Spark jar包（sbt/sbt assembly）放入CLASSPATH

2) Spark Application可以运行在local或者mesos上

3) Spark提供了两种RDD: Parallelized Collections 和 Hadoop Datasets, RDD能
够支持fault-tolerant，能够恢复因为节点crash造成的partition丢失问题

4) RDD上支持两种类型的Operation: transformation 和 action，其中transformation提供的
lazy类型的操作，只有当实际调用了action才会真正触发transformations

5) Spark提供了两种类型的shared variables: Broadcast Variables 和 Accumulators，对于
Broadcast variables则会将一份share variable分发到每台机器上，而不是默认情况下的每个task；
而对于accumulator则只能支持count和sum型的加操作，并且只有dirver program能够获取其value

- 如何写一些spark application？

多看一些spark例子，如：http://www.spark-project.org/examples.html

https://github.com/mesos/spark/tree/master/examples

- 遇到问题怎么办？

首先是google遇到的问题，如果还是解决不了就可以到spark google group去向作者提问题：
http://groups.google.com/group/spark-users?hl=en

- 想深入理解spark怎么办？

阅读spark的理论paper: http://www.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-82.pdf

阅读spark源代码：https://github.com/mesos/spark

秒客网

【转】Spark快速入门指南

相关文章