Introduction
Apache Spark is a general-purpose cluster computing system to process big data workloads. What sets Spark apart from its predecessors, such as MapReduce, is its speed, ease-of-use, and sophisticated analytics.
Apache Spark was originally developed at AMPLab, UC Berkeley, in 2009. It was made open source in 2010 under the BSD license and switched to the Apache 2.0 license in 2013. Toward the later part of 2013, the creators of Spark founded Databricks to focus on Spark's development and future releases.
Talking about speed, Spark can achieve sub-second latency on big data workloads. To achieve such low latency, Spark makes use of the memory for storage. In MapReduce, memory is primarily used for actual computation. Spark uses memory both to compute and store objects.
Spark also provides a unified runtime connecting to various big data storage sources, such as HDFS, Cassandra, HBase, and S3. It also provides a rich set of higher-level libraries for different big data compute tasks, such as machine learning, SQL processing, graph processing, and real-time streaming. These libraries make development faster and can be combined in an arbitrary fashion.
Though Spark is written in Scala, and this book only focuses only recipes in Scala, Spark also supports Java and Python.
Spark is an open source community project, and everyone uses the pure open source Apache distributions for deployments, unlike Hadoop, which has multiple distributinos available with vendor enhancements.
The Spark runtime runs on top of a variety of cluster managers, including YARN(Hadoop's compute framework), Mesos, and Spark's own cluster manager called standalone mode. Tachyon is a memory-centric distributed file system that enables reliable file sharing at memory speed across cluster frameworks. In short, it is an off-heap storage layer in memory, which helps share data across jobs and users. Mesos is a cluster manager, which is evolving into a data center operating system. YARN is Hadoop's compute framework that has a robust resource management reature that Spark can seamlessly use.
Installing Spark from binaries
http://spark.apache.org/downloads.html
1. download binaries
wget http://d3kbcqa49mib13.cloudfront.net/spark-1.4.0-bin-hadoop2.4.tgz
2. unpack binaries
tar -zxf spark-1.4.0-bin-hadoop2.4.tgz
3. rename the folder
sudo mv spark-1.4.0-bin-hadoop2.4 spark
4. move the configuration folder to the /etc folder
sudo mv spark/conf/* /etc/spark
5. create installation directory under /opt
sudo mkdir -p /opt/infoobjects
6. move the spark directory to /opt/infoobjects
sudo mv spark /opt/infoobjects/
7. change ownership of the spark home to root
sudo chown -R root:root /opt/infoobjects/spark
8. change permission for the spark home
sudo chmod -R 755 /opt/infoobjects/spark
9. move to the spark home
cd /opt/infoobjects/spark
10. create the symbolic link
sudo ln -s /etc/spark conf
11. append to PATH in .bashrc
echo "export PATH=$PATH:/opt/infoobjects/spark/bin" >> /home/hduser/.bashrc
12. open a new terminal
13. create a log directory in /var
sudo mkdir -p /var/log/spark
14. make hduser the owner of the spark log
sudo chown -R hduser:hduser /var/log/spark
15. create the spark tmp directory
mkdir /tmp/spark
16. configure spark
cd /etc/spark
echo "export HADOOP_CONF_DIR=/opt/infoobjects/hadoop/etc/hadoop" >> spark-env.sh
echo "export YARN_CONF_DIR=/opt/infoobjects/hadoop/etc/Hadoop" >> spark-env.sh
echo "export SPARK_LOG_DIR=/var/log/spark" >> spark-env.sh
echo "export SPARK_WORKER_DIR=/tmp/spark" >> spark-env.sh
Building the Spark source code with Maven
Java 1.6 & Maven 3.x
1. increase MaxPermSize for heap
echo "export _JAVA_OPTIONS=\"-XX:MaxPermSize=1G\"" >> /home/hduser/.bashrc
2. open a new terminal and download the spark source code from GitHub
wget https://github.com/apache/spark/archive/branch-1.4.zip
3. unpack the archive
gunzip branch-1.4.zip
4. move to the spark directory
cd spark
5. compile the sources
mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -DskipTests clean package
6. move the conf folder to the etc folder
sudo mv spark/conf /etc/
7. move the spark directory to /opt
sudo mv spark /opt/infoobjects/spark
8. change ownership of the spark home to root
sudo chown -R root:root /opt/infoobjects/spark
9. change permission for the spark home
sudo chmod -R 755 /opt/infoobjects/spark
10. move to the spark home
cd /opt/infoobjects/spark
11. create the symbolic link
sudo ln -s /etc/spark conf
12. append to PATH in .bashrc
echo "export PATH=$PATH:/opt/infoobjects/spark/bin" >> /home/hduser/.bashrc
13. open a new terminal
14. create a log directory in /var
sudo mkdir -p /var/log/spark
15. make hduser the owner of the spark log
sudo chown -R hduser:hduser /var/log/spark
16. create the spark tmp directory
mkdir /tmp/spark
17. configure spark
cd /etc/spark
echo "export HADOOP_CONF_DIR=/opt/infoobjects/hadoop/etc/hadoop" >> spark-env.sh
echo "export YARN_CONF_DIR=/opt/infoobjects/hadoop/etc/Hadoop" >> spark-env.sh
echo "export SPARK_LOG_DIR=/var/log/spark" >> spark-env.sh
echo "export SPARK_WORKER_DIR=/tmp/spark" >> spark-env.sh
Launching Spark on Amazon EC2
Getting ready
1. login to the Amazon AWS account(http://aws.amazon.com)
2. click on Security Credentials under your account name in the top-right corner
3. click on Access Keys and Create New Access Key
4. get access key id and secret access key
5. go to Services | EC2
6. click on Key Pairs in left-hand menu under NETWORK & SECURITY
7. click on Create Key Pair and enter kp-spark as key-pair name
8. download the private key file and copy it in the /home/hduser/keypairs folder
9. set permissions on key file to 600
10. set environment variables to reflect access key ID and secret access key
echo "export AWS_ACCESS_KEY_ID=\"{ACCESS_KEY_ID}\"" >> /home/hduser/.bashrc
echo "export AWS_SECRET_ACESS_KEY=\"{AWS_SECRET_ACESS_KEY}\"" >> /home/hduser/.bashrc
echo "export PATH=$PATH:/opt/infoobject/spark/ec2" >> /home/hduser/.bashrc
1. launch the cluster
cd /home/hduser
spark-ec2 -k <key-pair> -i <key-file> -s <num-slaves> launch <cluster-name>
2. launch the cluser with example value
spark-ec2 -k kp-spark -i /home/hduser/keypairs/kp-spark.pem --hadoop-major-version 2 -s 3 launch spark-cluster
3. specify zone if default availability zones not available
spark-ec2 -k kp-spark -i /home/hduser/keypairs/kp-spark.pem -z us-east-1b --hadoop-major-version 2 -s 3 launch spark-cluster
4. attach EBS volume if needs to retain data after the instance shuts down
spark-ec2 -k kp-spark -i /home/hduser/keypairs/kp-spark.pem --hadoop-major-version 2 -ebs-vol-size 10 -s 3 launch spark-cluster
5. use Amazon spot instances
spark-ec2 -k kp-spark -i /home/hduser/keypairs/kp-spark.pem -spot-price=0.15 --hadoop-major-version 2 -s 3 launch spark-cluster
6. check the status of the cluster
the url will be printed at the end
7. connect to the master node
spark-ec2 -k kp-spark -i /home/hduser/kp/kp-spark.pem login spark-cluster
8. check the HDFS version in an ephemeral instance
ephemeral-hdfs/bin/hadoop version
9. check the HDFS version in persistent instance
persistent-hdfs/bin/hadoop version
Deploying on a cluster in standalone mode
Compute resources in a distributed environment need to be managed so that resource utilization is efficient and every job gets a fair chance to run. Spark comes along with its own cluster manager conveniently called standalone mode. Spark also supports working with YARN and Mesos cluster managers.
The cluster manager that should be chosen is mostly driven by both legacy concerns and whether other frameworks, such as MapReduce, are sharing the same compute resource pool. If your cluster has legacy MapReduce jobs running, and all of them cannot be converted to Spark jobs, it is a good idea to use YARN as the cluster manager. Mesos is emerging as a data center operating system to conveniently manage jobs across frameworks, and is very compatible with Spark.
If the Spark framework is the only framework in your cluster, then standalone mode is good enough. As Spark evolves as technology, you will see more and more use cases of Spark being used as the standalone framework serving all big data compute needs. For example, some jobs may be using Apache Mahout at present because MLlib does not have a specific machine-learning library, which the job needs. As soon as MLlib gets this library, this particular job can be moved to Spark.
one master and five slaves
Master
m1.zettabytes.com
Slaves
s1.zettabytes.com
s2.zettabytes.com
s3.zettabytes.com
s4.zettabytes.com
s5.zettabytes.com
1. install spark binaries on both master and slave machines, put /opt/infoobjects/spark/sbin in path on every node
echo "export PATH=$PATH:/opt/infoobjects/spark/sbin" >> /home/hduser/.bashrc
2. ssh to master and start the standalone master server
start-master.sh
3. ssh to slave and start slaves
spark-class org.apache.spark.deploy.worker.Worker spark://m1.zettabytes.com:7077
4. create conf/slaves file on a master node and add one line per slave hostname
echo "s1.zettabytes.com" >> conf/slaves
echo "s2.zettabytes.com" >> conf/slaves
echo "s3.zettabytes.com" >> conf/slaves
echo "s4.zettabytes.com" >> conf/slaves
echo "s5.zettabytes.com" >> conf/slaves
start-master.sh
start-slaves.sh
start-all.sh
stop-master.sh
stop-slaves.sh
stop-all.sh
5. connect an application to the cluster through Scala code
val sparkContext = new SparkContext(new SparkConf().setMaster("spark://m1.zettabytes.com:7077"))
6. connect to the cluster through spark shell
spark-shell --master spark://master:7077
Deploying on a cluster with Mesos
Mesos is slowly emerging as a data center operating system to manage all compute resources across a data center. Mesos runs on any computer running the Linux operating system. Mesos is built using the same principles as Linux kernel.
1. Execute Mesos on Ubuntu OS with the trusty version
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv E56151BF DISTRO=$(lsb_release -is | tr '[:upper:]' '[:lower:]') CODENAME=$(lsb_release -cs)
sudo vi /etc/apt/sources.list.d/mesosphere.list
deb http://repos.mesosphere.io/Ubuntu trusty main
2. install mesos
sudo apt-get -y update
sudo apt-get -y install mesos
3. make spark binaries available to mesos and configure the spark driver to connect to mesos
4. upload spark binaries to HDFS
hdfs dfs -put spark-1.4.0-bin-hadoop2.4.tgz spark-1.4.0-bin-hadoop2.4.tgz
5. the master url for single master mesos is mesos://host:5050, and for the ZooKeeper managed mesos cluster, it is mesos://zk://host:2181
6. set variables in spark-env.sh
sudo vi spark-env.sh
export MESOS_NATIVE_LIBRARY=/usr/local/lib/libmesos.so
export SPARK_EXECUTOR_URI= hdfs://localhost:9000/user/hduser/spark-1.4.0-bin-hadoop2.4.tgz
7. run from Scala program
val conf = new SparkConf().setMaster("mesos://host:5050")
val sparkContext = new SparkContext(conf)
8. run from the Spark shell
spark-shell --master mesos://host:5050
Mesos has two run modes:
Fine-grained: In fine-grained (default) mode, every Spark task runs as a separate Mesos task
Coarse-grained: This mode will launch only one long-running Spark task on each Mesos machine
9. set to run in the coarse-grained mode
conf.set("spark.mesos.coarse", "true")
Deploying on a cluster with YARN
Yet another resource negotiator (YARN) is Hadoop's compute framework that runs on top of HDFS, which is Hadoop's storage layer.
YARN follows the master slave architecture. The master daemon is called ResourceManager and the slave daemon is called NodeManager. Besides this application, life cycle management is done by ApplicationMaster, which can be spawned on any slave node and is alive for the lifetime of an application.
When Spark is run on YARN, ResourceManager performs the role of Spark master and NodeManagers work as executor nodes.
While running Spark with YARN, each Spark executor is run as YARN container.
1. set the configuration
HADOOP_CONF_DIR: to write to HDFS
YARN_CONF_DIR: to connect to YARN ResourceManager
cd /opt/infoobjects/spark/conf (or /etc/spark)
sudo vi spark-env.sh
export HADOOP_CONF_DIR=/opt/infoobjects/hadoop/etc/Hadoop
export YARN_CONF_DIR=/opt/infoobjects/hadoop/etc/hadoop
2. launch YARN Spark in the yarn-client mode
spark-submit --class path.to.your.Class --master yarn-client [options] <app jar> [app options]
spark-submit --class com.infoobjects.TwitterFireHose --master yarn-client --num-executors 3 --driver-memory 4g --executor-memory 2g --executor-cores 1 target/sparkio.jar 10
3. launch spark shell in the yarn-client mode
spark-shell --master yarn-client
4. launch in the yarn-cluster mode
spark-submit --class path.to.your.Class --master yarn-cluster [options] <app jar> [app options]
spark-submit --class com.infoobjects.TwitterFireHose --master yarn-cluster --num-executors 3 --driver-memory 4g --executor-memory 2g --executor-cores 1 target/sparkio.jar 10