在现有的Hadoop集群上安装Spark

I am not a system administrator, but I may need to do some administrative task and hence need some help.

我不是系统管理员，但是我可能需要做一些管理任务，因此需要一些帮助。

We have a (remote) Hadoop cluster and people usually run map-reduce jobs on the cluster.

我们有一个(远程)Hadoop集群，人们通常在集群上运行map-reduce任务。

I am planning to install Apache Spark on the cluster so that all the machines in the cluster may be utilized. This should be possible and I have read from http://spark.apache.org/docs/latest/spark-standalone.html "You can run Spark alongside your existing Hadoop cluster by just launching it as a separate service on the same machines..."

我计划在集群上安装Apache Spark，以便可以使用集群中的所有机器。这应该是可能的，我从http://spark.apache.org/docs/latest/spark-standalone.html中读到:“您可以在现有的Hadoop集群上运行Spark，只需在相同的机器上作为单独的服务启动它……”

If you have done this before, please give me the detailed steps so that the Spark cluster may be created.

如果您以前已经这样做过，请给我详细的步骤，以便可以创建Spark集群。

1 个解决方案

#1

If you have Hadoop already installed on your cluster and want to run spark on YARN it's very easy:

如果您已经在集群上安装了Hadoop，并且想要在纱线上运行spark，那么非常简单:

Step 1: Find the YARN Master node (i.e. which runs the Resource Manager). The following steps are to be performed on the master node only.

步骤1:找到纱线主节点(即运行资源管理器的节点)。以下步骤仅在主节点上执行。

Step 2: Download the Spark tgz package and extract it somewhere.

步骤2:下载Spark tgz包并将其提取到某处。

Step 3: Define these environment variables, in .bashrc for example:

步骤3:定义这些环境变量，例如.bashrc中:

# Spark variables
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
export SPARK_HOME=<extracted_spark_package>
export PATH=$PATH:$SPARK_HOME/bin

Step 4: Run your spark job using the --master option to yarn-client or yarn-master:

第4步:使用—master选项运行您的spark作业到yarn-client或yarmaster:

spark-submit \
--master yarn-client \
--class org.apache.spark.examples.JavaSparkPi \
$SPARK_HOME/lib/spark-examples-1.5.1-hadoop2.6.0.jar \
100

This particular example uses a pre-compiled example job which comes with the Spark installation.

这个特定的示例使用了Spark安装附带的预编译示例作业。

You can read this blog post I wrote for more details on Hadoop and Spark installation on a cluster.

您可以阅读我为Hadoop和Spark安装在集群上编写的这篇博文。

You can read the post which follows to see how to compile and run your own Spark job in Java. If you want to code jobs in Python or Scala, its convenient to use a notebook like IPython or Zeppelin. Read more about how to use those with your Hadoop-Spark cluster here.

您可以阅读下面的文章，了解如何在Java中编译和运行您自己的Spark作业。如果您想用Python或Scala编写作业，那么使用IPython或Zeppelin这样的笔记本非常方便。阅读更多关于如何使用你的Hadoop-Spark集群。

#1