如何在Amazon EMR上运行Spark Scala代码

I am trying to run the following piece of Spark code written in Scala on Amazon EMR:

我正在尝试在Amazon EMR上运行以Scala编写的以下Spark代码：

import org.apache.spark.{SparkConf, SparkContext}

object TestRunner {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("Hello World")
    val sc = new SparkContext(conf)

    val words = sc.parallelize(Seq("a", "b", "c", "d", "e"))
    val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
    println(wordCounts)
  }
}

This is the script I am using to deploy the above code into EMR:

这是我用来将上述代码部署到EMR的脚本：

#!/usr/bin/env bash

set -euxo pipefail

cluster_id='j-XXXXXXXXXX'

app_name="HelloWorld"

main_class="TestRunner"
jar_name="HelloWorld-assembly-0.0.1-SNAPSHOT.jar"
jar_path="target/scala-2.11/${jar_name}"
s3_jar_dir="s3://jars/"
s3_jar_path="${s3_jar_dir}${jar_name}"
###################################################

sbt assembly

aws s3 cp ${jar_path} ${s3_jar_dir}

aws emr add-steps --cluster-id ${cluster_id} --steps Type=spark,Name=${app_name},Args=[--deploy-mode,cluster,--master,yarn-cluster,--class,${main_class},${s3_jar_path}],ActionOnFailure=CONTINUE

But, this exits with producing no output at all in AWS after few minutes!

但是，退出几分钟后在AWS中根本不产生任何输出！

Here's my controller's output:

这是我的控制器的输出：

2016-10-20T21:03:17.043Z INFO Ensure step 3 jar file command-runner.jar
2016-10-20T21:03:17.043Z INFO StepRunner: Created Runner for step 3
INFO startExec 'hadoop jar /var/lib/aws/emr/step-runner/hadoop-jars/command-runner.jar spark-submit --deploy-mode cluster --class TestRunner s3://jars/mscheiber/HelloWorld-assembly-0.0.1-SNAPSHOT.jar'
INFO Environment:
  PATH=/sbin:/usr/sbin:/bin:/usr/bin:/usr/local/sbin:/opt/aws/bin
  LESS_TERMCAP_md=[01;38;5;208m
  LESS_TERMCAP_me=[0m
  HISTCONTROL=ignoredups
  LESS_TERMCAP_mb=[01;31m
  AWS_AUTO_SCALING_HOME=/opt/aws/apitools/as
  UPSTART_JOB=rc
  LESS_TERMCAP_se=[0m
  HISTSIZE=1000
  HADOOP_ROOT_LOGGER=INFO,DRFA
  JAVA_HOME=/etc/alternatives/jre
  AWS_DEFAULT_REGION=us-east-1
  AWS_ELB_HOME=/opt/aws/apitools/elb
  LESS_TERMCAP_us=[04;38;5;111m
  EC2_HOME=/opt/aws/apitools/ec2
  TERM=linux
  XFILESEARCHPATH=/usr/dt/app-defaults/%L/Dt
  runlevel=3
  LANG=en_US.UTF-8
  AWS_CLOUDWATCH_HOME=/opt/aws/apitools/mon
  MAIL=/var/spool/mail/hadoop
  LESS_TERMCAP_ue=[0m
  LOGNAME=hadoop
  PWD=/
  LANGSH_SOURCED=1
  HADOOP_CLIENT_OPTS=-Djava.io.tmpdir=/mnt/var/lib/hadoop/steps/s-3UAS8JQ0KEOV3/tmp
  _=/etc/alternatives/jre/bin/java
  CONSOLETYPE=serial
  RUNLEVEL=3
  LESSOPEN=||/usr/bin/lesspipe.sh %s
  previous=N
  UPSTART_EVENTS=runlevel
  AWS_PATH=/opt/aws
  USER=hadoop
  UPSTART_INSTANCE=
  PREVLEVEL=N
  HADOOP_LOGFILE=syslog
  HOSTNAME=ip-10-17-186-102
  NLSPATH=/usr/dt/lib/nls/msg/%L/%N.cat
  HADOOP_LOG_DIR=/mnt/var/log/hadoop/steps/s-3UAS8JQ0KEOV3
  EC2_AMITOOL_HOME=/opt/aws/amitools/ec2
  SHLVL=5
  HOME=/home/hadoop
  HADOOP_IDENT_STRING=hadoop
INFO redirectOutput to /mnt/var/log/hadoop/steps/s-3UAS8JQ0KEOV3/stdout
INFO redirectError to /mnt/var/log/hadoop/steps/s-3UAS8JQ0KEOV3/stderr
INFO Working dir /mnt/var/lib/hadoop/steps/s-3UAS8JQ0KEOV3
INFO ProcessRunner started child process 24549 :
hadoop    24549   4780  0 21:03 ?        00:00:00 bash /usr/lib/hadoop/bin/hadoop jar /var/lib/aws/emr/step-runner/hadoop-jars/command-runner.jar spark-submit --deploy-mode cluster --class TestRunner s3://jars/TestRunner-assembly-0.0.1-SNAPSHOT.jar
2016-10-20T21:03:21.050Z INFO HadoopJarStepRunner.Runner: startRun() called for s-3UAS8JQ0KEOV3 Child Pid: 24549
INFO Synchronously wait child process to complete : hadoop jar /var/lib/aws/emr/step-runner/hadoop-...
INFO waitProcessCompletion ended with exit code 0 : hadoop jar /var/lib/aws/emr/step-runner/hadoop-...
INFO total process run time: 44 seconds
2016-10-20T21:04:03.102Z INFO Step created jobs: 
2016-10-20T21:04:03.103Z INFO Step succeeded with exitCode 0 and took 44 seconds

The syslog and stdout is empty and this is in my stderr:

syslog和stdout是空的，这是在我的stderr中：

16/10/20 21:03:20 INFO RMProxy: Connecting to ResourceManager at ip-10-17-186-102.ec2.internal/10.17.186.102:8032
16/10/20 21:03:21 INFO Client: Requesting a new application from cluster with 2 NodeManagers
16/10/20 21:03:21 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (53248 MB per container)
16/10/20 21:03:21 INFO Client: Will allocate AM container, with 53247 MB memory including 4840 MB overhead
16/10/20 21:03:21 INFO Client: Setting up container launch context for our AM
16/10/20 21:03:21 INFO Client: Setting up the launch environment for our AM container
16/10/20 21:03:21 INFO Client: Preparing resources for our AM container
16/10/20 21:03:21 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
16/10/20 21:03:22 INFO Client: Uploading resource file:/mnt/tmp/spark-6fceeedf-0ad5-4df1-a63e-c1d7eb1b95b4/__spark_libs__5484581201997889110.zip -> hdfs://ip-10-17-186-102.ec2.internal:8020/user/hadoop/.sparkStaging/application_1476995377469_0002/__spark_libs__5484581201997889110.zip
16/10/20 21:03:24 INFO Client: Uploading resource s3://jars/HelloWorld-assembly-0.0.1-SNAPSHOT.jar -> hdfs://ip-10-17-186-102.ec2.internal:8020/user/hadoop/.sparkStaging/application_1476995377469_0002/DataScience-assembly-0.0.1-SNAPSHOT.jar
16/10/20 21:03:24 INFO S3NativeFileSystem: Opening 's3://jars/HelloWorld-assembly-0.0.1-SNAPSHOT.jar' for reading
16/10/20 21:03:26 INFO Client: Uploading resource file:/mnt/tmp/spark-6fceeedf-0ad5-4df1-a63e-c1d7eb1b95b4/__spark_conf__5724047842379101980.zip -> hdfs://ip-10-17-186-102.ec2.internal:8020/user/hadoop/.sparkStaging/application_1476995377469_0002/__spark_conf__.zip
16/10/20 21:03:26 INFO SecurityManager: Changing view acls to: hadoop
16/10/20 21:03:26 INFO SecurityManager: Changing modify acls to: hadoop
16/10/20 21:03:26 INFO SecurityManager: Changing view acls groups to: 
16/10/20 21:03:26 INFO SecurityManager: Changing modify acls groups to: 
16/10/20 21:03:26 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(hadoop); groups with view permissions: Set(); users  with modify permissions: Set(hadoop); groups with modify permissions: Set()
16/10/20 21:03:26 INFO Client: Submitting application application_1476995377469_0002 to ResourceManager
16/10/20 21:03:26 INFO YarnClientImpl: Submitted application application_1476995377469_0002
16/10/20 21:03:27 INFO Client: Application report for application_1476995377469_0002 (state: ACCEPTED)
16/10/20 21:03:27 INFO Client: 
     client token: N/A
     diagnostics: N/A
     ApplicationMaster host: N/A
     ApplicationMaster RPC port: -1
     queue: default
     start time: 1476997406896
     final status: UNDEFINED
     tracking URL: http://ip-10-17-186-102.ec2.internal:20888/proxy/application_1476995377469_0002/
     user: hadoop
16/10/20 21:03:28 INFO Client: Application report for application_1476995377469_0002 (state: ACCEPTED)
16/10/20 21:03:29 INFO Client: Application report for application_1476995377469_0002 (state: ACCEPTED)
16/10/20 21:03:30 INFO Client: Application report for application_1476995377469_0002 (state: ACCEPTED)
16/10/20 21:03:31 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:31 INFO Client: 
     client token: N/A
     diagnostics: N/A
     ApplicationMaster host: 10.17.181.184
     ApplicationMaster RPC port: 0
     queue: default
     start time: 1476997406896
     final status: UNDEFINED
     tracking URL: http://ip-10-17-186-102.ec2.internal:20888/proxy/application_1476995377469_0002/
     user: hadoop
16/10/20 21:03:32 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:33 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:34 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:35 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:36 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:37 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:38 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:39 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:40 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:41 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:42 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:43 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:44 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:45 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:46 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:47 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:48 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:49 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:50 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:51 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:52 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:53 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:54 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:55 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:56 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:57 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:58 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:03:59 INFO Client: Application report for application_1476995377469_0002 (state: RUNNING)
16/10/20 21:04:00 INFO Client: Application report for application_1476995377469_0002 (state: FINISHED)
16/10/20 21:04:00 INFO Client: 
     client token: N/A
     diagnostics: N/A
     ApplicationMaster host: 10.17.181.184
     ApplicationMaster RPC port: 0
     queue: default
     start time: 1476997406896
     final status: SUCCEEDED
     tracking URL: http://ip-10-17-186-102.ec2.internal:20888/proxy/application_1476995377469_0002/
     user: hadoop
16/10/20 21:04:00 INFO Client: Deleting staging directory hdfs://ip-10-17-186-102.ec2.internal:8020/user/hadoop/.sparkStaging/application_1476995377469_0002
16/10/20 21:04:00 INFO ShutdownHookManager: Shutdown hook called
16/10/20 21:04:00 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-6fceeedf-0ad5-4df1-a63e-c1d7eb1b95b4
Command exiting with ret '0'

What am I missing?

我错过了什么？

1 个解决方案

#1

Looks like your application succeeded just fine. However, there are two reasons why you don't see any output in the step's stout logs.

看起来您的应用程序成功了。但是，有两个原因导致您在步骤的粗壮日志中看不到任何输出。

1) You ran the application in yarn-cluster mode, which means that the driver runs on a random cluster node rather than on the master node. If you specified an S3 log uri when creating the cluster, you should see the logs for this application in the containers directory of your S3 bucket. The logs for the driver will be in container #0's logs.

1）您以纱线群集模式运行应用程序，这意味着驱动程序在随机群集节点上而不是在主节点上运行。如果在创建群集时指定了S3 log uri，则应在S3存储桶的containers目录中看到此应用程序的日志。驱动程序的日志将在容器＃0的日志中。

2) You did not call anything like "collect()" to bring data from the Spark executors back to the driver, so your println() at the end is not printing the data anyway but rather a toString() representation of the RDD. You probably want to do something like .collect().foreach(println) instead.

2）你没有调用类似“collect（）”之类的东西来将Spark执行器中的数据带回驱动程序，所以最后你的println（）不打印数据，而是RDD的toString（）表示。您可能想要执行类似.collect（）。foreach（println）的操作。

#1

Looks like your application succeeded just fine. However, there are two reasons why you don't see any output in the step's stout logs.

看起来您的应用程序成功了。但是，有两个原因导致您在步骤的粗壮日志中看不到任何输出。

秒客网

如何在Amazon EMR上运行Spark Scala代码

1 个解决方案

#1

#1

相关文章