在Spark 0.9.0上运行作业会抛出错误

时间:2022-04-18 23:11:41

I have a Apache Spark 0.9.0 Cluster installed where I am trying to deploy a code which reads a file from HDFS. This piece of code throws a warning and eventually the job fails. Here is the code

我安装了一个Apache Spark 0.9.0集群,在这里我试图部署从HDFS读取文件的代码。这段代码抛出一个警告,最终任务失败。这是代码

/**
 * running the code would fail 
 * with a warning 
 * Initial job has not accepted any resources; check your cluster UI to ensure that 
 * workers are registered and have sufficient memory
 */

object Main extends App {
    val sconf = new SparkConf()
    .setMaster("spark://labscs1:7077")
    .setAppName("spark scala")
    val sctx = new SparkContext(sconf)
    sctx.parallelize(1 to 100).count
}

The below is the WARNING message

下面是警告信息

Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory

初始工作未接受任何资源;检查集群UI,以确保已注册了worker并具有足够的内存

how to get rid of this or am I missing some configurations.

如何摆脱这个或我缺少一些配置。

7 个解决方案

#1


5  

You get this when either the number of cores or amount of RAM (per node) you request via setting spark.cores.max and spark.executor.memory resp' exceeds what is available. Therefore even if no one else is using the cluster, and you specify you want to use, say 100GB RAM per node, but your nodes can only support 90GB, then you will get this error message.

当您通过设置spark.core来请求内核的数量或RAM(每个节点)的数量时,就会得到这个结果。max和spark.executor。内存resp'超过了可用内存。因此,即使没有其他人在使用集群,并且您指定要使用的是每个节点100GB的RAM,但是您的节点只能支持90GB,那么您将得到这个错误消息。

To be fair the message is vague in this situation, it would be more helpful if it said your exceeding the maximum.

公平地说,在这种情况下,信息是模糊的,如果它说你超过了最大限度,那将更有帮助。

#2


2  

Looks like Spark master can't assign any workers for this task. Either the workers aren't started or they are all busy.

看起来Spark master不能为这个任务分配任何工作人员。要么工人们还没开始工作,要么他们都很忙。

Check Spark UI on master node (port specified by SPARK_MASTER_WEBUI_PORT in spark-env.sh, 8080 by default). It should look like this: 在Spark 0.9.0上运行作业会抛出错误

检查主节点上的Spark UI (SPARK_MASTER_WEBUI_PORT在Spark -env中指定的端口)。sh 8080默认情况下)。它应该是这样的:

For cluster to function properly:

为了使集群正常运行:

  • There must be some workers with state "Alive"
  • 一定有一些工人是活着的
  • There must be some cores available (for example, if all cores are busy with the frozen task, the cluster won't accept new tasks)
  • 必须有一些可用的内核(例如,如果所有的内核都忙于冻结的任务,集群将不会接受新的任务)
  • There must be sufficient memory available
  • 必须有足够的可用内存

#3


2  

Also make sure your spark workers can communicate both ways with the driver. Check for firewalls, etc.

还要确保你的星星之火的工作人员能够与司机进行双向沟通。检查防火墙等。

#4


2  

I had this exact issue. I had a simple 1-node Spark cluster and was getting this error when trying to run my Spark app.

我有这个问题。我有一个简单的单节点星火集群,当我试图运行星火应用时,我得到了这个错误。

I ran through some of the suggestions above and it was when I tried to run the Spark shell against the cluster and not being able to see this in the UI that I became suspicious that my cluster was not working correctly.

我浏览了上面的一些建议,当我试图对集群运行Spark shell时,却在UI中看不到这些,我开始怀疑我的集群没有正常工作。

In my hosts file I had an entry, let's say SparkNode, that referenced the correct IP Address.

在我的主机文件中,我有一个条目,比方说SparkNode,它引用了正确的IP地址。

I had inadvertently put the wrong IP Address in the conf/spark-env.sh file against the SPARK_MASTER_IP variable. I changed this to SparkNode and I also changed SPARK_LOCAL_IP to point to SparkNode.

我无意中在conf/spark-env中放置了错误的IP地址。针对SPARK_MASTER_IP变量的sh文件。我将它更改为SparkNode,并将SPARK_LOCAL_IP更改为指向SparkNode。

To test this I opened up the UI using SparkNode:7077 in the browser and I could see an instance of Spark running.

为了测试这一点,我在浏览器中使用SparkNode:7077打开了UI,我可以看到Spark实例正在运行。

I then used Wildfires suggestion of running the Spark shell, as follows:

然后我用野火的建议来运行火星外壳,如下所示:

MASTER=spark://SparkNode:7077 bin/spark-shell

Going back to the UI I could now see the Spark shell application running, which I couldn't before.

回到UI,我现在可以看到Spark shell应用程序正在运行,这是我以前无法看到的。

So I exited the Spark shell and ran my app using Spark Submit and it now works correctly.

所以我退出了Spark shell并使用Spark Submit运行我的应用,现在它可以正常工作了。

It is definitely worth checking out all of your IP and host entries, this was the root cause of my problem.

它绝对值得检查您的所有IP和主机条目,这是我问题的根本原因。

#5


0  

You need to specify the right SPARK_HOME and your driver program's IP address in case Spark may not able to locate your Netty jar server. Be aware that your Spark master should listen to the correct IP address which you suppose to use. This can be done by setting SPARK_MASTER_IP=yourIP in file spark-env.sh.

您需要指定正确的SPARK_HOME和驱动程序的IP地址,以防Spark可能无法定位您的Netty jar服务器。请注意,您的Spark master应该监听您应该使用的正确IP地址。这可以通过设置SPARK_MASTER_IP=yourIP文件spark-env.sh来实现。

   val conf = new SparkConf()
  .setAppName("test")
  .setMaster("spark://yourSparkMaster:7077")
  .setSparkHome("YourSparkHomeDir")
  .set("spark.driver.host", "YourIPAddr")

#6


0  

Check for errors regard to hostname, IP address and loopback. Make sure to set SPARK_LOCAL_IP and SPARK_MASTER_IP.

检查有关主机名、IP地址和回送的错误。确保设置了SPARK_LOCAL_IP和SPARK_MASTER_IP。

#7


0  

I had similar issue Initial job has not accepted any resource, fixed it by specify the spark correct download url on spark-env.sh or installing spark on all slaves.

我有类似的问题,最初的工作没有接受任何资源,通过指定spark正确的下载url在spark-env上。在所有的奴隶上安装星火。

export SPARK_EXECUTOR_URI=http://mirror.fibergrid.in/apache/spark/spark-1.6.1/spark-1.6.1-bin-hadoop2.6.tgz

出口SPARK_EXECUTOR_URI = http://mirror.fibergrid.in/apache/spark/spark-1.6.1/spark-1.6.1-bin-hadoop2.6.tgz

Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory

初始工作未接受任何资源;检查集群UI,以确保已注册了worker并具有足够的内存

#1


5  

You get this when either the number of cores or amount of RAM (per node) you request via setting spark.cores.max and spark.executor.memory resp' exceeds what is available. Therefore even if no one else is using the cluster, and you specify you want to use, say 100GB RAM per node, but your nodes can only support 90GB, then you will get this error message.

当您通过设置spark.core来请求内核的数量或RAM(每个节点)的数量时,就会得到这个结果。max和spark.executor。内存resp'超过了可用内存。因此,即使没有其他人在使用集群,并且您指定要使用的是每个节点100GB的RAM,但是您的节点只能支持90GB,那么您将得到这个错误消息。

To be fair the message is vague in this situation, it would be more helpful if it said your exceeding the maximum.

公平地说,在这种情况下,信息是模糊的,如果它说你超过了最大限度,那将更有帮助。

#2


2  

Looks like Spark master can't assign any workers for this task. Either the workers aren't started or they are all busy.

看起来Spark master不能为这个任务分配任何工作人员。要么工人们还没开始工作,要么他们都很忙。

Check Spark UI on master node (port specified by SPARK_MASTER_WEBUI_PORT in spark-env.sh, 8080 by default). It should look like this: 在Spark 0.9.0上运行作业会抛出错误

检查主节点上的Spark UI (SPARK_MASTER_WEBUI_PORT在Spark -env中指定的端口)。sh 8080默认情况下)。它应该是这样的:

For cluster to function properly:

为了使集群正常运行:

  • There must be some workers with state "Alive"
  • 一定有一些工人是活着的
  • There must be some cores available (for example, if all cores are busy with the frozen task, the cluster won't accept new tasks)
  • 必须有一些可用的内核(例如,如果所有的内核都忙于冻结的任务,集群将不会接受新的任务)
  • There must be sufficient memory available
  • 必须有足够的可用内存

#3


2  

Also make sure your spark workers can communicate both ways with the driver. Check for firewalls, etc.

还要确保你的星星之火的工作人员能够与司机进行双向沟通。检查防火墙等。

#4


2  

I had this exact issue. I had a simple 1-node Spark cluster and was getting this error when trying to run my Spark app.

我有这个问题。我有一个简单的单节点星火集群,当我试图运行星火应用时,我得到了这个错误。

I ran through some of the suggestions above and it was when I tried to run the Spark shell against the cluster and not being able to see this in the UI that I became suspicious that my cluster was not working correctly.

我浏览了上面的一些建议,当我试图对集群运行Spark shell时,却在UI中看不到这些,我开始怀疑我的集群没有正常工作。

In my hosts file I had an entry, let's say SparkNode, that referenced the correct IP Address.

在我的主机文件中,我有一个条目,比方说SparkNode,它引用了正确的IP地址。

I had inadvertently put the wrong IP Address in the conf/spark-env.sh file against the SPARK_MASTER_IP variable. I changed this to SparkNode and I also changed SPARK_LOCAL_IP to point to SparkNode.

我无意中在conf/spark-env中放置了错误的IP地址。针对SPARK_MASTER_IP变量的sh文件。我将它更改为SparkNode,并将SPARK_LOCAL_IP更改为指向SparkNode。

To test this I opened up the UI using SparkNode:7077 in the browser and I could see an instance of Spark running.

为了测试这一点,我在浏览器中使用SparkNode:7077打开了UI,我可以看到Spark实例正在运行。

I then used Wildfires suggestion of running the Spark shell, as follows:

然后我用野火的建议来运行火星外壳,如下所示:

MASTER=spark://SparkNode:7077 bin/spark-shell

Going back to the UI I could now see the Spark shell application running, which I couldn't before.

回到UI,我现在可以看到Spark shell应用程序正在运行,这是我以前无法看到的。

So I exited the Spark shell and ran my app using Spark Submit and it now works correctly.

所以我退出了Spark shell并使用Spark Submit运行我的应用,现在它可以正常工作了。

It is definitely worth checking out all of your IP and host entries, this was the root cause of my problem.

它绝对值得检查您的所有IP和主机条目,这是我问题的根本原因。

#5


0  

You need to specify the right SPARK_HOME and your driver program's IP address in case Spark may not able to locate your Netty jar server. Be aware that your Spark master should listen to the correct IP address which you suppose to use. This can be done by setting SPARK_MASTER_IP=yourIP in file spark-env.sh.

您需要指定正确的SPARK_HOME和驱动程序的IP地址,以防Spark可能无法定位您的Netty jar服务器。请注意,您的Spark master应该监听您应该使用的正确IP地址。这可以通过设置SPARK_MASTER_IP=yourIP文件spark-env.sh来实现。

   val conf = new SparkConf()
  .setAppName("test")
  .setMaster("spark://yourSparkMaster:7077")
  .setSparkHome("YourSparkHomeDir")
  .set("spark.driver.host", "YourIPAddr")

#6


0  

Check for errors regard to hostname, IP address and loopback. Make sure to set SPARK_LOCAL_IP and SPARK_MASTER_IP.

检查有关主机名、IP地址和回送的错误。确保设置了SPARK_LOCAL_IP和SPARK_MASTER_IP。

#7


0  

I had similar issue Initial job has not accepted any resource, fixed it by specify the spark correct download url on spark-env.sh or installing spark on all slaves.

我有类似的问题,最初的工作没有接受任何资源,通过指定spark正确的下载url在spark-env上。在所有的奴隶上安装星火。

export SPARK_EXECUTOR_URI=http://mirror.fibergrid.in/apache/spark/spark-1.6.1/spark-1.6.1-bin-hadoop2.6.tgz

出口SPARK_EXECUTOR_URI = http://mirror.fibergrid.in/apache/spark/spark-1.6.1/spark-1.6.1-bin-hadoop2.6.tgz

Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory

初始工作未接受任何资源;检查集群UI,以确保已注册了worker并具有足够的内存