群集上运行的执行程序很少

时间:2022-04-18 23:11:47

I have a lab environment of cdh5 with 6 nodes-node[1-6] and node7 as the nameNode. node[1-5]: 8gb ram, 2 cores node[6]: 32gb ram, 8 cores I am new to spark and I am trying to simply count the number of lines in our data. I have uploaded the data on hdfs (5.3GB). When I submit my spark job, it only runs 2 executors and I can see its splitting the task into 161 task (there are 161 files in the dir).

我有一个cdh5的实验室环境,有6个节点 - 节点[1-6],node7作为nameNode。 node [1-5]:8gb ram,2个核心节点[6]:32gb ram,8个核心我是新手来点火,我试图简单地计算数据中的行数。我已经将数据上传到hdfs(5.3GB)。当我提交我的火花作业时,它只运行2个执行器,我可以看到它将任务分成161个任务(目录中有161个文件)。

In the code, I am reading all the files and doing the count on them.

在代码中,我正在读取所有文件并对它们进行计数。

data_raw = sc.textFile(path) 
print data_raw.count()

On CLI: spark-submit --master yarn-client file_name.py --num-executors 6 --executor-cores 1

在CLI上:spark-submit --master yarn-client file_name.py --num-executors 6 --executor-cores 1

It should run with 6 executors with 1 task running on them. But I only see 2 executors running. I am not able to figure the cause for it.

它应该与6个执行程序一起运行,并在其上运行1个任务。但我只看到2个执行器正在运行。我无法找出原因。

Any help would be greatly appreciated.

任何帮助将不胜感激。

2 个解决方案

#1


Correct way to submit the job is: spark-submit --num-executors 6 --executor-cores 1 --master yarn-client file_name.py Now its showing all the other executors.

提交作业的正确方法是:spark-submit --num-executors 6 --executor-cores 1 --master yarn-client file_name.py现在显示所有其他执行程序。

#2


I suspect only 2 nodes are running spark. Go to cloudera manager -> clusters -> spark -> instances to confirm.

我怀疑只有2个节点正在运行spark。转到cloudera manager - > clusters - > spark - > instances进行确认。

#1


Correct way to submit the job is: spark-submit --num-executors 6 --executor-cores 1 --master yarn-client file_name.py Now its showing all the other executors.

提交作业的正确方法是:spark-submit --num-executors 6 --executor-cores 1 --master yarn-client file_name.py现在显示所有其他执行程序。

#2


I suspect only 2 nodes are running spark. Go to cloudera manager -> clusters -> spark -> instances to confirm.

我怀疑只有2个节点正在运行spark。转到cloudera manager - > clusters - > spark - > instances进行确认。