I have a lab environment of cdh5 with 6 nodes-node[1-6] and node7 as the nameNode. node[1-5]: 8gb ram, 2 cores node[6]: 32gb ram, 8 cores I am new to spark and I am trying to simply count the number of lines in our data. I have uploaded the data on hdfs (5.3GB). When I submit my spark job, it only runs 2 executors and I can see its splitting the task into 161 task (there are 161 files in the dir).
我有一个cdh5的实验室环境,有6个节点 - 节点[1-6],node7作为nameNode。 node [1-5]:8gb ram,2个核心节点[6]:32gb ram,8个核心我是新手来点火,我试图简单地计算数据中的行数。我已经将数据上传到hdfs(5.3GB)。当我提交我的火花作业时,它只运行2个执行器,我可以看到它将任务分成161个任务(目录中有161个文件)。
In the code, I am reading all the files and doing the count on them.
在代码中,我正在读取所有文件并对它们进行计数。
data_raw = sc.textFile(path)
print data_raw.count()
On CLI: spark-submit --master yarn-client file_name.py --num-executors 6 --executor-cores 1
在CLI上:spark-submit --master yarn-client file_name.py --num-executors 6 --executor-cores 1
It should run with 6 executors with 1 task running on them. But I only see 2 executors running. I am not able to figure the cause for it.
它应该与6个执行程序一起运行,并在其上运行1个任务。但我只看到2个执行器正在运行。我无法找出原因。
Any help would be greatly appreciated.
任何帮助将不胜感激。
2 个解决方案
#1
Correct way to submit the job is: spark-submit --num-executors 6 --executor-cores 1 --master yarn-client file_name.py Now its showing all the other executors.
提交作业的正确方法是:spark-submit --num-executors 6 --executor-cores 1 --master yarn-client file_name.py现在显示所有其他执行程序。
#2
I suspect only 2 nodes are running spark. Go to cloudera manager -> clusters -> spark -> instances to confirm.
我怀疑只有2个节点正在运行spark。转到cloudera manager - > clusters - > spark - > instances进行确认。
#1
Correct way to submit the job is: spark-submit --num-executors 6 --executor-cores 1 --master yarn-client file_name.py Now its showing all the other executors.
提交作业的正确方法是:spark-submit --num-executors 6 --executor-cores 1 --master yarn-client file_name.py现在显示所有其他执行程序。
#2
I suspect only 2 nodes are running spark. Go to cloudera manager -> clusters -> spark -> instances to confirm.
我怀疑只有2个节点正在运行spark。转到cloudera manager - > clusters - > spark - > instances进行确认。