Yarn公平调度器

默认，基于内存做公平调度。可以配置成基于内存和CPU。
只有一个作业时，它使用整个系统的资源；新的作业提交后，被释放的资源就会分配给它，最终每个作业获得同样的资源。
这会让短时作业在合理时间内完成，并不会“饿着”长时间的作业。
公平调度器也支持优先级、它作为作业获得集群资源的权值。

队列可用配置最小保证资源，以保证某些用户、作业总能得到足够的资源。
配置可以限制用户、队列中同时可以运行的作业数。这点主要用于当用户一下提交过多作业时或者运行大量作业，
导致大量中间文件或者不断切换上下文，改善系统的性能问题。
限制同时运行的作业数目不会导致作业提交失败，只会等着用户下的作业运行完成才会参与调度。

队列支持层级结构--schedule-allocation文件配置
所有队列继承自root队列。父队列的资源按照公平调度器的模式分配给子队列，只有叶子队列能运行作业。
队列名：root.parent1.queue1 即由父队列名拼成
可以扩展org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.SchedulingPolicy建立自己的调度策略。

管理：
1、allocation文件可以修改，系统会在修改后10-15秒后自动重新加载文件。可修改包括 minimum shares, limits, weights, preemption timeouts等。
2、http://ResourceManager URL/cluster/scheduler可以查看资源调度情况。
包括：queue中的container使用的资源量、active的应用的个数（至少有一个container，类似slot概念？），pendind的应用个数（即一个container都没有的）
队列的最新保证资源、队列能用的最大资源、等

Allocatioin文件的配置
minResources：队列最小资源保证量，格式为："X mb, Y vcores"。单资源公平调度模式下，vcores被忽略。
如果当前队列的内存数没有满足，则此队列的内存优先满足。
Under dominant resource fairness, a queue is considered unsatisfied if its usage for its dominant resource with respect to the cluster capacity is below its minimum share for that resource.
当多个队列不满足条件，已有资源量/最小资源量值越小越优先。
maxResources：队列能使用资源最大量。
maxRunningApps：队列中同时运行应用最大数
schedulingPolicy：调度策略fifo，fair，drf。fifo则先提交先调度，后提交的只有系统有多余的资源时会被调度。
aclSubmitApps：控制哪个用户能提交作业到队列，逗号分割。子队列会继承父队列的ACL
aclAdministerApps：也要配置？？怎么用，待调查。
minSharePreemptionTimeout：资源抢占超时时间，超过这个时间不能再抢？

User elements, which represent settings governing the behavior of individual users. They can contain a single property: maxRunningApps, a limit on the number of running apps for a particular user.
A userMaxAppsDefault element, which sets the default running app limit for any users whose limit is not otherwise specified.
A fairSharePreemptionTimeout element, number of seconds a queue is under its fair share before it will try to preempt containers to take resources from other queues.
A defaultQueueSchedulingPolicy element, which sets the default scheduling policy for queues; overriden by the schedulingPolicy element in each queue if specified. Defaults to "fair".

安装

yarn-site.xml中配置如下

<property> <name>yarn.resourcemanager.scheduler.class</name> <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value> </property> <property> <name>yarn.scheduler.fair.allocation.file</name> <value>{HADOOP_CONF_DIR}/fair-allocation.xml</value> <description>fair-allocation.xml 文件是对每一个队列的属性配置</description> </property> <property> <name>yarn.scheduler.fair.user-as-default-queue</name> <description>如果希望以用户名作为队列，可以将该属性配置为true，默认为true，所以如果不想以用户名为队列的，必须显式的设置成false</description> <value>false</value> </property> <property> <name>yarn.scheduler.fair.preemption</name> <description>是否可抢占，如果某个队列配置的最小资源量没有达到，用户提交的作业可以抢占别的队列正在运作的任务的资源。建议不要设置成true，会造成集群资源浪费，并且目前该功能还在进一步完善，默认 false。</description> <value>false</value> </property> <property> <name>yarn.scheduler.fair.sizebasedweight</name> <description>以作业的规模大小作为调度的权值，默认为false</description> <value>true</value> </property> <property> <name>yarn.scheduler.fair.assignmultiple</name> <description>一次心跳响应是否允许分配多个container，默认为false</description> <value>true</value> </property> <property> <name>yarn.scheduler.fair.max.assign</name> <description>如果一次心跳响应允许分配多个container，一次最多允许分配的container数，默认为-1，没有限制</description> <value>-1</value> </property> <property> <name>yarn.scheduler.fair.locality.threshold.node</name> <description>fair 允许等待nodelocal调度的次数，这个配置是集群节点数的比例值。比如说集群100台，为达到某个作业的datalocal调度，允许等待10次。大于10就强制调度</description> <value>0.1</value> </property> <property> <name>yarn.scheduler.fair.locality.threshold.rack</name> <description>类似于yarn.scheduler.fair.locality.threshold.node的配置，这个是为了调度racklocal的任务</description> <value>0.1</value> </property>

http://hadoop.apache.org/docs/r2.2.0/hadoop-yarn/hadoop-yarn-site/FairScheduler.html

yarn.nodemanager.resource.memory-mb 表示该节点上YARN可使用的物理内存总量，默认是8192（MB），yarn不能识别物理内存总量，需要手工配置。
yarn.nodemanager.resource.cpu-vcores 表示该节点上YARN可使用的虚拟CPU个数，默认是8，推荐将该值设值为与物理CPU核数数目相同

fair-allocation.xml简单配置例：

<?xml version="1.0"?>
<allocations>
  <queue name="cdh5">
    <minResources>100 mb,1vcores</minResources>
    <maxResources>90000 mb,8vcores</maxResources>
    <maxRunningApps>50</maxRunningApps>
    <weight>1.0</weight>
    <schedulingPolicy>fair</schedulingPolicy>
  </queue>
  <queue name="ocnosql">
    <minResources>100 mb,1vcores</minResources>
    <maxResources>300 mb,2vcores</maxResources>
    <maxRunningApps>10</maxRunningApps>
    <weight>1.0</weight>
    <schedulingPolicy>fair</schedulingPolicy>
  </queue>

  <user name="cdh5">
    <maxRunningApps>30</maxRunningApps>
  </user>
  <userMaxAppsDefault>5</userMaxAppsDefault>

  <queuePlacementPolicy>
    <rule name="specified" />
    <rule name="primaryGroup" create="false" />
    <rule name="default" />
  </queuePlacementPolicy>
</allocations>

两个队列cdh5和ocnoql，配置不同资源使用量，缺省提交任务的组名和队列名对应。

hive中set mapred.job.queue.name=指定队列 ( hive --hiveconf mapreduce.job.queuename=queue1)

关于acl控制：
子队列的权限可以继承父队列，即使子队列设置自己的权限控制，还是会与父队列的权限取并集。而root队列是所有队列的父队列，他的权限是所有用户都可以提交作业、管理作业。所以，子队列的权限设置是不起作用的。必须显式设置root队列权限

<?xml version="1.0"?>
<allocations>
  <queue name="root">  
    <minResources>10000mb,10vcores</minResources>  
    <maxResources>90000mb,100vcores</maxResources>  
    <maxRunningApps>50</maxRunningApps>  
    <weight>2.0</weight>  
    <schedulingMode>fair</schedulingMode>  
    <aclSubmitApps> </aclSubmitApps>  
    <aclAdministerApps> </aclAdministerApps>

  <queue name="cdh5">
    <minResources>100 mb,1vcores</minResources>
    <maxResources>5000 mb,5vcores</maxResources>
    <maxRunningApps>50</maxRunningApps>
    <weight>2.0</weight>
    <schedulingPolicy>fair</schedulingPolicy>
                <aclAdministerApps>cdh5 cdh5</aclAdministerApps>
                <aclSubmitApps>cdh5 cdh5</aclSubmitApps>
<!--
    <queue name="sample_sub_queue">
      <aclSubmitApps>cdh5</aclSubmitApps>
      <minResources>5000 mb,0vcores</minResources>
        <maxResources>9000 mb,4vcores</maxResources>
    </queue>
-->
  </queue>
  <queue name="ocnosql">
    <minResources>100 mb,1vcores</minResources>
    <maxResources>2000 mb,2vcores</maxResources>
    <maxRunningApps>10</maxRunningApps>
    <weight>1.0</weight>
    <schedulingPolicy>fair</schedulingPolicy>
        <aclAdministerApps>ocnosql</aclAdministerApps>
                <aclSubmitApps>ocnosql</aclSubmitApps>
  </queue>
  <queue name="default">
    <minResources>100 mb,1vcores</minResources>
    <maxResources>6000 mb,2vcores</maxResources>
    <maxRunningApps>10</maxRunningApps>
    <weight>1.0</weight>
    <schedulingPolicy>fair</schedulingPolicy>
  </queue>
</queue>

  <user name="cdh5">
    <maxRunningApps>30</maxRunningApps>
  </user>
  <userMaxAppsDefault>5</userMaxAppsDefault>

  <queuePlacementPolicy>
    <rule name="specified" />
    <rule name="primaryGroup" create="false" />
    <rule name="default" />
  </queuePlacementPolicy>
</allocations>

cdh5用户执行hadoop queue -showacls
Queue acls for user : cdh5

Queue Operations =====================
root
root.cdh5 ADMINISTER_QUEUE,SUBMIT_APPLICATIONS
root.default
root.ocnosql

问题：
提交任务的用户需要在rm主机上也存在，否则提交任务会失败：
WARN security.UserGroupInformation: No groups available for user

Used Resources: memory=mapreduce.map.memory.mb*num+mapreduce.reduce.memory.mb*num+yarn.app.mapreduce.am.resource.mb*num (mapred-site.xml)
cpu cores=mapreduce.map.cpu.vcores*num+mapreduce.reduce.cpu.vcores*num+yarn.app.mapreduce.am.resource.cpu-vcores*num
（注意参数： yarn.scheduler.increment-allocation-mb：内存规整化单位，默认是1024，这意味着，如果一个Container请求资源是1.5GB，则将被调度器规整化为ceiling(1.5 GB / 1GB) * 1G=2GB。
yarn.scheduler.minimum-allocation-mb 最小分配内存，如果以上内存设置小于这个值，会以这个值为准)

yarn.admin.acl 缺省是*，需要修改为某个具体用户（或者空？），否则aclAdministerApps配置不起作用。

<property>
    <name>yarn.acl.enable</name>
    <value>true</value>
    <description>Enable ACLs? Defaults to false.</description>
  </property>
  <property>
    <name>yarn.admin.acl</name>
    <value>cdh5</value>
    <description>ACL to set admins on the cluster. ACLs are of for comma-separated-usersspacecomma-separated-groups. Defaults to special value of * which means anyone. Special value of just space means no one has access</description>
  </property>

Updated by 杨启虎 , 更新于 19 分钟之前

秒客网

Yarn公平调度器[转自 AIMP平台wiki]

Yarn公平调度器

安装

yarn-site.xml中配置如下

相关文章