Openstack nova-scheduler 源码分析 — Filters/Weighting

时间:2021-01-03 02:11:28

目录

前言

本篇记录了 Openstack 在创建 Instances 时,nova-scheduler 作为调度器的工作原理和代码实现。
Openstack 中会由多个的 Instance 共享同一个 Host,而不是独占。所以就需要使用调度器这种管理规则来协调和管理 Instance 之间的资源分配。

调度器

调度器:调度 Instance 在哪一个 Host 上运行的方式。
目前 Nova 中实现的调度器方式由下列几种:

  • ChanceScheduler(随机调度器):从所有正常运行 nova-compute 服务的 Host Node 中随机选取来创建 Instance

  • FilterScheduler(过滤调度器):根据指定的过滤条件以及权重来挑选最佳创建 Instance 的 Host Node 。

  • Caching(缓存调度器):是 FilterScheduler 中的一种,在其基础上将 Host 资源信息缓存到本地的内存中,然后通过后台的定时任务从数据库中获取最新的 Host 资源信息。

为了便于扩展,Nova 将一个调度器必须要实现的接口提取出来成为 nova.scheduler.driver.Scheduler,只要继承了该类并实现其中的接口,我们就可以自定义调度器。

注意不同的调度器并不能共存,需要在 /etc/nova/nova.conf 中的选项指定使用哪一个调度器。默认为 FilterScheduler 。

vim /etc/nova/nova.conf

scheduler_driver = nova.scheduler.filter_scheduler.FilterScheduler

FilterScheduler调度器的工作流程

Openstack nova-scheduler 源码分析 — Filters/Weighting

FilterScheduler 首先使用指定的 Filters(过滤器) 过滤符合条件的 Host,EG. 内存使用率小于 2% 。然后对得到的 Host 列表计算 Weighting 权重并排序,获得最佳的 Host 。

Filters 过滤器

Filtering 就是首先根据各个 Host 当前可用的资源情况来过滤掉那些不能满足 Instance 要求的 Host,然后再使用配置文件指定的各种 Filters 去过滤掉不符合过滤条件的 Host。经过 Filters 过滤后,会得到一个 Host 列表。

这样的话 nova-scheduler 就需要从数据库中取得当前各个 Host 最新的资源使用情况,这些资源数据的收集和存储都由 nova-compute 中定义的数据库同步机制来完成。但是 nova-compute 对数据库的更新是周期性的, nova-scheduler 在选择最佳 Host 时需要最新的资源数据。所以在 nova-scheduler 中使用了 nova.scheduler.host_manager:HostState 来维护一份数据。这份数据仅保存在当前进程的内存中,里面包含了从上次数据库更新到现在 Host 资源的变化情况,也就是最新的 Host 资源数据。nova-scheduler 为了保持自己所维护的资源数据是最新的,每创建一个 Instance ,nova-scheduler 都要将这份资源数据更新,并从 Host 可用资源中去掉虚拟机使用的部分。
注意:nova-scheduler 所维护的数据不会同步到数据库,它只会从数据库同步数据到自身,所以 nova-scheduler 并没有写数据库的功能。

Filters 类型

  • ALLHostsFilter:不进行任何过滤
  • RamFilter:根据内存的可用情况来进行过滤
  • ComputeFilter:选取所有处于 Active 的 Host
  • TrustedFilter:选取所有可信的 Host
  • PciPassthroughFilter:选取提供 PCI SR-IOV 支持的 Host

所有的 Filters 实现都位于nova/scheduler/filters 目录,每个 Filter 都要继承自 nova.scheduler.filters.BaseHostFilter 。如果需要自定义一个 Filter,只需通过继承此类并实现一个函数 host_passes(),返回的结果只有 True or False 。

在配置文件中指定 Filters

scheduler_available_filters=
scheduler_default_filters=

Weighting 权重

Weighting 表示对所有符合过滤条件(通过 Filters)的 Host 计算权重并以此排序从而得到最佳的一个 Host。计算 Host 权重的过程需要调用指定的各种 Weigher Module,得到每个 Host 的权重值。

所有的 Weigher 的实现都位于 nova/scheduler/weights 目录下。

源码实现

关键文件及其意义

  • /nova/scheduler/driver.py: 文件中最重要的就是 Scheduler 类,是所有调度器实现都要继承的基类,包含了调度器必须要实现的所有接口。

  • /nova/scheduler/manager.py: 主要实现了 SchedulerManager 类,定义了 Host 的管理操作函数,如:删除 Host 中的 Instance — delete_instance_info

  • /nova/scheduler/host_manager.py: 有两个类的实现,都是描述了跟调度器相关的 Host 的操作实现,类 HostState 维护了一份最新的 Host 资源数据。类 HostManager 描述了调度器相关的操作函数, EG._choose_host_filters/get_filtered_hosts/get_weighed_hosts

  • /nova/scheduler/chance.py: 只有 ChanceScheduler 类(随机调度器),继承自 Scheduler 类,实现随机选取 Host Node 的调度器

  • /nova/scheduler/client: 客户端调用程序的入口

  • /nova/scheduler/filter_scheduler.py: 只有 FilterScheduler 类(过滤调度器),继承自 Scheduler 类,实现了根据指定的过滤条件来选取 Host Node 的调度器

  • /nova/scheduler/filters 和 /nova/scheduler/weights: 这两个目录下的内容分别对应 过滤器权重 的实现 。

阶段一:nova-scheduler 接收 build_instances RPC 远程调用

Openstack nova-scheduler 源码分析 — Filters/Weighting

nova-conductor ==> RPC scheduler_client.select_destinations() ==> nova-sechduler

#nova.conductor.manager.ComputeTaskManager:build_instances()

def build_instances(self, context, instances, image, filter_properties,
admin_password, injected_files, requested_networks,
security_groups, block_device_mapping=None, legacy_bdm=True)
:

# TODO(ndipanov): Remove block_device_mapping and legacy_bdm in version
# 2.0 of the RPC API.

# 获取需要创建的 Instance 的参数信息
request_spec = scheduler_utils.build_request_spec(context, image,
instances)

# TODO(danms): Remove this in version 2.0 of the RPC API
if (requested_networks and
not isinstance(requested_networks,
objects.NetworkRequestList)):
# 请求 network 信息
requested_networks = objects.NetworkRequestList(
objects=[objects.NetworkRequest.from_tuple(t)
for t in requested_networks])
# TODO(melwitt): Remove this in version 2.0 of the RPC API

# 获取 flavor 信息
flavor = filter_properties.get('instance_type')
if flavor and not isinstance(flavor, objects.Flavor):
# Code downstream may expect extra_specs to be populated since it
# is receiving an object, so lookup the flavor to ensure this.
flavor = objects.Flavor.get_by_id(context, flavor['id'])
filter_properties = dict(filter_properties, instance_type=flavor)

try:
scheduler_utils.setup_instance_group(context, request_spec,
filter_properties)
# check retry policy. Rather ugly use of instances[0]...
# but if we've exceeded max retries... then we really only
# have a single instance.
scheduler_utils.populate_retry(filter_properties,
instances[0].uuid)

# 获取 Hosts 列表
hosts = self.scheduler_client.select_destinations(context,
request_spec, filter_properties)

except Exception as exc:
updates = {'vm_state': vm_states.ERROR, 'task_state': None}
for instance in instances:
self._set_vm_state_and_notify(
context, instance.uuid, 'build_instances', updates,
exc, request_spec)
return

for (instance, host) in itertools.izip(instances, hosts):
try:
instance.refresh()
except (exception.InstanceNotFound,
exception.InstanceInfoCacheNotFound):
LOG.debug('Instance deleted during build', instance=instance)
continue
local_filter_props = copy.deepcopy(filter_properties)
scheduler_utils.populate_filter_properties(local_filter_props,
host)
# The block_device_mapping passed from the api doesn't contain
# instance specific information
bdms = objects.BlockDeviceMappingList.get_by_instance_uuid(
context, instance.uuid)


self.compute_rpcapi.build_and_run_instance(context,
instance=instance, host=host['host'], image=image,
request_spec=request_spec,
filter_properties=local_filter_props,
admin_password=admin_password,
injected_files=injected_files,
requested_networks=requested_networks,
security_groups=security_groups,
block_device_mapping=bdms, node=host['nodename'],
limits=host['limits'])

nova-conductor 在调用 nova-scheduler 来获取能够创建 Instance 的 Host 的同时也获取了:requested_networks/flavor 等信息。

其中获取 Hosts 列表的代码块:

            # 获取 Hosts 列表
hosts = self.scheduler_client.select_destinations(context,
request_spec, filter_properties)

下面列出了一系列为了获取 Hosts 列表的函数调用跳转

# nova.scheduler.client.query.SchedulerQueryClient:select_destinations()

from nova.scheduler import rpcapi as scheduler_rpcapi

class SchedulerQueryClient(object):
"""Client class for querying to the scheduler."""

def __init__(self):
self.scheduler_rpcapi = scheduler_rpcapi.SchedulerAPI()

def select_destinations(self, context, request_spec, filter_properties):
"""Returns destinations(s) best suited for this request_spec and
filter_properties.

The result should be a list of dicts with 'host', 'nodename' and
'limits' as keys.
"""

#
return self.scheduler_rpcapi.select_destinations(
context, request_spec, filter_properties)


# nova.scheduler.rpcapi.SchedulerAPI:select_destinations()

def select_destinations(self, ctxt, request_spec, filter_properties):
cctxt = self.client.prepare(version='4.0')
return cctxt.call(ctxt, 'select_destinations',
request_spec=request_spec, filter_properties=filter_properties)

阶段二:从 scheduler.rpcapi.SchedulerAPI 到 scheduler.manager.SchedulerManager

rpcapi.py 中的接口函数会在 manager.py 中实现实际操作函数。
所以跳转到 nova.scheduler.manager.SchedulerManager:select_destinations()

# nova.scheduler.manager.SchedulerManager:select_destinations()
class SchedulerManager(manager.Manager):
"""Chooses a host to run instances on."""

target = messaging.Target(version='4.2')

def __init__(self, scheduler_driver=None, *args, **kwargs):
if not scheduler_driver:
scheduler_driver = CONF.scheduler_driver
# 可以看出这里的 driver 是通过配置文件中的选项值指定的类来返回的对象 EG.nova.scheduler.filter_scheduler.FilterScheduler
self.driver = importutils.import_object(scheduler_driver)
super(SchedulerManager, self).__init__(service_name='scheduler',
*args, **kwargs)


def select_destinations(self, context, request_spec, filter_properties):
"""Returns destinations(s) best suited for this request_spec and
filter_properties.

The result should be a list of dicts with 'host', 'nodename' and
'limits' as keys.
"""

dests = self.driver.select_destinations(context, request_spec,
filter_properties)
return jsonutils.to_primitive(dests)

阶段三:从 scheduler.manager.SchedulerManager 到调度器 FilterScheduler

vim /etc/nova/nova.conf

scheduler_driver = nova.scheduler.filter_scheduler.FilterScheduler

从配置文件选项 scheduler_driver 的值可以知道,nova.scheduler.manager.SchedulerManager:driver
nova.scheduler.filter_scheduler.FilterScheduler 的实例化对象。
所以跳转到 nova.scheduler.filter_scheduler.FilterScheduler:select_destinations()

# nova.scheduler.filter_scheduler.FilterScheduler:select_destinations()

class FilterScheduler(driver.Scheduler):
"""Scheduler that can be used for filtering and weighing."""
def __init__(self, *args, **kwargs):
super(FilterScheduler, self).__init__(*args, **kwargs)
self.options = scheduler_options.SchedulerOptions()
self.notifier = rpc.get_notifier('scheduler')

def select_destinations(self, context, request_spec, filter_properties):
"""Selects a filtered set of hosts and nodes."""
self.notifier.info(context, 'scheduler.select_destinations.start',
dict(request_spec=request_spec))

# 需要创建的 Instances 的数量
num_instances = request_spec['num_instances']

# 获取满足笫一次过滤条件的主机列表 List (详见上述的调度器过滤原理)
# nova.scheduler.filter_scheduler.FilterScheduler:_schedule() ==> return selected_hosts
selected_hosts = self._schedule(context, request_spec,
filter_properties)

# Couldn't fulfill the request_spec
# 当请求的 Instance 数量大于合适的主机数量时,不会创建 Instance 且输出 'There are not enough hosts available.'
if len(selected_hosts) < num_instances:
# NOTE(Rui Chen): If multiple creates failed, set the updated time
# of selected HostState to None so that these HostStates are
# refreshed according to database in next schedule, and release
# the resource consumed by instance in the process of selecting
# host.
for host in selected_hosts:
host.obj.updated = None

# Log the details but don't put those into the reason since
# we don't want to give away too much information about our
# actual environment.
LOG.debug('There are %(hosts)d hosts available but '
'%(num_instances)d instances requested to build.',
{'hosts': len(selected_hosts),
'num_instances': num_instances})

reason = _('There are not enough hosts available.')
raise exception.NoValidHost(reason=reason)

dests = [dict(host=host.obj.host, nodename=host.obj.nodename,
limits=host.obj.limits) for host in selected_hosts]

self.notifier.info(context, 'scheduler.select_destinations.end',
dict(request_spec=request_spec))
return dests


def _schedule(self, context, request_spec, filter_properties):
# 获取所有 Hosts 的状态
hosts = self._get_all_host_states(elevated)

selected_hosts = []

# 获取需要创建的 Instances 数目
num_instances = request_spec.get('num_instances', 1)

# 遍历 num_instances,为每个 Instance 选取合适的主机
for num in range(num_instances):
# Filter local hosts based on requirements ...

# 在 for 循环里,_schedule 的两个关键操作,get_filtered_hosts() 和 get_weighed_hosts()
hosts = self.host_manager.get_filtered_hosts(hosts,
filter_properties, index=num)
if not hosts:
# Can't get any more locally.
break

LOG.debug("Filtered %(hosts)s", {'hosts': hosts})

weighed_hosts = self.host_manager.get_weighed_hosts(hosts,
filter_properties)

LOG.debug("Weighed %(hosts)s", {'hosts': weighed_hosts})

scheduler_host_subset_size = CONF.scheduler_host_subset_size

# 下面两个 if,主要为了防止 random.choice 调用越界
if scheduler_host_subset_size > len(weighed_hosts):
scheduler_host_subset_size = len(weighed_hosts)
if scheduler_host_subset_size < 1:
scheduler_host_subset_size = 1

# 在符合要求的weigh过的host里进行随机选取
chosen_host = random.choice(
weighed_hosts[0:scheduler_host_subset_size])
LOG.debug("Selected host: %(host)s", {'host': chosen_host})
selected_hosts.append(chosen_host)

# Now consume the resources so the filter/weights
# will change for the next instance.
chosen_host.obj.consume_from_instance(instance_properties)
if update_group_hosts is True:
if isinstance(filter_properties['group_hosts'], list):
filter_properties['group_hosts'] = set(
filter_properties['group_hosts'])
filter_properties['group_hosts'].add(chosen_host.obj.host)
# 循环为每一个实例获取合适的主机后,返回选择的主机列表
return selected_hosts

上述的函数有三个非常关键的操作函数:

  • _get_all_host_states: 获取所有的 Host 状态,并且将初步满足条件的 Hosts 过滤出来。
  • get_filtered_hosts:使用 Filters 过滤器将第一个函数返回的 hosts 进行再一次过滤。
  • get_weighed_hosts:通过 Weighed 选取最优 Host。

这三个关键函数在后面会继续介绍。

首先看看host_manager.get_filtered_hosts() 中,host_manager 是 nova.scheduler.driver.Scheduler 的成员变量 。如下:

# nova.scheduler.driver.Scheduler:__init__()

# nova.scheduler.filter_scheduler.FilterScheduler 继承了 nova.scheduler.driver.Scheduler
class Scheduler(object):
"""The base class that all Scheduler classes should inherit from."""

def __init__(self):
# 从这里知道 host_manager 会根据配置文件动态导入
self.host_manager = importutils.import_object(
CONF.scheduler_host_manager)
self.servicegroup_api = servicegroup.API()

还需要注意:scheduler.filter_scheduler.FilterScheduler:_schedule() 中获取 Hosts 状态的函数 _get_all_host_states() 实现如下:

# nova.scheduler.host_manager.HostManager:get_all_host_states()

def get_all_host_states(self, context):

service_refs = {service.host: service
for service in objects.ServiceList.get_by_binary(
context, 'nova-compute')}

# 获取 Compute Node 资源
compute_nodes = objects.ComputeNodeList.get_all(context)
# nova.object.__init__()
# ==> nova.object.compute_node.ComputeNodeList:get_all
seen_nodes = set()
for compute in compute_nodes:
service = service_refs.get(compute.host)

if not service:
LOG.warning(_LW(
"No compute service record found for host %(host)s"),
{'host': compute.host})
continue
host = compute.host
node = compute.hypervisor_hostname
state_key = (host, node)
host_state = self.host_state_map.get(state_key)

# 更新主机信息
if host_state:
host_state.update_from_compute_node(compute)
else:
host_state = self.host_state_cls(host, node, compute=compute)
self.host_state_map[state_key] = host_state
# We force to update the aggregates info each time a new request
# comes in, because some changes on the aggregates could have been
# happening after setting this field for the first time
host_state.aggregates = [self.aggs_by_id[agg_id] for agg_id in
self.host_aggregates_map[
host_state.host]]
host_state.update_service(dict(service))
self._add_instance_info(context, compute, host_state)
seen_nodes.add(state_key)

# remove compute nodes from host_state_map if they are not active
# * 移除 not active 的节点
dead_nodes = set(self.host_state_map.keys()) - seen_nodes


for state_key in dead_nodes:
host, node = state_key
LOG.info(_LI("Removing dead compute node %(host)s:%(node)s "
"from scheduler"), {'host': host, 'node': node})
del self.host_state_map[state_key]

return six.itervalues(self.host_state_map)
# get_all_host_states主要用来去除不活跃的节点

继续往下看获取 Compute Node 资源信息函数 objects.ComputeNodeList.get_all(context) 的实现。

# nova.object.compute_node:get_all()

@base.remotable_classmethod
def get_all(cls, context):
# 调到了 nova.db.api.compute_node_get_all()
db_computes = db.compute_node_get_all(context)


return base.obj_make_list(context, cls(context), objects.ComputeNode,
db_computes)



#nova.db.api:compute_node_get_all()

def compute_node_get_all(context):
"""Get all computeNodes.

:param context: The security context

:returns: List of dictionaries each containing compute node properties
"""

return IMPL.compute_node_get_all(context)

至此,说明 liberty 版本的 nova-scheduler 还是能够访问数据库的。

问题是: nova-scheduler 是怎么更新主机信息的,能够直接数据库进行写操作吗?
答案是:不能,nova-scheduler 不能够对数据库进行写操作,但是却可以从数据库中读取 Host 资源数据并缓存在进程的内存中。如下:

# nova.scheduler.host_manager.HostState:__init__()
class HostState(object):
"""Mutable and immutable information tracked for a host.
This is an attempt to remove the ad-hoc data structures
previously used and lock down access.
"""


def __init__(self, host, node, compute=None):
self.host = host
self.nodename = node

# Mutable available resources.
# These will change as resources are virtually "consumed".
self.total_usable_ram_mb = 0
self.total_usable_disk_gb = 0
self.disk_mb_used = 0
self.free_ram_mb = 0
self.free_disk_mb = 0
self.vcpus_total = 0
self.vcpus_used = 0
self.pci_stats = None
self.numa_topology = None

# Additional host information from the compute node stats:
self.num_instances = 0
self.num_io_ops = 0

# Other information
self.host_ip = None
self.hypervisor_type = None
self.hypervisor_version = None
self.hypervisor_hostname = None
self.cpu_info = None
self.supported_instances = None

nova-scheduler 并没有写数据库的操作函数,但是 nova-scheduler 会将数据库的数据缓存到进程内存中。这样就可以在保证了 nova-scheduler 能使用最新的 Host 资源信息,同时下降低了对数据库的 I/O 请求。

阶段四:从调度器 FilterScheduler 到过滤器 Filters

上面的代码中调用了 Filters 函数:get_filtered_hosts(),实现如下:

# nova.scheduler.host_manager.HostManager:get_filtered_hosts()
def get_filtered_hosts(self, hosts, filter_properties,
filter_class_names=None, index=0)
:

"""Filter hosts and return only ones passing all filters."""
# 下面定义了若干局部函数,先省略掉
def _strip_ignore_hosts(host_map, hosts_to_ignore):
ignored_hosts = []
for host in hosts_to_ignore:

。。。。
# 返回经过验证的可用的过滤器;
filter_classes = self._choose_host_filters(filter_class_names)
。。。。
# 调用了get_filtered_objects
return self.filter_handler.get_filtered_objects(filters,
hosts, filter_properties, index)



# 继续跳转到 get_filtered_objects()
def get_filtered_objects(self, filters, objs, filter_properties, index=0):
list_objs = list(objs)
LOG.debug("Starting with %d host(s)", len(list_objs))
part_filter_results = []
full_filter_results = []
log_msg = "%(cls_name)s: (start: %(start)s, end: %(end)s)"
for filter_ in filters:
if filter_.run_filter_for_index(index):
cls_name = filter_.__class__.__name__
start_count = len(list_objs)
# 关键的一句话
objs = filter_.filter_all(list_objs, filter_properties)
if objs is None:
LOG.debug("Filter %s says to stop filtering", cls_name)
return
list_objs = list(objs)
end_count = len(list_objs)
part_filter_results.append(log_msg % {"cls_name": cls_name,
"start": start_count, "end": end_count})
if list_objs:
remaining = [(getattr(obj, "host", obj),
getattr(obj, "nodename", ""))
for obj in list_objs]
full_filter_results.append((cls_name, remaining))

return list_objs



# objs 的 return 又调用了 filter_.filter_all(list_objs, filter_properties)
def filter_all(self, filter_obj_list, filter_properties):
for obj in filter_obj_list:
if self._filter_one(obj, filter_properties):
# 符合规则 生产一个obj
yield obj



# 继续调用 _filter_one()
def _filter_one(self, obj, filter_properties):

# 如果符合 Filter 过滤器,就返回 TRUE,否则返回 FALSE

return self.host_passes(obj, filter_properties)

经过一连串的调用跳转,Filter 的过滤工作就完成了。

阶段五:Filters 到权重计算与排序

# nova.scheduler.host_manager.HostManager:get_weighed_hosts()
def get_weighed_hosts(self, hosts, weight_properties):
"""Weigh the hosts."""
return self.weight_handler.get_weighed_objects(self.weighers,
hosts, weight_properties)


# nova.weights.BaseWeightHandler:get_weighed_objects()
class BaseWeightHandler(loadables.BaseLoader):
object_class = WeighedObject

def get_weighed_objects(self, weighers, obj_list, weighing_properties):
"""Return a sorted (descending), normalized list of WeighedObjects."""
weighed_objs = [self.object_class(obj, 0.0) for obj in obj_list]

if len(weighed_objs) <= 1:
return weighed_objs

for weigher in weighers:
weights = weigher.weigh_objects(weighed_objs, weighing_properties)

# Normalize the weights
weights = normalize(weights,
minval=weigher.minval,
maxval=weigher.maxval)

for i, weight in enumerate(weights):
obj = weighed_objs[i]
obj.weight += weigher.weight_multiplier() * weight

# 进行排序
return sorted(weighed_objs, key=lambda x: x.weight, reverse=True)