scheduler源码分析——preempt抢占

时间:2022-09-26 00:13:01

前言

之前探讨scheduler的调度流程时,提及过preempt抢占机制,它发生在预选调度失败的时候,当时由于篇幅限制就没有展开细说。

回顾一下抢占流程的主要逻辑在DefaultPreemption.preempt方法,步骤包括:

  1. 拿最新版本的pod,刷新lister的缓存
  2. 确保抢占者有资格抢占其他Pod
  3. 寻找抢占候选者
  4. 与注册扩展器进行交互,以便在需要时筛选出某些候选者。
  5. 选出最佳的候选者
  6. 在提名选定的候选人之前,先进行准备工作。

代码位于/pkg/scheduler/framework/plugins/defaultpreemption/default_preemption.go

func (pl *DefaultPreemption) preempt(...) (string, error) {
cs := pl.fh.ClientSet()
ph := pl.fh.PreemptHandle()
nodeLister := pl.fh.SnapshotSharedLister().NodeInfos()
//1.拿最新版本的pod,刷新lister的缓存
pod, err := pl.podLister.Pods(pod.Namespace).Get(pod.Name)
//2.确保抢占者有资格抢占其他Pod
if !PodEligibleToPreemptOthers(pod, nodeLister, m[pod.Status.NominatedNodeName]) {
}
//3.寻找抢占候选者
candidates, err := FindCandidates(ctx, cs, state, pod, m, ph, nodeLister, pl.pdbLister)
//4.与注册扩展器进行交互,以便在需要时筛选出某些候选者。
candidates, err = CallExtenders(ph.Extenders(), pod, nodeLister, candidates)
//5.选出最佳的候选者
bestCandidate := SelectCandidate(candidates)
//6.在提名选定的候选人之前,先进行准备工作。
if err := PrepareCandidate(bestCandidate, pl.fh, cs, pod); err != nil {
}
return bestCandidate.Name(), nil
}

下面则展开细说每个函数的细节

PodEligibleToPreemptOthers

func PodEligibleToPreemptOthers(pod *v1.Pod, nodeInfos framework.NodeInfoLister, nominatedNodeStatus *framework.Status) bool {
if pod.Spec.PreemptionPolicy != nil && *pod.Spec.PreemptionPolicy == v1.PreemptNever {
klog.V(5).Infof("Pod %v/%v is not eligible for preemption because it has a preemptionPolicy of %v", pod.Namespace, pod.Name, v1.PreemptNever)
return false
}
nomNodeName := pod.Status.NominatedNodeName
if len(nomNodeName) > 0 {
// If the pod's nominated node is considered as UnschedulableAndUnresolvable by the filters,
// then the pod should be considered for preempting again.
if nominatedNodeStatus.Code() == framework.UnschedulableAndUnresolvable {
return true
} if nodeInfo, _ := nodeInfos.Get(nomNodeName); nodeInfo != nil {
podPriority := podutil.GetPodPriority(pod)
for _, p := range nodeInfo.Pods {
if p.Pod.DeletionTimestamp != nil && podutil.GetPodPriority(p.Pod) < podPriority {
// There is a terminating pod on the nominated node.
return false
}
}
}
}
return true
}

如果pod的调度策略设置成不抢占的,则这个pod不适合执行抢占机制,就会直接退出

pod.Status.NominatedNodeName这个字段不为空,则说明了当前pod已经经历过一次抢占,当pod可以抢占调度到某个节点时,pod.Status.NominatedNodeName字段就会填写上这个node的name。如果字段为空则没发生过抢占,可以让它执行;如果有抢占过改节点,则要判断该节点是否有优先级较低Pod的正在被删除(p.Pod.DeletionTimestamp != nil),有则先让当前Pod不执行抢占,因为那个抢占会引起优先级低的Pod删除,这个正在被删除的Pod有可能是上次抢占的时候被当前的Pod给挤掉的,应该要当前的Pod再等等,待正在删除的Pod清掉后能否正常调度到该节点,减少无谓的抢占。

FindCandidates

FindCandidates函数是寻找所有可供抢占的候选者集合,候选者就是有可能被抢占到的node,以及这个node中因为这次抢占而被驱逐的Pod(即牺牲者),另外还有这些牺牲者中PDB的数量。相关结构的定义如下

type candidate struct {
victims *extenderv1.Victims
name string
}
type Victims struct {
Pods []*v1.Pod
NumPDBViolations int64
}

FindCandidates函数的定义如下,

func FindCandidates(ctx context.Context, cs kubernetes.Interface, state *framework.CycleState, pod *v1.Pod,
m framework.NodeToStatusMap, ph framework.PreemptHandle, nodeLister framework.NodeInfoLister,
pdbLister policylisters.PodDisruptionBudgetLister) ([]Candidate, error) {
allNodes, err := nodeLister.List()
if err != nil {
return nil, err
}
if len(allNodes) == 0 {
return nil, core.ErrNoNodesAvailable
}
//获取所有非不可调度的节点
potentialNodes := nodesWherePreemptionMightHelp(allNodes, m)
if len(potentialNodes) == 0 {
klog.V(3).Infof("Preemption will not help schedule pod %v/%v on any node.", pod.Namespace, pod.Name)
// In this case, we should clean-up any existing nominated node name of the pod.
if err := util.ClearNominatedNodeName(cs, pod); err != nil {
klog.Errorf("Cannot clear 'NominatedNodeName' field of pod %v/%v: %v", pod.Namespace, pod.Name, err)
// We do not return as this error is not critical.
}
return nil, nil
}
if klog.V(5).Enabled() {
var sample []string
for i := 0; i < 10 && i < len(potentialNodes); i++ {
sample = append(sample, potentialNodes[i].Node().Name)
}
klog.Infof("%v potential nodes for preemption, first %v are: %v", len(potentialNodes), len(sample), sample)
}
pdbs, err := getPodDisruptionBudgets(pdbLister)
if err != nil {
return nil, err
}
///重点的函数
return dryRunPreemption(ctx, ph, state, pod, potentialNodes, pdbs), nil
}

nodesWherePreemptionMightHelp

nodesWherePreemptionMightHelp函数用于从所有节点中筛选掉UnschedulableAndUnresolvable的节点,关于状态UnschedulableAndUnresolvable,注释是这样的:用于预选调度发现pod不可调度且抢占不会改变任何内容时。如果pod可以通过抢占获得调度,插件应该返回Unschedulable。随附的状态信息应解释pod不可调度的原因。

func nodesWherePreemptionMightHelp(nodes []*framework.NodeInfo, m framework.NodeToStatusMap) []*framework.NodeInfo {
var potentialNodes []*framework.NodeInfo
for _, node := range nodes {
name := node.Node().Name
// We reply on the status by each plugin - 'Unschedulable' or 'UnschedulableAndUnresolvable'
// to determine whether preemption may help or not on the node.
if m[name].Code() == framework.UnschedulableAndUnresolvable {
continue
}
potentialNodes = append(potentialNodes, node)
}
return potentialNodes
}

dryRunPreemption

dryRunPreemption用于并行执行模拟抢占的函数,它对nodesWherePreemptionMightHelp筛选出来的节点执行一次模拟抢占函数,凡是可以通过模拟抢占的节点就会生成候选者信息,把节点名,牺牲者的集合及PDB数量记录下来

func dryRunPreemption(ctx context.Context, fh framework.PreemptHandle, state *framework.CycleState,
pod *v1.Pod, potentialNodes []*framework.NodeInfo, pdbs []*policy.PodDisruptionBudget) []Candidate {
var resultLock sync.Mutex
var candidates []Candidate checkNode := func(i int) {
nodeInfoCopy := potentialNodes[i].Clone()
stateCopy := state.Clone()
//通过预选调度模拟计算出可牺牲的pod列表及牺牲pod中PBD数量
pods, numPDBViolations, fits := selectVictimsOnNode(ctx, fh, stateCopy, pod, nodeInfoCopy, pdbs)
if fits {
resultLock.Lock()
victims := extenderv1.Victims{
Pods: pods,
NumPDBViolations: int64(numPDBViolations),
}
c := candidate{
victims: &victims,
name: nodeInfoCopy.Node().Name,
}
candidates = append(candidates, &c)
resultLock.Unlock()
}
}
parallelize.Until(ctx, len(potentialNodes), checkNode)
return candidates
}

selectVictimsOnNode

selectVictimsOnNode是执行模拟抢占的最核心函数,大体思路就是

  1. 找出候选节点上所有优先级较低的Pod并将他们移除,这些Pod定为潜在牺牲者
  2. 将当前Pod执行预选调度到候选节点看是否合适
  3. 将潜在牺牲者按优先级排序重新执行预选调度看能否重新调回到节点上,不能调度的成为真正的牺牲者,且统计PDB的数量

在k8s中一个Pod的默认值优先级是0

func selectVictimsOnNode(...) ([]*v1.Pod, int, bool) {
//模拟从节点上移除pod
removePod := func(rp *v1.Pod) error {
}
//模拟从节点数增加Pod
addPod := func(ap *v1.Pod) error {
}
//找出所有优先级较低的Pod移除,并加入潜在牺牲者集合
for _, p := range nodeInfo.Pods {
if podutil.GetPodPriority(p.Pod) < podPriority {
potentialVictims = append(potentialVictims, p.Pod)
if err := removePod(p.Pod); err != nil {
return nil, 0, false
}
}
}
//移除了潜在牺牲者后尝试执行预选调度算法将Pod加入到节点中
if fits, _, err := core.PodPassesFiltersOnNode(ctx, ph, state, pod, nodeInfo); !fits {
}
//将潜在牺牲者按优先级排序,并分辨出含有PDB和不含PDB的
sort.Slice(potentialVictims, func(i, j int) bool { return util.MoreImportantPod(potentialVictims[i], potentialVictims[j]) })
violatingVictims, nonViolatingVictims := filterPodsWithPDBViolation(potentialVictims, pdbs)
//将潜在牺牲者在当前Pod加入到候选节点后尝试预选调度,如果不能调度成功的,候选牺牲者成为本节点的真正牺牲者,也统计牺牲者中PDB的数量
reprievePod := func(p *v1.Pod) (bool, error) {
if err := addPod(p); err != nil {
return false, err
}
//执行预选调度
fits, _, _ := core.PodPassesFiltersOnNode(ctx, ph, state, pod, nodeInfo)
if !fits {
if err := removePod(p); err != nil {
return false, err
}
victims = append(victims, p)
klog.V(5).Infof("Pod %v/%v is a potential preemption victim on node %v.", p.Namespace, p.Name, nodeInfo.Node().Name)
}
return fits, nil
}
for _, p := range violatingVictims {
if fits, err := reprievePod(p); err != nil {
klog.Warningf("Failed to reprieve pod %q: %v", p.Name, err)
return nil, 0, false
} else if !fits {
numViolatingVictim++
}
}
// Now we try to reprieve non-violating victims.
for _, p := range nonViolatingVictims {
if _, err := reprievePod(p); err != nil {
klog.Warningf("Failed to reprieve pod %q: %v", p.Name, err)
return nil, 0, false
}
}
}

core.PodPassesFiltersOnNode就是上一篇执行预选调度算法的函数,每一次调用这个函数时,预选调度的那批插件都有可能执行两次,

  • 第一次是加上这个节点中抢占Pod之后,看当前的Pod能否调度成功,抢占的Pod是那些会抢占调度到当前Node但是又没有实际调度到的Pod
  • 如果根本没有抢占Pod在这个节点,或者第一次运行根本不成功的,就完全不用执行第二次了。

执行两次主要考虑到抢占Pod与当前Pod间是否有亲缘性与反亲缘性问题,代码位于 /pkg/schduler/core/generic_scheduler.go

func PodPassesFiltersOnNode(...){
for i := 0; i < 2; i++ {
stateToUse := state
nodeInfoToUse := info
if i == 0 {
var err error
podsAdded, stateToUse, nodeInfoToUse, err = addNominatedPods(ctx, ph, pod, state, info)
if err != nil {
return false, nil, err
}
} else if !podsAdded || !status.IsSuccess() {
break
} statusMap := ph.RunFilterPlugins(ctx, stateToUse, pod, nodeInfoToUse)
status = statusMap.Merge()
if !status.IsSuccess() && !status.IsUnschedulable() {
return false, status, status.AsError()
}
}
}

模拟抢占的逻辑就这样结束,逻辑执行完会产生若干个候选者节点,如果一个都没有则意味着抢占失败

CallExtenders

CallExtenders主要把候选者都经过扩展的过滤器插件筛选一遍,代码简略如下

func CallExtenders(extenders []framework.Extender, pod *v1.Pod, nodeLister framework.NodeInfoLister,
candidates []Candidate) ([]Candidate, error) {
victimsMap := candidatesToVictimsMap(candidates)
for _, extender := range extenders {
nodeNameToVictims, err := extender.ProcessPreemption(pod, victimsMap, nodeLister)
victimsMap = nodeNameToVictims
}
for nodeName := range victimsMap {
newCandidates = append(newCandidates, &candidate{
victims: victimsMap[nodeName],
name: nodeName,
})
}
}

SelectCandidate

经过扩展的过滤器插件筛选后,则需要调用SelectCandidate从剩余的候选者中选出一个最优的节点来抢占。

  • 当发现只有一个候选者时不需要选择
  • 执行一系列筛选标准算出最优的候选者
  • 当选不出的时候就默认拿候选者集合的第一个作为最优候选者
func SelectCandidate(candidates []Candidate) Candidate {
if len(candidates) == 0 {
return nil
}
if len(candidates) == 1 {
return candidates[0]
} //将结构转成 nodeName,牺牲者数组 的map
victimsMap := candidatesToVictimsMap(candidates)
//按照一些列选择标准挑选出最优的候选者
candidateNode := pickOneNodeForPreemption(victimsMap) // Same as candidatesToVictimsMap, this logic is not applicable for out-of-tree
// preemption plugins that exercise different candidates on the same nominated node.
if victims := victimsMap[candidateNode]; victims != nil {
return &candidate{
victims: victims,
name: candidateNode,
}
} // We shouldn't reach here.
klog.Errorf("should not reach here, no candidate selected from %v.", candidates)
// To not break the whole flow, return the first candidate.
return candidates[0]
}

最优候选者的标准如下

  1. 选择一个PBD违规数量最少的
  2. 选择一个包含最高优先级牺牲者最小的
  3. 所有牺牲者的优先级总和最小的
  4. 最少牺牲者的
  5. 拥有所有最高优先级的牺牲者最迟才启动的

这个标准是层层筛选,选到哪一层只剩下一个候选者的,那个剩余的就是最优候选者

func pickOneNodeForPreemption(nodesToVictims map[string]*extenderv1.Victims) string {
//计算PDB数量最少的候选者,
for node, victims := range nodesToVictims {
numPDBViolatingPods := victims.NumPDBViolations
if numPDBViolatingPods < minNumPDBViolatingPods {
minNumPDBViolatingPods = numPDBViolatingPods
minNodes1 = nil
lenNodes1 = 0
}
if numPDBViolatingPods == minNumPDBViolatingPods {
minNodes1 = append(minNodes1, node)
lenNodes1++
}
} //计算单个候选的牺牲者优先级最大的,但和其他候选者相比优先级却是最小的
for i := 0; i < lenNodes1; i++ {
node := minNodes1[i]
victims := nodesToVictims[node]
// highestPodPriority is the highest priority among the victims on this node.
highestPodPriority := podutil.GetPodPriority(victims.Pods[0])
if highestPodPriority < minHighestPriority {
minHighestPriority = highestPodPriority
lenNodes2 = 0
}
if highestPodPriority == minHighestPriority {
minNodes2[lenNodes2] = node
lenNodes2++
}
} //计算所有牺牲者优先级总和最小的
for i := 0; i < lenNodes2; i++ {
var sumPriorities int64
node := minNodes2[i]
for _, pod := range nodesToVictims[node].Pods {
// We add MaxInt32+1 to all priorities to make all of them >= 0. This is
// needed so that a node with a few pods with negative priority is not
// picked over a node with a smaller number of pods with the same negative
// priority (and similar scenarios).
sumPriorities += int64(podutil.GetPodPriority(pod)) + int64(math.MaxInt32+1)
}
if sumPriorities < minSumPriorities {
minSumPriorities = sumPriorities
lenNodes1 = 0
}
if sumPriorities == minSumPriorities {
minNodes1[lenNodes1] = node
lenNodes1++
}
} //计算所有牺牲者数量最少的
for i := 0; i < lenNodes1; i++ {
node := minNodes1[i]
numPods := len(nodesToVictims[node].Pods)
if numPods < minNumPods {
minNumPods = numPods
lenNodes2 = 0
}
if numPods == minNumPods {
minNodes2[lenNodes2] = node
lenNodes2++
}
}
//GetEarliestPodStartTime是获取优先级最高又跑了最久的Pod的启动时间
latestStartTime := util.GetEarliestPodStartTime(nodesToVictims[minNodes2[0]])
if latestStartTime == nil {
// If the earliest start time of all pods on the 1st node is nil, just return it,
// which is not expected to happen.
klog.Errorf("earliestStartTime is nil for node %s. Should not reach here.", minNodes2[0])
return minNodes2[0]
}
//计算GetEarliestPodStartTime,挑一个最大值,意味着找最晚启动的来牺牲,让跑得久的先稳定着。
for i := 1; i < lenNodes2; i++ {
node := minNodes2[i]
// Get earliest start time of all pods on the current node.
earliestStartTimeOnNode := util.GetEarliestPodStartTime(nodesToVictims[node])
if earliestStartTimeOnNode == nil {
klog.Errorf("earliestStartTime is nil for node %s. Should not reach here.", node)
continue
}
if earliestStartTimeOnNode.After(latestStartTime.Time) {
latestStartTime = earliestStartTimeOnNode
nodeToReturn = node
}
}
}

上一篇文章说挑选最优候选者的时候,有6个标准,而pickOneNodeForPreemption函数只涵盖了5个,其实最后一个就是SelectCandidate调用pickOneNodeForPreemption函数调用后还找不出最优候选者时,就默认拿候选者集合的第一个作为最优候选者。

PrepareCandidate

当选定了最优候选者后,调用PrepareCandidate执行准备工作。准备工作就包含

  • 驱逐牺牲者(看源码实际是删除)
  • Reject waitingMap里面的牺牲者
  • 把抢占目标Node中其他抢占到该节点上的优先级较低的Pod也清除了(实际就更新那些Pod的Status.NominatedNodeName字段,让他们恢复抢占前的状态)
func PrepareCandidate(c Candidate, fh framework.FrameworkHandle, cs kubernetes.Interface, pod *v1.Pod) error {
for _, victim := range c.Victims().Pods {
//驱逐Pod
if err := util.DeletePod(cs, victim); err != nil {
klog.Errorf("Error preempting pod %v/%v: %v", victim.Namespace, victim.Name, err)
return err
}
//拒绝Pod
if waitingPod := fh.GetWaitingPod(victim.UID); waitingPod != nil {
waitingPod.Reject("preempted")
}
}
//清除抢占目标Node中其他优先级较低的抢占Pod
nominatedPods := getLowerPriorityNominatedPods(fh.PreemptHandle(), pod, c.Name())
if err := util.ClearNominatedNodeName(cs, nominatedPods...); err != nil {
klog.Errorf("Cannot clear 'NominatedNodeName' field: %v", err)
// We do not return as this error is not critical.
} return nil
}

util.DeletePod定义如下,代码位于/pkg/scheduler/util/utils.go

func DeletePod(cs kubernetes.Interface, pod *v1.Pod) error {
return cs.CoreV1().Pods(pod.Namespace).Delete(context.TODO(), pod.Name, metav1.DeleteOptions{})
}

回归到scheduleOne

抢占逻辑执行完毕后,回到Scheduler.scheduleOne函数,先记录抢占目标的节点名,再调用Scheduler.recordSchedulingFailure方法

	if status.IsSuccess() && result != nil {
nominatedNode = result.NominatedNodeName
}
....
sched.recordSchedulingFailure(prof, podInfo, err, v1.PodReasonUnschedulable, nominatedNode)

Scheduler.recordSchedulingFailure先把Pod加到scheduler的SchedulingQueue队列中,再把将Pod的Status.NominatedNodeName字段更新,代码位于/pkg/scheduler/scheduler.go

func (sched *Scheduler) recordSchedulingFailure(prof *profile.Profile, podInfo *framework.QueuedPodInfo, err error, reason string, nominatedNode string) {
sched.Error(podInfo, err) if sched.SchedulingQueue != nil {
sched.SchedulingQueue.AddNominatedPod(podInfo.Pod, nominatedNode)
} pod := podInfo.Pod
prof.Recorder.Eventf(pod, nil, v1.EventTypeWarning, "FailedScheduling", "Scheduling", err.Error())
if err := updatePod(sched.client, pod, &v1.PodCondition{...}, nominatedNode); err != nil {
}
} func updatePod(client clientset.Interface, pod *v1.Pod, condition *v1.PodCondition, nominatedNode string) error {
podCopy := pod.DeepCopy()
if !podutil.UpdatePodCondition(&podCopy.Status, condition) &&
(len(nominatedNode) == 0 || pod.Status.NominatedNodeName == nominatedNode) {
return nil
}
if nominatedNode != "" {
podCopy.Status.NominatedNodeName = nominatedNode
}
return util.PatchPod(client, pod, podCopy)
}

上一篇就已经提到过抢占执行完毕并非是pod马上就可以调度到节点上,还是需要重新回到Scheduler的队列中,等待把选中的节点上面牺牲者Pod驱逐掉,腾出了资源,祈求能调度到选中的节点上而已。

Scheduler的SchedulingQueue队列

上面提到把抢占成功的Pod加到scheduler的SchedulingQueue队列中,下面介绍一下这个SchedulingQueue,另外上面代码中让pod重新入队让其能在下个周期能调度的调用是包含在sched.Error,而并非sched.SchedulingQueue.AddNominatedPod。

SchedulingQueue是scheduler结构的一个成员,它是一个定义在/pkg/scheduler/internal/queue/scheduling_queue.go的接口

type SchedulingQueue interface {
framework.PodNominator
Add(pod *v1.Pod) error
AddUnschedulableIfNotPresent(pod *framework.QueuedPodInfo, podSchedulingCycle int64) error
.....
Run()
}

它继承了/pkg/scheduler/framework/v1alpha1/interface.go的PodNominator接口

type PodNominator interface {
AddNominatedPod(pod *v1.Pod, nodeName string)
DeleteNominatedPodIfExists(pod *v1.Pod)
UpdateNominatedPod(oldPod, newPod *v1.Pod)
NominatedPodsForNode(nodeName string) []*v1.Pod
}

在recordSchedulingFailure处调用的sched.SchedulingQueue.AddNominatedPod就是调用PodNominator接口的方法,作用是记录一个节点上抢占Pod,在预选调度时能执行addNominatedPods尝试把抢占Pod也加到节点上看待调度的Pod能否正常调度到节点上。

实现了SchedulingQueue接口的是位于/pkg/scheduler/internal/queue/scheduling_queue.go的PriorityQueue结构,它包含3个队列

type PriorityQueue struct {
...
// activeQ is heap structure that scheduler actively looks at to find pods to
// schedule. Head of heap is the highest priority pod.
activeQ *heap.Heap
// podBackoffQ is a heap ordered by backoff expiry. Pods which have completed backoff
// are popped from this heap before the scheduler looks at activeQ
podBackoffQ *heap.Heap
// unschedulableQ holds pods that have been tried and determined unschedulable.
unschedulableQ *UnschedulablePodsMap
...
}

activeQ队列是马上可以调度的队列,上篇介绍Pod调度流程时从sched.NextPod中取出Pod就是来源于activeQ;

剩余两个队列podBackoffQ和unschedulableQ是用来存放调度失败的Pod

activeQ队列中Pod的来源有几个,外部来源的则是从podInformer监听到apiserver有pod创建,调用链如下

informerFactory.Core().V1().Pods().Informer().AddEventHandler	/pkg/scheduler/eventhandlers.go
|--Scheduler.addPodToSchedulingQueue
|--sched.SchedulingQueue.Add(pod)

内部来源则是从podBackoffQ和unschedulableQ转过去的,其中一个转移的地方在 Scheduler.Run的地方,它调用sched.SchedulingQueue.Run(),这就包含了两个定时调用函数,分别是把podBackoffQ移到activeQ,和把unschedulableQ移到activeQ或podBackoffQ中,

代码位于/pkg/scheduler/scheduler.go

func (sched *Scheduler) Run(ctx context.Context) {
sched.SchedulingQueue.Run()
wait.UntilWithContext(ctx, sched.scheduleOne, 0)
sched.SchedulingQueue.Close()
}

代码位于/pkg/scheduler/internal/queue/scheduling_queue.go

func (p *PriorityQueue) Run() {
//不断将BackoffQ里面的Pod,超过backoff time的Pod加到ActiveQ
go wait.Until(p.flushBackoffQCompleted, 1.0*time.Second, p.stop)
//筛选出UnschedulableQ中不可调度时间持续1分钟的,按其backofftime来分辨加入ActiveQ还是BackOffQ
go wait.Until(p.flushUnschedulableQLeftover, 30*time.Second, p.stop)
}
func (p *PriorityQueue) flushBackoffQCompleted() {
p.lock.Lock()
defer p.lock.Unlock()
for {
rawPodInfo := p.podBackoffQ.Peek()
if rawPodInfo == nil {
return
}
pod := rawPodInfo.(*framework.QueuedPodInfo).Pod
//pod backoff 的时间指数级增长
boTime := p.getBackoffTime(rawPodInfo.(*framework.QueuedPodInfo))
if boTime.After(p.clock.Now()) {
return
}
_, err := p.podBackoffQ.Pop()
if err != nil {
klog.Errorf("Unable to pop pod %v from backoff queue despite backoff completion.", nsNameForPod(pod))
return
}
p.activeQ.Add(rawPodInfo)
metrics.SchedulerQueueIncomingPods.WithLabelValues("active", BackoffComplete).Inc()
defer p.cond.Broadcast()
}
}
func (p *PriorityQueue) flushUnschedulableQLeftover() {
p.lock.Lock()
defer p.lock.Unlock() var podsToMove []*framework.QueuedPodInfo
currentTime := p.clock.Now()
for _, pInfo := range p.unschedulableQ.podInfoMap {
lastScheduleTime := pInfo.Timestamp
if currentTime.Sub(lastScheduleTime) > unschedulableQTimeInterval {
podsToMove = append(podsToMove, pInfo)
}
} if len(podsToMove) > 0 {
p.movePodsToActiveOrBackoffQueue(podsToMove, UnschedulableTimeout)
}
}
func (p *PriorityQueue) movePodsToActiveOrBackoffQueue(podInfoList []*framework.QueuedPodInfo, event string) {
for _, pInfo := range podInfoList {
pod := pInfo.Pod
//按BackOff time来分辨 boTime.After(p.clock.Now())就留backOffQ
//因为判断为否的时候flushBackoffQCompleted也会将其移到ActiveQ
if p.isPodBackingoff(pInfo) {
if err := p.podBackoffQ.Add(pInfo); err != nil {
klog.Errorf("Error adding pod %v to the backoff queue: %v", pod.Name, err)
} else {
metrics.SchedulerQueueIncomingPods.WithLabelValues("backoff", event).Inc()
p.unschedulableQ.delete(pod)
}
} else {
if err := p.activeQ.Add(pInfo); err != nil {
klog.Errorf("Error adding pod %v to the scheduling queue: %v", pod.Name, err)
} else {
metrics.SchedulerQueueIncomingPods.WithLabelValues("active", event).Inc()
p.unschedulableQ.delete(pod)
}
}
}
p.moveRequestCycle = p.schedulingCycle
p.cond.Broadcast()
}

将Pod加到podBackoffQ和unschedulableQ两个队列的地方仅有PriorityQueue的AddUnschedulableIfNotPresent方法,

func (p *PriorityQueue) AddUnschedulableIfNotPresent(pInfo *framework.QueuedPodInfo, podSchedulingCycle int64) error {
...
if p.moveRequestCycle >= podSchedulingCycle {
if err := p.podBackoffQ.Add(pInfo); err != nil {
return fmt.Errorf("error adding pod %v to the backoff queue: %v", pod.Name, err)
}
} else {
p.unschedulableQ.addOrUpdate(pInfo)
} p.PodNominator.AddNominatedPod(pod, "")
return nil
}

而调用这个方法的来源也仅有SchedulerOne调度失败时调用了sched.Error,调用链如下

sched.recordSchedulingFailure	/pkg/scheduler/scheduler.go
|--sched.Error(podInfo, err)
|==MakeDefaultErrorFunc /pkg/scheduler/factory.go
|--podQueue.AddUnschedulableIfNotPresent

调用sched.recordSchedulingFailure的地方有多处,都在Scheduler.scheduleOne中,执行完预选调度或优选调度后,至绑定到某个节点前。

小结

本篇摊开讲了scheduler的抢占机制,抢占触发在一个Pod在预选调度失败之后,试图从现有节点中挑选可抢占的节点及抢占时需要驱逐牺牲的Pod,经过一系列计算筛选后选出一个最优的节点,驱逐上面的牺牲者后,将抢占的Pod放回scheduler的调度队列中,等待抢占Pod下次调度的一个流程。还额外的提及到scheduler的调度队列的类别,以及各个队列入队列和出队列的场景。

如有兴趣,可阅读鄙人“k8s源码之旅”系列的其他文章

kubelet源码分析——kubelet简介与启动

kubelet源码分析——启动Pod

kubelet源码分析——关闭Pod

kubelet源码分析——监控Pod变更

scheduler源码分析——调度流程

scheduler源码分析——preempt抢占

apiserver源码分析——启动流程

apiserver源码分析——处理请求

参考文章

kube-scheduler源码分析(六)之 preempt

scheduler之调度之优选(priority)与抢占(preempt)

kube-scheduler 优先级与抢占机制源码分析

scheduler源码分析——preempt抢占的更多相关文章

  1. scheduler源码分析——调度流程

    前言 当api-server处理完一个pod的创建请求后,此时可以通过kubectl把pod get出来,但是pod的状态是Pending.在这个Pod能运行在节点上之前,它还需要经过schedule ...

  2. Hadoop学习之--Capaycity Scheduler源码分析

    Capacity Scheduler调度策略当一个新的job是否允许添加到队列中进行初始化,判断当前队列和用户是否已经达到了初始化数目的上限,下面就从代码层面详细介绍整个的判断逻辑.Capaycity ...

  3. scrapy-redis&lpar;调度器Scheduler源码分析&rpar;

    settings里面的配置:'''当下面配置了这个(scrapy-redis)时候,下面的调度器已经配置在scrapy-redis里面了'''##########连接配置######## REDIS_ ...

  4. Storm源码分析--Nimbus-data

    nimbus-datastorm-core/backtype/storm/nimbus.clj (defn nimbus-data [conf inimbus] (let [forced-schedu ...

  5. apiserver源码分析——启动流程

    前言 apiserver是k8s控制面的一个组件,在众多组件中唯一一个对接etcd,对外暴露http服务的形式为k8s中各种资源提供增删改查等服务.它是RESTful风格,每个资源的URI都会形如 / ...

  6. apiserver源码分析——处理请求

    前言 上一篇说道k8s-apiserver如何启动,本篇则介绍apiserver启动后,接收到客户端请求的处理流程.如下图所示 认证与授权一般系统都会使用到,认证是鉴别访问apiserver的请求方是 ...

  7. JStorm与Storm源码分析(三)--Scheduler,调度器

    Scheduler作为Storm的调度器,负责为Topology分配可用资源. Storm提供了IScheduler接口,用户可以通过实现该接口来自定义Scheduler. 其定义如下: public ...

  8. Quartz源码——scheduler&period;start&lpar;&rpar;启动源码分析(二)

    scheduler.start()是Quartz的启动方式!下面进行分析,方便自己查看! 我都是分析的jobStore 方式为jdbc的SimpleTrigger!RAM的方式类似分析方式! Quar ...

  9. Spark Scheduler模块源码分析之TaskScheduler和SchedulerBackend

    本文是Scheduler模块源码分析的第二篇,第一篇Spark Scheduler模块源码分析之DAGScheduler主要分析了DAGScheduler.本文接下来结合Spark-1.6.0的源码继 ...

随机推荐

  1. &lbrack;LeetCode&rsqb; Valid Word Square 验证单词平方

    Given a sequence of words, check whether it forms a valid word square. A sequence of words forms a v ...

  2. memcached在windows下的安装与命令使用方法

    先下载memcached for win32 下载地址1:http://filemarkets.com/fs/8tdo6ndg41d919599/ 下载地址2:http://www.400gb.com ...

  3. DC游戏《斑鸠》原创赏析&lbrack;转载&rsqb;

    游戏背景:      凤来之国本来只是边远地区的一个小国.但现在他们却自称为得到“神之力”的“神通者”,在“选民思想”“和平统一”之类的名义下开始了对各地的武力侵略.      事情的起因是因为凤来之 ...

  4. Layui文件上传样式在ng-dialog不显示的问题处理

    1.项目业务改动,在一个弹窗页面加图片上传. 2.页面使用angular框架,图片上传使用layui的文件上传组件. js: layui.upload({ url: '/test/upload.jso ...

  5. OpenStack的基础原理

    OpenStack的基础原理 作者:尹正杰 版权声明:原创作品,谢绝转载!否则将追究法律责任.   OpenStack既是一个社区,也是一个项目和一个开源软件,它提供了一个部署云的操作平台或工具集.其 ...

  6. java项目中Excel文件的导入导出

    package poi.excel; import java.io.IOException; import java.io.InputStream; import java.io.OutputStre ...

  7. 如何加快C&plus;&plus;代码的编译速度 转 ccache

    http://www.cnblogs.com/baiyanhuang/archive/2010/01/17/1730717.html   C++代码一直以其运行时的高性能高调面对世人, 但是说起编译速 ...

  8. CodeForces - 946D Timetable &lpar;分组背包&plus;思维&rpar;

    题意 n天的课程,每天有m个时间单位.若时间i和j都有课,那么要在学校待\(j-i+1\)个时间.现在最多能翘k节课,问最少能在学校待多少时间. 分析 将一天的内容视作一个背包的组,可以预处理出该天内 ...

  9. 20145227鄢曼君《网络对抗》Web基础

    20145227鄢曼君<网络对抗>Web基础 实验内容 (1)Web前端HTML (2)Web前端javascipt (3)Web后端:MySQL基础:正常安装.启动MySQL,建库.创建 ...

  10. linux命令(32):cat

    1.cat 显示文件连接文件内容的工具: cat 是一个文本文件查看和连接工具.查看一个文件的内容,用cat比较简单,就是cat 后面直接接文件名. 比如: [root@localhost ~]# c ...