pod(八):pod的调度——将 Pod 指派给节点

时间:2022-11-06 19:07:36

服务器版本 docker软件版本 Kubernetes(k8s)集群版本 CPU架构
CentOS Linux release 7.4.1708 (Core) Docker version 20.10.12 v1.21.9 x86_64

Kubernetes集群架构:k8scloude1作为master节点,k8scloude2,k8scloude3作为worker节点

服务器 操作系统版本 CPU架构 进程 功能描述
k8scloude1/192.168.110.130 CentOS Linux release 7.4.1708 (Core) x86_64 docker,kube-apiserver,etcd,kube-scheduler,kube-controller-manager,kubelet,kube-proxy,coredns,calico k8s master节点
k8scloude2/192.168.110.129 CentOS Linux release 7.4.1708 (Core) x86_64 docker,kubelet,kube-proxy,calico k8s worker节点
k8scloude3/192.168.110.128 CentOS Linux release 7.4.1708 (Core) x86_64 docker,kubelet,kube-proxy,calico k8s worker节点

二.前言

本文介绍pod的调度,即如何让pod运行在Kubernetes集群的指定节点。

进行pod的调度的前提是已经有一套可以正常运行的Kubernetes集群,关于Kubernetes(k8s)集群的安装部署,可以查看博客《Centos7 安装部署Kubernetes(k8s)集群》https://www.cnblogs.com/renshengdezheli/p/16686769.html

三.pod的调度

3.1 pod的调度概述

你可以约束一个 Pod 以便 限制 其只能在特定的节点上运行, 或优先在特定的节点上运行。 有几种方法可以实现这点,推荐的方法都是用 标签选择算符来进行选择。 通常这样的约束不是必须的,因为调度器将自动进行合理的放置(比如,将 Pod 分散到节点上, 而不是将 Pod 放置在可用资源不足的节点上等等)。但在某些情况下,你可能需要进一步控制 Pod 被部署到哪个节点。例如,确保 Pod 最终落在连接了 SSD 的机器上, 或者将来自两个不同的服务且有大量通信的 Pods 被放置在同一个可用区。

你可以使用下列方法中的任何一种来选择 Kubernetes 对特定 Pod 的调度:

  • 与节点标签匹配的 nodeSelector
  • 亲和性与反亲和性
  • nodeName 字段
  • Pod 拓扑分布约束

3.2 pod自动调度

如果不手动指定pod运行在哪个节点上,k8s会自动调度pod的,k8s自动调度pod在哪个节点上运行考虑的因素有:

  • 待调度的pod列表
  • 可用的node列表
  • 调度算法:主机过滤,主机打分

3.2.1 创建3个主机端口为80的pod

查看hostPort字段的解释,hostPort字段表示把pod的端口映射到节点,即在节点上公开 Pod 的端口。

#主机端口映射:hostPort: 80
[root@k8scloude1 pod]# kubectl explain pods.spec.containers.ports.hostPort
KIND:     Pod
VERSION:  v1

FIELD:    hostPort <integer>

DESCRIPTION:
     Number of port to expose on the host. If specified, this must be a valid
     port number, 0 < x < 65536. If HostNetwork is specified, this must match
     ContainerPort. Most containers do not need this.

创建第一个pod,hostPort: 80表示把容器的80端口映射到节点的80端口

[root@k8scloude1 pod]# vim schedulepod.yaml

#kind: Pod表示资源类型为Pod   labels指定pod标签   metadata下面的name指定pod名字   containers下面全是容器的定义   
#image指定镜像名字  imagePullPolicy指定镜像下载策略   containers下面的name指定容器名
#resources指定容器资源(CPU,内存等)   env指定容器里的环境变量   dnsPolicy指定DNS策略
#restartPolicy容器重启策略    ports指定容器端口  containerPort容器端口  hostPort节点上的端口
[root@k8scloude1 pod]# cat schedulepod.yaml 
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: pod
  name: pod
  namespace: pod
spec:
  containers:
  - image: nginx
    imagePullPolicy: IfNotPresent
    name: pod
    resources: {}
    ports:
    - name: http
      containerPort: 80
      protocol: TCP
      hostPort: 80
  dnsPolicy: ClusterFirst
  restartPolicy: Always
status: {}

[root@k8scloude1 pod]# kubectl apply -f schedulepod.yaml 
pod/pod created

[root@k8scloude1 pod]# kubectl get pods
NAME   READY   STATUS    RESTARTS   AGE
pod    1/1     Running   0          6s

可以看到pod创建成功。

接下来创建第二个pod,hostPort: 80表示把容器的80端口映射到节点的80端口,两个pod只有pod名字不一样。

[root@k8scloude1 pod]# cp schedulepod.yaml schedulepod1.yaml 

[root@k8scloude1 pod]# vim schedulepod1.yaml 

[root@k8scloude1 pod]# cat schedulepod1.yaml 
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: pod1
  name: pod1
  namespace: pod
spec:
  containers:
  - image: nginx
    imagePullPolicy: IfNotPresent
    name: pod1
    resources: {}
    ports:
    - name: http
      containerPort: 80
      protocol: TCP
      hostPort: 80
  dnsPolicy: ClusterFirst
  restartPolicy: Always
status: {}

[root@k8scloude1 pod]# kubectl apply -f schedulepod1.yaml 
pod/pod1 created

[root@k8scloude1 pod]# kubectl get pods
NAME   READY   STATUS    RESTARTS   AGE
pod    1/1     Running   0          11m
pod1   1/1     Running   0          5s

第二个pod创建成功,现在创建第三个pod。

开篇我们已经介绍过集群架构,Kubernetes集群架构:k8scloude1作为master节点,k8scloude2,k8scloude3作为worker节点,k8s集群只有2个worker节点,master节点默认不运行应用pod,主机端口80已经被占用两台worker节点全部占用,所以pod2无法运行。

[root@k8scloude1 pod]# sed 's/pod1/pod2/' schedulepod1.yaml | kubectl apply -f -
pod/pod2 created

#主机端口80已经被占用两台worker节点全部占用,pod2无法运行
[root@k8scloude1 pod]# kubectl get pods
NAME   READY   STATUS    RESTARTS   AGE
pod    1/1     Running   0          16m
pod1   1/1     Running   0          5m28s
pod2   0/1     Pending   0          5s

观察pod在k8s集群的分布情况,NODE显示pod运行在哪个节点

[root@k8scloude1 pod]# kubectl get pods
NAME   READY   STATUS    RESTARTS   AGE
pod    1/1     Running   0          18m
pod1   1/1     Running   0          7m28s

[root@k8scloude1 pod]# kubectl get pods -o wide
NAME   READY   STATUS    RESTARTS   AGE   IP               NODE         NOMINATED NODE   READINESS GATES
pod    1/1     Running   0          29m   10.244.251.208   k8scloude3   <none>           <none>
pod1   1/1     Running   0          18m   10.244.112.156   k8scloude2   <none>           <none>

删除pod

[root@k8scloude1 pod]# kubectl delete pod pod2 
pod "pod2" deleted

[root@k8scloude1 pod]# kubectl delete pod pod1 pod
pod "pod1" deleted
pod "pod" deleted

上面三个pod都是k8s自动调度的,下面我们手动指定pod运行在哪个节点。

3.3 使用nodeName 字段指定pod运行在哪个节点

使用nodeName 字段指定pod运行在哪个节点,这是一种比较直接的方式,nodeName 是 Pod 规约中的一个字段。如果 nodeName 字段不为空,调度器会忽略该 Pod, 而指定节点上的 kubelet 会尝试将 Pod 放到该节点上。 使用 nodeName 规则的优先级会高于使用 nodeSelector 或亲和性与非亲和性的规则

使用 nodeName 来选择节点的方式有一些局限性:

  • 如果所指代的节点不存在,则 Pod 无法运行,而且在某些情况下可能会被自动删除。
  • 如果所指代的节点无法提供用来运行 Pod 所需的资源,Pod 会失败, 而其失败原因中会给出是否因为内存或 CPU 不足而造成无法运行。
  • 在云环境中的节点名称并不总是可预测的,也不总是稳定的。

创建pod,nodeName: k8scloude3表示pod要运行在名为k8scloude3的节点

[root@k8scloude1 pod]# vim schedulepod2.yaml 

#kind: Pod表示资源类型为Pod   labels指定pod标签   metadata下面的name指定pod名字   containers下面全是容器的定义   
#image指定镜像名字  imagePullPolicy指定镜像下载策略   containers下面的name指定容器名
#resources指定容器资源(CPU,内存等)   env指定容器里的环境变量   dnsPolicy指定DNS策略
#restartPolicy容器重启策略    ports指定容器端口  containerPort容器端口  hostPort节点上的端口
#nodeName: k8scloude3指定pod在k8scloude3上运行
[root@k8scloude1 pod]# cat schedulepod2.yaml 
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: pod1
  name: pod1
  namespace: pod
spec:
  nodeName: k8scloude3
  containers:
  - image: nginx
    imagePullPolicy: IfNotPresent
    name: pod1
    resources: {}
    ports:
    - name: http
      containerPort: 80
      protocol: TCP
      hostPort: 80
  dnsPolicy: ClusterFirst
  restartPolicy: Always
status: {}

[root@k8scloude1 pod]# kubectl apply -f schedulepod2.yaml 
pod/pod1 created

可以看到pod运行在k8scloude3节点

[root@k8scloude1 pod]# kubectl get pod -o wide
NAME   READY   STATUS    RESTARTS   AGE   IP               NODE         NOMINATED NODE   READINESS GATES
pod1   1/1     Running   0          7s    10.244.251.209   k8scloude3   <none>           <none>

[root@k8scloude1 pod]# kubectl delete pod pod1 --force
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.

[root@k8scloude1 pod]# kubectl get pods
No resources found in pod namespace.

创建pod,nodeName: k8scloude1让pod运行在k8scloude1节点

[root@k8scloude1 pod]# vim schedulepod3.yaml 

#kind: Pod表示资源类型为Pod   labels指定pod标签   metadata下面的name指定pod名字   containers下面全是容器的定义   
#image指定镜像名字  imagePullPolicy指定镜像下载策略   containers下面的name指定容器名
#resources指定容器资源(CPU,内存等)   env指定容器里的环境变量   dnsPolicy指定DNS策略
#restartPolicy容器重启策略    ports指定容器端口  containerPort容器端口  hostPort节点上的端口
#nodeName: k8scloude1让pod运行在k8scloude1节点
[root@k8scloude1 pod]# cat schedulepod3.yaml 
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: pod1
  name: pod1
  namespace: pod
spec:
  nodeName: k8scloude1
  containers:
  - image: nginx
    imagePullPolicy: IfNotPresent
    name: pod1
    resources: {}
    ports:
    - name: http
      containerPort: 80
      protocol: TCP
      hostPort: 80
  dnsPolicy: ClusterFirst
  restartPolicy: Always
status: {}

[root@k8scloude1 pod]# kubectl apply -f schedulepod3.yaml 
pod/pod1 created

可以看到pod运行在k8scloude1,注意k8scloude1是master节点,master节点一般不运行应用pod,并且k8scloude1有污点,一般来说,pod是不运行在有污点的主机上的,如果强制调度上去的话,pod的状态应该是pending,但是通过nodeName可以把一个pod调度到有污点的主机上正常运行的,比如nodeName指定pod运行在master上

[root@k8scloude1 pod]# kubectl get pods -o wide
NAME   READY   STATUS    RESTARTS   AGE   IP              NODE         NOMINATED NODE   READINESS GATES
pod1   1/1     Running   0          47s   10.244.158.81   k8scloude1   <none>           <none>

[root@k8scloude1 pod]# kubectl delete pod pod1 --force
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
pod "pod1" force deleted

3.4 使用节点标签nodeSelector指定pod运行在哪个节点

与很多其他 Kubernetes 对象类似,节点也有标签。 你可以手动地添加标签。 Kubernetes 也会为集群中所有节点添加一些标准的标签。

通过为节点添加标签,你可以准备让 Pod 调度到特定节点或节点组上。 你可以使用这个功能来确保特定的 Pod 只能运行在具有一定隔离性,安全性或监管属性的节点上。
nodeSelector 是节点选择约束的最简单推荐形式。你可以将 nodeSelector 字段添加到 Pod 的规约中设置你希望目标节点所具有的节点标签。 Kubernetes 只会将 Pod 调度到拥有你所指定的每个标签的节点上。nodeSelector 提供了一种最简单的方法来将 Pod 约束到具有特定标签的节点上。

3.4.1 查看标签

查看节点node的标签,标签的格式:键值对:xxxx/yyyy.aaaa=456123,xxxx1/yyyy1.aaaa=456123,--show-labels参数显示标签

[root@k8scloude1 pod]# kubectl get nodes --show-labels
NAME         STATUS   ROLES                  AGE    VERSION   LABELS
k8scloude1   Ready    control-plane,master   7d1h   v1.21.0   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=k8scloude1,kubernetes.io/os=linux,node-role.kubernetes.io/control-plane=,node-role.kubernetes.io/master=,node.kubernetes.io/exclude-from-external-load-balancers=
k8scloude2   Ready    <none>                 7d     v1.21.0   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=k8scloude2,kubernetes.io/os=linux
k8scloude3   Ready    <none>                 7d     v1.21.0   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=k8scloude3,kubernetes.io/os=linux

查看namespace的标签

[root@k8scloude1 pod]# kubectl get ns --show-labels
NAME              STATUS   AGE    LABELS
default           Active   7d1h   kubernetes.io/metadata.name=default
kube-node-lease   Active   7d1h   kubernetes.io/metadata.name=kube-node-lease
kube-public       Active   7d1h   kubernetes.io/metadata.name=kube-public
kube-system       Active   7d1h   kubernetes.io/metadata.name=kube-system
ns1               Active   6d5h   kubernetes.io/metadata.name=ns1
ns2               Active   6d5h   kubernetes.io/metadata.name=ns2
pod               Active   4d2h   kubernetes.io/metadata.name=pod

查看pod的标签

[root@k8scloude1 pod]# kubectl get pod -A --show-labels 
NAMESPACE     NAME                                       READY   STATUS    RESTARTS   AGE    LABELS
kube-system   calico-kube-controllers-6b9fbfff44-4jzkj   1/1     Running   12         7d     k8s-app=calico-kube-controllers,pod-template-hash=6b9fbfff44
kube-system   calico-node-bdlgm                          1/1     Running   7          7d     controller-revision-hash=6b57d9cd54,k8s-app=calico-node,pod-template-generation=1
kube-system   calico-node-hx8bk                          1/1     Running   7          7d     controller-revision-hash=6b57d9cd54,k8s-app=calico-node,pod-template-generation=1
kube-system   calico-node-nsbfs                          1/1     Running   7          7d     controller-revision-hash=6b57d9cd54,k8s-app=calico-node,pod-template-generation=1
kube-system   coredns-545d6fc579-7wm95                   1/1     Running   7          7d1h   k8s-app=kube-dns,pod-template-hash=545d6fc579
kube-system   coredns-545d6fc579-87q8j                   1/1     Running   7          7d1h   k8s-app=kube-dns,pod-template-hash=545d6fc579
kube-system   etcd-k8scloude1                            1/1     Running   7          7d1h   component=etcd,tier=control-plane
kube-system   kube-apiserver-k8scloude1                  1/1     Running   11         7d1h   component=kube-apiserver,tier=control-plane
kube-system   kube-controller-manager-k8scloude1         1/1     Running   7          7d1h   component=kube-controller-manager,tier=control-plane
kube-system   kube-proxy-599xh                           1/1     Running   7          7d1h   controller-revision-hash=6795549d44,k8s-app=kube-proxy,pod-template-generation=1
kube-system   kube-proxy-lpj8z                           1/1     Running   7          7d1h   controller-revision-hash=6795549d44,k8s-app=kube-proxy,pod-template-generation=1
kube-system   kube-proxy-zxlk9                           1/1     Running   7          7d1h   controller-revision-hash=6795549d44,k8s-app=kube-proxy,pod-template-generation=1
kube-system   kube-scheduler-k8scloude1                  1/1     Running   7          7d1h   component=kube-scheduler,tier=control-plane
kube-system   metrics-server-bcfb98c76-k5dmj             1/1     Running   6          6d5h   k8s-app=metrics-server,pod-template-hash=bcfb98c76

3.4.2 创建标签

以node-role.kubernetes.io/control-plane= 标签为例,键是node-role.kubernetes.io/control-plane,值为空。

创建标签的语法:kubectl label 对象类型 对象名 键=值

给k8scloude2节点设置标签

[root@k8scloude1 pod]# kubectl label nodes k8scloude2 k8snodename=k8scloude2
node/k8scloude2 labeled

[root@k8scloude1 pod]# kubectl get nodes --show-labels
NAME         STATUS   ROLES                  AGE    VERSION   LABELS
k8scloude1   Ready    control-plane,master   7d1h   v1.21.0   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=k8scloude1,kubernetes.io/os=linux,node-role.kubernetes.io/control-plane=,node-role.kubernetes.io/master=,node.kubernetes.io/exclude-from-external-load-balancers=
k8scloude2   Ready    <none>                 7d1h   v1.21.0   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,k8snodename=k8scloude2,kubernetes.io/arch=amd64,kubernetes.io/hostname=k8scloude2,kubernetes.io/os=linux
k8scloude3   Ready    <none>                 7d1h   v1.21.0   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=k8scloude3,kubernetes.io/os=linux

k8scloude2节点删除标签

[root@k8scloude1 pod]# kubectl label nodes k8scloude2 k8snodename-
node/k8scloude2 labeled

[root@k8scloude1 pod]# kubectl get nodes --show-labels
NAME         STATUS   ROLES                  AGE    VERSION   LABELS
k8scloude1   Ready    control-plane,master   7d1h   v1.21.0   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=k8scloude1,kubernetes.io/os=linux,node-role.kubernetes.io/control-plane=,node-role.kubernetes.io/master=,node.kubernetes.io/exclude-from-external-load-balancers=
k8scloude2   Ready    <none>                 7d1h   v1.21.0   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=k8scloude2,kubernetes.io/os=linux
k8scloude3   Ready    <none>                 7d1h   v1.21.0   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=k8scloude3,kubernetes.io/os=linux

列出含有标签k8snodename=k8scloude2的节点

[root@k8scloude1 pod]# kubectl label nodes k8scloude2 k8snodename=k8scloude2

#列出含有标签k8snodename=k8scloude2的节点
[root@k8scloude1 pod]# kubectl get nodes -l k8snodename=k8scloude2
NAME         STATUS   ROLES    AGE    VERSION
k8scloude2   Ready    <none>   7d1h   v1.21.0

[root@k8scloude1 pod]# kubectl label nodes k8scloude2 k8snodename-
node/k8scloude2 labeled

对所有节点设置标签

[root@k8scloude1 pod]# kubectl label nodes --all k8snodename=cloude
node/k8scloude1 labeled
node/k8scloude2 labeled
node/k8scloude3 labeled

列出含有标签k8snodename=cloude的节点

#列出含有标签k8snodename=cloude的节点
[root@k8scloude1 pod]# kubectl get nodes -l k8snodename=cloude
NAME         STATUS   ROLES                  AGE    VERSION
k8scloude1   Ready    control-plane,master   7d1h   v1.21.0
k8scloude2   Ready    <none>                 7d1h   v1.21.0
k8scloude3   Ready    <none>                 7d1h   v1.21.0

#删除标签
[root@k8scloude1 pod]# kubectl label nodes --all k8snodename-
node/k8scloude1 labeled
node/k8scloude2 labeled
node/k8scloude3 labeled

[root@k8scloude1 pod]# kubectl get nodes -l k8snodename=cloude
No resources found

--overwrite参数,标签的覆盖

[root@k8scloude1 pod]# kubectl label nodes k8scloude2 k8snodename=k8scloude2
node/k8scloude2 labeled

#标签的覆盖
[root@k8scloude1 pod]# kubectl label nodes k8scloude2 k8snodename=k8scloude
error: 'k8snodename' already has a value (k8scloude2), and --overwrite is false

#--overwrite参数,标签的覆盖
[root@k8scloude1 pod]# kubectl label nodes k8scloude2 k8snodename=k8scloude --overwrite
node/k8scloude2 labeled

[root@k8scloude1 pod]# kubectl get nodes -l k8snodename=k8scloude2
No resources found

[root@k8scloude1 pod]# kubectl get nodes -l k8snodename=k8scloude
NAME         STATUS   ROLES    AGE    VERSION
k8scloude2   Ready    <none>   7d1h   v1.21.0

[root@k8scloude1 pod]# kubectl label nodes k8scloude2 k8snodename-
node/k8scloude2 labeled

Tips如果不想在k8scloude1的ROLES里看到control-plane,则可以通过取消标签达到目的:kubectl label nodes k8scloude1 node-role.kubernetes.io/control-plane- 进行取消标签

[root@k8scloude1 pod]# kubectl get nodes --show-labels
NAME         STATUS   ROLES                  AGE    VERSION   LABELS
k8scloude1   Ready    control-plane,master   7d1h   v1.21.0   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=k8scloude1,kubernetes.io/os=linux,node-role.kubernetes.io/control-plane=,node-role.kubernetes.io/master=,node.kubernetes.io/exclude-from-external-load-balancers=
k8scloude2   Ready    <none>                 7d1h   v1.21.0   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=k8scloude2,kubernetes.io/os=linux
k8scloude3   Ready    <none>                 7d1h   v1.21.0   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=k8scloude3,kubernetes.io/os=linux

[root@k8scloude1 pod]# kubectl label nodes k8scloude1 node-role.kubernetes.io/control-plane-

3.4.3 通过标签控制pod在哪个节点运行

给k8scloude2节点打上标签k8snodename=k8scloude2

[root@k8scloude1 pod]# kubectl label nodes k8scloude2 k8snodename=k8scloude2
node/k8scloude2 labeled

[root@k8scloude1 pod]# kubectl get nodes -l k8snodename=k8scloude2
NAME         STATUS   ROLES    AGE    VERSION
k8scloude2   Ready    <none>   7d1h   v1.21.0

[root@k8scloude1 pod]# kubectl get pods
No resources found in pod namespace.

创建pod,nodeSelector:k8snodename: k8scloude2 指定pod运行在标签为k8snodename=k8scloude2的节点上

[root@k8scloude1 pod]# vim schedulepod4.yaml

[root@k8scloude1 pod]# cat schedulepod4.yaml 
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: pod1
  name: pod1
  namespace: pod
spec:
  nodeSelector:
    k8snodename: k8scloude2
  containers:
  - image: nginx
    imagePullPolicy: IfNotPresent
    name: pod1
    resources: {}
    ports:
    - name: http
      containerPort: 80
      protocol: TCP
      hostPort: 80
  dnsPolicy: ClusterFirst
  restartPolicy: Always
status: {}

[root@k8scloude1 pod]# kubectl apply -f schedulepod4.yaml 
pod/pod1 created

可以看到pod运行在k8scloude2节点

[root@k8scloude1 pod]# kubectl get pod -o wide
NAME   READY   STATUS    RESTARTS   AGE   IP               NODE         NOMINATED NODE   READINESS GATES
pod1   1/1     Running   0          21s   10.244.112.158   k8scloude2   <none>           <none>

删除pod,删除标签

[root@k8scloude1 pod]# kubectl get pod --show-labels
NAME   READY   STATUS    RESTARTS   AGE   LABELS
pod1   1/1     Running   0          32m   run=pod1

[root@k8scloude1 pod]# kubectl delete pod pod1 --force
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
pod "pod1" force deleted

[root@k8scloude1 pod]# kubectl get pod --show-labels
No resources found in pod namespace.

[root@k8scloude1 pod]# kubectl label nodes k8scloude2 k8snodename-
node/k8scloude2 labeled

[root@k8scloude1 pod]# kubectl get nodes -l k8snodename=k8scloude2
No resources found

[root@k8scloude1 pod]# kubectl get nodes -l k8snodename=k8scloude
No resources found

注意:如果两台主机的标签是一致的,那么通过在这两台机器上进行打分,哪个机器分高,pod就运行在哪个pod上

给k8s集群的master节点打标签

[root@k8scloude1 pod]# kubectl label nodes k8scloude1 k8snodename=k8scloude1
node/k8scloude1 labeled

[root@k8scloude1 pod]# kubectl get nodes -l k8snodename=k8scloude1
NAME         STATUS   ROLES                  AGE    VERSION
k8scloude1   Ready    control-plane,master   7d2h   v1.21.0

创建pod,nodeSelector:k8snodename: k8scloude1 指定pod运行在标签为k8snodename=k8scloude1的节点上

[root@k8scloude1 pod]# vim schedulepod5.yaml 

[root@k8scloude1 pod]# cat schedulepod5.yaml 
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: pod1
  name: pod1
  namespace: pod
spec:
  nodeSelector:
    k8snodename: k8scloude1
  containers:
  - image: nginx
    imagePullPolicy: IfNotPresent
    name: pod1
    resources: {}
    ports:
    - name: http
      containerPort: 80
      protocol: TCP
      hostPort: 80
  dnsPolicy: ClusterFirst
  restartPolicy: Always
status: {}

[root@k8scloude1 pod]# kubectl apply -f schedulepod5.yaml 
pod/pod1 created

因为k8scloude1上有污点,所以pod不能运行在k8scloude1上,pod状态为Pending

[root@k8scloude1 pod]# kubectl get pod
NAME   READY   STATUS    RESTARTS   AGE
pod1   0/1     Pending   0          9s

删除pod,删除标签

[root@k8scloude1 pod]# kubectl delete pod pod1 
pod "pod1" deleted

[root@k8scloude1 pod]# kubectl get pod
No resources found in pod namespace.

[root@k8scloude1 pod]# kubectl label nodes k8scloude1 k8snodename-
node/k8scloude1 labeled

[root@k8scloude1 pod]# kubectl get nodes -l k8snodename=k8scloude1
No resources found

3.5 使用亲和性与反亲和性调度pod

nodeSelector 提供了一种最简单的方法来将 Pod 约束到具有特定标签的节点上。 亲和性和反亲和性扩展了你可以定义的约束类型。使用亲和性与反亲和性的一些好处有:

  • 亲和性、反亲和性语言的表达能力更强。nodeSelector 只能选择拥有所有指定标签的节点。 亲和性、反亲和性为你提供对选择逻辑的更强控制能力。

  • 你可以标明某规则是“软需求”或者“偏好”,这样调度器在无法找到匹配节点时仍然调度该 Pod。

  • 你可以使用节点上(或其他拓扑域中)运行的其他 Pod 的标签来实施调度约束, 而不是只能使用节点本身的标签。这个能力让你能够定义规则允许哪些 Pod 可以被放置在一起。

亲和性功能由两种类型的亲和性组成:

  • 节点亲和性功能类似于 nodeSelector 字段,但它的表达能力更强,并且允许你指定软规则。
  • Pod 间亲和性/反亲和性允许你根据其他 Pod 的标签来约束 Pod。

节点亲和性概念上类似于 nodeSelector, 它使你可以根据节点上的标签来约束 Pod 可以调度到哪些节点上。 节点亲和性有两种:

  • requiredDuringSchedulingIgnoredDuringExecution: 调度器只有在规则被满足的时候才能执行调度。此功能类似于 nodeSelector, 但其语法表达能力更强。
  • preferredDuringSchedulingIgnoredDuringExecution: 调度器会尝试寻找满足对应规则的节点。如果找不到匹配的节点,调度器仍然会调度该 Pod。

在上述类型中,IgnoredDuringExecution 意味着如果节点标签在 Kubernetes 调度 Pod 后发生了变更,Pod 仍将继续运行

你可以使用 Pod 规约中的 .spec.affinity.nodeAffinity 字段来设置节点亲和性。

查看nodeAffinity字段解释

[root@k8scloude1 pod]# kubectl explain pods.spec.affinity.nodeAffinity 
KIND:     Pod
VERSION:  v1

RESOURCE: nodeAffinity <Object>

DESCRIPTION:
     Describes node affinity scheduling rules for the pod.

     Node affinity is a group of node affinity scheduling rules.

FIELDS:
#软策略
   preferredDuringSchedulingIgnoredDuringExecution	<[]Object>
     The scheduler will prefer to schedule pods to nodes that satisfy the
     affinity expressions specified by this field, but it may choose a node that
     violates one or more of the expressions. The node that is most preferred is
     the one with the greatest sum of weights, i.e. for each node that meets all
     of the scheduling requirements (resource request, requiredDuringScheduling
     affinity expressions, etc.), compute a sum by iterating through the
     elements of this field and adding "weight" to the sum if the node matches
     the corresponding matchExpressions; the node(s) with the highest sum are
     the most preferred.

#硬策略
   requiredDuringSchedulingIgnoredDuringExecution	<Object>
     If the affinity requirements specified by this field are not met at
     scheduling time, the pod will not be scheduled onto the node. If the
     affinity requirements specified by this field cease to be met at some point
     during pod execution (e.g. due to an update), the system may or may not try
     to eventually evict the pod from its node.

3.5.1 使用硬策略requiredDuringSchedulingIgnoredDuringExecution

创建pod,requiredDuringSchedulingIgnoredDuringExecution参数表示:节点必须包含一个键名为 kubernetes.io/hostname 的标签, 并且该标签的取值必须k8scloude2k8scloude3

你可以使用 operator 字段来为 Kubernetes 设置在解释规则时要使用的逻辑操作符。 你可以使用 In、NotIn、Exists、DoesNotExist、Gt 和 Lt 之一作为操作符。NotIn 和 DoesNotExist 可用来实现节点反亲和性行为。 你也可以使用节点污点 将 Pod 从特定节点上驱逐。

注意:

  • 如果你同时指定了 nodeSelector 和 nodeAffinity,两者 必须都要满足, 才能将 Pod 调度到候选节点上。
  • 如果你指定了多个与 nodeAffinity 类型关联的 nodeSelectorTerms, 只要其中一个 nodeSelectorTerms 满足的话,Pod 就可以被调度到节点上。
  • 如果你指定了多个与同一 nodeSelectorTerms 关联的 matchExpressions, 则只有当所有 matchExpressions 都满足时 Pod 才可以被调度到节点上。
[root@k8scloude1 pod]# vim requiredDuringSchedule.yaml 

 #硬策略
[root@k8scloude1 pod]# cat requiredDuringSchedule.yaml 
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: pod1
  name: pod1
  namespace: pod
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/hostname
            operator: In
            values: 
            - k8scloude2
            - k8scloude3
  containers:
  - image: nginx
    imagePullPolicy: IfNotPresent
    name: pod1
    resources: {}
    ports:
    - name: http
      containerPort: 80
      protocol: TCP
      hostPort: 80
  dnsPolicy: ClusterFirst
  restartPolicy: Always
status: {}

[root@k8scloude1 pod]# kubectl apply -f requiredDuringSchedule.yaml 
pod/pod1 created

可以看到pod运行在k8scloude3节点

[root@k8scloude1 pod]# kubectl get pods
NAME   READY   STATUS    RESTARTS   AGE
pod1   1/1     Running   0          6s

[root@k8scloude1 pod]# kubectl get pods -o wide
NAME   READY   STATUS    RESTARTS   AGE   IP               NODE         NOMINATED NODE   READINESS GATES
pod1   1/1     Running   0          10s   10.244.251.212   k8scloude3   <none>           <none>

[root@k8scloude1 pod]# kubectl delete pod pod1 --force
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
pod "pod1" force deleted

创建pod,requiredDuringSchedulingIgnoredDuringExecution参数表示:节点必须包含一个键名为 kubernetes.io/hostname 的标签, 并且该标签的取值必须k8scloude4k8scloude5

[root@k8scloude1 pod]# vim requiredDuringSchedule1.yaml 

[root@k8scloude1 pod]# cat requiredDuringSchedule1.yaml 
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: pod1
  name: pod1
  namespace: pod
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/hostname
            operator: In
            values: 
            - k8scloude4
            - k8scloude5
  containers:
  - image: nginx
    imagePullPolicy: IfNotPresent
    name: pod1
    resources: {}
    ports:
    - name: http
      containerPort: 80
      protocol: TCP
      hostPort: 80
  dnsPolicy: ClusterFirst
  restartPolicy: Always
status: {}

[root@k8scloude1 pod]# kubectl apply -f requiredDuringSchedule1.yaml 
pod/pod1 created

由于requiredDuringSchedulingIgnoredDuringExecution是硬策略,k8scloude4,k8scloude5不满足条件,所以pod创建失败

[root@k8scloude1 pod]# kubectl get pods -o wide
NAME   READY   STATUS    RESTARTS   AGE   IP       NODE     NOMINATED NODE   READINESS GATES
pod1   0/1     Pending   0          7s    <none>   <none>   <none>           <none>

[root@k8scloude1 pod]# kubectl delete pod pod1 --force
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
pod "pod1" force deleted

3.5.2 使用软策略preferredDuringSchedulingIgnoredDuringExecution

给节点打标签

[root@k8scloude1 pod]# kubectl label nodes k8scloude2 xx=72
node/k8scloude2 labeled

[root@k8scloude1 pod]# kubectl label nodes k8scloude3 xx=59
node/k8scloude3 labeled

创建pod,preferredDuringSchedulingIgnoredDuringExecution参数表示:节点最好具有一个键名为 xx 且取值大于 60 的标签。

[root@k8scloude1 pod]# vim preferredDuringSchedule.yaml 

[root@k8scloude1 pod]# cat preferredDuringSchedule.yaml 
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: pod1
  name: pod1
  namespace: pod
spec:
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 2
        preference:
          matchExpressions:
          - key: xx
            operator: Gt
            values:
            - "60"
  containers:
  - image: nginx
    imagePullPolicy: IfNotPresent
    name: pod1
    resources: {}
    ports:
    - name: http
      containerPort: 80
      protocol: TCP
      hostPort: 80
  dnsPolicy: ClusterFirst
  restartPolicy: Always
status: {}

[root@k8scloude1 pod]# kubectl apply -f preferredDuringSchedule.yaml 
pod/pod1 created

可以看到pod运行在k8scloude2,因为k8scloude2标签为 xx=72,72大于60

[root@k8scloude1 pod]# kubectl get pods -o wide
NAME   READY   STATUS    RESTARTS   AGE   IP               NODE         NOMINATED NODE   READINESS GATES
pod1   1/1     Running   0          13s   10.244.112.159   k8scloude2   <none>           <none>

[root@k8scloude1 pod]# kubectl delete pod pod1 --force
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
pod "pod1" force deleted

创建pod,preferredDuringSchedulingIgnoredDuringExecution参数表示:节点最好具有一个键名为 xx 且取值大于 600 的标签。

[root@k8scloude1 pod]# vim preferredDuringSchedule1.yaml 

[root@k8scloude1 pod]# cat preferredDuringSchedule1.yaml 
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: pod1
  name: pod1
  namespace: pod
spec:
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 2
        preference:
          matchExpressions:
          - key: xx
            operator: Gt
            values:
            - "600"
  containers:
  - image: nginx
    imagePullPolicy: IfNotPresent
    name: pod1
    resources: {}
    ports:
    - name: http
      containerPort: 80
      protocol: TCP
      hostPort: 80
  dnsPolicy: ClusterFirst
  restartPolicy: Always
status: {}

[root@k8scloude1 pod]# kubectl apply -f preferredDuringSchedule1.yaml 
pod/pod1 created

因为preferredDuringSchedulingIgnoredDuringExecution是软策略,尽管k8scloude2,k8scloude3都不满足xx>600,但是还是能成功创建pod

[root@k8scloude1 pod]# kubectl get pods -o wide
NAME   READY   STATUS    RESTARTS   AGE   IP               NODE         NOMINATED NODE   READINESS GATES
pod1   1/1     Running   0          7s    10.244.251.213   k8scloude3   <none>           <none>

[root@k8scloude1 pod]# kubectl delete pod pod1 --force
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
pod "pod1" force deleted

3.5.3 节点亲和性权重

你可以为 preferredDuringSchedulingIgnoredDuringExecution 亲和性类型的每个实例设置 weight 字段,其取值范围是 1 到 100。 当调度器找到能够满足 Pod 的其他调度请求的节点时,调度器会遍历节点满足的所有的偏好性规则, 并将对应表达式的 weight 值加和。最终的加和值会添加到该节点的其他优先级函数的评分之上。 在调度器为 Pod 作出调度决定时,总分最高的节点的优先级也最高。

给节点打标签

[root@k8scloude1 pod]# kubectl label nodes k8scloude2 yy=59
node/k8scloude2 labeled

[root@k8scloude1 pod]# kubectl label nodes k8scloude3 yy=72
node/k8scloude3 labeled

创建pod,preferredDuringSchedulingIgnoredDuringExecution指定了2条软策略,但是权重不一样:weight: 2 和 weight: 10

[root@k8scloude1 pod]# vim preferredDuringSchedule2.yaml 

[root@k8scloude1 pod]# cat preferredDuringSchedule2.yaml 
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: pod1
  name: pod1
  namespace: pod
spec:
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 2
        preference:
          matchExpressions:
          - key: xx
            operator: Gt
            values:
            - "60"
      - weight: 10
        preference:
          matchExpressions:
          - key: yy
            operator: Gt
            values:
            - "60"
  containers:
  - image: nginx
    imagePullPolicy: IfNotPresent
    name: pod1
    resources: {}
    ports:
    - name: http
      containerPort: 80
      protocol: TCP
      hostPort: 80
  dnsPolicy: ClusterFirst
  restartPolicy: Always
status: {}

[root@k8scloude1 pod]# kubectl apply -f preferredDuringSchedule2.yaml 
pod/pod1 created

存在两个候选节点,因为yy>60这条规则的weight权重大,所以pod运行在k8scloude3

[root@k8scloude1 pod]# kubectl get pods -o wide
NAME   READY   STATUS    RESTARTS   AGE   IP               NODE         NOMINATED NODE   READINESS GATES
pod1   1/1     Running   0          10s   10.244.251.214   k8scloude3   <none>           <none>

[root@k8scloude1 pod]# kubectl delete pod pod1 --force
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
pod "pod1" force deleted

3.6 Pod 拓扑分布约束

你可以使用 拓扑分布约束(Topology Spread Constraints) 来控制 Pod 在集群内故障域之间的分布, 故障域的示例有区域(Region)、可用区(Zone)、节点和其他用户自定义的拓扑域。 这样做有助于提升性能、实现高可用或提升资源利用率。