恢复Master节点
注意:本教程只适用于k8s-1.17.x的版本,不适用于k8s-1.8.x的版本
有时候,K8S的某个Master节点因为故障(比如系统盘或数据盘坏掉了)导致重装了操作系统,但是此时该节点还在K8S集群和数据库中,但是为异常状态。本节,主要针对这种情况介绍如何恢复这个Master节点。
现状
假设有三个Master节点:20,21,22;22主机的系统盘被重装了操作系统,起来后节点变成了异常状态,如下:
$ kubectl get node
NAME STATUS ROLES AGE VERSION
193.168.180.20 Ready master 33d v1.17.3
193.168.180.21 Ready master 33d v1.17.3
193.168.180.22 NotReady master 33d v1.17.3
接下来,我们来介绍如何恢复22这个Master节点。
注意:不要从管理台上或数据库中删除这个节点
一、从ETCD集群中删除异常节点(20或21上执行)
在正常的Master节点上执行以下命令,查看etcd集群的member:
$ etcdctl --key /etc/kubernetes/pki/apiserver-etcd-client.key --cert /etc/kubernetes/pki/apiserver-etcd-client.crt --cacert /etc/kubernetes/pki/etcd/ca.crt member list
270a8f0f1a97d2da, started, 193.168.180.20, https://193.168.180.20:2380, https://193.168.180.20:2379, false
38282a6380e0dc71, started, 193.168.180.21, https://193.168.180.21:2380, https://193.168.180.21:2379, false
4ea4da720a30b3a2, started, 193.168.180.22, https://193.168.180.22:2380, https://193.168.180.22:2379, false
然后执行以下命令查看每个member的状态,我们发现193.168.180.22这个member已经unhealthy了
$ etcdctl --endpoints https://193.168.180.20:2379,https://193.168.180.21:2379,https://193.168.180.22:2379 --key /etc/kubernetes/pki/apiserver-etcd-client.key --cert /etc/kubernetes/pki/apiserver-etcd-client.crt --cacert /etc/kubernetes/pki/etcd/ca.crt endpoint health
{"level":"warn","ts":"2021-06-21T10:36:11.576+0800","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-057d4226-fd62-4ccc-82e6-c3f5752a6cc7/193.168.180.22:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 193.168.180.22:2379: connect: connection refused\""}
https://193.168.180.20:2379 is healthy: successfully committed proposal: took = 12.016486ms
https://193.168.180.21:2379 is healthy: successfully committed proposal: took = 12.273189ms
https://193.168.180.22:2379 is unhealthy: failed to commit proposal: context deadline exceeded
Error: unhealthy cluster
执行命令删除这个异常的member(member的id在上面的member list命令的输出中可以查到)
$ etcdctl --key /etc/kubernetes/pki/apiserver-etcd-client.key --cert /etc/kubernetes/pki/apiserver-etcd-client.crt --cacert /etc/kubernetes/pki/etcd/ca.crt member remove 4ea4da720a30b3a2
二、清理节点上的旧数据(22上执行)
首先我们要清理掉节点上的旧的数据:
$ sudo rm -rvf /etc/kuberntes/*
另外我们还要在页面上找到异常节点的集群安装目录,如下为/dcos/data/docker
etcd的数据存储在上面目录下的etcd子目录下;所以我们还要删除etcd的数据目录:
$ sudo rm -rvf /dcos/data/docker/etcd
三、安装kubeadm(22上执行)
从其它正常的Master节点上拷贝/usr/bin/kubeadm
文件到本机的/usr/bin/
目录下
四、安装docker、kubelet、kubectl(22上执行)
首先,从正常的Master节点上拷贝/etc/yum.repos.d/ccse.repo
文件到主机的/etc/yum.repos.d/
目录下
然后,执行以下的命令安装docker、kubelet、kubeadm
$ sudo yum -y install docker-ce kubelet kubectl --disablerepo=* --enablerepo=ccse*
配置docker与kubelet
在22上新建目录/etc/docker
,然后从20或21上拷贝文件/etc/docker/daemon.json
到本机的/etc/docker
目录下
在22上新建目录/usr/lib/systemd/system/kubelet.service.d
,然后从20或21上拷贝文件/usr/lib/systemd/system/kubelet.service.d/10-kubeadm.conf
文件到22主机的对应目录下。并且,找到文件中如下的内容,修改为22主机的IP:
... --node-ip=193.168.180.22 ...
把--root-dir
的值改成/dcos/data/docker/kubelet
(/dcos/data/docker是上面页面上显示的22节点的集群安装目录)
启动docker与kubelet
$ sudo systemctl daemon-reload
$ sudo systemctl start docker kubelet
五、获取kubeadm join命令(在20或21上执行)
在20或21上执行以下命令,会得到如下字符串
$ sudo kubeadm init phase upload-certs --upload-certs
...
be6993d21541b255a5073fc6c22fa258bb5ebf2a50b7a375e26db1bf77cd9125
在20或21上执行以下命令,得到join命令
$ kubeadm token create --print-join-command
...
kubeadm join 193.168.180.89:6443 --token u7le8e.mmwip3749zr9xp6i --discovery-token-ca-cert-hash sha256:0f4b6057a1fa48bead573f3fe825c6bdeb7c5fc240b9514c759bbd0e20930cf7
六、安装Master(22节点上)
在22节点上执行以下命令,安装Master
$ sudo kubeadm join 193.168.180.89:6443 \
--token u7le8e.mmwip3749zr9xp6i \
--discovery-token-ca-cert-hash sha256:0f4b6057a1fa48bead573f3fe825c6bdeb7c5fc240b9514c759bbd0e20930cf7 \
--control-plane \
--certificate-key be6993d21541b255a5073fc6c22fa258bb5ebf2a50b7a375e26db1bf77cd9125 \
--node-name 193.168.180.22 \
--apiserver-advertise-address 193.168.180.22
然后,查看节点状态:
七、安装keepalived(22节点上)
执行命令安装keepalived
$ sudo yum -y install keepalived --disablerepo=* --enablerepo=ccse*
创建目录/etc/keepalived
,然后从20或21上把该目录下的所有文件都拷贝到22主机的这个目录下,一般有如下文件:keepalived.conf、keepalived_checkkubeapiserver.sh、keepalived-kubernetes-external.conf(或keepalived-kubernetes-internal.conf)
如果keepalived-kubernetes-internal.conf(或keepalived-kubernetes-external.conf)文件存在,需要找到如下内容,修改为另外两个Master节点的IP:
unicast_peer {
193.168.180.20
193.168.180.21
}
然后启动keepalived
$ sudo systemctl daemon-reload
$ sudo systemctl start keepalived
注意
如果在第六步执行kubeadm join命令报错,则记得先执行sudo kubeadm reset --force
命令,然后执行第一步与第二步清理数据,然后重新来过。