生产环境k8s master节点丢失处理

时间:2022-12-04 07:18:59

恢复Master节点

注意:本教程只适用于k8s-1.17.x的版本,不适用于k8s-1.8.x的版本

有时候,K8S的某个Master节点因为故障(比如系统盘或数据盘坏掉了)导致重装了操作系统,但是此时该节点还在K8S集群和数据库中,但是为异常状态。本节,主要针对这种情况介绍如何恢复这个Master节点。

现状

假设有三个Master节点:20,21,22;22主机的系统盘被重装了操作系统,起来后节点变成了异常状态,如下:

$ kubectl get node
NAME STATUS ROLES AGE VERSION
193.168.180.20 Ready master 33d v1.17.3
193.168.180.21 Ready master 33d v1.17.3
193.168.180.22 NotReady master 33d v1.17.3

接下来,我们来介绍如何恢复22这个Master节点。

注意:不要从管理台上或数据库中删除这个节点

一、从ETCD集群中删除异常节点(20或21上执行)

在正常的Master节点上执行以下命令,查看etcd集群的member:

$ etcdctl --key /etc/kubernetes/pki/apiserver-etcd-client.key --cert /etc/kubernetes/pki/apiserver-etcd-client.crt --cacert /etc/kubernetes/pki/etcd/ca.crt member list
270a8f0f1a97d2da, started, 193.168.180.20, https://193.168.180.20:2380, https://193.168.180.20:2379, false
38282a6380e0dc71, started, 193.168.180.21, https://193.168.180.21:2380, https://193.168.180.21:2379, false
4ea4da720a30b3a2, started, 193.168.180.22, https://193.168.180.22:2380, https://193.168.180.22:2379, false

然后执行以下命令查看每个member的状态,我们发现193.168.180.22这个member已经unhealthy了

$ etcdctl --endpoints https://193.168.180.20:2379,https://193.168.180.21:2379,https://193.168.180.22:2379 --key /etc/kubernetes/pki/apiserver-etcd-client.key --cert /etc/kubernetes/pki/apiserver-etcd-client.crt --cacert /etc/kubernetes/pki/etcd/ca.crt endpoint health
{"level":"warn","ts":"2021-06-21T10:36:11.576+0800","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-057d4226-fd62-4ccc-82e6-c3f5752a6cc7/193.168.180.22:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 193.168.180.22:2379: connect: connection refused\""}
https://193.168.180.20:2379 is healthy: successfully committed proposal: took = 12.016486ms
https://193.168.180.21:2379 is healthy: successfully committed proposal: took = 12.273189ms
https://193.168.180.22:2379 is unhealthy: failed to commit proposal: context deadline exceeded
Error: unhealthy cluster

执行命令删除这个异常的member(member的id在上面的member list命令的输出中可以查到)

$ etcdctl --key /etc/kubernetes/pki/apiserver-etcd-client.key --cert /etc/kubernetes/pki/apiserver-etcd-client.crt --cacert /etc/kubernetes/pki/etcd/ca.crt member remove 4ea4da720a30b3a2

二、清理节点上的旧数据(22上执行)

首先我们要清理掉节点上的旧的数据:

$ sudo rm -rvf /etc/kuberntes/*

另外我们还要在页面上找到异常节点的集群安装目录,如下为​​/dcos/data/docker​


etcd的数据存储在上面目录下的etcd子目录下;所以我们还要删除etcd的数据目录:

$ sudo rm -rvf /dcos/data/docker/etcd

三、安装kubeadm(22上执行)

从其它正常的Master节点上拷贝​​/usr/bin/kubeadm​​文件到本机的​​/usr/bin/​​目录下

四、安装docker、kubelet、kubectl(22上执行)

首先,从正常的Master节点上拷贝​​/etc/yum.repos.d/ccse.repo​​文件到主机的​​/etc/yum.repos.d/​​目录下

然后,执行以下的命令安装docker、kubelet、kubeadm

$ sudo yum -y install docker-ce kubelet kubectl --disablerepo=* --enablerepo=ccse*
配置docker与kubelet

在22上新建目录​​/etc/docker​​,然后从20或21上拷贝文件​​/etc/docker/daemon.json​​到本机的​​/etc/docker​​目录下

在22上新建目录​​/usr/lib/systemd/system/kubelet.service.d​​,然后从20或21上拷贝文件​​/usr/lib/systemd/system/kubelet.service.d/10-kubeadm.conf​​文件到22主机的对应目录下。并且,找到文件中如下的内容,修改为22主机的IP:

... --node-ip=193.168.180.22 ...

把​​--root-dir​​的值改成​​/dcos/data/docker/kubelet​​(/dcos/data/docker是上面页面上显示的22节点的集群安装目录)

启动docker与kubelet
$ sudo systemctl daemon-reload
$ sudo systemctl start docker kubelet

五、获取kubeadm join命令(在20或21上执行)

在20或21上执行以下命令,会得到如下字符串

$ sudo kubeadm init phase upload-certs --upload-certs
...
be6993d21541b255a5073fc6c22fa258bb5ebf2a50b7a375e26db1bf77cd9125

在20或21上执行以下命令,得到join命令

$ kubeadm token create --print-join-command
...
kubeadm join 193.168.180.89:6443 --token u7le8e.mmwip3749zr9xp6i --discovery-token-ca-cert-hash sha256:0f4b6057a1fa48bead573f3fe825c6bdeb7c5fc240b9514c759bbd0e20930cf7

六、安装Master(22节点上)

在22节点上执行以下命令,安装Master

$ sudo kubeadm join 193.168.180.89:6443 \
--token u7le8e.mmwip3749zr9xp6i \
--discovery-token-ca-cert-hash sha256:0f4b6057a1fa48bead573f3fe825c6bdeb7c5fc240b9514c759bbd0e20930cf7 \
--control-plane \
--certificate-key be6993d21541b255a5073fc6c22fa258bb5ebf2a50b7a375e26db1bf77cd9125 \
--node-name 193.168.180.22 \
--apiserver-advertise-address 193.168.180.22

然后,查看节点状态:

$ kubectl get node

七、安装keepalived(22节点上)

执行命令安装keepalived

$ sudo yum -y install keepalived --disablerepo=* --enablerepo=ccse*

创建目录​​/etc/keepalived​​,然后从20或21上把该目录下的所有文件都拷贝到22主机的这个目录下,一般有如下文件:keepalived.conf、keepalived_checkkubeapiserver.sh、keepalived-kubernetes-external.conf(或keepalived-kubernetes-internal.conf)

如果keepalived-kubernetes-internal.conf(或keepalived-kubernetes-external.conf)文件存在,需要找到如下内容,修改为另外两个Master节点的IP:

unicast_peer {
193.168.180.20
193.168.180.21
}

然后启动keepalived

$ sudo systemctl daemon-reload
$ sudo systemctl start keepalived

注意

如果在第六步执行kubeadm join命令报错,则记得先执行​​sudo kubeadm reset --force​​命令,然后执行第一步与第二步清理数据,然后重新来过。