Kubernetes-集群ETCD故障排查

时间:2022-12-24 19:59:33

前言

记录一次本地虚拟机Kubernetes集群排障过程。

修复内容包括:1、Kubernetes证书过期,2、ETCD数据损坏

发现问题

[root@xuegod63 ~]# kubectl get nodes
The connection to the server localhost:8080 was refused - did you specify the right host or port?

排查开始

1、检查docker和kubelet的服务状态

[root@xuegod63 ~]# systemctl status docker
● docker.service - Docker Application Container Engine
Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled)
Active: active (running) since Fri 2022-12-23 04:32:33 EST; 23h ago
Docs: https://docs.docker.com
Main PID: 1574 (dockerd)
Tasks: 12
Memory: 98.0M
CGroup: /system.slice/docker.service
└─1574 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock

[root@xuegod63 ~]# systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; vendor preset: disabled)
Drop-In: /usr/lib/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: activating (auto-restart) (Result: exit-code) since Sat 2022-12-24 03:51:24 EST; 5s ago
Docs: https://kubernetes.io/docs/
Process: 15700 ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS (code=exited, status=255)
Main PID: 15700 (code=exited, status=255)

2、初步发现kubelet异常,尝试重启失败后查看详细日志

[root@xuegod63 ~]# journalctl -xeu kubelet
Dec 24 03:52:36 xuegod63 kubelet[15795]: E1224 03:52:36.043715 15795 bootstrap.go:265] part of the existing bootstrap client certificate is expired: 2022-05-01 08:24:14 +0000 UTC
Dec 24 03:52:36 xuegod63 kubelet[15795]: F1224 03:52:36.043767 15795 server.go:269] failed to run Kubelet: unable to load bootstrap kubeconfig: stat /etc/kubernetes/bootstrap-kubelet.conf: no such file or
Dec 24 03:52:36 xuegod63 systemd[1]: kubelet.service: main process exited, code=exited, status=255/n/a
Dec 24 03:52:36 xuegod63 kubelet[15795]: goroutine 1 [running]:
Dec 24 03:52:36 xuegod63 kubelet[15795]: k8s.io/kubernetes/vendor/k8s.io/klog/v2.stacks(0xc000010001, 0xc0002ce000, 0xb0, 0xb3)
Dec 24 03:52:36 xuegod63 kubelet[15795]: /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:1026 +0xb9
Dec 24 03:52:36 xuegod63 kubelet[15795]: k8s.io/kubernetes/vendor/k8s.io/klog/v2.(*loggingT).output(0x70e94e0, 0xc000000003, 0x0, 0x0, 0xc0005f0380, 0x6f53cea, 0x9, 0x10d, 0x411b00)
Dec 24 03:52:36 xuegod63 kubelet[15795]: /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:975 +0x19b
Dec 24 03:52:36 xuegod63 kubelet[15795]: k8s.io/kubernetes/vendor/k8s.io/klog/v2.(*loggingT).printDepth(0x70e94e0, 0xc000000003, 0x0, 0x0, 0x0, 0x0, 0x1, 0xc000bd5940, 0x1, 0x1)
Dec 24 03:52:36 xuegod63 kubelet[15795]: /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:732 +0x16f
Dec 24 03:52:36 xuegod63 kubelet[15795]: k8s.io/kubernetes/vendor/k8s.io/klog/v2.(*loggingT).print(...)
Dec 24 03:52:36 xuegod63 kubelet[15795]: /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:714
Dec 24 03:52:36 xuegod63 kubelet[15795]: k8s.io/kubernetes/vendor/k8s.io/klog/v2.Fatal(...)

3、发现证书过期导致,查看证书过期时间

kubeadm certs check-expiration
或者:
for item in `find /etc/kubernetes/ -maxdepth 2 -name "*.crt"`;do echo "==========$item=========";openssl x509 -in $item -text -noout| grep Not;done

4、更新Kubernetes证书

#脚本转载自https://github.com/yuyicai/update-kube-cert
下载脚本
git clone https://github.com/yuyicai/update-kube-cert.git
cd update-kubeadm-cert
chmod 755 update-kubeadm-cert.sh
./update-kubeadm-cert.sh all

#证书回滚
脚本会自动备份 /etc/kubernetes 目录到 /etc/kubernetes.old-$(date +%Y%m%d) 目录(备份目录命名示例:kubernetes.old-20200325)
若更新证书失败需要回滚,手动将备份 /etc/kubernetes.old-$(date +%Y%m%d)目录覆盖 /etc/kubernetes 目录

5、验证证书过期时间是否更新

[root@xuegod63 ~]# kubeadm certs check-expiration
CERTIFICATE EXPIRES RESIDUAL TIME CERTIFICATE AUTHORITY EXTERNALLY MANAGED
admin.conf Dec 24, 2023 09:50 UTC 364d no
apiserver Dec 21, 2032 09:47 UTC 9y ca no
apiserver-etcd-client Dec 21, 2032 09:47 UTC 9y etcd-ca no
apiserver-kubelet-client Dec 21, 2032 09:47 UTC 9y ca no
controller-manager.conf Dec 24, 2023 09:50 UTC 364d no
etcd-healthcheck-client Dec 21, 2032 09:47 UTC 9y etcd-ca no
etcd-peer Dec 21, 2032 09:47 UTC 9y etcd-ca no
etcd-server Dec 21, 2032 09:47 UTC 9y etcd-ca no
front-proxy-client Dec 21, 2032 09:47 UTC 9y front-proxy-ca no
scheduler.conf Dec 24, 2023 09:50 UTC 364d no

CERTIFICATE AUTHORITY EXPIRES RESIDUAL TIME EXTERNALLY MANAGED
ca Apr 29, 2031 08:24 UTC 8y no
etcd-ca Apr 29, 2031 08:24 UTC 8y no
front-proxy-ca Apr 29, 2031 08:24 UTC 8y no

6、重新生成配置文件

#备份
[root@xuegod63 ~]# mkdir /tmp/k8s
[root@xuegod63 ~]# mv /etc/kubernetes/*.conf /tmp/k8s/
#重新生成所有配置
[root@xuegod63 ~]# kubeadm init phase kubeconfig all
[kubeconfig] Using kubeconfig folder "/etc/kubernetes"
[kubeconfig] Writing "admin.conf" kubeconfig file
[kubeconfig] Writing "kubelet.conf" kubeconfig file
[kubeconfig] Writing "controller-manager.conf" kubeconfig file
[kubeconfig] Writing "scheduler.conf" kubeconfig file

7、重启服务验证

#重启docker和kubelet
[root@xuegod63 ~]# systemctl restart docker && systemctl restart kubelet
#服务启动成功
[root@xuegod63 ~]# systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; vendor preset: disabled)
Drop-In: /usr/lib/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: active (running) since Sat 2022-12-24 04:54:30 EST; 1min 34s ago
Docs: https://kubernetes.io/docs/
Main PID: 25514 (kubelet)
Tasks: 22
Memory: 56.4M
CGroup: /system.slice/kubelet.service
├─25514 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --network-plugin=cni --pod-inf...
└─27574 /opt/cni/bin/calico

Dec 24 04:56:03 xuegod63 kubelet[25514]: I1224 04:56:03.929825 25514 scope.go:111] [topologymanager] RemoveContainer - Container ID: 80edca1e3891b0c988ffe4812df55371313ce2872b1eb8c9b131c014ec666d72
Dec 24 04:56:03 xuegod63 kubelet[25514]: I1224 04:56:03.973837 25514 kubelet.go:449] kubelet nodes not sync
Dec 24 04:56:03 xuegod63 kubelet[25514]: I1224 04:56:03.992489 25514 kubelet.go:449] kubelet nodes not sync
Dec 24 04:56:03 xuegod63 kubelet[25514]: I1224 04:56:03.992531 25514 kubelet.go:449] kubelet nodes not sync
Dec 24 04:56:03 xuegod63 kubelet[25514]: E1224 04:56:03.997385 25514 remote_runtime.go:332] ContainerStatus "714c60ef254de88c7970df58e3b7320c86a7b1f13ef43251166bdc206d24d623" from runtime serv...6bdc206d24d623
Dec 24 04:56:03 xuegod63 kubelet[25514]: E1224 04:56:03.997434 25514 kuberuntime_manager.go:980] getPodContainerStatuses for pod "etcd-xuegod63_kube-system(ef09464435a105e3b9a55b43476955ae)" f...6bdc206d24d623
Dec 24 04:56:03 xuegod63 kubelet[25514]: I1224 04:56:03.997499 25514 kubelet.go:449] kubelet nodes not sync
Dec 24 04:56:03 xuegod63 kubelet[25514]: I1224 04:56:03.997508 25514 kubelet.go:449] kubelet nodes not sync
Dec 24 04:56:04 xuegod63 kubelet[25514]: I1224 04:56:04.270725 25514 kubelet.go:449] kubelet nodes not sync
Dec 24 04:56:04 xuegod63 kubelet[25514]: I1224 04:56:04.558293 25514 kubelet.go:449] kubelet nodes not sync

#发现集群依然有问题
[root@xuegod63 ~]# kubectl get nodes
The connection to the server localhost:8080 was refused - did you specify the right host or port?

8、继续排查,以下问题为ETCD数据损坏

#查看kubelet日志发现kubelet nodes not sync
#查看apiserver日志
docker ps -a |grep -i api
docker logs -f 0a88e3a4226a
Flag --insecure-port has been deprecated, This flag has no effect now and will be removed in v1.24.
I1224 10:14:17.637002 1 server.go:632] external host was not specified, using 192.168.124.63
I1224 10:14:17.637561 1 server.go:182] Version: v1.20.4
I1224 10:14:18.013836 1 plugins.go:158] Loaded 12 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,TaintNodesByCondition,Priority,DefaultTolerationSeconds,DefaultStorageClass,StorageObjectInUseProtection,RuntimeClass,DefaultIngressClass,MutatingAdmissionWebhook.
I1224 10:14:18.013863 1 plugins.go:161] Loaded 10 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,Priority,PersistentVolumeClaimResize,RuntimeClass,CertificateApproval,CertificateSigning,CertificateSubjectRestriction,ValidatingAdmissionWebhook,ResourceQuota.
I1224 10:14:18.014008 1 shared_informer.go:240] Waiting for caches to sync for node_authorizer
I1224 10:14:18.014537 1 plugins.go:158] Loaded 12 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,TaintNodesByCondition,Priority,DefaultTolerationSeconds,DefaultStorageClass,StorageObjectInUseProtection,RuntimeClass,DefaultIngressClass,MutatingAdmissionWebhook.
I1224 10:14:18.014572 1 plugins.go:161] Loaded 10 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,Priority,PersistentVolumeClaimResize,RuntimeClass,CertificateApproval,CertificateSigning,CertificateSubjectRestriction,ValidatingAdmissionWebhook,ResourceQuota.
I1224 10:14:18.016042 1 client.go:360] parsed scheme: "endpoint"
I1224 10:14:18.016082 1 endpoint.go:68] ccResolverWrapper: sending new addresses to cc: [{https://127.0.0.1:2379 <nil> 0 <nil>}]
W1224 10:14:18.016589 1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
I1224 10:14:19.015536 1 client.go:360] parsed scheme: "endpoint"
I1224 10:14:19.015645 1 endpoint.go:68] ccResolverWrapper: sending new addresses to cc: [{https://127.0.0.1:2379 <nil> 0 <nil>}]
W1224 10:14:19.016554 1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
W1224 10:14:19.017501 1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
W1224 10:14:20.018145 1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
W1224 10:14:20.553762 1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...


#日志显示连接不到ETCD的2379端口
#查看ETCD日志,定位到ETCD数据损坏
[root@xuegod63 ~]# docker logs -f c6ee77a0ff89
[WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
2022-12-24 10:16:19.354078 I | etcdmain: etcd Version: 3.4.13
2022-12-24 10:16:19.354121 I | etcdmain: Git SHA: ae9734ed2
2022-12-24 10:16:19.354125 I | etcdmain: Go Version: go1.12.17
2022-12-24 10:16:19.354127 I | etcdmain: Go OS/Arch: linux/amd64
2022-12-24 10:16:19.354130 I | etcdmain: setting maximum number of CPUs to 3, total number of available CPUs is 3
2022-12-24 10:16:19.354180 N | etcdmain: the server is already initialized as member before, starting as etcd member...
[WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
2022-12-24 10:16:19.354239 I | embed: peerTLS: cert = /etc/kubernetes/pki/etcd/peer.crt, key = /etc/kubernetes/pki/etcd/peer.key, trusted-ca = /etc/kubernetes/pki/etcd/ca.crt, client-cert-auth = true, crl-file =
2022-12-24 10:16:19.355042 I | embed: name = xuegod63
2022-12-24 10:16:19.355065 I | embed: data dir = /var/lib/etcd
2022-12-24 10:16:19.355070 I | embed: member dir = /var/lib/etcd/member
2022-12-24 10:16:19.355072 I | embed: heartbeat = 100ms
2022-12-24 10:16:19.355074 I | embed: election = 1000ms
2022-12-24 10:16:19.355077 I | embed: snapshot count = 10000
2022-12-24 10:16:19.355083 I | embed: advertise client URLs = https://192.168.124.63:2379
2022-12-24 10:16:19.355086 I | embed: initial advertise peer URLs = https://192.168.124.63:2380
2022-12-24 10:16:19.355090 I | embed: initial cluster =
2022-12-24 10:16:19.822156 I | etcdserver: recovered store from snapshot at index 370037
2022-12-24 10:16:19.823857 C | etcdserver: recovering backend from snapshot error: failed to find database snapshot file (snap: snapshot file doesn't exist)
panic: recovering backend from snapshot error: failed to find database snapshot file (snap: snapshot file doesn't exist)
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0xc0587e]

goroutine 1 [running]:
go.etcd.io/etcd/etcdserver.NewServer.func1(0xc000298e30, 0xc000296d80)
/tmp/etcd-release-3.4.13/etcd/release/etcd/etcdserver/server.go:334 +0x3e
panic(0xee6840, 0xc00006f480)
/usr/local/go/src/runtime/panic.go:522 +0x1b5
github.com/coreos/pkg/capnslog.(*PackageLogger).Panicf(0xc000195e20, 0x10c13b2, 0x2a, 0xc000296e50, 0x1, 0x1)
/home/ANT.AMAZON.COM/leegyuho/go/pkg/mod/github.com/coreos/pkg@v0.0.0-20160727233714-3ac0863d7acf/capnslog/pkg_logger.go:75 +0x135
go.etcd.io/etcd/etcdserver.NewServer(0x7ffc1b6d6e49, 0x8, 0x0, 0x0, 0x0, 0x0, 0xc00011af80, 0x1, 0x1, 0xc00011b100, ...)
/tmp/etcd-release-3.4.13/etcd/release/etcd/etcdserver/server.go:464 +0x433c
go.etcd.io/etcd/embed.StartEtcd(0xc000260000, 0xc0000fa580, 0x0, 0x0)
/tmp/etcd-release-3.4.13/etcd/release/etcd/embed/etcd.go:214 +0x988
go.etcd.io/etcd/etcdmain.startEtcd(0xc000260000, 0x10963d6, 0x6, 0x1, 0xc0001f1140)
/tmp/etcd-release-3.4.13/etcd/release/etcd/etcdmain/etcd.go:302 +0x40
go.etcd.io/etcd/etcdmain.startEtcdOrProxyV2()
/tmp/etcd-release-3.4.13/etcd/release/etcd/etcdmain/etcd.go:144 +0x2ef9
go.etcd.io/etcd/etcdmain.Main()
/tmp/etcd-release-3.4.13/etcd/release/etcd/etcdmain/main.go:46 +0x38
main.main()
/tmp/etcd-release-3.4.13/etcd/release/etcd/main.go:28 +0x20

9、修复ETCD

#应为没有做数据备份,只能删除数据文件,来启动服务.PS:但是这样会导致pod资源全部丢失
rm -rf /var/lib/etcd/*
cp /etc/kubernetes/admin.conf ~/.kube/config
#重启docker和kubelet,问题修复
[root@xuegod63 ~]# systemctl restart docker && systemctl restart kubelet

10、附加ETCD备份恢复方式

#备份(二进制环境)
# export ETCDCTL_API=3
# etcdctl --cert=/etc/etcd/ssl/etcd.pem --key=/etc/etcd/ssl/etcd-key.pem --cacert=/etc/kubernetes/ssl/ca.pem --endpoints=https://192.168.4.181:2379,https://192.168.4.182:2379,https://192.168.4.183:2379 snapshot save /mnt/aps/backup/etcd/snapshot$(date +%Y%m%d).db
#恢复
1.停止所有节点etcd服务
2.备份当前数据目录
3.还原数据,根据实际情况修改,并在所有ETCD节点执行恢复
# export ETCDCTL_API=3
# etcdctl --cert=/etc/etcd/ssl/etcd.pem \
--key=/etc/etcd/ssl/etcd-key.pem \
--cacert=/etc/kubernetes/ssl/ca.pem \
--endpoints=https://192.168.4.181:2379,https://192.168.4.182:2379,https://192.168.4.183:2379 \
--name=etcd1 \
--initial-cluster=etcd1=https://192.168.4.181:2380,etcd2=https://192.168.4.182:2380,etcd3=https://192.168.4.183:2380 \
--initial-cluster-token=etcd-cluster-0 \
--initial-advertise-peer-urls=https://192.168.4.181:2380 \
--data-dir=/var/lib/etcddata snapshot restore snapshot20200630.db