Ceph删除/添加故障OSD(ceph-deploy)

时间:2022-06-01 19:45:35

今天osd.5起不来,发现是硬盘的问题

[root@ceph-osd-1 ceph-cluster]# ceph osd tree
# id weight type name up/down reweight
-1 11.63 root default
-2 5.26 host ceph-osd-1
0 5.26 osd.0 up 1
-3 2.73 host ceph-osd-2
1 0.91 osd.1 up 1
2 0.91 osd.2 up 1
3 0.91 osd.3 up 1
-4 3.64 host ceph-osd-3
4 1.82 osd.4 up 1
5 0.91 osd.5 down 1
6 0.91 osd.6 up 1

更换磁盘后,将osd.5删除,然后通过ceph-deploy重新添加

[root@ceph-osd-1 ceph-cluster]# ceph osd rm 5
[root@ceph-osd-1 ceph-cluster]# ceph-deploy osd prepare 10.10.200.165:/osd5
[root@ceph-osd-1 ceph-cluster]# ceph-deploy osd activate 10.10.200.165:/osd5

在执行activate的时候,提示以下错误

[10.10.200.165][WARNIN] Error EINVAL: entity osd.5 exists but key does not match
[10.10.200.165][WARNIN] Traceback (most recent call last):
[10.10.200.165][WARNIN] File "/usr/sbin/ceph-disk", line 2591, in <module>
[10.10.200.165][WARNIN] main()
[10.10.200.165][WARNIN] File "/usr/sbin/ceph-disk", line 2569, in main
[10.10.200.165][WARNIN] args.func(args)
[10.10.200.165][WARNIN] File "/usr/sbin/ceph-disk", line 1929, in main_activate
[10.10.200.165][WARNIN] init=args.mark_init,
[10.10.200.165][WARNIN] File "/usr/sbin/ceph-disk", line 1761, in activate_dir
[10.10.200.165][WARNIN] (osd_id, cluster) = activate(path, activate_key_template, init)
[10.10.200.165][WARNIN] File "/usr/sbin/ceph-disk", line 1897, in activate
[10.10.200.165][WARNIN] keyring=keyring,
[10.10.200.165][WARNIN] File "/usr/sbin/ceph-disk", line 1520, in auth_key
[10.10.200.165][WARNIN] 'mon', 'allow profile osd',
[10.10.200.165][WARNIN] File "/usr/sbin/ceph-disk", line 304, in command_check_call
[10.10.200.165][WARNIN] return subprocess.check_call(arguments)
[10.10.200.165][WARNIN] File "/usr/lib64/python2.6/subprocess.py", line 505, in check_call
[10.10.200.165][WARNIN] raise CalledProcessError(retcode, cmd)
[10.10.200.165][WARNIN] subprocess.CalledProcessError: Command '['/usr/bin/ceph', '--cluster', 'ceph', '--name', 'client.bootstrap-osd', '--keyring', '/var/lib/ceph/bootstrap-osd/ceph.keyring', 'auth', 'add', 'osd.5', '-i', '/osd5/keyring', 'osd', 'allow *', 'mon', 'allow profile osd']' returned non-zero exit status 22
[10.10.200.165][ERROR ] RuntimeError: command returned non-zero exit status: 1
[ceph_deploy][ERROR ] RuntimeError: Failed to execute command: ceph-disk -v activate --mark-init sysvinit --mount /osd5

解决办法:

在删除故障OSD的时候,执行以下操作

[root@ceph-osd-1 ceph-cluster]# ceph auth del osd.5
updated
[root@ceph-osd-1 ceph-cluster]# ceph osd rm 5
removed osd.5

而后在执行

[root@ceph-osd-1 ceph-cluster]# ceph-deploy prepare 10.10.200.165:/osd5
[root@ceph-osd-1 ceph-cluster]# ceph-deploy activate 10.10.200.165:/osd5

查看osd状态

[root@ceph-osd-1 ceph-cluster]# ceph osd tree
# id weight type name up/down reweight
-1 11.63 root default
-2 5.26 host ceph-osd-1
0 5.26 osd.0 up 1
-3 2.73 host ceph-osd-2
1 0.91 osd.1 up 1
2 0.91 osd.2 up 1
3 0.91 osd.3 up 1
-4 3.64 host ceph-osd-3
4 1.82 osd.4 up 1
5 0.91 osd.5 up 1
6 0.91 osd.6 up 1

查看ceph状态,已经开始recovery操作

[root@ceph-osd-1 ceph-cluster]# ceph -s
cluster 374b8b6b-8b47-4d14-af47-f383d42af2ba
health HEALTH_WARN 7 pgs backfill; 8 pgs backfilling; 1 pgs degraded; 6 pgs recovering; 34 pgs recovery_wait; 55 pgs stuck unclean; recovery 15753/191906 objects degraded (8.209%)
monmap e1: 1 mons at {ceph-osd-1=10.10.200.163:6789/0}, election epoch 1, quorum 0 ceph-osd-1
osdmap e371: 7 osds: 7 up, 7 in
pgmap v52585: 704 pgs, 7 pools, 251 GB data, 61957 objects
800 GB used, 11105 GB / 11905 GB avail
15753/191906 objects degraded (8.209%)
34 active+recovery_wait
649 active+clean
7 active+remapped+wait_backfill
1 active+degraded+remapped+backfilling
6 active+recovering
7 active+remapped+backfilling
recovery io 8150 kB/s, 1 objects/s
client io 2037 B/s wr, 0 op/s