Ceph删除/添加故障OSD(ceph-deploy)

时间:2021-10-26 03:28:48

今天osd.5起不来,发现是硬盘的问题

[root@ceph-osd-1 ceph-cluster]# ceph osd tree
# id    weight  type name       up/down reweight
-1      11.63   root default
-2      5.26            host ceph-osd-1
0       5.26                    osd.0   up      1
-3      2.73            host ceph-osd-2
1       0.91                    osd.1   up      1
2       0.91                    osd.2   up      1
3       0.91                    osd.3   up      1
-4      3.64            host ceph-osd-3
4       1.82                    osd.4   up      1
5       0.91                    osd.5   down    1
6       0.91                    osd.6   up      1

更换磁盘后,将osd.5删除,然后通过ceph-deploy重新添加

[root@ceph-osd-1 ceph-cluster]# ceph osd rm 5
[root@ceph-osd-1 ceph-cluster]# ceph-deploy osd prepare 10.10.200.165:/osd5
[root@ceph-osd-1 ceph-cluster]# ceph-deploy osd activate 10.10.200.165:/osd5

在执行activate的时候,提示以下错误

[10.10.200.165][WARNIN] Error EINVAL: entity osd.5 exists but key does not match
[10.10.200.165][WARNIN] Traceback (most recent call last):
[10.10.200.165][WARNIN]   File "/usr/sbin/ceph-disk", line 2591, in <module>
[10.10.200.165][WARNIN]     main()
[10.10.200.165][WARNIN]   File "/usr/sbin/ceph-disk", line 2569, in main
[10.10.200.165][WARNIN]     args.func(args)
[10.10.200.165][WARNIN]   File "/usr/sbin/ceph-disk", line 1929, in main_activate
[10.10.200.165][WARNIN]     init=args.mark_init,
[10.10.200.165][WARNIN]   File "/usr/sbin/ceph-disk", line 1761, in activate_dir
[10.10.200.165][WARNIN]     (osd_id, cluster) = activate(path, activate_key_template, init)
[10.10.200.165][WARNIN]   File "/usr/sbin/ceph-disk", line 1897, in activate
[10.10.200.165][WARNIN]     keyring=keyring,
[10.10.200.165][WARNIN]   File "/usr/sbin/ceph-disk", line 1520, in auth_key
[10.10.200.165][WARNIN]     'mon', 'allow profile osd',
[10.10.200.165][WARNIN]   File "/usr/sbin/ceph-disk", line 304, in command_check_call
[10.10.200.165][WARNIN]     return subprocess.check_call(arguments)
[10.10.200.165][WARNIN]   File "/usr/lib64/python2.6/subprocess.py", line 505, in check_call
[10.10.200.165][WARNIN]     raise CalledProcessError(retcode, cmd)
[10.10.200.165][WARNIN] subprocess.CalledProcessError: Command '['/usr/bin/ceph', '--cluster', 'ceph', '--name', 'client.bootstrap-osd', '--keyring', '/var/lib/ceph/bootstrap-osd/ceph.keyring', 'auth', 'add', 'osd.5', '-i', '/osd5/keyring', 'osd', 'allow *', 'mon', 'allow profile osd']' returned non-zero exit status 22
[10.10.200.165][ERROR ] RuntimeError: command returned non-zero exit status: 1
[ceph_deploy][ERROR ] RuntimeError: Failed to execute command: ceph-disk -v activate --mark-init sysvinit --mount /osd5

解决办法:

在删除故障OSD的时候,执行以下操作

[root@ceph-osd-1 ceph-cluster]# ceph auth del osd.5
updated
[root@ceph-osd-1 ceph-cluster]# ceph osd rm 5
removed osd.5

而后在执行

[root@ceph-osd-1 ceph-cluster]# ceph-deploy prepare 10.10.200.165:/osd5
[root@ceph-osd-1 ceph-cluster]# ceph-deploy activate 10.10.200.165:/osd5

查看osd状态

[root@ceph-osd-1 ceph-cluster]# ceph osd tree
# id    weight  type name       up/down reweight
-1      11.63   root default
-2      5.26            host ceph-osd-1
0       5.26                    osd.0   up      1
-3      2.73            host ceph-osd-2
1       0.91                    osd.1   up      1
2       0.91                    osd.2   up      1
3       0.91                    osd.3   up      1
-4      3.64            host ceph-osd-3
4       1.82                    osd.4   up      1
5       0.91                    osd.5   up      1
6       0.91                    osd.6   up      1

查看ceph状态,已经开始recovery操作

[root@ceph-osd-1 ceph-cluster]# ceph -s
    cluster 374b8b6b-8b47-4d14-af47-f383d42af2ba
     health HEALTH_WARN 7 pgs backfill; 8 pgs backfilling; 1 pgs degraded; 6 pgs recovering; 34 pgs recovery_wait; 55 pgs stuck unclean; recovery 15753/191906 objects degraded (8.209%)
     monmap e1: 1 mons at {ceph-osd-1=10.10.200.163:6789/0}, election epoch 1, quorum 0 ceph-osd-1
     osdmap e371: 7 osds: 7 up, 7 in
      pgmap v52585: 704 pgs, 7 pools, 251 GB data, 61957 objects
            800 GB used, 11105 GB / 11905 GB avail
            15753/191906 objects degraded (8.209%)
                  34 active+recovery_wait
                 649 active+clean
                   7 active+remapped+wait_backfill
                   1 active+degraded+remapped+backfilling
                   6 active+recovering
                   7 active+remapped+backfilling
recovery io 8150 kB/s, 1 objects/s
  client io 2037 B/s wr, 0 op/s