今天osd.5起不来,发现是硬盘的问题
[root@ceph-osd-1 ceph-cluster]# ceph osd tree # id weight type name up/down reweight -1 11.63 root default -2 5.26 host ceph-osd-1 0 5.26 osd.0 up 1 -3 2.73 host ceph-osd-2 1 0.91 osd.1 up 1 2 0.91 osd.2 up 1 3 0.91 osd.3 up 1 -4 3.64 host ceph-osd-3 4 1.82 osd.4 up 1 5 0.91 osd.5 down 1 6 0.91 osd.6 up 1
更换磁盘后,将osd.5删除,然后通过ceph-deploy重新添加
[root@ceph-osd-1 ceph-cluster]# ceph osd rm 5 [root@ceph-osd-1 ceph-cluster]# ceph-deploy osd prepare 10.10.200.165:/osd5 [root@ceph-osd-1 ceph-cluster]# ceph-deploy osd activate 10.10.200.165:/osd5
在执行activate的时候,提示以下错误
[10.10.200.165][WARNIN] Error EINVAL: entity osd.5 exists but key does not match [10.10.200.165][WARNIN] Traceback (most recent call last): [10.10.200.165][WARNIN] File "/usr/sbin/ceph-disk", line 2591, in <module> [10.10.200.165][WARNIN] main() [10.10.200.165][WARNIN] File "/usr/sbin/ceph-disk", line 2569, in main [10.10.200.165][WARNIN] args.func(args) [10.10.200.165][WARNIN] File "/usr/sbin/ceph-disk", line 1929, in main_activate [10.10.200.165][WARNIN] init=args.mark_init, [10.10.200.165][WARNIN] File "/usr/sbin/ceph-disk", line 1761, in activate_dir [10.10.200.165][WARNIN] (osd_id, cluster) = activate(path, activate_key_template, init) [10.10.200.165][WARNIN] File "/usr/sbin/ceph-disk", line 1897, in activate [10.10.200.165][WARNIN] keyring=keyring, [10.10.200.165][WARNIN] File "/usr/sbin/ceph-disk", line 1520, in auth_key [10.10.200.165][WARNIN] 'mon', 'allow profile osd', [10.10.200.165][WARNIN] File "/usr/sbin/ceph-disk", line 304, in command_check_call [10.10.200.165][WARNIN] return subprocess.check_call(arguments) [10.10.200.165][WARNIN] File "/usr/lib64/python2.6/subprocess.py", line 505, in check_call [10.10.200.165][WARNIN] raise CalledProcessError(retcode, cmd) [10.10.200.165][WARNIN] subprocess.CalledProcessError: Command '['/usr/bin/ceph', '--cluster', 'ceph', '--name', 'client.bootstrap-osd', '--keyring', '/var/lib/ceph/bootstrap-osd/ceph.keyring', 'auth', 'add', 'osd.5', '-i', '/osd5/keyring', 'osd', 'allow *', 'mon', 'allow profile osd']' returned non-zero exit status 22 [10.10.200.165][ERROR ] RuntimeError: command returned non-zero exit status: 1 [ceph_deploy][ERROR ] RuntimeError: Failed to execute command: ceph-disk -v activate --mark-init sysvinit --mount /osd5
解决办法:
在删除故障OSD的时候,执行以下操作
[root@ceph-osd-1 ceph-cluster]# ceph auth del osd.5 updated [root@ceph-osd-1 ceph-cluster]# ceph osd rm 5 removed osd.5
而后在执行
[root@ceph-osd-1 ceph-cluster]# ceph-deploy prepare 10.10.200.165:/osd5 [root@ceph-osd-1 ceph-cluster]# ceph-deploy activate 10.10.200.165:/osd5
查看osd状态
[root@ceph-osd-1 ceph-cluster]# ceph osd tree # id weight type name up/down reweight -1 11.63 root default -2 5.26 host ceph-osd-1 0 5.26 osd.0 up 1 -3 2.73 host ceph-osd-2 1 0.91 osd.1 up 1 2 0.91 osd.2 up 1 3 0.91 osd.3 up 1 -4 3.64 host ceph-osd-3 4 1.82 osd.4 up 1 5 0.91 osd.5 up 1 6 0.91 osd.6 up 1
查看ceph状态,已经开始recovery操作
[root@ceph-osd-1 ceph-cluster]# ceph -s cluster 374b8b6b-8b47-4d14-af47-f383d42af2ba health HEALTH_WARN 7 pgs backfill; 8 pgs backfilling; 1 pgs degraded; 6 pgs recovering; 34 pgs recovery_wait; 55 pgs stuck unclean; recovery 15753/191906 objects degraded (8.209%) monmap e1: 1 mons at {ceph-osd-1=10.10.200.163:6789/0}, election epoch 1, quorum 0 ceph-osd-1 osdmap e371: 7 osds: 7 up, 7 in pgmap v52585: 704 pgs, 7 pools, 251 GB data, 61957 objects 800 GB used, 11105 GB / 11905 GB avail 15753/191906 objects degraded (8.209%) 34 active+recovery_wait 649 active+clean 7 active+remapped+wait_backfill 1 active+degraded+remapped+backfilling 6 active+recovering 7 active+remapped+backfilling recovery io 8150 kB/s, 1 objects/s client io 2037 B/s wr, 0 op/s