OSD(Hammer)开机自启动失败

时间:2022-07-24 12:37:48

OSD(Hammer)开机自启动失败

分析及解决

1. 初步定位问题出在手动添加的几个osd的磁盘分区的type code没有修改。

  • ceph中两种类型分区的type code:
type type code
journal 45b0969e-9b03-4f30-b4c6-b4b80ceff106
osd 4fbd7e29-9d25-41b8-afd0-062c0ceff05d
  • 修改分区type code及分区名
root@host1:~# sgdisk -t 1:45b0969e-9b03-4f30-b4c6-b4b80ceff106 -c 1:"ceph journal" /dev/nvme0n1
The operation has completed successfully.

root@host1:~# sgdisk -t 3:4fbd7e29-9d25-41b8-afd0-062c0ceff05d -c 3:"ceph data" /dev/nvme0n1
The operation has completed successfully.

第一种解决后,OSD还是没有开机启动,继续分析:

2. 试图使用ceph-disk activate-all手动启动所有osd,但失败了,报错如下:

Error EINVAL: entity osd.27 exists but cap mon does not match
ERROR:ceph-disk:Failed to activate
ceph-disk: Command '['/usr/bin/ceph', '--cluster', 'ceph', '--name', 'client.bootstrap-osd', '--keyring', '/var/lib/ceph/bootstrap-osd/ceph.keyring', 'auth', 'add', 'osd.27', '-i', '/var/lib/ceph/tmp/mnt.pHvUBm/keyring', 'osd', 'allow *', 'mon', 'allow profile osd']' returned non-zero exit status 22
Error EINVAL: entity osd.30 exists but cap mon does not match
ERROR:ceph-disk:Failed to activate
ceph-disk: Command '['/usr/bin/ceph', '--cluster', 'ceph', '--name', 'client.bootstrap-osd', '--keyring', '/var/lib/ceph/bootstrap-osd/ceph.keyring', 'auth', 'add', 'osd.30', '-i', '/var/lib/ceph/tmp/mnt.KJtXc9/keyring', 'osd', 'allow *', 'mon', 'allow profile osd']' returned non-zero exit status 22
Error EINVAL: entity osd.24 exists but cap mon does not match
ERROR:ceph-disk:Failed to activate
ceph-disk: Command '['/usr/bin/ceph', '--cluster', 'ceph', '--name', 'client.bootstrap-osd', '--keyring', '/var/lib/ceph/bootstrap-osd/ceph.keyring', 'auth', 'add', 'osd.24', '-i', '/var/lib/ceph/tmp/mnt.HraMX9/keyring', 'osd', 'allow *', 'mon', 'allow profile osd']' returned non-zero exit status 22
Error EINVAL: entity osd.29 exists but cap mon does not match
ERROR:ceph-disk:Failed to activate
ceph-disk: Command '['/usr/bin/ceph', '--cluster', 'ceph', '--name', 'client.bootstrap-osd', '--keyring', '/var/lib/ceph/bootstrap-osd/ceph.keyring', 'auth', 'add', 'osd.29', '-i', '/var/lib/ceph/tmp/mnt.tQTBrQ/keyring', 'osd', 'allow *', 'mon', 'allow profile osd']' returned non-zero exit status 22
Error EINVAL: entity osd.28 exists but cap mon does not match
ERROR:ceph-disk:Failed to activate
ceph-disk: Command '['/usr/bin/ceph', '--cluster', 'ceph', '--name', 'client.bootstrap-osd', '--keyring', '/var/lib/ceph/bootstrap-osd/ceph.keyring', 'auth', 'add', 'osd.28', '-i', '/var/lib/ceph/tmp/mnt.vVEkBg/keyring', 'osd', 'allow *', 'mon', 'allow profile osd']' returned non-zero exit status 22
ceph-disk: Error: One or more partitions failed to activate

从log来看, 应该是caps的问题。

: 在系统启动的时候,其实也有上面的log,在/var/log/upstart目录下的ceph-osd-all-starter.log中。

  • 查看ceph auth:
root@node-16:~# ceph auth list
installed auth entries:

osd.0
key: AQBiCpZXqoEFJxAAJSJNB6ssR6Llfem6yYapiQ==
caps: [mon] allow profile osd
caps: [osd] allow *
osd.1
key: AQBkCpZXLu8wBRAAB/93w9IREudzSZqFCe8BPw==
caps: [mon] allow profile osd
caps: [osd] allow *
osd.24
key: AQBORxBYNsfGORAAon01sB3Bc3smHw4aH37hqg==
caps: [mon] allow rwx
caps: [osd] allow *
osd.25
key: AQD6RxBYx37NIxAA1ruW4XnGuHgqlGdzxVlXPA==
caps: [mon] allow rwx
caps: [osd] allow *

对比上面正常启动的OSD和不正常的OSD发现:

正常启动的OSD的mon caps是”allow profile osd”

异常的OSD的mon caps是”allow rwx”

基本上判断是这个问题导致的。

  • 修改OSD的caps:
root@node-16:/var/lib/ceph# ceph auth caps osd.24 mon 'allow profile osd' osd 'allow *'
updated caps for osd.24
root@node-16:/var/lib/ceph# ceph auth caps osd.25 mon 'allow profile osd' osd 'allow *'
updated caps for osd.25