通过远控发现有几块坏的硬盘
Raid10环境下换硬盘还是很简单的,支持热插拔,直接拔下换掉就可以了,下面是操作步骤。
通过磁盘SN查看坏磁盘是哪个(可以在远控查看磁盘SN)
/opt/MegaRAID/MegaCli/MegaCli64 -PDList -aAll -NoLog | grep -B 25 3SL1KEF2
卸载故障硬盘
/opt/MegaRAID/MegaCli/MegaCli64 -PDOffline -PhysDrv[32:7] -a0
上面命令中 32 和 7 以及 -a0 的对应关系:
Adapter #0
Enclosure Device ID: 32
Slot Number: 7
点亮指定硬盘(定位,让磁盘闪灯)
/opt/MegaRAID/MegaCli/MegaCli64 -PdLocate -start -physdrv[32:7] -a0
注:磁盘换完后关闭指定硬盘指示灯
/opt/MegaRAID/MegaCli/MegaCli64 -PdLocate -stop -physdrv[32:7] -a0
替换故障硬盘
此时故障硬盘已经OFFLINE,在服务器现场查看时,故障硬盘闪烁的是黄灯,正常硬盘的绿灯; 拔下故障硬盘,插上好硬盘,硬盘灯闪烁为绿色,并硬盘快速旋转,表示硬盘正在rebuild状态,查看状态如下:
$ MegaCli -PDList -aAll -NoLog
...
Enclosure Device ID: 32
Slot Number: 7
...
Firmware state: Rebuild
查看rebuild进度
# /opt/MegaRAID/MegaCli/MegaCli64 -PDRbld -ShowProg -PhysDrv[32:7] -aAll
Rebuild Progress on Device at Enclosure 32, Slot 3 Completed 16% in 94 Minutes.
或者以动态可视化文字界面显示
#/opt/MegaRAID/MegaCli/MegaCli64 -PDRbld -ProgDsply -PhysDrv[32:7] -a0
Rebuild progress of physical drives...
Enclosure:Slot Percent Complete Time Elps
032 :07 #######****************15 %*********************** 00:24:37
Press <ESC> key to quit...
换盘完成
# /opt/MegaRAID/MegaCli/MegaCli64 -PDList -aAll -NoLog | grep \'Firmware state\'
Firmware state: Copyback
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Hotspare, Spun Up
Firmware state: Online, Spun Up
Firmware state: Online, Spun Up
Firmware state: Offline
设置热备
为了防止磁盘损坏过多,为raid设置一个热备盘
# /opt/MegaRAID/MegaCli/MegaCli64 -PDHSP -Set -Dedicated -Array1 -physdrv[32:9] -a0 #添加局部热备盘,其中array1表示第1个raid(Target Id: 1)
添加完成后查看热备的位置
# /opt/MegaRAID/MegaCli/MegaCli64 -LDInfo -Lall -aALL
Adapter 0 -- Virtual Drive Information:
Virtual Drive: 0 (Target Id: 0)
Name :Virtual Disk 0
RAID Level : Primary-1, Secondary-0, RAID Level Qualifier-0
Size : 223.0 GB
Sector Size : 512
Mirror Data : 223.0 GB
State : Optimal
Strip Size : 64 KB
Number Of Drives : 2
Span Depth : 1
Default Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write Cache if Bad BBU
Default Access Policy: Read/Write
Current Access Policy: Read/Write
Disk Cache Policy : Disk\'s Default
Encryption Type : None
Default Power Savings Policy: Controller Defined
Current Power Savings Policy: None
Can spin up in 1 minute: No
LD has drives that support T10 power conditions: No
LD\'s IO profile supports MAX power savings with cached writes: No
Bad Blocks Exist: No
Is VD Cached: No
Virtual Drive: 1 (Target Id: 1)
Name :
RAID Level : Primary-1, Secondary-0, RAID Level Qualifier-0
Size : 1.635 TB
Sector Size : 512
Mirror Data : 1.635 TB
State : Degraded
Strip Size : 64 KB
Number Of Drives per span:2
Span Depth : 3
Default Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write Cache if Bad BBU
Default Access Policy: Read/Write
Current Access Policy: Read/Write
Disk Cache Policy : Disk\'s Default
Encryption Type : None
Default Power Savings Policy: Controller Defined
Current Power Savings Policy: None
Can spin up in 1 minute: Yes
LD has drives that support T10 power conditions: Yes
LD\'s IO profile supports MAX power savings with cached writes: No
Bad Blocks Exist: No
Is VD Cached: No
Number of Dedicated Hot Spares: 1
0 : EnclId - 32 SlotId - 9
Exit Code: 0x00
# 查看逻辑盘详细信息
sudo /opt/MegaRAID/MegaCli/MegaCli64 -LdPdInfo -aAll -NoLog
当有raid有热备的时候,更换磁盘会是Firmware state: Copyback的状态
查看copyback的进度可以直接查看日志
# watch -n 30 \'MegaCli -FwTermLog -Dsply -aALL | tail -f\'
Every 30.0s: MegaCli -FwTermLog -Dsply -aALL | tail -f
07/29/19 13:16:36: Load Balance Statistics Path0PDs d Path1PDs 0
07/29/19 13:16:36: EVT#25896-07/29/19 13:16:36: 91=Inserted: PD 00(e0x20/s0)
07/29/19 13:16:36: EVT#25897-07/29/19 13:16:36: 247=Inserted: PD 00(e0x20/s0) Info: enclPd=20, scsiType=0, portMap=00, sasAddr=5000c500720794fd,0000000000000000
07/29/19 13:16:37: request temp sensor i2c failed
07/29/19 13:16:37: PD_InsertionPostProcess: Setting foreign DDF type on pd=0
07/29/19 13:16:37: EVT#25898-07/29/19 13:16:37: 114=State change on PD 00(e0x20/s0) from UNCONFIGURED_BAD(1) to UNCONFIGURED_GOOD(0)
07/29/19 13:16:37: pdHspHistCheckInsertedPdCallback: Start copy back from sparePd=03 to pd=0, changing entryType to ok
07/29/19 13:16:37: ArDiskTypeMisMatch : NO_MIXING_VIOLATION array=1 destPD=0
07/29/19 13:16:37: EVT#25899-07/29/19 13:16:37: 281=CopyBack automatically started on PD 00(e0x20/s0) from PD 03(e0x20/s3)
07/29/19 13:16:37: EVT#25900-07/29/19 13:16:37: 114=State change on PD 00(e0x20/s0) from UNCONFIGURED_GOOD(0) to COPYBACK(20)
07/29/19 13:18:18: EVT#25901-07/29/19 13:18:18: 279=CopyBack progress on PD 00(e0x20/s0) is 0.99%(99s)
07/29/19 13:19:57: EVT#25902-07/29/19 13:19:57: 279=CopyBack progress on PD 00(e0x20/s0) is 1.99%(197s)
07/29/19 13:21:37: EVT#25903-07/29/19 13:21:37: 279=CopyBack progress on PD 00(e0x20/s0) is 2.99%(297s)
07/29/19 13:23:17: EVT#25904-07/29/19 13:23:17: 279=CopyBack progress on PD 00(e0x20/s0) is 3.99%(397s)
07/29/19 13:24:57: EVT#25905-07/29/19 13:24:57: 279=CopyBack progress on PD 00(e0x20/s0) is 4.99%(497s)
07/29/19 13:26:39: EVT#25906-07/29/19 13:26:39: 279=CopyBack progress on PD 00(e0x20/s0) is 5.99%(598s)
Exit Code: 0x00
megacli基本用法
# 查raid级别
$ megacli -LDInfo -Lall -aALL
# 查看逻辑盘详细信息
$ /opt/MegaRAID/MegaCli/MegaCli64 -LdPdInfo -aAll -NoLog
# 查raid卡信息
$ megacli -AdpAllInfo -aALL
# 查看硬盘信息
$ /opt/MegaRAID/MegaCli/MegaCli64 -PDList -aALL
# 查看电池信息
$ megacli -AdpBbuCmd -aAll
# 查看raid卡日志
$ /opt/MegaRAID/MegaCli/MegaCli64 -FwTermLog -Dsply -aALL
# 显示适配器个数
$ megacli -adpCount
# 显示适配器时间
$ megacli -AdpGetTime –aALL
# 显示所有适配器信息
$ megacli -AdpAllInfo -aAll
# 显示所有逻辑磁盘组信息
$ megacli -LDInfo -LALL -aAll
# 显示所有的物理信息
$ megacli -PDList -aAll
# 查看充电状态
$ megacli -AdpBbuCmd -GetBbuStatus -aALL |grep \'Charger Status\'
# 显示BBU状态信息
$ megacli -AdpBbuCmd -GetBbuStatus -aALL
# 显示BBU容量信息
$ megacli -AdpBbuCmd -GetBbuCapacityInfo -aALL
# 显示BBU设计参数
$ megacli -AdpBbuCmd -GetBbuDesignInfo -aALL
# 显示当前BBU属性
$ megacli -AdpBbuCmd -GetBbuProperties -aALL
# 显示Raid卡型号,Raid设置,Disk相关信息
$ megacli -cfgdsply -aALL
## 磁带状态的变化,从拔盘,到插盘的过程中。
Device |Normal |Damage |Rebuild |Normal
Virtual Drive |Optimal|Degraded|Degraded|Optimal
Physical Drive |Online |Failed Unconfigured|Rebuild|Online
# 查看物理磁盘状态:
$ megacli -PDRbld -ShowProg -PhysDrv [Enclosure Device ID:Slot Number] -a0
## Rebuild 中的物理磁盘状态中会显示:"Firmware state: Rebuild"
# 查询 Rebuild 进度:
$ megacli -pdrbld -showprog -physdrv[E:S] -aALL
## 返回内容类似于下面这样:
Rebuild Progress on Device at Enclosure 32, Slot 5 Completed 77% in 101 Minutes.
# 以文本进度条样式显示 Rebuild 进度:
$ megacli -pdrbld -progdsply -physdrv[E:S] -aALL
## 屏幕显示类似下面的内容:
Rebuild progress of physical drives...
Enclosure:Slot Percent Complete Time Elps
032 :05 #######################87 %################******* 01:59:07
Press key to quit...
# 查看 RAID 卡 Rebuild 参数:
$ megacli -AdpAllinfo -aALL | grep -i rebuild
## 返回结果类似下面这样
Rebuild Rate : 30%
Auto Rebuild : Enabled
Rebuild Rate : YesForce
Rebuild : Yes
# 设置 RAID 卡 Rebuild 比例为60%(提升Rebuild速度):
$ /opt/MegaRAID/MegaCli/MegaCli64 -AdpSetProp RebuildRate -60 -a0
## 设置成功后返回:
Adapter 0: Set rebuild rate to 60% success.
# 设置HotSpare
/opt/MegaRAID/MegaCli/MegaCli64 -pdhsp -set[-Dedicated[-Array2]][-EnclAffinity][-nonRevertible]-PhysDrv[4:11]-a0
/opt/MegaRAID/MegaCli/MegaCli64 -pdhsp -set[-EnclAffinity][-nonRevertible]-PhysDrv[32:1}]-a0
MegaCli -PDHSP -Set -Dedicated -Array0 -physdrv[E:S] -a0 添加局部热备盘,其中array0表示第0个raid(Target Id: 0)
示范:sudo /opt/MegaRAID/MegaCli/MegaCli64 -PDHSP -Set -Dedicated -Array1 -physdrv[32:9] -a0 #添加局部热备盘,其中array1表示第1个raid(Target Id: 1)
MegaCli -pdhsp -set -physdrv[E:S] -a0 添加全局热备盘
MegaCli -pdhsp -rmv -physdrv[E:S] -a0 移除全局和热备局部热备
示范:sudo /opt/MegaRAID/MegaCli/MegaCli64 -PDHSP -rmv -physdrv[32:9] -a0
# 删除阵列
/opt/MegaRAID/MegaCli/MegaCli64 -cfglddel -L2 -Force -a0 强制删除指定的raid组(Target Id: 2)的raid组,可以通过上面的“查看逻辑盘详细信息”得到。(有时不加强制参数,会报错--Virtual Disk is associate with Cache Cade. Please Use force option to delete)
/opt/MegaRAID/MegaCli/MegaCli64 -cfgclr -a0 清除所有的raid组的配置
# 清除外来配置
/opt/MegaRAID/MegaCli/MegaCli64 -cfgforeign -clear -a0
# 再次扫描外来配置的个数
/opt/MegaRAID/MegaCli/MegaCli64 -cfgforeign -scan -a0
常见问题:
1.Firmware state: Unconfigured(good), Spun Up( Idrac监控报错:登陆idrac卡后如下如所示:硬盘状态是感叹号,状态是外来)
解决办法:/opt/MegaRAID/MegaCli/MegaCli64 -CfgForeign -Import -aall
导入后我们发现了另外一个问题,就是这块磁盘归属到一个只有一块磁盘的raid组中了,这和我本来要把这块磁盘加到热备的目的有冲突
于是我们删除新出现的raid组
/opt/MegaRAID/MegaCli/MegaCli64 -cfglddel -L2 -Force -a0 强制删除指定的raid组(Target Id: 2)的raid组,可以通过上面的“查看逻辑盘详细信息”得到。(有时不加强制参数,会报错--Virtual Disk is associate with Cache Cade. Please Use force option to delete)
最后执行
将驱动设置为热备(hotspare)。
sudo /opt/MegaRAID/MegaCli/MegaCli64 -PDHSP -Set -Dedicated -Array1 -physdrv[32:9] -a0
2.Firmware state: Unconfigured(bad) 怎么解决--我有新的磁盘想作为磁盘组的热备
Enclosure Device ID: 32
Slot Number: 9
Enclosure position: 1
Device Id: 9
Firmware state: Unconfigured(bad)
服务器硬盘出现Unconfigured Bad可能是因为驱动器出现误差,具体操作如下:
1、用命令行监测一下驱动是否配置良好。
sudo /opt/MegaRAID/MegaCli/MegaCli64 -PDMakeGood -physdrv[32:9] -a0
2、再检测一下32:9的状态是否配置良好。
Enclosure Device ID: 32
Slot Number: 9
Enclosure position: 1
Device Id: 9
Firmware state: Unconfigured(good), Spun Up
3、然后需要清理一下foreign conifig。(坑的一毛 整个服务器挂机了,千万不要执行清理foreign conifig,要不只能去bios里导入foreign conifig才能恢复)
### sudo /opt/MegaRAID/MegaCli/MegaCli64 -cfgforeign -clear -a0
/opt/MegaRAID/MegaCli/MegaCli64 -CfgForeign -Import -aall #谨慎操作
参考: http://www.51niux.com/?id=77(MegaCLI 工具的使用)
4、最后清除以前的外部配置,将驱动设置为热备(hotspare)。
sudo /opt/MegaRAID/MegaCli/MegaCli64 -PDHSP -Set -Dedicated -Array1 -physdrv[32:9] -a0