硬件故障检测工具:mcelog
mcelog介绍
源码地址:/andikleen/mcelog
mcelog logs and accounts machine checks (in particular memory, IO, and CPU hardware errors) on modern x86 Linux systems.
mcelog is required by both 32bit x86 Linux kernels (since 2.6.30) and 64bit Linux kernels (since early 2.6 kernel releases) to log machine checks and should run on all Linux systems that need error handling.
简单的说,mcelog支持x86架构linux系统的内存,IO,以及cpu硬件故障检测,显然该工具不支持arm架构的硬件故障检测。
mcelog安装
yum install gcc.x86_64 gcc-c++.x86_64 flex.x86_64 dialog.x86_64 ras-utils.x86_64 git.x86_64
安装后,如果有硬件故障会在 /var/log/messages 或者 /var/log/mcelog文件下看到,因为系统版本不同或者mcelog默认配置不同,mcelog的日志存放位置也会有所不同。
我们可以使用 journalctl |grep -i "mcelog"
全局搜索系统日志也就是包括messages和mcelog文件等,日志如下所示。
Jul 13 18:23:30 instance-vwviu68u mcelog[13877]: Location: SOCKET:0 CHANNEL:0 DIMM:? []
Jul 13 18:23:54 instance-vwviu68u mcelog[15876]: Running trigger `dimm-error-trigger'
Jul 13 18:23:54 instance-vwviu68u mcelog[15876]: Hardware event. This is not a software error.
Jul 13 18:23:54 instance-vwviu68u mcelog[15876]: MCE 0
Jul 13 18:23:54 instance-vwviu68u mcelog[15876]: CPU 0 BANK 2 TSC 14072acda54512
Jul 13 18:23:54 instance-vwviu68u mcelog[15876]: RIP !INEXACT! 73:1eadbabe
Jul 13 18:23:54 instance-vwviu68u mcelog[15876]: MISC 8c ADDR 1000
Jul 13 18:23:54 instance-vwviu68u mcelog[15876]: TIME 1563013434 Sat Jul 13 18:23:54 2019
Jul 13 18:23:54 instance-vwviu68u mcelog[15876]: MCG status:RIPV MCIP
Jul 13 18:23:54 instance-vwviu68u mcelog[15876]: MCi status:
Jul 13 18:23:54 instance-vwviu68u mcelog[15876]: Uncorrected error
Jul 13 18:23:54 instance-vwviu68u mcelog[15876]: Error enabled
Jul 13 18:23:54 instance-vwviu68u mcelog[15876]: MCi_MISC register valid
Jul 13 18:23:54 instance-vwviu68u mcelog[15876]: MCi_ADDR register valid
Jul 13 18:23:54 instance-vwviu68u mcelog[15876]: SRAO
Jul 13 18:23:54 instance-vwviu68u mcelog[15876]: MCA: MEMORY CONTROLLER MS_CHANNEL0_ERR
Jul 13 18:23:54 instance-vwviu68u mcelog[15876]: Transaction: Memory scrubbing error
Jul 13 18:23:54 instance-vwviu68u mcelog[15876]: STATUS bd000000000000c0 MCGSTATUS 5
Jul 13 18:23:54 instance-vwviu68u mcelog[15876]: MCGCAP 100010a APICID 0 SOCKETID 0
Jul 13 18:23:54 instance-vwviu68u mcelog[15876]: MICROCODE 1
Jul 13 18:23:54 instance-vwviu68u mcelog[15876]: CPUID Vendor Intel Family 6 Model 85
Jul 13 18:23:54 instance-vwviu68u mcelog[13932]: Uncorrected DIMM memory error count exceeded threshold: 2 in 24h
Jul 13 18:23:54 instance-vwviu68u mcelog[13933]: Location: SOCKET:0 CHANNEL:0 DIMM:? []
硬件故障模拟触发工具:mce-test
mce-test介绍
源码地址:/andikleen/mce-test
The MCE test suite is a collection of tools and test scripts for
testing the Linux RAS related features, including CPU/Memory error
containment and recovery, ACPI/APEI support etc.
mce-test实践
创建一个可纠正的CPU故障脚本如下名为test文件:
CPU 0 BANK 5
STATUS corrected
使用modprobe mce-inject
命令加载故障注入模块。
执行mce-inject test
命令触发故障模拟,我们通过 journalctl |grep -i "mcelog"
命令就能查到mce错误信息了。
其他的故障类型模拟触发脚本参考链接:/andikleen/mce-test/tree/master/cases/coverage/soft-inj
脚本内容在data文件目录下 /andikleen/mce-test/tree/master/cases/coverage/soft-inj/panic_noser/data