centos7( 3.10.0-123.el7.x86_64) 重启问题 | http://aperise.iteye.com/blog/2326082 |
centos7( 3.10.0-327.el7.x86_64) 重启问题 | http://aperise.iteye.com/blog/2425717 |
centos7( 3.10.0-327.el7.x86_64) 重启问题
1.问题
服务器(2U 2cpu 6cores/cpu 16G*8 5 * 2TB)安装centos7(3.10.0-123.el7.x86_64),之前遇到过“kernel BUG at mm/page_alloc.c:3765!”的kernel BUG,redhat官网给的意见是升级,之前已经升级到centos7(3.10.0-327.el7.x86_64),但是最近发现一台应用服务器在资源使用到一定时间后,仍然出现自动重启问题,该问题的错误信息是“kernel BUG at mm/page_alloc.c:1389!”。
2.解决思路
服务器自动重启问题,因为之前已经有过类似处理经验,这里主要步骤如下:
(1)在出问题的机器上启用KDUMP服务,在服务器宕机重启时候抓取宕机日志;
(2)分析服务器宕机日志,从日志中发现问题,解决问题。
3.KDUMP服务安装
公司已经购买了redhat相关服务,这里从系统运维工程师那边已经拿到了一个KDUMP安装的shell文件,直接执行,即可完成KDUMP的所有安装和启用事宜,脚本主要干了以下事情:
(1)#kexec-tools checking,
(2)#add crash kernel https://access.redhat.com/site/solutions/916043
(3)#backup kdump.conf
(4)#Check if the dump directory be mounted
(5)#enable kdump service
(6)#kernel parameter change
(7)#server hang
(8)#softlockup
(9)#oom
完整的脚本参见附件kdumpconfig.zip,内容如下:
#!/bin/sh echo Kdump Helper is starting to configure kdump service #kexec-tools checking if ! rpm -q kexec-tools > /dev/null then echo "kexec-tools no found, please run command yum install kexec-tools to install it" exit 1 fi mem_total=`free -g |awk 'NR==2 {print $2 }'` echo Your total memory is $mem_total G #add crash kernel #https://access.redhat.com/site/solutions/916043 grub_conf=/boot/grub2/grub.cfg grub_conf_kdumphelper=/boot/grub2/grub.cfg.kdumphelper.$(date +%y-%m-%d-%H:%M:%S) echo backup $grub_conf to $grub_conf_kdumphelper cp $grub_conf $grub_conf_kdumphelper compute_rhel7_crash_kernel () { mem_size=$1 if [ $mem_size -le 2 ] then reserved_memory="128M" else reserved_memory="auto" fi echo "$reserved_memory" } crashkernel_para=`compute_rhel7_crash_kernel $mem_total ` echo crashkernel=$crashkernel_para is set in $grub_conf sed -i '/^\tlinux/ s/crashkernel=\(auto\|[[:digit:]]*[mM]@[[:digit:]]*[mM]\|[[:digit:]]*[mM]\)//g' $grub_conf sed -i ' /^\tlinux/ s/$/ crashkernel='$crashkernel_para'/g' $grub_conf #backup kdump.conf kdump_conf=/etc/kdump.conf kdump_conf_kdumphelper=/etc/kdump.conf.kdumphelper.$(date +%y-%m-%d-%H:%M:%S) echo backup $kdump_conf to $kdump_conf_kdumphelper cp $kdump_conf $kdump_conf_kdumphelper dump_path=/var/crash echo path $dump_path > $kdump_conf dump_level=1 echo core_collector makedumpfile -c --message-level 1 -d $dump_level >> $kdump_conf echo 'default reboot' >> $kdump_conf #Check if the dump directory be mounted dump_dev_name=$(mount | grep $dump_path | awk '{print $1}') dump_dev_uuid=$(blkid `mount | grep $dump_path | awk '{print $1}'`| awk '{print $2}') dump_fs_type=$(mount | grep $dump_path | awk '{print $5}') mount | grep $dump_path > /dev/null if [ $? -ne 0 ]; then echo "==== The dump directory is not mounted to a separate device. Your vmcore will be saved in the root filesystem ====" else echo "==== The dump directory is mounted to a separate device. Your vmcore will be dumped to that device ====" echo "$dump_fs_type $dump_dev_uuid" >> $kdump_conf cat /etc/fstab | awk '{print $1}' | grep -E "^${dump_dev_name}|^${dump_dev_uuid}" >> /dev/null if [ $? -ne 0 ]; then echo "==== You need to add an entry in the /etc/fstab to make sure the dump directory is auto-mounted after system reboot. ====" echo "==== Read more in https://access.redhat.com/solutions/1197493 ====" fi fi #enable kdump service echo enable kdump service... systemctl enable kdump.service systemctl -a|grep kdump systemctl restart kdump.service #kernel parameter change echo Starting to Configure extra diagnostic opstions sysctl_conf=/etc/sysctl.conf sysctl_conf_kdumphelper=/etc/sysctl.conf.kdumphelper.$(date +%y-%m-%d-%H:%M:%S) echo backup $sysctl_conf to $sysctl_conf_kdumphelper cp $sysctl_conf $sysctl_conf_kdumphelper #server hang sed -i '/^kernel.sysrq/ s/kernel/#kernel/g ' $sysctl_conf echo >> $sysctl_conf echo '#Panic on sysrq and nmi button, magic button alt+printscreen+c or nmi button could be pressed to collect a vmcore' >> $sysctl_conf echo '#Added by kdumphelper, more information about it can be found in solution below' >> $sysctl_conf echo '#https://access.redhat.com/site/solutions/2023' >> $sysctl_conf echo 'kernel.sysrq=1' >> $sysctl_conf echo 'kernel.sysrq=1 set in /etc/sysctl.conf' echo '#https://access.redhat.com/site/solutions/125103' >> $sysctl_conf echo 'kernel.unknown_nmi_panic=1' >> $sysctl_conf echo 'kernel.unknown_nmi_panic=1 set in /etc/sysctl.conf' #softlockup sed -i '/^kernel.softlockup_panic/ s/kernel/#kernel/g ' $sysctl_conf echo >> $sysctl_conf echo '#Panic on soft lockups.' >> $sysctl_conf echo '#Added by kdumphelper, more information about it can be found in solution below' >> $sysctl_conf echo '#https://access.redhat.com/site/solutions/19541' >> $sysctl_conf echo 'kernel.softlockup_panic=1' >> $sysctl_conf echo 'kernel.softlockup_panic=1 set in /etc/sysctl.conf' #oom sed -i '/^kernel.panic_on_oom/ s/kernel/#kernel/g ' $sysctl_conf echo >> $sysctl_conf echo '#Panic on out of memory.' >> $sysctl_conf echo '#Added by kdumphelper, more information about it can be found in solution below' >> $sysctl_conf echo '#https://access.redhat.com/site/solutions/20985' >> $sysctl_conf echo 'vm.panic_on_oom=1' >> $sysctl_conf echo 'vm.panic_on_oom=1 set in /etc/sysctl.conf'
拿到上面脚本,直接在服务器上运行,即可完成KDUMP的安装和启用,这样在下次服务器宕机时候,KDUMP会记录宕机日志,日志会在/var/crash/目录下存储。
3.分析日志
启用了KDUMP后,就是坐等下次出问题时拿到日志分析问题了,这里我的服务器拿到的日志如下:
以上是服务器上生成的宕机日志文件,这里打开文件vmcore-dmesg.txt,查看到如下内容:
[466312.238996] ------------[ cut here ]------------
[466312.239025] kernel BUG at mm/page_alloc.c:1389!
[466312.239043] invalid opcode: 0000 [#1] SMP
注意上面的那句“kernel BUG at mm/page_alloc.c:1389!”,这句已经提示是一个内核级的BUG,那么好了,下面要做的就是去redhat官网查下如何解决这个问题。
4.解决问题
去redhat官网查询“kernel BUG at mm/page_alloc.c:1389!”相关问题,官网上该问题处理意见参见https://access.redhat.com/solutions/3208581,注意Redhat官网只能注册的用户才能查看完整问题处理内容,注册时候也需要购买了Redhat服务的公司或者个人才能注册,这个比较扯淡,不过我已经联系了公司的系统运维工程师用他的账号给我下载了这个页面,详见附件“RHEL 7.4 server panics with message _kernel BUG at mm_page_alloc.c_1389!_.rar”。
从上可知:
(1)centos7只要版本为RHEL 7.4的都会存在上述问题;
(2)该问题最后修复并验证成功的时间是在2018年五月30日21:55
(3)出现问题的centos7大版本为Red Hat Enterprise Linux 7.4,这个大版本下包括kernel-3.10.0-693.el7.x86_64及其以后版本都会存在类似问题
(4)目前发现存在该问题的厂商及服务器版本如下:
- LENOVO System x3650 M5
- LENOVO System x3550 M5
- IBM Flex System x240 M5
- FUJITSU PRIMERGY BX2560 M1
- FUJITSU PRIMERGY RX2530 M4
- Cisco UCS B200 M4
(5)官网处理建议
- Red Hat Enterprise Linux 7:针对这个版本,请升级到kernel-3.10.0-862.el7 from Errata RHSA-2018:1062 或者更新的版本
- Red Hat Enterprise Linux 7.4 (EUS):针对这个版本,请升级到kernel-3.10.0-693.33.1.el7 from Errata RHSA-2018:1738或者更新的版本
(6)官网给出的产生问题的根源
When a memory page is reclaimed from a freelist a whole block is considered. When the beginning of a range is not aligned with the block, kernel crashes due to uninitialized page metadata at the beginning of the block.
(7)官网也给出了诊断步骤,详细内容参见https://access.redhat.com/solutions/3208581
5.总结
处理服务器宕机问题,首要是拿到宕机日志,比如启用KDUMP服务拿取宕机日志,只要拿到了日志了,问题就不是问题了。