线上centos6出现软死锁 kernel:BUG: soft lockup

时间:2021-06-21 00:40:45

线上centos6出现软死锁 kernel:BUG: soft lockup

今天线上一台centos6机器用xshell一直连接不上,然后在xshell上显示

Message from syslogd@GZxxx at Mar 29 14:13:14 ...
kernel:BUG: soft lockup - CPU#1 stuck for 68s! [events/1:36]

过了10分钟,终于可以连上了,看一下开机日志

dmesg |grep stuck
BUG: soft lockup - CPU#2 stuck for 67s! [vmmemctl:894]
BUG: soft lockup - CPU#5 stuck for 67s! [bdi-default:49]
BUG: soft lockup - CPU#3 stuck for 67s! [irqbalance:1351]
BUG: soft lockup - CPU#4 stuck for 67s! [swapper:0]
BUG: soft lockup - CPU#6 stuck for 67s! [watchdog/6:30]
BUG: soft lockup - CPU#5 stuck for 67s! [vmmemctl:894]
BUG: soft lockup - CPU#0 stuck for 67s! [events/0:35]
BUG: soft lockup - CPU#7 stuck for 67s! [lldpad:1459]
BUG: soft lockup - CPU#6 stuck for 67s! [mpt_poll_0:376]
BUG: soft lockup - CPU#4 stuck for 67s! [ksoftirqd/4:21]
BUG: soft lockup - CPU#1 stuck for 67s! [events/1:36]
BUG: soft lockup - CPU#3 stuck for 62s! [rsyslogd:1325]
BUG: soft lockup - CPU#4 stuck for 72s! [events/4:39]
BUG: soft lockup - CPU#1 stuck for 70s! [automount:4252]
BUG: soft lockup - CPU#2 stuck for 73s! [hald:1685]
BUG: soft lockup - CPU#0 stuck for 61s! [automount:1776]
BUG: soft lockup - CPU#6 stuck for 67s! [events/6:41]
BUG: soft lockup - CPU#5 stuck for 67s! [vmmemctl:894]
BUG: soft lockup - CPU#7 stuck for 65s! [lldpad:1459]
BUG: soft lockup - CPU#3 stuck for 68s! [swapper:0]
BUG: soft lockup - CPU#2 stuck for 68s! [events/2:37]
BUG: soft lockup - CPU#0 stuck for 67s! [crond:1815]
BUG: soft lockup - CPU#7 stuck for 67s! [watchdog/7:34]
BUG: soft lockup - CPU#1 stuck for 68s! [events/1:36]
BUG: soft lockup - CPU#4 stuck for 67s! [watchdog/4:22]
BUG: soft lockup - CPU#5 stuck for 68s! [watchdog/5:26]
BUG: soft lockup - CPU#3 stuck for 66s! [swapper:0]
BUG: soft lockup - CPU#2 stuck for 66s! [ksoftirqd/2:13]
BUG: soft lockup - CPU#0 stuck for 67s! [watchdog/0:6]
BUG: soft lockup - CPU#5 stuck for 67s! [watchdog/5:26]
BUG: soft lockup - CPU#6 stuck for 62s! [fcoemon:1509]
BUG: soft lockup - CPU#4 stuck for 70s! [lldpad:1459]
BUG: soft lockup - CPU#7 stuck for 63s! [watchdog/7:34]
BUG: soft lockup - CPU#1 stuck for 63s! [sync_supers:48]
BUG: soft lockup - CPU#3 stuck for 63s! [irqbalance:1351]
BUG: soft lockup - CPU#2 stuck for 62s! [events/2:37]
BUG: soft lockup - CPU#0 stuck for 68s! [events/0:35]
BUG: soft lockup - CPU#2 stuck for 68s! [sa1:4687]
BUG: soft lockup - CPU#3 stuck for 78s! [flush-8:0:4618]
BUG: soft lockup - CPU#1 stuck for 78s! [events/1:36]
BUG: soft lockup - CPU#4 stuck for 63s! [lldpad:1459]
BUG: soft lockup - CPU#6 stuck for 64s! [fcoemon:1509]
BUG: soft lockup - CPU#5 stuck for 64s! [NetworkManager:1531]
BUG: soft lockup - CPU#0 stuck for 62s! [watchdog/0:6]
BUG: soft lockup - CPU#7 stuck for 68s! [watchdog/7:34]
BUG: soft lockup - CPU#4 stuck for 63s! [lldpad:1459]
BUG: soft lockup - CPU#1 stuck for 162s! [irqbalance:1351]
BUG: soft lockup - CPU#6 stuck for 128s! [hald:1685]
BUG: soft lockup - CPU#2 stuck for 130s! [sshd:4688]
BUG: soft lockup - CPU#5 stuck for 147s! [rsyslogd:1325]
BUG: soft lockup - CPU#3 stuck for 71s! [flush-8:0:4618]
BUG: soft lockup - CPU#6 stuck for 68s! [events/6:41]
BUG: soft lockup - CPU#2 stuck for 68s! [irqbalance:1351]
BUG: soft lockup - CPU#1 stuck for 68s! [su:4783]
BUG: soft lockup - CPU#7 stuck for 67s! [crond:1815]
BUG: soft lockup - CPU#5 stuck for 67s! [events/5:40]
BUG: soft lockup - CPU#0 stuck for 66s! [lldpad:1459]
BUG: soft lockup - CPU#4 stuck for 65s! [automount:4785]

全部都是这种错误:BUG: soft lockup - CPU#x  stuck for xs

这个错误是什么鬼?

查了一下百度,发现这是一个软死锁

内核软死锁(soft lockup)bug

Soft lockup名称解释:所谓,soft lockup就是说,这个bug没有让系统彻底死机,但是若干个进程(或者kernel thread)被锁死在了某个状态(一般在内核区域),很多情况下这个是由于内核锁的使用的问题。

Linux内核对于每一个cpu都有一个监控进程,在技术界这个叫做watchdog(看门狗)。通过ps -eo ppid,pid,user,args |grep watchdog能够看见,进程名称大概是watchdog/X(数字:cpu逻辑编号1/2/3/4之类的)。这个进程或者线程每一秒钟运行一次,否则会睡眠和待机。这个进程运行会收集每一个cpu运行时使用数据的时间并且存放到属于每个cpu自己的内核数据结构。在内核中有很多特定的中断函数。这些中断函数会调用soft lockup计数,他会使用当前的时间戳与特定(对应的)cpu的内核数据结构中保存的时间对比,如果发现当前的时间戳比对应cpu保存的时间大于设定的阀值,他就假设监测进程或看门狗线程在一个相当可观的时间还没有执。Cpu软锁为什么会产生,是怎么产生的?如果linux内核是经过精心设计安排的CPU调度访问,那么怎么会产生cpu软死锁?那么只能说由于用户开发的或者第三方软件引入,看我们服务器内核panic的原因就是qmgr进程引起。因为每一个无限的循环都会一直有一个cpu的执行流程(qmgr进程示一个后台邮件的消息队列服务进程),并且拥有一定的优先级。Cpu调度器调度一个驱动程序来运行,如果这个驱动程序有问题并且没有被检测到,那么这个驱动程序将会暂用cpu的很长时间。根据前面的描述,看门狗进程会抓住(catch)这一点并且抛出一个软死锁(soft lockup)错误。软死锁会挂起cpu使你的系统不可用。

如果是用户空间的进程或线程引起的问题backtrace是不会有内容的,如果内核线程那么在soft lockup消息中会显示出backtrace信息。

简单来说: 由于系统的某个驱动程序有问题导致watchdog无法收集每一个逻辑cpu运行时使用数据并抛出一个软死锁(soft lockup)错误

线上服务器有8个逻辑cpu所以有8只狗

cat /proc/cpuinfo |grep processor
processor : 0
processor : 1
processor : 2
processor : 3
processor : 4
processor : 5
processor : 6
processor : 7

ps -eo ppid,pid,user,args |grep watchdog
2 6 root      [watchdog/0]
2 10 root [watchdog/1]
2 14 root [watchdog/2]
2 18 root [watchdog/3]
2 22 root [watchdog/4]
2 26 root [watchdog/5]
2 30 root [watchdog/6]
2 34 root [watchdog/7]
4852 4883 root grep watchdog

线上centos6出现软死锁 kernel:BUG: soft lockup

在/var/log/messages里找到关键信息,由于用的是vmware esxi平台,估计vmware esxi的某个硬件驱动有问题,正准备联系vmware那边的工程师解决

less /var/log/messages
Mar 28 18:34:55 xxx kernel: UNSUPPORTED HARDWARE DEVICE: CPU family 6 model > 59
Mar 28 18:34:55 xxx kernel: ------------[ cut here ]------------
Mar 28 18:34:55 xxx kernel: WARNING: at kernel/rh_taint.c:13 mark_hardware_unsupported+0x39/0x40() (Not tainted)
Mar 28 18:34:55 xxx kernel: Hardware name: VMware Virtual Platform
Mar 28 18:34:55 xxx kernel: Your hardware is unsupported. Please do not report bugs, panics, oopses, etc., on this hardware.
Mar 28 18:34:55 xxx kernel: Modules linked in:
Mar 28 18:34:55 xxx kernel: Pid: 0, comm: swapper Not tainted 2.6.32-279.el6.x86_64 #1
Mar 28 18:34:55 xxx kernel: Call Trace:
Mar 28 18:34:55 xxx kernel: [<ffffffff8106b747>] ? warn_slowpath_common+0x87/0xc0
Mar 28 18:34:55 xxx kernel: [<ffffffff8106b7df>] ? warn_slowpath_fmt_taint+0x3f/0x50
Mar 28 18:34:55 xxx kernel: [<ffffffff8109a869>] ? mark_hardware_unsupported+0x39/0x40
Mar 28 18:34:55 xxx kernel: [<ffffffff81c27b5d>] ? setup_arch+0xb1f/0xb42
Mar 28 18:34:55 xxx kernel: [<ffffffff814fd223>] ? printk+0x41/0x46
Mar 28 18:34:55 xxx kernel: [<ffffffff81c21c33>] ? start_kernel+0xdc/0x430
Mar 28 18:34:55 xxx kernel: [<ffffffff81c2133a>] ? x86_64_start_reservations+0x125/0x129
Mar 28 18:34:55 xxx kernel: [<ffffffff81c21438>] ? x86_64_start_kernel+0xfa/0x109
Mar 28 18:34:55 xxx kernel: ---[ end trace a7919e7f17c0a725 ]---
Mar 28 18:34:55 xxx kernel: NR_CPUS:4096 nr_cpumask_bits:8 nr_cpu_ids:8 nr_node_ids:1
Mar 28 18:34:55 xxx kernel: PERCPU: Embedded 31 pages/cpu @ffff880028200000 s94424 r8192 d24360 u262144
Mar 28 18:34:55 xxx kernel: pcpu-alloc: s94424 r8192 d24360 u262144 alloc=1*2097152
Mar 28 18:34:55 xxx kernel: pcpu-alloc: [0] 0 1 2 3 4 5 6 7
Mar 28 18:34:55 xxx kernel: Built 1 zonelists in Zone order, mobility grouping on. Total pages: 2064657
Mar 28 18:34:55 xxx kernel: Policy zone: Normal
Mar 28 18:34:55 xxx kernel: Kernel command line: ro root=UUID=12b1eb92-e0a3-441c-98e0-6d75d9e510c2 rd_NO_LUKS rd_NO_LVM LANG=en_US.UTF-8 rd_NO_MD SYSFONT=latarcyrheb-sun16 crashkernel=128M KEYBOAR
DTYPE=pc KEYTABLE=us rd_NO_DM rhgb quiet

参考文章
http://blog.jobbole.com/110581/
http://www.cnblogs.com/brucewoo/archive/2012/12/16/3226861.html

如有不对的地方,欢迎大家拍砖o(∩_∩)o 

本文版权归作者所有,未经作者同意不得转载。

线上centos6出现软死锁 kernel:BUG: soft lockup的更多相关文章

  1. kernel&colon;NMI watchdog&colon; BUG&colon; soft lockup - CPU&num;6 stuck for 28s&excl; CentOS7linux中内核被锁死

    环境说明:虚拟机 CentOS7中解压一个8G的包时,内核报错 Message from syslogd@cosmo-01 at Apr 25 11:05:59 ... kernel:NMI watc ...

  2. 报错kernel&colon;NMI watchdog&colon; BUG&colon; soft lockup - CPU&num;0 stuck for 26s

    近期在服务器跑大量高负载程序,造成cpu soft lockup.如果确认不是软件的问题. 解决办法: #追加到配置文件中 echo 30 > /proc/sys/kernel/watchdog ...

  3. CentOS7运行报错kernel&colon;NMI watchdog&colon; BUG&colon; soft lockup - CPU&num;0 stuck for 26s

    CentOS内核,对应的文件是/proc/sys/kernel/watchdog_thresh.CentOS内核和标准内核还有一个地方不一样,就是处理CPU占用时间过长的函数,CentOS下是watc ...

  4. NMI watchdog&colon; BUG&colon; soft lockup - CPU&num;0 stuck for 22s&excl;

    今天测试环境一虚拟机运行中突然报错,,, 没见过的内核报错,于是google一番. 系统日志: Nov :: dev- kernel: NMI watchdog: BUG: soft lockup - ...

  5. 内核报错kernel&colon;NMI watchdog&colon; BUG&colon; soft lockup - CPU&num;1

    1.现象描述 系统管理员电话通知,描述为一台服务器突然无法ssh连接,登录服务器带外IP地址并进入远程控制台界面后,提示Authentication error,重启后即可正常进入系统,进入后过20分 ...

  6. 安装ubuntu出现BUG soft lockup的解决方法&lpar;16&period;04 14&period;04&rpar;

    对于16.04而言,当时用的是UtrISO 安装的,导致安装过程用会出现 “not a com32r image” 的错误,解决方法见上文的: boot: live 华硕Z9主板安装16.04以上系统 ...

  7. 关于线上的bug什么时候修复的思考

    这里系统专门指的是那种用户量大的系统,比如有几百万或者上千万的注册会员.因为小系统因为用户量少,不存在这种思考,考虑有时候是多余的.另外还有内部系统,给自己公司内部人员使用的,即便是出现了问题,也不会 ...

  8. Android线上Bug热修复分析

    针对app线上修复技术,目前有好几种解决方案,开源界往往一个方案会有好几种实现.重复的实现会有造*之嫌,但分析解决方案在技术上的探索和衍变,这*还是值得去推动的 关于Hot Fix技术 Hot F ...

  9. MySQL死锁系列-线上死锁问题排查思路

    前言 MySQL 死锁异常是我们经常会遇到的线上异常类别,一旦线上业务日间复杂,各种业务操作之间往往会产生锁冲突,有些会导致死锁异常.这种死锁异常一般要在特定时间特定数据和特定业务操作才会复现,并且分 ...

随机推荐

  1. 2012 Multi-University Training Contest 9 &sol; hdu4389

    2012 Multi-University Training Contest 9 / hdu4389 打巨表,实为数位dp 还不太懂 先这样放着.. 对于打表,当然我们不能直接打,这里有技巧.我们可以 ...

  2. S3C6410开发板开发环境的搭建

    本节主要介绍了S3C6410开发板及OK6410开发板.OK6410开发板是基于ARM11处理器的S3C6410,采用“核心版+底板”结构 主要步骤如下:. OK6410开发板自带一个串口,PC也需要 ...

  3. 纪念逝去的岁月——C&sol;C&plus;&plus;交换排序

    交换排序 代码 #include <stdio.h> void printList(int iList[], int iLen) { ; ; i < iLen; i++) { pri ...

  4. Java接入图灵机器人,实现与机器人聊天

    很多人都玩过微信,其中就有与机器人聊天的功能:

  5. expdp ORA-39213

    [oracle@BI2 dir_dp]$ impdp ruijie_kettle/ruijie_kettle schemas=ruijie_kettle directory=dir_dp dumpfi ...

  6. &lbrack;css3动画&rsqb;渐隐渐现

    测试 <!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8& ...

  7. RandomAccessFile乱码问题

    转自:http://www.cnblogs.com/xudong-bupt/archive/2013/04/20/3028980.html     Thanks Java对文件的读.写随机访问,Ran ...

  8. win32多线程编程

    关于多线程多进程的学习,有没有好的书籍我接触的书里头关于多线程多进程部分,一是<操作系统原理>里面讲的相关概念   一个是<linux基础教程>里面讲的很简单的多线程多进程编程 ...

  9. &bsol;Process&lpar;sqlservr&rpar;&bsol;&percnt; Processor Time 计数器飙高

    计数器" \Process(sqlservr)\% Processor Time",是经常监测,看看SQL Server如何消耗CPU资源.sqlserver是如何利用现有的资源; ...

  10. Flutter 输入控件TextField设置内容并保持光标&lpar;cursor&rpar;在末尾

    TextField( controller: TextEditingController.fromValue(TextEditingValue( // 设置内容 text: inputText, // ...