服务器自动重启原因排查方法
手头两台Windows Server 2003 R2服务器最近总是莫名其妙重启,停了备机应用之后,9月1日凌晨监控显示又重启了:
[设备指标:【Zabbix agent】告警,状态:Zabbix agent不可达](2018-09-01 04:38:37) [设备告警,状态:操作系统重启](2018-09-01 04:45:28)
1.查看操作系统日志
查看操作系统日志方法:控制面板-管理工具-事件查看器-系统,查看重启时间前后的事件,一般查看以下事件:1076、1074、6013、6008、1001。查到以下相关日志:
2.分析memory.dmp文件
memory.dmp是在上次操作系统发生错误的时候,由操作系统将当时内存(含虚拟内存)中的数据直接写到文件中去,以备以后让系统工程师检查。
为分析memory.dmp文件,安装了Windows Debugging Tool ,打开windbg,配置Windows Debugging Tool (WinDBG):File菜单-〉选择Symbol File Path,复制此配置到文本框中(D:\Symbol相当于工程目录):
srv*D:\Symbol*http://msdl.microsoft.com/download/symbols;D:\Symbol
File菜单->Open Crash Dump,选择memory.dmp文件,界面中出现“Use !analyze -v to get detailed debugging information.”之后,在kb>中输入!analyze -v,开始进行分析:
22: kd> !analyze -v
*******************************************************************************
* *
* Bugcheck Analysis *
* *
*******************************************************************************
DRIVER_IRQL_NOT_LESS_OR_EQUAL (d1)
An attempt was made to access a pageable (or completely invalid) address at an
interrupt request level (IRQL) that is too high. This is usually
caused by drivers using improper addresses.
If kernel debugger is available get stack backtrace.
Arguments:
Arg1: 92f91000, memory referenced
Arg2: d0000002, IRQL
Arg3: 00000000, value 0 = read operation, 1 = write operation
Arg4: f7ad7e34, address which referenced memory
Debugging Details:
------------------
Page 21a05a not present in the dump file. Type ".hh dbgerr004" for details
Page 21a07b not present in the dump file. Type ".hh dbgerr004" for details
READ_ADDRESS: 92f91000
CURRENT_IRQL: 2
FAULTING_IP:
Ntfs!NtfsAllocateRestartTableIndex+68
f7ad7e34 8b02 mov eax,dword ptr [edx]
DEFAULT_BUCKET_ID: DRIVER_FAULT
BUGCHECK_STR: 0xD1
PROCESS_NAME: helpsvc.exe
TRAP_FRAME: b7da95fc -- (.trap 0xffffffffb7da95fc)
ErrCode = 00000000
eax=08000000 ebx=80a61480 ecx=00000000 edx=92f91000 esi=8af91000 edi=8e88c2c8
eip=f7ad7e34 esp=b7da9670 ebp=b7da9688 iopl=0 nv up ei ng nz na pe nc
cs=0008 ss=0010 ds=0023 es=0023 fs=0030 gs=0000 efl=00010286
Ntfs!NtfsAllocateRestartTableIndex+0x68:
f7ad7e34 8b02 mov eax,dword ptr [edx] ds:0023:92f91000=????????
Resetting default scope
LAST_CONTROL_TRANSFER: from f7ad7e34 to 8088e730
STACK_TEXT:
b7da95fc f7ad7e34 badb0d00 92f91000 80a63456 nt!KiTrap0E+0x18c
b7da9688 f7b11779 08000000 00000001 c4ac0000 Ntfs!NtfsAllocateRestartTableIndex+0x68
b7da97f0 f7b1b6aa 8b245ef8 e16330d0 8cfc7d68 Ntfs!NtfsWriteLog+0x22a
b7da9984 f7b1b786 8b245ef8 e16330d0 e19629c8 Ntfs!NtfsUpdateFileNameInIndex+0x128
b7da9a80 f7b09791 8b245ef8 e1962728 e1962978 Ntfs!NtfsUpdateDuplicateInfo+0x2b0
b7da9ae0 f7b09cac 8b245ef8 e1962728 e250b750 Ntfs!NtfsUpdateFileDupInfo+0xf0
b7da9b74 f7b0e377 8b245ef8 8cf393e8 8cf348c0 Ntfs!NtfsSetBasicInfo+0x3b5
b7da9be0 f7ad9fd8 8b245ef8 8cf348c0 8e88c330 Ntfs!NtfsCommonSetInformation+0x40a
b7da9c48 8081e185 8e18e020 8cf348c0 8e847f38 Ntfs!NtfsFsdSetInformation+0xa3
b7da9c5c f7875c59 8e847f38 b857dfc8 000001c8 nt!IofCallDriver+0x45
b7da9c84 8081e185 8e88c330 8cf348c0 8cf34a50 fltmgr!FltpDispatch+0x6f
b7da9c98 b85438f5 8cf34a50 8cdbf288 8cf348c0 nt!IofCallDriver+0x45
WARNING: Stack unwind information not available. Following frames may be wrong.
b7da9cac 8081e185 8dfb2020 8cf348c0 8cec2e58 eamon+0x58f5
b7da9cc0 808f34dd b7da9d64 00cbf3a8 808f2f3e nt!IofCallDriver+0x45
b7da9d48 8088b658 000002a0 00cbf3d0 00cbf3a8 nt!NtSetInformationFile+0x59f
b7da9d48 7c9583ac 000002a0 00cbf3d0 00cbf3a8 nt!KiSystemServicePostCall
00cbf3d8 00000000 00000000 00000000 00000000 0x7c9583ac
STACK_COMMAND: kb
FOLLOWUP_IP:
eamon+58f5
b85438f5 83c9ff or ecx,0FFFFFFFFh
SYMBOL_STACK_INDEX: c
SYMBOL_NAME: eamon+58f5
FOLLOWUP_NAME: MachineOwner
MODULE_NAME: eamon
IMAGE_NAME: eamon.sys
DEBUG_FLR_IMAGE_TIMESTAMP: 53f5d906
FAILURE_BUCKET_ID: 0xD1_eamon+58f5
BUCKET_ID: 0xD1_eamon+58f5
Followup: MachineOwner
---------
3.解决方法
错误原因:DRIVER_IRQL_NOT_LESS_OR_EQUAL,表明存在以太高的进程内部请求级别(IRQL)访问其没有权限访问的内存地址。
PROCESS_NAME: helpsvc.exe 此进程有耗尽计算机资源的风险。开始-》运行-》services.msc,停用“Help and Support”服务,并将启动方式改为手动。
4.第二次自动重启
第二天上午又发生了自动重启的问题,通过事件查看器查出错误代码是0x0000007e,分析dump文件,错误名称为SYSTEM_THREAD_EXCEPTION_NOT_HANDLED。表示系统进程遇到了问题,但Windows错误处理器无法准确捕获错误类型。这个错误比较棘手,可以设置蓝屏时不自动重启,等下一次复现时观察下蓝屏中的提示进一步分析:
这次在dump文件中查到与Ntfs.sys有关,重新扫描了磁盘,状态良好:
5.最终解决方案
更新操作补丁,更新之后不再发生此情况。
2018-11-09 11:26:42