oom_kill_process造成数据库挂起并出现found dead shared server

时间:2022-01-21 23:50:07

这篇博客是上一篇博客Oracle shutdown immediate遭遇ORA-24324 ORA-24323 ORA-01089的延伸(数据库挂起hang时,才去重启的),其实这是我们海外一工厂的遇到的案例,把内容拆开是因为这个case分开讲述显得主题明确一些。正式进入主题:

 

服务器数据库版本Oracle Database 10g Release 10.2.0.4.0,操作系统为Red Hat Enterprise Linux Server release 5.7,虚拟机。当时告警日志里面出现大量的found dead shared server这里信息。数据库也出现连接不上的情况

 

found dead shared server 'S016', pid = (35, 23)
found dead shared server 'S023', pid = (42, 1)
Fri Aug  5 10:28:48 2016
found dead shared server 'S013', pid = (32, 110)
found dead shared server 'S021', pid = (40, 1)
Fri Aug  5 10:33:53 2016
found dead shared server 'S012', pid = (31, 132)
found dead shared server 'S023', pid = (38, 3)
Fri Aug  5 10:38:55 2016
found dead shared server 'S013', pid = (32, 111)
found dead shared server 'S022', pid = (42, 3)
Fri Aug  5 10:40:53 2016
found dead shared server 'S020', pid = (39, 4)
found dead shared server 'S021', pid = (40, 3)
failed to start shared server, oer=0

oom_kill_process造成数据库挂起并出现found dead shared server

 

通过检查发现系统内存耗尽,出现了oom_kill 。OOMkiller,即out of memory killer,是linux下面当内存耗尽时的的一种处理机制。当内存较少时,OOM会遍历整个进程链表,然后根据进程的内存使用情况以及它的oom score值最终找到得分较高的进程,然后发送kill信号将其杀掉。

 

oom_kill_process造成数据库挂起并出现found dead shared server

# grep -i kill /var/log/messages | more
Aug  5 10:12:10 xxxxx kernel: oracle invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
Aug  5 10:12:10 xxxxx kernel:  [<ffffffff810d9ae6>] oom_kill_process+0x85/0x25b
Aug  5 10:12:11 xxxxx kernel: Out of memory: kill process 21687 (oracle) score 2296119 or a child
Aug  5 10:12:11 xxxxx kernel: Killed process 21687 (oracle)
Aug  5 10:12:11 xxxxx kernel: oracle invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
Aug  5 10:12:11 xxxxx kernel:  [<ffffffff810d9ae6>] oom_kill_process+0x85/0x25b
Aug  5 10:12:11 xxxxx kernel: Out of memory: kill process 21668 (oracle) score 2144517 or a child
Aug  5 10:12:11 xxxxx kernel: Killed process 21668 (oracle)
Aug  5 10:23:09 xxxxx kernel: oracle invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
Aug  5 10:23:09 xxxxx kernel:  [<ffffffff810d9ae6>] oom_kill_process+0x85/0x25b
Aug  5 10:23:09 xxxxx kernel: Out of memory: kill process 21756 (oracle) score 2144517 or a child
Aug  5 10:23:09 xxxxx kernel: Killed process 21756 (oracle)
Aug  5 10:23:09 xxxxx kernel: oracle invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
Aug  5 10:23:09 xxxxx kernel:  [<ffffffff810d9ae6>] oom_kill_process+0x85/0x25b
Aug  5 10:23:09 xxxxx kernel: Out of memory: kill process 21732 (oracle) score 2138384 or a child
Aug  5 10:23:09 xxxxx kernel: Killed process 21732 (oracle)
Aug  5 10:28:08 xxxxx kernel: oracle invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
Aug  5 10:28:08 xxxxx kernel:  [<ffffffff810d9ae6>] oom_kill_process+0x85/0x25b
Aug  5 10:28:09 xxxxx kernel: Out of memory: kill process 21752 (oracle) score 2144521 or a child
Aug  5 10:28:09 xxxxx kernel: Killed process 21752 (oracle)
Aug  5 10:28:09 xxxxx kernel: oracle invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
Aug  5 10:28:09 xxxxx kernel:  [<ffffffff810d9ae6>] oom_kill_process+0x85/0x25b
Aug  5 10:28:09 xxxxx kernel: Out of memory: kill process 21722 (oracle) score 2138377 or a child
Aug  5 10:28:09 xxxxx kernel: Killed process 21722 (oracle)
Aug  5 10:32:24 xxxxx kernel: oracle invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
Aug  5 10:32:24 xxxxx kernel:  [<ffffffff810d9ae6>] oom_kill_process+0x85/0x25b
Aug  5 10:32:24 xxxxx kernel: Out of memory: kill process 21718 (oracle) score 2135307 or a child
Aug  5 10:32:24 xxxxx kernel: Killed process 21718 (oracle)
Aug  5 10:32:24 xxxxx kernel: gdm-rh-security invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
Aug  5 10:32:24 xxxxx kernel:  [<ffffffff810d9ae6>] oom_kill_process+0x85/0x25b
Aug  5 10:32:24 xxxxx kernel: Out of memory: kill process 22053 (oracle) score 2135300 or a child
Aug  5 10:32:24 xxxxx kernel: Killed process 22053 (oracle)
Aug  5 10:37:54 xxxxx kernel: beremote invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
Aug  5 10:37:54 xxxxx kernel:  [<ffffffff810d9ae6>] oom_kill_process+0x85/0x25b
Aug  5 10:37:54 xxxxx kernel: Out of memory: kill process 22238 (oracle) score 2134274 or a child
Aug  5 10:37:54 xxxxx kernel: Killed process 22238 (oracle)
Aug  5 10:37:54 xxxxx kernel: oracle invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
Aug  5 10:37:54 xxxxx kernel:  [<ffffffff810d9ae6>] oom_kill_process+0x85/0x25b
Aug  5 10:37:54 xxxxx kernel: Out of memory: kill process 22128 (oracle) score 2133001 or a child
--More--

oom_kill_process造成数据库挂起并出现found dead shared server

 

从上面可以看到大量的ORACLE的进程被kill掉,从而导致ORACLE出现"found dead shared server 'S016', pid = (35, 23)"这类错误,在官方文档Found Dead Shared Server Messages Reported In Alert.Log (文档 ID 760872.1) 有如下介绍(这个文档较老旧,不过原理依然适用于此处环境):

 

SYMPTOMS


The following is being reported in the alert.log file

Mon Dec 22 16:48:31 2008
found dead shared server 'S004', pid = (13, 1)
found dead shared server 'S001', pid = (10, 1)

No further errors accompany those messages in the alert.log file.

Listener.log shows no errors.

CAUSE

When a session is killed, in some circumstances, the shared server could die as well leading to those messages being reported in the alert.log file.
This has been verified by << Bug 4270723 >> FOUND DEAD SHARED SERVER

 

也就是说当一个会话被杀,在某些情况下, shared sever进程会死掉,导致上面“found dead shared server”出现在告警日志中。 这个案例中,由于系统大量kill掉会话进程,导致shared server进程死掉。所以Oracle数据库出现无法访问的情况。突然挂起。

 

$grep "Out of memory" messages
Aug  5 10:12:11 xxxxx kernel: Out of memory: kill process 21687 (oracle) score 2296119 or a child
Aug  5 10:12:11 xxxxx kernel: Out of memory: kill process 21668 (oracle) score 2144517 or a child
Aug  5 10:23:09 xxxxx kernel: Out of memory: kill process 21756 (oracle) score 2144517 or a child
Aug  5 10:23:09 xxxxx kernel: Out of memory: kill process 21732 (oracle) score 2138384 or a child
Aug  5 10:28:09 xxxxx kernel: Out of memory: kill process 21752 (oracle) score 2144521 or a child
Aug  5 10:28:09 xxxxx kernel: Out of memory: kill process 21722 (oracle) score 2138377 or a child
Aug  5 10:32:24 xxxxx kernel: Out of memory: kill process 21718 (oracle) score 2135307 or a child
Aug  5 10:32:24 xxxxx kernel: Out of memory: kill process 22053 (oracle) score 2135300 or a child
Aug  5 10:37:54 xxxxx kernel: Out of memory: kill process 22238 (oracle) score 2134274 or a child
Aug  5 10:37:54 xxxxx kernel: Out of memory: kill process 22128 (oracle) score 2133001 or a child
Aug  5 10:38:46 xxxxx kernel: Out of memory: kill process 22227 (oracle) score 2132996 or a child
Aug  5 10:39:14 xxxxx kernel: Out of memory: kill process 22229 (oracle) score 2134280 or a child
Aug  5 10:40:57 xxxxx kernel: Out of memory: kill process 22286 (oracle) score 2135299 or a child
Aug  5 10:41:24 xxxxx kernel: Out of memory: kill process 22245 (oracle) score 2135302 or a child
Aug  5 10:41:25 xxxxx kernel: Out of memory: kill process 22485 (oracle) score 2133009 or a child
Aug  5 10:41:56 xxxxx kernel: Out of memory: kill process 21779 (oracle) score 2132880 or a child
Aug  5 10:42:08 xxxxx kernel: Out of memory: kill process 22068 (oracle) score 2132873 or a child
Aug  5 10:42:26 xxxxx kernel: Out of memory: kill process 22249 (oracle) score 2132873 or a child
Aug  5 10:42:26 xxxxx kernel: Out of memory: kill process 22278 (oracle) score 2132873 or a child
Aug  5 10:42:31 xxxxx kernel: Out of memory: kill process 21662 (oracle) score 2132872 or a child
Aug  5 10:42:47 xxxxx kernel: Out of memory: kill process 22045 (oracle) score 2132872 or a child
Aug  5 10:42:57 xxxxx kernel: Out of memory: kill process 22314 (oracle) score 2132872 or a child
Aug  5 10:43:35 xxxxx kernel: Out of memory: kill process 22336 (oracle) score 2132872 or a child
Aug  5 10:43:35 xxxxx kernel: Out of memory: kill process 22435 (oracle) score 2132870 or a child
Aug  5 10:43:55 xxxxx kernel: Out of memory: kill process 21666 (oracle) score 2132869 or a child
Aug  5 10:44:02 xxxxx kernel: Out of memory: kill process 22263 (oracle) score 2132869 or a child
Aug  5 10:44:19 xxxxx kernel: Out of memory: kill process 22405 (oracle) score 2132866 or a child
Aug  5 10:44:20 xxxxx kernel: Out of memory: kill process 22438 (oracle) score 2132866 or a child
Aug  5 10:44:20 xxxxx kernel: Out of memory: kill process 22453 (oracle) score 2132865 or a child
Aug  5 10:44:23 xxxxx kernel: Out of memory: kill process 22466 (oracle) score 2132737 or a child
Aug  5 10:44:26 xxxxx kernel: Out of memory: kill process 22499 (oracle) score 2132607 or a child
Aug  5 10:44:27 xxxxx kernel: Out of memory: kill process 21716 (oracle) score 1078417 or a child
Aug  5 10:44:27 xxxxx kernel: Out of memory: kill process 21670 (oracle) score 1066455 or a child
Aug  5 10:48:02 xxxxx kernel: Out of memory: kill process 22829 (oracle) score 2134273 or a child
Aug  5 10:49:47 xxxxx kernel: Out of memory: kill process 22900 (oracle) score 2133007 or a child
Aug  5 10:50:36 xxxxx kernel: Out of memory: kill process 22842 (oracle) score 2133095 or a child
Aug  5 10:51:25 xxxxx kernel: Out of memory: kill process 22990 (oracle) score 2134285 or a child
Aug  5 10:51:25 xxxxx kernel: Out of memory: kill process 23054 (oracle) score 2132994 or a child
Aug  5 10:51:49 xxxxx kernel: Out of memory: kill process 22933 (oracle) score 2134277 or a child
Aug  5 10:51:49 xxxxx kernel: Out of memory: kill process 23103 (oracle) score 2132996 or a child
Aug  5 10:52:52 xxxxx kernel: Out of memory: kill process 23211 (oracle) score 2134267 or a child

 

在官方文档Oracle VM Server hangs after Invoking the OOM Killer and having hundreds of kpartx processes spawned and "state S non-preferred supports toluSnA" reported on the FC LUNs (文档 ID 2123877.1)介绍了这么一个案例

 

APPLIES TO:

Oracle VM - Version 3.3.2 and later
Linux x86-64

SYMPTOMS

On Oracle VM Server, during normal server operation, the server suddenly hangs, with the following kind of messages being logged about invoking the Out Of Memory Killer :

$ grep "Out of memory" messages
Mar 19 09:17:50 ovs01 kernel: Out of memory: Kill process 7382 (xend) score 2 or sacrifice child
Mar 19 09:17:51 ovs01 kernel: Out of memory: Kill process 7382 (xend) score 2 or sacrifice child
Mar 19 09:17:56 ovs01 kernel: Out of memory: Kill process 7425 (xend) score 2 or sacrifice child
Mar 19 09:17:57 ovs01 kernel: Out of memory: Kill process 7425 (xend) score 2 or sacrifice child
Mar 19 09:17:58 ovs01 kernel: Out of memory: Kill process 7425 (xend) score 2 or sacrifice child
Mar 19 09:17:58 ovs01 kernel: Out of memory: Kill process 7425 (xend) score 2 or sacrifice child
Mar 19 09:18:38 ovs01 kernel: Out of memory: Kill process 7557 (xend) score 2 or sacrifice child
Mar 19 09:19:21 ovs01 kernel: Out of memory: Kill process 17926 (mounted.ocfs2) score 1 or sacrifice child

Stack traces like this one are also being printed on the console :

Mar 19 09:17:50 ovs01 kernel: Out of memory: Kill process 7382 (xend) score 2 or sacrifice child
Mar 19 09:17:50 ovs01 kernel: Killed process 7422 (sh) total-vm:5512kB, anon-rss:40kB, file-rss:108kB
Mar 19 09:17:51 ovs01 kernel: snmpd invoked oom-killer: gfp_mask=0x2800d0, order=0, oom_score_adj=0
Mar 19 09:17:51 ovs01 kernel: snmpd cpuset=/ mems_allowed=0
Mar 19 09:17:51 ovs01 kernel: Pid: 6602, comm: snmpd Tainted: G W 3.8.13-55.1.8.el6uek.x86_64 #2
Mar 19 09:17:51 ovs01 kernel: Call Trace:
Mar 19 09:17:51 ovs01 kernel: [<ffffffff810d6bba>] ? cpuset_print_task_mems_allowed+0xba/0xe0
Mar 19 09:17:51 ovs01 kernel: [<ffffffff81133c43>] dump_header+0x93/0x1f0
Mar 19 09:17:51 ovs01 kernel: [<ffffffff8128d330>] ? ___ratelimit+0xa0/0x120
Mar 19 09:17:51 ovs01 kernel: [<ffffffff8159a43a>] ? error_exit+0x2a/0x60
Mar 19 09:17:51 ovs01 kernel: [<ffffffff811343da>] oom_kill_process+0x20a/0x350
Mar 19 09:17:51 ovs01 kernel: [<ffffffff8122b735>] ? security_capable_noaudit+0x15/0x20
Mar 19 09:17:51 ovs01 kernel: [<ffffffff8106bcc5>] ? has_ns_capability_noaudit+0x15/0x20
Mar 19 09:17:51 ovs01 kernel: [<ffffffff8106bce7>] ? has_capability_noaudit+0x17/0x20
Mar 19 09:17:51 ovs01 kernel: [<ffffffff811347a6>] out_of_memory+0x286/0x2d0
Mar 19 09:17:51 ovs01 kernel: [<ffffffff8113a1a2>] __alloc_pages_slowpath+0x742/0x7c0
Mar 19 09:17:51 ovs01 kernel: [<ffffffff8113a51b>] __alloc_pages_nodemask+0x2fb/0x320
Mar 19 09:17:51 ovs01 kernel: [<ffffffff8117ec27>] kmem_getpages+0x67/0x1c0
Mar 19 09:17:51 ovs01 kernel: [<ffffffff8117fa99>] fallback_alloc+0x189/0x250
Mar 19 09:17:51 ovs01 kernel: [<ffffffff8117f85a>] ____cache_alloc_node+0x9a/0x150
Mar 19 09:17:51 ovs01 kernel: [<ffffffff8118102f>] kmem_cache_alloc+0x15f/0x2f0
Mar 19 09:17:51 ovs01 kernel: [<ffffffff811fa370>] proc_alloc_inode+0x20/0xa0
Mar 19 09:17:51 ovs01 kernel: [<ffffffff811adec7>] alloc_inode+0x27/0xa0
Mar 19 09:17:51 ovs01 kernel: [<ffffffff811adf5b>] new_inode_pseudo+0x1b/0x70
Mar 19 09:17:51 ovs01 kernel: [<ffffffff811adfcd>] new_inode+0x1d/0x40
Mar 19 09:17:51 ovs01 kernel: [<ffffffff811fc648>] proc_pid_make_inode+0x28/0x100
Mar 19 09:17:51 ovs01 kernel: [<ffffffff811fdbbb>] proc_pid_instantiate+0x1b/0xd0
Mar 19 09:17:51 ovs01 kernel: [<ffffffff811fc2d6>] proc_fill_cache+0x126/0x150
Mar 19 09:17:51 ovs01 kernel: [<ffffffff811a58f0>] ? sys_ioctl+0xb0/0xb0
Mar 19 09:17:51 ovs01 kernel: [<ffffffff811fdba0>] ? pid_revalidate+0x120/0x120
Mar 19 09:17:51 ovs01 kernel: [<ffffffff811a58f0>] ? sys_ioctl+0xb0/0xb0
Mar 19 09:17:51 ovs01 kernel: [<ffffffff811a58f0>] ? sys_ioctl+0xb0/0xb0
Mar 19 09:17:51 ovs01 kernel: [<ffffffff811fe25e>] proc_pid_readdir+0xfe/0x170
Mar 19 09:17:51 ovs01 kernel: [<ffffffff811a58f0>] ? sys_ioctl+0xb0/0xb0
Mar 19 09:17:51 ovs01 kernel: [<ffffffff811a58f0>] ? sys_ioctl+0xb0/0xb0
Mar 19 09:17:51 ovs01 kernel: [<ffffffff811fae3a>] proc_root_readdir+0x4a/0x60
Mar 19 09:17:51 ovs01 kernel: [<ffffffff811a58f0>] ? sys_ioctl+0xb0/0xb0
Mar 19 09:17:51 ovs01 kernel: [<ffffffff811a5ab8>] vfs_readdir+0xb8/0xd0
Mar 19 09:17:51 ovs01 kernel: [<ffffffff811a5c65>] sys_getdents+0x95/0x110
Mar 19 09:17:51 ovs01 kernel: [<ffffffff815a2599>] system_call_fastpath+0x16/0x1b

In the processes dump before the Out Of Memory Killer starts killing processes, there are hundreds of kpartx instances :

$ grep -c kpartx messages
124438

Error messages like the following one are also being reported :

Mar 13 03:59:20 ovs01 kernel: sd 12:0:2:98: alua: port group 01 state S non-preferred supports toluSnA
Mar 13 03:59:40 ovs01 kernel: sd 11:0:3:98: alua: port group 01 state S non-preferred supports toluSnA
Mar 13 03:59:55 ovs01 kernel: sd 12:0:2:98: alua: port group 01 state S non-preferred supports toluSnA
Mar 13 04:00:13 ovs01 kernel: sd 11:0:3:98: alua: port group 01 state S non-preferred supports toluSnA
Mar 13 04:00:24 ovs01 kernel: sd 12:0:2:98: alua: port group 01 state S non-preferred supports toluSnA
Mar 13 04:00:58 ovs01 kernel: sd 11:0:3:98: alua: port group 01 state S non-preferred supports toluSnA
Mar 13 04:01:09 ovs01 kernel: sd 12:0:2:98: alua: port group 01 state S non-preferred supports toluSnA
Mar 13 04:01:34 ovs01 kernel: sd 11:0:3:98: alua: port group 01 state S non-preferred supports toluSnA
Mar 13 04:01:46 ovs01 kernel: sd 12:0:2:98: alua: port group 01 state S non-preferred supports toluSnA

 

CAUSE

Issue(s) on the Fiber Channel Storage Server which is hosting the LUNs used by the Oracle VM Server.
 

SOLUTION

To resolve this issue, please resolve the issue on the Fiber Channel Storage Array - The issue on the VM Server will then resolve as the issues on the Storage is fixed.

 

 

跟这个案例非常类似,猜测就有可能就是这个原因造成的,需要海外那边的系统管理员去检查、处理。哎,但是一直没给我反馈。有时候要帮他们看着这些属于他们职责范围内的事情,还得不到有效支持,真心觉得累!