一、故障描述
某次,用户反馈HIS数据库在完成容灾切换之后,在RACDG端只能正常打开一个节点实例,此时如果手动启动另一个节点实例,该实例能够正常打开,但是会导致之前打开的实例命中ORA-7445,进而引起实例异常宕机。由于RACDG端只能读写打开一个实例,业务虽然可以正常接管,但考虑到业务的稳定性、连续性和性能,建议用户尽快回切。随后,HIS数据库完成回切。随后,针对RACDG端两个实例读写打开为什么会命中ORA-7445,进行远程根因排查,详细分析过程如下。
二、问题时间点关键描述
(1)25号17点:容灾切换后,RACDG端两节点实例读写打开后,其中一节点命中ORA-7445进而导致实例异常宕机。
(2)25号18点:HIS数据库完成回切。
三、问题详细诊断过程
(1)2节点orcl2实例告警日志分析
Fri Mar 25 19:46:01 2022 Starting ORACLE instance (normal) ************************ Large Pages Information ******************* Per process system memlock (soft) limit = UNLIMITED Total Shared Global Region in Large Pages = 133 GB (100%) Large Pages used by this instance: 68097 (133 GB) Large Pages unused system wide = 28347 (55 GB) Large Pages configured system wide = 96444 (188 GB) Large Page size = 2048 KB ******************************************************************** LICENSE_MAX_SESSION = 0 LICENSE_SESSIONS_WARNING = 0 Initial number of CPU is 64 Number of processor cores in the system is 64 Number of processor sockets in the system is 4 Private Interface 'eth1:1' configured from GPnP for use as a private interconnect. [name='eth1:1', type=1, ip=169.254.24.180, mac=00-50-56-aa-ce-92, net=169.254.0.0/16, mask=255.255.0.0, use=haip:cluster_interconnect/62] Public Interface 'eth0' configured from GPnP for use as a public interface. [name='eth0', type=1, ip=192.200.54.32, mac=00-50-56-aa-fa-17, net=192.200.54.0/24, mask=255.255.255.0, use=public/1] Picked latch-free SCN scheme 3 Autotune of undo retention is turned on. LICENSE_MAX_USERS = 0 SYS auditing is disabled NUMA system with 2 nodes detected Starting up: Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production With the Partitioning, Real Application Clusters, OLAP, Data Mining and Real Application Testing options. ... Deprecated system parameters with specified values: standby_archive_dest End of deprecated system parameter listing Cluster communication is configured to use the following interface(s) for this instance 169.254.24.180 cluster interconnect IPC version:Oracle UDP/IP (generic) IPC Vendor 1 proto 2 ... starting up 1 dispatcher(s) for network address '(ADDRESS=(PARTIAL=YES)(PROTOCOL=TCP))'... starting up 1 shared server(s) ... NOTE: initiating MARK startup Starting background process MARK Fri Mar 25 19:46:11 2022 MARK started with pid=37, OS id=58633 NOTE: MARK has subscribed lmon registered with NM - instance number 2 (internal mem no 1) Reconfiguration started (old inc 0, new inc 16) List of instances: 1 2 (myinst: 2) Global Resource Directory frozen * allocate domain 0, invalid = TRUE Communication channels reestablished * domain 0 valid = 1 according to instance 1 Master broadcasted resource hash value bitmaps Non-local Process blocks cleaned out LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived LMS 2: 0 GCS shadows cancelled, 0 closed, 0 Xw survived LMS 3: 0 GCS shadows cancelled, 0 closed, 0 Xw survived LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived Set master node info Submitted all remote-enqueue requests Dwn-cvts replayed, VALBLKs dubious All grantable enqueues granted Submitted all GCS remote-cache requests Fix write in gcs resources Reconfiguration complete Fri Mar 25 19:46:13 2022 LCK0 started with pid=39, OS id=58653 ... Fri Mar 25 19:46:21 2022 SMON: enabling tx recovery Database Characterset is ZHS16GBK Starting background process GTX0 Fri Mar 25 19:46:21 2022 GTX0 started with pid=49, OS id=58745 Starting background process RCBG Fri Mar 25 19:46:21 2022 RCBG started with pid=50, OS id=58747 replication_dependency_tracking turned off (no async multimaster replication found) Starting background process QMNC Fri Mar 25 19:46:22 2022 QMNC started with pid=51, OS id=58751 Completed: ALTER DATABASE OPEN /* db agent *//* {0:7:9} */ Fri Mar 25 19:46:36 2022 Dumping diagnostic data in directory=[cdmp_20220325194624], requested by (instance=1, osid=40273 (LMS3)), summary=[incident=1648121]. Dumping diagnostic data in directory=[cdmp_20220325194625], requested by (instance=1, osid=40198 (PMON)), summary=[abnormal instance termination]. Fri Mar 25 19:46:38 2022 Reconfiguration started (old inc 16, new inc 18) List of instances: 2 (myinst: 2) Global Resource Directory frozen * dead instance detected - domain 0 invalid = TRUE Communication channels reestablished Master broadcasted resource hash value bitmaps Non-local Process blocks cleaned out Fri Mar 25 19:46:38 2022 Fri Mar 25 19:46:38 2022 Fri Mar 25 19:46:38 2022 LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived Fri Mar 25 19:46:38 2022 LMS 3: 1 GCS shadows cancelled, 0 closed, 0 Xw survived LMS 2: 0 GCS shadows cancelled, 0 closed, 0 Xw survived Set master node info Submitted all remote-enqueue requests Dwn-cvts replayed, VALBLKs dubious All grantable enqueues granted Post SMON to start 1st pass IR Submitted all GCS remote-cache requests Post SMON to start 1st pass IR Fix write in gcs resources Reconfiguration complete
在后续验证中,从以上alert_orcl2.log日志不难发现,在1节点实例正常运行的前提下,在19点46分01秒,手动启动2节点实例,在启动过程中会第一次触发重新配置,将2节点实例加入集群。在19点46分22秒,2节点实例正常打开,但随后不久在19点46分36秒,提示1节点实例异常宕机,并转储DIAG故障诊断TRACE信息。之后,在19点46分38秒,第二次触发重新配置,将1节点移出集群。
(2)1节点orcl1实例告警日志分析
Fri Mar 25 19:46:23 2022 Exception [type: SIGIOT, unknown code] [ADDR:0x1F600009D51] [PC:0x39E5432495, gsignal()+53] [exception issued by pid: 40273, uid: 502] [flags: 0x0, count: 1] Errors in file /u01/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_lms3_40273.trc (incident=1648121): ORA-07445: exception encountered: core dump [gsignal()+53] [SIGIOT] [ADDR:0x1F600009D51] [PC:0x39E5432495] [unknown code] [] Incident details in: /u01/app/oracle/diag/rdbms/orcl/orcl1/incident/incdir_1648121/orcl1_lms3_40273_i1648121.trc Use ADRCI or Support Workbench to package the incident. See Note 411.1 at My Oracle Support for error and packaging details. Fri Mar 25 19:46:24 2022 Dumping diagnostic data in directory=[cdmp_20220325194624], requested by (instance=1, osid=40273 (LMS3)), summary=[incident=1648121]. Fri Mar 25 19:46:24 2022 Sweep [inc][1648121]: completed Sweep [inc2][1648121]: completed Fri Mar 25 19:46:25 2022 PMON (ospid: 40198): terminating the instance due to error 484 System state dump requested by (instance=1, osid=40198 (PMON)), summary=[abnormal instance termination]. System State dumped to trace file /u01/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_diag_40249_20220325194625.trc Dumping diagnostic data in directory=[cdmp_20220325194625], requested by (instance=1, osid=40198 (PMON)), summary=[abnormal instance termination]. Instance terminated by PMON, pid = 40198
从以上alert_orcl1.log日志不难发现,在19点46分23秒,1节点实例很快就命中ORA-7445报错;在19点46分25秒,1节点实例被PMON进程杀掉造成实例宕机。通过以上3.1日志分析,我们知道2节点实例在19点46分22秒完成正常打开,但紧接着1节点就立马宕机。结合该现象与处理ORA-7445的经验,怀疑很可能是命中Oracle软件BUG。
(3)1节点orcl1实例lms3进程trace日志分析
*** 2022-03-25 19:46:23.680 SKGXP:[7f248be356c8.0]{0}: SKGXP Assertion msize <= SKGXP_ATTR_DGRAM_MTU_MAX(ctx) Failed at location skgxpseginit2 line_num: 31934 *** 2022-03-25 19:46:23.681 Exception [type: SIGIOT, unknown code] [ADDR:0x1F600009D51] [PC:0x39E5432495, gsignal()+53] [exception issued by pid: 40273, uid: 502] [flags: 0x0, count: 1] Incident 1648121 created, dump file: /u01/app/oracle/diag/rdbms/orcl/orcl1/incident/incdir_1648121/orcl1_lms3_40273_i1648121.trc ORA-07445: exception encountered: core dump [gsignal()+53] [SIGIOT] [ADDR:0x1F600009D51] [PC:0x39E5432495] [unknown code] [] ssexhd: crashing the process... Background_Core_Dump = partial ksdbgcra: writing core file to directory '/u01/app/oracle/diag/rdbms/orcl/orcl1/cdump'
从以上orcl1_lms3_40273.trc日志不难发现,lms3进程在运行过程中命中ORA-7445报错,其作用是用于集群之间资源管理和控制。该进程在命中ORA-7445报错之前,已经创建相应的事件(1648121)跟踪文件。因此,需要进一步分析orcl1_lms3_40273_i1648121.trc文件。
(4)1节点orcl1实例lms3进程incident日志分析
*** 2022-03-25 19:46:23.695 *** SESSION ID:(3527.1) 2022-03-25 19:46:23.695 *** CLIENT ID:() 2022-03-25 19:46:23.695 *** SERVICE NAME:(SYS$BACKGROUND) 2022-03-25 19:46:23.695 *** MODULE NAME:() 2022-03-25 19:46:23.695 *** ACTION NAME:() 2022-03-25 19:46:23.695 Dump continued from file: /u01/app/oracle/diag/rdbms/orcl/orcl1/trace/orcl1_lms3_40273.trc ORA-07445: exception encountered: core dump [gsignal()+53] [SIGIOT] [ADDR:0x1F600009D51] [PC:0x39E5432495] [unknown code] [] ========= Dump for incident 1648121 (ORA 7445 [gsignal()+53]) ======== ----- Beginning of Customized Incident Dump(s) ----- Exception [type: SIGIOT, unknown code] [ADDR:0x1F600009D51] [PC:0x39E5432495, gsignal()+53] [exception issued by pid: 40273, uid: 502] [flags: 0x0, count: 1] Registers: %rax: 0x0000000000000000 %rbx: 0x00007f2487d41000 %rcx: 0xffffffffffffffff %rdx: 0x0000000000000006 %rdi: 0x0000000000009d51 %rsi: 0x0000000000009d51 %rsp: 0x00007ffdcda48b78 %rbp: 0x00007f248d9a5bf0 %r8: 0x00000000ffffffff %r9: 0xfeff092d63646b68 %r10: 0x0000000000000008 %r11: 0x0000000000000202 %r12: 0x00007f248d9a74ac %r13: 0x00007f248d9a5be0 %r14: 0x00007f248d9ba68c %r15: 0x0000000000007cbe %rip: 0x00000039e5432495 %efl: 0x0000000000000202 gsignal()+40 (0x39e5432488) movsxdq %esi,%rsi gsignal()+43 (0x39e543248b) movsxdq %eax,%rdi gsignal()+46 (0x39e543248e) mov $0xea,%eax gsignal()+51 (0x39e5432493) syscall > gsignal()+53 (0x39e5432495) cmp $0xfffff000,%rax gsignal()+59 (0x39e543249b) ja 0x39e54324af gsignal()+61 (0x39e543249d) (repe) ret gsignal()+63 (0x39e543249f) nop gsignal()+64 (0x39e54324a0) test %eax,%eax
从以上orcl1_lms3_40273_i1648121.trc日志不难发现,有很多ORA-7445报错相关的转储信息,考虑到7445是Oracle内部的错误代码,这类报错命中BUG的概率较大,因此在Oracle官方MOS平台上进行匹配,发现Bug 25722720 - RAC LMS getting ORA-07445: exception encountered: core dump [gsignal()+53] [SIGIOT] (Doc ID 25722720.8),与我们的报错现象高度类似,具体匹配情况如下:
从目前日志分析结果来看,RACDG两节点实例打开后,其中一个实例立马命中ORA-7445导致实例宕机,虽然怀疑跟Bug 25722720相关,但是官方针对该BUG只发布了12.2.0.1版本的补丁,因此需要进一步定位。
(5)备节点实例模拟测试复现ORA-7445
由于切换问题,此时备节点已经强制打开,后续测试均在备节点上进行。
测试步骤如下:
1、启动备库一节点,此时数据库正常
2、启动备库二节点,此时情况复现,实例节点宕机,且后台日志出现相同ORA-7445报错
3、查询切换过程中出现的等待事件,结果如下:
SQL> select instance_number,event 2 from DBA_HIST_ACTIVE_SESS_HISTORY 3 where to_char(sample_time, 'yyyy-mm-dd hh24:mi') >= '2022-03-28 22:00' and 4 to_char(sample_time, 'yyyy-mm-dd hh24:mi') <= '2022-03-28 23:00' 5 and event in('enq: US - contention','row cache lock');
发现存在大量 row cache lock、enq: US – contention相关等待,此时想到之前曾经在HIS库上出现过由于回滚段的争用导致业务卡顿问题,曾对数据库做过隐含参数层面调整。
--高并发环境下,设置以下参数尽可能多的保留online回滚段数量 ALTER SYSTEM SET "_rollback_segment_count"=12000 sid='*'; --关闭undo retention自动调整功能 ALTER SYSTEM SET "_undo_autotune" = false sid='*'; --设置_highthreshold_undoretention控制undo retention alter system set "_highthreshold_undoretention"=21600 sid='*';
排查主备库两端参数文件设置有差异,发现备库端未设置相关隐含参数。一般我们建议主备两端参数需保持一致,故思路为备库端设置隐藏参数排查参数影响。
重新进行如下步骤测试:
1、启动备库一节点
2、启动备库二节点,查看数据库状态
3、查询切换过程中出现的等待事件。发现,设置该参数后,备库能正常打开,且未出现ORA-7445问题。重复测试基本能匹配到Oracle MOS官方文档IF: Undo Wait Event - Enq: US - contention(Doc ID 1951680.1)。
比对主备两端补丁号信息,主备两端补丁版本不同。
四、解决办法和建议
根本原因还在Oracle软件自身软件BUG引起,命中ORA-7445报错。当前可通过设置下列隐含参数,提前分配undo段,规避ORA-7445报错。搭建DG环境时,强烈建议主备两端环境要保持一致,包括数据库软件版本和补丁(PSU)版本以及参数(含隐含参数)。
ALTER SYSTEM SET "_rollback_segment_count"=12000 sid='*';