Oracle 11g RAC 节点异常重启问题分析

时间:2024-10-08 07:10:34

一、背景

在国庆期间巡检的时候,发现数据库alert日志中出现了异常重启的信息,当即对该报错进行分析处理。

二、处理过程

(1)数据库告警日志分析

node1 alert:
Sat Oct 05 13:05:14 2024
Thread 1 advanced to log sequence 6981 (LGWR switch)
  Current log# 11 seq# 6981 mem# 0: +DATA/ybqddb/onlinelog/group_11.302.1144593261
Sat Oct 05 13:05:15 2024
Archived Log entry 12130 added for thread 1 sequence 6980 ID 0x8d497377 dest 1:
Sat Oct 05 14:50:48 2024
Reconfiguration started (old inc 27, new inc 29)
List of instances:
 1 (myinst: 1) 
 Global Resource Directory frozen
 * dead instance detected - domain 0 invalid = TRUE 
 Communication channels reestablished
 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
Sat Oct 05 14:50:48 2024
Sat Oct 05 14:50:48 2024
 LMS 3: 1 GCS shadows cancelled, 1 closed, 0 Xw survived
 LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Sat Oct 05 14:50:48 2024
 LMS 1: 1 GCS shadows cancelled, 0 closed, 0 Xw survived
Sat Oct 05 14:50:48 2024
 LMS 2: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
 Set master node info 
 Submitted all remote-enqueue requests
 Dwn-cvts replayed, VALBLKs dubious
 All grantable enqueues granted
 Post SMON to start 1st pass IR
Sat Oct 05 14:50:48 2024
Instance recovery: looking for dead threads
Beginning instance recovery of 1 threads
 Submitted all GCS remote-cache requests
 Post SMON to start 1st pass IR
 Fix write in gcs resources
Reconfiguration complete
 parallel recovery started with 32 processes
Started redo scan
Completed redo scan
 read 76 KB redo, 16 data blocks need recovery
Started redo application at
 Thread 2: logseq 5168, block 86069
Sat Oct 05 14:50:53 2024
Setting Resource Manager plan SCHEDULER[0x32DE]:DEFAULT_MAINTENANCE_PLAN via scheduler window
Setting Resource Manager plan DEFAULT_MAINTENANCE_PLAN via parameter
Recovery of Online Redo Log: Thread 2 Group 10 Seq 5168 Reading mem 0
  Mem# 0: +DATA/ybqddb/onlinelog/group_10.300.1144593255
Completed redo application of 0.01MB
Completed instance recovery at
 Thread 2: logseq 5168, block 86222, scn 245107035
 16 data blocks read, 16 data blocks written, 76 redo k-bytes read
Sat Oct 05 14:50:53 2024
minact-scn: master found reconf/inst-rec before recscn scan old-inc#:29 new-inc#:29
Thread 2 advanced to log sequence 5169 (thread recovery)
Redo thread 2 internally disabled at seq 5169 (SMON)
Sat Oct 05 14:50:54 2024
Archived Log entry 12131 added for thread 2 sequence 5168 ID 0x8d497377 dest 1:
Sat Oct 05 14:50:54 2024
ARC2: Archiving disabled thread 2 sequence 5169
Archived Log entry 12132 added for thread 2 sequence 5169 ID 0x8d497377 dest 1:
minact-scn: master continuing after IR
minact-scn: Master considers inst:2 dead
Sat Oct 05 14:51:49 2024
Decreasing number of real time LMS from 4 to 0
Sat Oct 05 14:52:11 2024
Reconfiguration started (old inc 29, new inc 31)
List of instances:
 1 2 (myinst: 1) 
 Global Resource Directory frozen
 Communication channels reestablished
 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
Sat Oct 05 14:52:11 2024
 LMS 3: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Sat Oct 05 14:52:11 2024
 LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Sat Oct 05 14:52:11 2024
 LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Sat Oct 05 14:52:11 2024
 LMS 2: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
 Set master node info 
 Submitted all remote-enqueue requests
 Dwn-cvts replayed, VALBLKs dubious
 All grantable enqueues granted
Sat Oct 05 14:52:11 2024
minact-scn: Master returning as live inst:2 has inc# mismatch instinc:0 cur:31 errcnt:0
 Submitted all GCS remote-cache requests
 Fix write in gcs resources
Reconfiguration complete
Sat Oct 05 14:53:26 2024
Increasing number of real time LMS from 0 to 4
Sat Oct 05 17:05:20 2024
ALTER SYSTEM ARCHIVE LOG
Sat Oct 05 17:05:21 2024
Thread 1 advanced to log sequence 6982 (LGWR switch)
  Current log# 5 seq# 6982 mem# 0: +DATA/ybqddb/onlinelog/group_5.290.1144593225
Sat Oct 05 17:05:21 2024
Archived Log entry 12134 added for thread 1 sequence 6981 ID 0x8d497377 dest 1:
Sat Oct 05 21:05:22 2024
ALTER SYSTEM ARCHIVE LOG
Sat Oct 05 21:05:22 2024
Thread 1 advanced to log sequence 6983 (LGWR switch)
  Current log# 7 seq# 6983 mem# 0: +DATA/ybqddb/onlinelog/group_7.294.1144593235
Sat Oct 05 21:05:23 2024
Archived Log entry 12135 added for thread 1 sequence 6982 ID 0x8d497377 dest 1:
Sun Oct 06 01:08:47 2024
ALTER SYSTEM ARCHIVE LOG
Sun Oct 06 01:08:49 2024
Thread 1 advanced to log sequence 6984 (LGWR switch)
  Current log# 9 seq# 6984 mem# 0: +DATA/ybqddb/onlinelog/group_9.298.1144593249
Sun Oct 06 01:08:49 2024
Archived Log entry 12138 added for thread 1 sequence 6983 ID 0x8d497377 dest 1:
Sun Oct 06 05:05:18 2024
ALTER SYSTEM ARCHIVE LOG
Sun Oct 06 05:05:18 2024
Thread 1 advanced to log sequence 6985 (LGWR switch)
  Current log# 11 seq# 6985 mem# 0: +DATA/ybqddb/onlinelog/group_11.302.1144593261
Archived Log entry 12139 added for thread 2 sequence 5173 ID 0x8d497377 dest 1:
Sun Oct 06 05:05:19 2024

node2 alert:

Sat Oct 05 13:05:14 2024
Archived Log entry 12129 added for thread 2 sequence 5167 ID 0x8d497377 dest 1:
Sat Oct 05 14:50:47 2024
NOTE: ASMB terminating
Errors in file /u01/app/oracle/diag/rdbms/ybqddb/ybqddb2/trace/ybqddb2_asmb_15097.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Process ID: 
Session ID: 2109 Serial number: 3
Errors in file /u01/app/oracle/diag/rdbms/ybqddb/ybqddb2/trace/ybqddb2_asmb_15097.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Process ID: 
Session ID: 2109 Serial number: 3
ASMB (ospid: 15097): terminating the instance due to error 15064
Instance terminated by ASMB, pid = 15097
Sat Oct 05 14:51:59 2024
Starting ORACLE instance (normal)
************************ Large Pages Information *******************
Per process system memlock (soft) limit = UNLIMITED
 
Total Shared Global Region in Large Pages = 0 KB (0%)
 
Large Pages used by this instance: 0 (0 KB)
Large Pages unused system wide = 0 (0 KB)
Large Pages configured system wide = 0 (0 KB)
Large Page size = 2048 KB
 
从节点2日志可以看出,在14:50:47出现与asm实例通信失败报错,并于14:51:59该节点实例开始启动。

(2)查看crs及ocssd日志

node1 crs日志:

2024-07-26 03:02:59.119: 
[client(240419)]CRS-10051:CVU found following errors with Clusterware setup : PRVF-4657 : Name resolution setup check for "nxybj-scan" (IP address: 172.31.122.38) failedPRVF-4664 : Found inconsistent name resolution entries for SCAN name "nxybj-scan"

2024-10-05 14:50:31.281: 
[cssd(10595)]CRS-1612:Network communication with node nxybj2 (2) missing for 50% of timeout interval.  Removal of this node from cluster in 14.200 seconds
2024-10-05 14:50:38.282: 
[cssd(10595)]CRS-1611:Network communication with node nxybj2 (2) missing for 75% of timeout interval.  Removal of this node from cluster in 7.200 seconds
2024-10-05 14:50:43.283: 
[cssd(10595)]CRS-1610:Network communication with node nxybj2 (2) missing for 90% of timeout interval.  Removal of this node from cluster in 2.200 seconds
2024-10-05 14:50:45.484: 
[cssd(10595)]CRS-1607:Node nxybj2 is being evicted in cluster incarnation 587229053; details at (:CSSNM00007:) in /u01/app/11.2.0/grid/log/nxybj1/cssd/ocssd.log.
2024-10-05 14:50:47.994: 
[cssd(10595)]CRS-1625:Node nxybj2, number 2, was manually shut down
2024-10-05 14:50:47.999: 
[cssd(10595)]CRS-1601:CSSD Reconfiguration complete. Active nodes are nxybj1 .
2024-10-05 14:50:48.000: 
[crsd(14356)]CRS-5504:Node down event reported for node 'nxybj2'.
2024-10-05 14:50:51.248: 
[crsd(14356)]CRS-2773:Server 'nxybj2' has been removed from pool 'Generic'.
2024-10-05 14:50:51.249: 
[crsd(14356)]CRS-2773:Server 'nxybj2' has been removed from pool 'ora.orcl'.
2024-10-05 14:50:51.249: 
[crsd(14356)]CRS-2773:Server 'nxybj2' has been removed from pool 'ora.ybqddb'.
2024-10-05 14:50:51.249: 
[crsd(14356)]CRS-2773:Server 'nxybj2' has been removed from pool 'ora.ybzxddb'.
2024-10-05 14:51:13.084: 
[cssd(10595)]CRS-1601:CSSD Reconfiguration complete. Active nodes are nxybj1 nxybj2 .
2024-10-05 14:51:42.117: 
[crsd(14356)]CRS-2772:Server 'nxybj2' has been assigned to pool 'Generic'.
2024-10-05 14:51:42.118: 
[crsd(14356)]CRS-2772:Server 'nxybj2' has been assigned to pool 'ora.ybzxddb'.
2024-10-05 14:51:42.118: 
[crsd(14356)]CRS-2772:Server 'nxybj2' has been assigned to pool 'ora.ybqddb'.
2024-10-05 14:51:42.118: 
[crsd(14356)]CRS-2772:Server 'nxybj2' has been assigned to pool 'ora.orcl'.

node1 ocssd日志:
2024-10-05 14:50:28.553: [    CSSD][2946451200]clssnmSendingThread: sent 5 status msgs to all nodes
2024-10-05 14:50:31.281: [    CSSD][2948028160]clssnmPollingThread: node nxybj2 (2) at 50% heartbeat fatal, removal in 14.200 seconds
2024-10-05 14:50:31.281: [    CSSD][2948028160]clssnmPollingThread: node nxybj2 (2) is impending reconfig, flag 2229260, misstime 15800
2024-10-05 14:50:31.281: [    CSSD][2948028160]clssnmPollingThread: local diskTimeout set to 27000 ms, remote disk timeout set to 27000, impending reconfig status(1)
2024-10-05 14:50:31.282: [    CSSD][3159578368]clssnmvDHBValidateNcopy: node 2, nxybj2, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108693547, LATS 2409222280, lastSeqNo 108691578, uniqueness 1708519428, timestamp 1728111030/2407806020
2024-10-05 14:50:31.282: [    CSSD][3164333824]clssnmvDHBValidateNcopy: node 2, nxybj2, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108693549, LATS 2409222280, lastSeqNo 108693378, uniqueness 1708519428, timestamp 1728111030/2407806140
2024-10-05 14:50:31.691: [    CSSD][3162732288]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111031/2409222690
2024-10-05 14:50:31.781: [    CSSD][3172316928]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111031/2409222780
2024-10-05 14:50:31.784: [    CSSD][3169081088]clssnmvDHBValidateNcopy: node 2, nxybj2, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108693551, LATS 2409222790, lastSeqNo 108691579, uniqueness 1708519428, timestamp 1728111031/2407807110
2024-10-05 14:50:31.811: [    CSSD][3167487744]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111031/2409222810
2024-10-05 14:50:32.193: [    CSSD][3162732288]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111032/2409223200
2024-10-05 14:50:32.282: [    CSSD][3159578368]clssnmvDHBValidateNcopy: node 2, nxybj2, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108693553, LATS 2409223280, lastSeqNo 108693547, uniqueness 1708519428, timestamp 1728111032/2407807530
2024-10-05 14:50:32.291: [    CSSD][3172316928]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111032/2409223290
2024-10-05 14:50:32.321: [    CSSD][3167487744]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111032/2409223320
2024-10-05 14:50:32.701: [    CSSD][3162732288]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111032/2409223700
2024-10-05 14:50:32.783: [    CSSD][3159578368]clssnmvDHBValidateNcopy: node 2, nxybj2, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108693556, LATS 2409223790, lastSeqNo 108693553, uniqueness 1708519428, timestamp 1728111032/2407808040
2024-10-05 14:50:32.793: [    CSSD][3172316928]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111032/2409223790
2024-10-05 14:50:32.823: [    CSSD][3167487744]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111032/2409223820
2024-10-05 14:50:33.203: [    CSSD][3162732288]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111033/2409224200
2024-10-05 14:50:33.282: [    CSSD][3159578368]clssnmvDHBValidateNcopy: node 2, nxybj2, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108693559, LATS 2409224280, lastSeqNo 108693556, uniqueness 1708519428, timestamp 1728111033/2407808540
2024-10-05 14:50:33.301: [    CSSD][3172316928]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111033/2409224300
2024-10-05 14:50:33.331: [    CSSD][3167487744]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111033/2409224330
2024-10-05 14:50:33.554: [    CSSD][2946451200]clssnmSendingThread: sending status msg to all nodes
2024-10-05 14:50:33.554: [    CSSD][2946451200]clssnmSendingThread: sent 5 status msgs to all nodes
2024-10-05 14:50:33.711: [    CSSD][3162732288]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111033/2409224710
2024-10-05 14:50:33.784: [    CSSD][3159578368]clssnmvDHBValidateNcopy: node 2, nxybj2, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108693562, LATS 2409224790, lastSeqNo 108693559, uniqueness 1708519428, timestamp 1728111033/2407809050
2024-10-05 14:50:33.803: [    CSSD][3172316928]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111033/2409224800
2024-10-05 14:50:33.833: [    CSSD][3167487744]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111033/2409224840
2024-10-05 14:50:34.213: [    CSSD][3162732288]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111034/2409225210
2024-10-05 14:50:34.282: [    CSSD][3169081088]clssnmvDHBValidateNcopy: node 2, nxybj2, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108693563, LATS 2409225280, lastSeqNo 108693551, uniqueness 1708519428, timestamp 1728111033/2407809130
2024-10-05 14:50:34.282: [    CSSD][3159578368]clssnmvDHBValidateNcopy: node 2, nxybj2, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108693565, LATS 2409225280, lastSeqNo 108693562, uniqueness 1708519428, timestamp 1728111034/2407809550
2024-10-05 14:50:34.311: [    CSSD][3172316928]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111034/2409225310
2024-10-05 14:50:34.341: [    CSSD][3167487744]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111034/2409225340
2024-10-05 14:50:34.721: [    CSSD][3162732288]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111034/2409225720
2024-10-05 14:50:34.784: [    CSSD][3164333824]clssnmvDHBValidateNcopy: node 2, nxybj2, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108693567, LATS 2409225790, lastSeqNo 108693549, uniqueness 1708519428, timestamp 1728111034/2407809680
2024-10-05 14:50:34.784: [    CSSD][3159578368]clssnmvDHBValidateNcopy: node 2, nxybj2, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108693568, LATS 2409225790, lastSeqNo 108693565, uniqueness 1708519428, timestamp 1728111034/2407810050
2024-10-05 14:50:34.813: [    CSSD][3172316928]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111034/2409225810
 

                                                                                                                            
node2 crs日志:
[ctssd(11996)]CRS-2409:The clock on host nxybj2 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode.
2024-10-03 05:57:32.243: 
[ctssd(11996)]CRS-2409:The clock on host nxybj2 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode.
2024-10-05 14:50:31.036: 
[cssd(10538)]CRS-1612:Network communication with node nxybj1 (1) missing for 50% of timeout interval.  Removal of this node from cluster in 14.520 seconds
2024-10-05 14:50:39.038: 
[cssd(10538)]CRS-1611:Network communication with node nxybj1 (1) missing for 75% of timeout interval.  Removal of this node from cluster in 6.520 seconds
2024-10-05 14:50:43.039: 
[cssd(10538)]CRS-1610:Network communication with node nxybj1 (1) missing for 90% of timeout interval.  Removal of this node from cluster in 2.520 seconds
2024-10-05 14:50:45.560: 
[cssd(10538)]CRS-1609:This node is unable to communicate with other nodes in the cluster and is going down to preserve cluster integrity; details at (:CSSNM00008:) in /u01/app/11.2.0/grid/log/nxybj2/cssd/ocssd.log.
2024-10-05 14:50:45.560: 
[cssd(10538)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /u01/app/11.2.0/grid/log/nxybj2/cssd/ocssd.log
2024-10-05 14:50:45.560: 
[cssd(10538)]CRS-1608:This node was evicted by node 1, nxybj1; details at (:CSSNM00005:) in /u01/app/11.2.0/grid/log/nxybj2/cssd/ocssd.log.
2024-10-05 14:50:45.560: 
[cssd(10538)]CRS-1608:This node was evicted by node 1, nxybj1; details at (:CSSNM00005:) in /u01/app/11.2.0/grid/log/nxybj2/cssd/ocssd.log.
2024-10-05 14:50:45.577: 
[cssd(10538)]CRS-1652:Starting clean up of CRSD resources.
2024-10-05 14:50:46.038: 
[cssd(10538)]CRS-1608:This node was evicted by node 1, nxybj1; details at (:CSSNM00005:) in /u01/app/11.2.0/grid/log/nxybj2/cssd/ocssd.log.
2024-10-05 14:50:46.803: 
[/u01/app/11.2.0/grid/bin/oraagent.bin(14456)]CRS-5016:Process "/u01/app/11.2.0/grid/opmn/bin/onsctli" spawned by agent "/u01/app/11.2.0/grid/bin/oraagent.bin" for action "check" failed: details at "(:CLSN00010:)" in "/u01/app/11.2.0/grid/log/nxybj2/agent/crsd/oraagent_grid/oraagent_grid.log"
2024-10-05 14:50:47.404: 
[/u01/app/11.2.0/grid/bin/oraagent.bin(14456)]CRS-5016:Process "/u01/app/11.2.0/grid/bin/lsnrctl" spawned by agent "/u01/app/11.2.0/grid/bin/oraagent.bin" for action "check" failed: details at "(:CLSN00010:)" in "/u01/app/11.2.0/grid/log/nxybj2/agent/crsd/oraagent_grid/oraagent_grid.log"
2024-10-05 14:50:47.405: 
[cssd(10538)]CRS-1654:Clean up of CRSD resources finished successfully.
2024-10-05 14:50:47.406: 
[cssd(10538)]CRS-1655:CSSD on node nxybj2 detected a problem and started to shutdown.
2024-10-05 14:50:47.434: 
[/u01/app/11.2.0/grid/bin/oraagent.bin(14698)]CRS-5822:Agent '/u01/app/11.2.0/grid/bin/oraagent_oracle' disconnected from server. Details at (:CRSAGF00117:) {0:7:28777} in /u01/app/11.2.0/grid/log/nxybj2/agent/crsd/oraagent_oracle/oraagent_oracle.log.
2024-10-05 14:50:47.436: 
[ohasd(10322)]CRS-2765:Resource 'ora.crsd' has failed on server 'nxybj2'.
2024-10-05 14:50:47.437: 
[/u01/app/11.2.0/grid/bin/oraagent.bin(14456)]CRS-5822:Agent '/u01/app/11.2.0/grid/bin/oraagent_grid' disconnected from server. Details at (:CRSAGF00117:) {0:1:4} in /u01/app/11.2.0/grid/log/nxybj2/agent/crsd/oraagent_grid/oraagent_grid.log.
2024-10-05 14:50:47.437: 
[/u01/app/11.2.0/grid/bin/orarootagent.bin(14460)]CRS-5822:Agent '/u01/app/11.2.0/grid/bin/orarootagent_root' disconnected from server. Details at (:CRSAGF00117:) {0:3:2522} in /u01/app/11.2.0/grid/log/nxybj2/agent/crsd/orarootagent_root/orarootagent_root.log.
2024-10-05 14:50:47.691: 
[cssd(10538)]CRS-1660:The CSS daemon shutdown has completed
2024-10-05 14:50:47.846: 
[ohasd(10322)]CRS-2765:Resource 'ora.evmd' has failed on server 'nxybj2'.
2024-10-05 14:50:47.848: 
[ohasd(10322)]CRS-2765:Resource 'ora.ctssd' has failed on server 'nxybj2'.
2024-10-05 14:50:48.452: 
[crsd(345044)]CRS-0805:Cluster Ready Service aborted due to failure to communicate with Cluster Synchronization Service with error [3]. Details at (:CRSD00109:) in /u01/app/11.2.0/grid/log/nxybj2/crsd/crsd.log.
2024-10-05 14:50:48.598: 
[ohasd(10322)]CRS-2765:Resource 'ora.cssdmonitor' has failed on server 'nxybj2'.
2024-10-05 14:50:48.603: 
[ohasd(10322)]CRS-2765:Resource 'ora.crsd' has failed on server 'nxybj2'.
2024-10-05 14:50:48.861: 
[ctssd(348486)]CRS-2402:The Cluster Time Synchronization Service aborted on host nxybj2. Details at (:ctss_css_init1:) in /u01/app/11.2.0/grid/log/nxybj2/ctssd/octssd.log.
2024-10-05 14:50:49.102: 
[ohasd(10322)]CRS-2765:Resource 'ora.cssd' has failed on server 'nxybj2'.
2024-10-05 14:50:49.105: 
[/u01/app/11.2.0/grid/bin/oraagent.bin(10448)]CRS-5011:Check of resource "+ASM" failed: details at "(:CLSN00006:)" in "/u01/app/11.2.0/grid/log/nxybj2/agent/ohasd/oraagent_grid/oraagent_grid.log"
2024-10-05 14:50:49.271: 
[/u01/app/11.2.0/grid/bin/oraagent.bin(10448)]CRS-5011:Check of resource "+ASM" failed: details at "(:CLSN00006:)" in "/u01/app/11.2.0/grid/log/nxybj2/agent/ohasd/oraagent_grid/oraagent_grid.log"
2024-10-05 14:50:49.272: 
[ohasd(10322)]CRS-2765:Resource 'ora.asm' has failed on server 'nxybj2'.
2024-10-05 14:50:51.030: 
[cssd(349990)]CRS-1713:CSSD daemon is started in clustered mode
2024-10-05 14:50:52.863: 
[ohasd(10322)]CRS-2878:Failed to restart resource 'ora.ctssd'
2024-10-05 14:50:52.863: 
[ohasd(10322)]CRS-2769:Unable to failover resource 'ora.ctssd'.
2024-10-05 14:51:06.688: 
[cssd(349990)]CRS-1707:Lease acquisition for node nxybj2 number 2 completed
2024-10-05 14:51:07.344: 
[ohasd(10322)]CRS-2765:Resource 'ora.cluster_interconnect.haip' has failed on server 'nxybj2'.
2024-10-05 14:51:07.954: 
[cssd(349990)]CRS-1605:CSSD voting file is online: /dev/asmdisk/ocr02; details in /u01/app/11.2.0/grid/log/nxybj2/cssd/ocssd.log.
2024-10-05 14:51:07.961: 
[cssd(349990)]CRS-1605:CSSD voting file is online: /dev/asmdisk/ocr03; details in /u01/app/11.2.0/grid/log/nxybj2/cssd/ocssd.log.
2024-10-05 14:51:07.968: 
[cssd(349990)]CRS-1605:CSSD voting file is online: /dev/asmdisk/ocr01; details in /u01/app/11.2.0/grid/log/nxybj2/cssd/ocssd.log.
2024-10-05 14:51:13.083: 
[cssd(349990)]CRS-1601:CSSD Reconfiguration complete. Active nodes are nxybj1 nxybj2 .
2024-10-05 14:51:15.026: 
[ctssd(351425)]CRS-2403:The Cluster Time Synchronization Service on host nxybj2 is in observer mode.
2024-10-05 14:51:15.261: 
[ctssd(351425)]CRS-2407:The new Cluster Time Synchronization Service reference node is host nxybj1.
2024-10-05 14:51:15.261: 
[ctssd(351425)]CRS-2401:The Cluster Time Synchronization Service started on host nxybj2.
2024-10-05 14:51:17.018: 
[ohasd(10322)]CRS-2767:Resource state recovery not attempted for 'ora.diskmon' as its target state is OFFLINE
2024-10-05 14:51:17.018: 
[ohasd(10322)]CRS-2769:Unable to failover resource 'ora.diskmon'.
2024-10-05 14:51:22.920: 
[ohasd(10322)]CRS-2878:Failed to restart resource 'ora.cluster_interconnect.haip'
2024-10-05 14:51:22.920: 
[/u01/app/11.2.0/grid/bin/oraagent.bin(10448)]CRS-5011:Check of resource "+ASM" failed: details at "(:CLSN00006:)" in "/u01/app/11.2.0/grid/log/nxybj2/agent/ohasd/oraagent_grid/oraagent_grid.log"
2024-10-05 14:51:23.116: 
[/u01/app/11.2.0/grid/bin/oraagent.bin(10448)]CRS-5011:Check of resource "+ASM" failed: details at "(:CLSN00006:)" in "/u01/app/11.2.0/grid/log/nxybj2/agent/ohasd/oraagent_grid/oraagent_grid.log"
2024-10-05 14:51:30.099: 
[ohasd(10322)]CRS-2878:Failed to restart resource 'ora.asm'
2024-10-05 14:51:31.109: 
[ohasd(10322)]CRS-2878:Failed to restart resource 'ora.crsd'
2024-10-05 14:51:36.711: 
[crsd(353461)]CRS-1012:The OCR service started on node nxybj2.
2024-10-05 14:51:36.717: 
[evmd(348478)]CRS-1401:EVMD started on node nxybj2.
2024-10-05 14:51:37.724: 
[crsd(353461)]CRS-1201:CRSD started on node nxybj2.
2024-10-05 14:51:39.350: 

                                                                                                                    
node2 ocssd日志:
2024-10-05 14:50:31.036: [    CSSD][1568802560]clssnmPollingThread: node nxybj1 (1) at 50% heartbeat fatal, removal in 14.520 seconds
2024-10-05 14:50:31.036: [    CSSD][1568802560]clssnmPollingThread: node nxybj1 (1) is impending reconfig, flag 2491406, misstime 15480
2024-10-05 14:50:31.036: [    CSSD][1568802560]clssnmPollingThread: local diskTimeout set to 27000 ms, remote disk timeout set to 27000, impending reconfig status(1)
2024-10-05 14:50:31.037: [    CSSD][1584641792]clssnmvDHBValidateNcopy: node 1, nxybj1, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108592072, LATS 2407806370, lastSeqNo 108585015, uniqueness 1708518327, timestamp 1728111030/2409221770
2024-10-05 14:50:31.037: [    CSSD][1575139072]clssnmvDHBValidateNcopy: node 1, nxybj1, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108592073, LATS 2407806370, lastSeqNo 108589502, uniqueness 1708518327, timestamp 1728111030/2409221810
2024-10-05 14:50:31.687: [    CSSD][1583048448]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111031/2407807020
2024-10-05 14:50:31.777: [    CSSD][1587885824]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111031/2407807110
2024-10-05 14:50:31.817: [    CSSD][1578292992]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111031/2407807150
2024-10-05 14:50:32.037: [    CSSD][1575139072]clssnmvDHBValidateNcopy: node 1, nxybj1, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108592076, LATS 2407807370, lastSeqNo 108592073, uniqueness 1708518327, timestamp 1728111031/2409222810
2024-10-05 14:50:32.197: [    CSSD][1583048448]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111032/2407807530
2024-10-05 14:50:32.287: [    CSSD][1587885824]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111032/2407807620
2024-10-05 14:50:32.327: [    CSSD][1578292992]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111032/2407807660
2024-10-05 14:50:32.480: [    CSSD][1567225600]clssnmSendingThread: sending status msg to all nodes
2024-10-05 14:50:32.480: [    CSSD][1567225600]clssnmSendingThread: sent 5 status msgs to all nodes
2024-10-05 14:50:32.539: [    CSSD][1575139072]clssnmvDHBValidateNcopy: node 1, nxybj1, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108592079, LATS 2407807880, lastSeqNo 108592076, uniqueness 1708518327, timestamp 1728111032/2409223320
2024-10-05 14:50:32.698: [    CSSD][1583048448]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111032/2407808040
2024-10-05 14:50:32.789: [    CSSD][1587885824]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111032/2407808130
2024-10-05 14:50:32.828: [    CSSD][1578292992]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111032/2407808170
2024-10-05 14:50:33.038: [    CSSD][1579894528]clssnmvDHBValidateNcopy: node 1, nxybj1, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108592080, LATS 2407808370, lastSeqNo 108591299, uniqueness 1708518327, timestamp 1728111032/2409223700
2024-10-05 14:50:33.038: [    CSSD][1575139072]clssnmvDHBValidateNcopy: node 1, nxybj1, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108592082, LATS 2407808370, lastSeqNo 108592079, uniqueness 1708518327, timestamp 1728111032/2409223820
2024-10-05 14:50:33.207: [    CSSD][1583048448]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111033/2407808540
2024-10-05 14:50:33.297: [    CSSD][1587885824]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111033/2407808630
2024-10-05 14:50:33.337: [    CSSD][1578292992]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111033/2407808670
2024-10-05 14:50:33.540: [    CSSD][1584641792]clssnmvDHBValidateNcopy: node 1, nxybj1, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108592084, LATS 2407808880, lastSeqNo 108592072, uniqueness 1708518327, timestamp 1728111033/2409224300
2024-10-05 14:50:33.546: [    CSSD][1575139072]clssnmvDHBValidateNcopy: node 1, nxybj1, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108592085, LATS 2407808880, lastSeqNo 108592082, uniqueness 1708518327, timestamp 1728111033/2409224330
2024-10-05 14:50:33.709: [    CSSD][1583048448]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111033/2407809050
2024-10-05 14:50:33.799: [    CSSD][1587885824]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111033/2407809130
2024-10-05 14:50:33.838: [    CSSD][1578292992]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111033/2407809170
2024-10-05 14:50:34.038: [    CSSD][1579894528]clssnmvDHBValidateNcopy: node 1, nxybj1, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108592086, LATS 2407809370, lastSeqNo 108592080, uniqueness 1708518327, timestamp 1728111033/2409224710
2024-10-05 14:50:34.038: [    CSSD][1575139072]clssnmvDHBValidateNcopy: node 1, nxybj1, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108592088, LATS 2407809370, lastSeqNo 108592085, uniqueness 1708518327, timestamp 1728111033/2409224840
2024-10-05 14:50:34.217: [    CSSD][1583048448]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111034/2407809550
2024-10-05 14:50:34.307: [    CSSD][1587885824]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111034/2407809640
2024-10-05 14:50:34.347: [    CSSD][1578292992]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111034/2407809680
2024-10-05 14:50:34.540: [    CSSD][1579894528]clssnmvDHBValidateNcopy: node 1, nxybj1, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108592089, LATS 2407809880, lastSeqNo 108592086, uniqueness 1708518327, timestamp 1728111034/2409225210
2024-10-05 14:50:34.540: [    CSSD][1575139072]clssnmvDHBValidateNcopy: node 1, nxybj1, has a disk HB, but no network HB, DHB has rcfg 587229053, wrtcnt, 108592091, LATS 2407809880, lastSeqNo 108592088, uniqueness 1708518327, timestamp 1728111034/2409225340
2024-10-05 14:50:34.718: [    CSSD][1583048448]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111034/2407810050
2024-10-05 14:50:34.809: [    CSSD][1587885824]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111034/2407810150
2024-10-05 14:50:34.849: [    CSSD][1578292992]clssnmvDiskPing: Writing with status 0x3, timestamp 1728111034/2407810180                                                                                                 

通过crs日志可以看出,在14:50:31开始出现节点之间通信失败报错,随后节点2发生了重启。

(3) 查看节点2系统日志

节点2系统日志:
[root@nxybj2 log]# more messages-20241006
Sep 29 03:50:01 nxybj2 rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="7519" x-info="http://www.rsyslog.com"] rsyslogd was HUPed
Sep 29 03:50:02 nxybj2 rhsmd: In order for Subscription Manager to provide your system with updates, your system must be registered with the Customer Portal. Please enter your Red Hat login to ensure your system is up-
to-date.
Sep 30 03:28:02 nxybj2 rhsmd: In order for Subscription Manager to provide your system with updates, your system must be registered with the Customer Portal. Please enter your Red Hat login to ensure your system is up-
to-date.
Oct  1 04:16:02 nxybj2 rhsmd: In order for Subscription Manager to provide your system with updates, your system must be registered with the Customer Portal. Please enter your Red Hat login to ensure your system is up-
to-date.
Oct  2 03:24:02 nxybj2 rhsmd: In order for Subscription Manager to provide your system with updates, your system must be registered with the Customer Portal. Please enter your Red Hat login to ensure your system is up-
to-date.
Oct  3 03:23:03 nxybj2 rhsmd: In order for Subscription Manager to provide your system with updates, your system must be registered with the Customer Portal. Please enter your Red Hat login to ensure your system is up-
to-date.
Oct  4 03:27:02 nxybj2 rhsmd: In order for Subscription Manager to provide your system with updates, your system must be registered with the Customer Portal. Please enter your Red Hat login to ensure your system is up-
to-date.
Oct  5 03:34:02 nxybj2 rhsmd: In order for Subscription Manager to provide your system with updates, your system must be registered with the Customer Portal. Please enter your Red Hat login to ensure your system is up-
to-date.
Oct  5 14:50:46 nxybj2 ntpd[394054]: Deleting interface #4 bond0:1, 172.31.122.37#123, interface stats: received=0, sent=0, dropped=0, active_time=6191372 secs
Oct  5 14:51:15 nxybj2 ntpd[394054]: Deleting interface #6 bond1:1, 169.254.209.169#123, interface stats: received=0, sent=0, dropped=0, active_time=6191401 secs
Oct  5 14:51:23 nxybj2 ntpd[394054]: Listen normally on 17 bond1:1 169.254.209.169 UDP 123
Oct  5 14:51:46 nxybj2 ntpd[394054]: Listen normally on 18 bond0:1 172.31.122.37 UDP 123

 系统日志无异常报错。

三、故障原因分析

根据日志信息可以看到,是由于节点之间心跳通信异常,导致节点2发生了重启,根据日志中的关键信息:

2024-10-05 14:50:45.484: 
[cssd(10595)]CRS-1607:Node nxybj2 is being evicted in cluster incarnation 587229053; details at (:CSSNM00007:) in /u01/app/11.2.0/grid/log/nxybj1/cssd/ocssd.log.
2024-10-05 14:50:47.994: 
[cssd(10595)]CRS-1625:Node nxybj2, number 2, was manually shut down

在mos上查询该报错,通过查询到的结果分析,该报错为bug:

受影响版本为:11.2.0.3,当前生成环境版本为:11.2.0.4,由此可以确定本次节点2重启是bug导致。

进一步查询,该bug在如下版本中进行了修复:

根据查询结果,要想修复此bug,就需要打对应版本的补丁。