环境:AIX6.1 + Oracle11.2.0.4 RAC(2 nodes)
1.故障现象
使用crsctl查看集群各资源状态,在任一节点都会直接报错CRS-4535, CRS-4000;但此时数据库是可以被正常访问的。
具体故障现象如下:
#节点1查询
grid@bjdb1:/home/grid>crsctl stat res -t
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4000: Command Status failed, or completed with errors.
#节点2查询
root@bjdb2:/>crsctl stat res -t
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4000: Command Status failed, or completed with errors.
同样的,crs_stat -t 查看一样报错,错误码是CRS-0184:
root@bjdb1:/>crs_stat -t
CRS-0184: Cannot communicate with the CRS daemon.
节点2也一样!
确定此时数据库是可以被正常访问的。如下:
#节点2模拟客户端登录RAC集群,使用SCAN IP访问,发现可以正常访问到数据库
oracle@bjdb2:/home/oracle>sqlplus jingyu/jingyu@192.168.103.31/bjdb
SQL*Plus: Release 11.2.0.4.0 Production on Mon Oct 10 14:24:47 2016
Copyright (c) 1982, 2013, Oracle. All rights reserved.
Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Data Mining and Real Application Testing options
SQL>
RAC环境下的/etc/hosts文件相关内容:
#scan
192.168.103.31 scan-ip
2.定位问题
首先查看节点1的集群相关日志:
Clusterware(GI)的日志存放在$GRID_HOME/log/nodename下;
Clusterware(GI)对应几个关键的后台进程css,crs,evm,它们的日志分别存在cssd,crsd,evmd目录下;
节点1查看相关日志:
#查看GI的alert日志文件,最近的记录只是提示GI所在存储空间使用率高,稍后清理下即可,而且目前还有一定空间剩余,显然并非是此次故障的原因。
root@bjdb1:/opt/u01/app/11.2.0/grid/log/bjdb1>tail -f alert*.log
2016-10-10 14:18:26.125:
[crflogd(39190674)]CRS-9520:The storage of Grid Infrastructure Management Repository is 93% full. The storage location is '/opt/u01/app/11.2.0/grid/crf/db/bjdb1'.
2016-10-10 14:23:31.125:
[crflogd(39190674)]CRS-9520:The storage of Grid Infrastructure Management Repository is 93% full. The storage location is '/opt/u01/app/11.2.0/grid/crf/db/bjdb1'.
2016-10-10 14:28:36.125:
[crflogd(39190674)]CRS-9520:The storage of Grid Infrastructure Management Repository is 93% full. The storage location is '/opt/u01/app/11.2.0/grid/crf/db/bjdb1'.
2016-10-10 14:33:41.125:
[crflogd(39190674)]CRS-9520:The storage of Grid Infrastructure Management Repository is 93% full. The storage location is '/opt/u01/app/11.2.0/grid/crf/db/bjdb1'.
2016-10-10 14:38:46.125:
[crflogd(39190674)]CRS-9520:The storage of Grid Infrastructure Management Repository is 93% full. The storage location is '/opt/u01/app/11.2.0/grid/crf/db/bjdb1'.
#因为crsctl不可以使用,进而查看crs的日志信息,发现3号已经有报错,无法打开裸设备,从而导致无法初始化OCR;继续看错误信息,发现是这个时候访问共享存储时无法成功。怀疑此刻存储出现问题,需要进一步和现场人员确定此时间点是否有存储相关的施工。
root@bjdb1:/opt/u01/app/11.2.0/grid/log/bjdb1/crsd>tail -f crsd.log
2016-10-03 18:04:40.248: [ OCRRAW][1]proprinit: Could not open raw device
2016-10-03 18:04:40.248: [ OCRASM][1]proprasmcl: asmhandle is NULL
2016-10-03 18:04:40.252: [ OCRAPI][1]a_init:16!: Backend init unsuccessful : [26]
2016-10-03 18:04:40.253: [ CRSOCR][1] OCR context init failure. Error: PROC-26: Error while accessing the physical storage
2016-10-03 18:04:40.253: [ CRSD][1] Created alert : (:CRSD00111:) : Could not init OCR, error: PROC-26: Error while accessing the physical storage
2016-10-03 18:04:40.253: [ CRSD][1][PANIC] CRSD exiting: Could not init OCR, code: 26
2016-10-03 18:04:40.253: [ CRSD][1] Done.
节点2查看相关日志:
#查看GI的alert日志,发现节点2的ctss有CRS-2409的报错,虽然根据MOS文档 ID 1135337.1说明,This is not an error. ctssd is reporting that there is a time difference and it is not doing anything about it as it is running in observer mode.只需要查看两个节点的时间是否一致,但实际上查询节点时间一致:
root@bjdb2:/opt/u01/app/11.2.0/grid/log/bjdb2>tail -f alert*.log
2016-10-10 12:29:22.145:
[ctssd(5243030)]CRS-2409:The clock on host bjdb2 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode.
2016-10-10 12:59:38.799:
[ctssd(5243030)]CRS-2409:The clock on host bjdb2 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode.
2016-10-10 13:34:11.402:
[ctssd(5243030)]CRS-2409:The clock on host bjdb2 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode.
2016-10-10 14:12:44.168:
[ctssd(5243030)]CRS-2409:The clock on host bjdb2 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode.
2016-10-10 14:44:04.824:
[ctssd(5243030)]CRS-2409:The clock on host bjdb2 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode.
#查看节点2的crs日志,发现和节点1相近的时间点,同样访问共享存储出现了问题,进而无法初始化OCR
root@bjdb2:/opt/u01/app/11.2.0/grid/log/bjdb2/crsd>tail -f crsd.log
2016-10-03 18:04:31.077: [ OCRRAW][1]proprinit: Could not open raw device
2016-10-03 18:04:31.077: [ OCRASM][1]proprasmcl: asmhandle is NULL
2016-10-03 18:04:31.081: [ OCRAPI][1]a_init:16!: Backend init unsuccessful : [26]
2016-10-03 18:04:31.081: [ CRSOCR][1] OCR context init failure. Error: PROC-26: Error while accessing the physical storage
2016-10-03 18:04:31.082: [ CRSD][1] Created alert : (:CRSD00111:) : Could not init OCR, error: PROC-26: Error while accessing the physical storage
2016-10-03 18:04:31.082: [ CRSD][1][PANIC] CRSD exiting: Could not init OCR, code: 26
2016-10-03 18:04:31.082: [ CRSD][1] Done.
现在登入到grid用户,确定下ASM磁盘组的状态:
sqlplus / as sysasm 直接查询v$asm_diskgroup;
select name, state, total_mb, free_mb from v$asm_diskgroup;
发现OCR_VOTE1磁盘组在两个ASM实例上都是没有mount;
SQL> select instance_name from v$instance;
INSTANCE_NAME
------------------------------------------------
+ASM2
SQL> select name, state, total_mb, free_mb from v$asm_diskgroup;
NAME STATE TOTAL_MB FREE_MB
------------------------------ --------------------------------- ---------- ----------
DATA MOUNTED 737280 88152
FRA_ARCHIVE MOUNTED 10240 9287
OCR_VOTE1 DISMOUNTED 0 0
另一个节点一样;
节点1mount OCR相关磁盘组
SQL> select name, state from v$asm_diskgroup;
NAME STATE
------------------------------ ---------------------------------
DATA MOUNTED
FRA_ARCHIVE MOUNTED
OCR_VOTE1 DISMOUNTED
再确认下目前GI的一些核心后台进程:
#发现crs这个进程是没有启动的,查询没有任何结果输出
root@bjdb1:/>ps -ef|grep crsd.bin|grep -v grep
同样,节点2查询也是一样没有启动crs进程。
简单总结问题现状:故障发生在10月3日 下午18:04左右,所有节点都因为无法访问共享存储进而导致OCR初始化失败。目前的crs进程是没有正常启动的。
3.处理问题
3.1 尝试手工挂载OCR磁盘组
SQL> alter diskgroup ocr_vote1 mount;
Diskgroup altered.
SQL> select name, state from v$asm_diskgroup;
NAME STATE
------------------------------ ---------------------------------
DATA MOUNTED
FRA_ARCHIVE MOUNTED
OCR_VOTE1 MOUNTED
3.2 节点1启动CRS
目前,crs这个进程依然是没有启动的,
#证明crsd.bin当前没有启动
root@bjdb1:/>ps -ef|grep crsd.bin|grep -v grep
节点1尝试正常开启crs失败
root@bjdb1:/>crsctl start crs
CRS-4640: Oracle High Availability Services is already active
CRS-4000: Command Start failed, or completed with errors.
节点1尝试正常关闭crs失败
root@bjdb1:/>crsctl stop crs
CRS-2796: The command may not proceed when Cluster Ready Services is not running
CRS-4687: Shutdown command has completed with errors.
CRS-4000: Command Stop failed, or completed with errors.
那么下一步如何处理呢?
最终选择在节点1强制停止crs再启动成功
#强制关闭节点1的crs
root@bjdb1:/>crsctl stop crs -f
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'bjdb1'
CRS-2673: Attempting to stop 'ora.mdnsd' on 'bjdb1'
CRS-2673: Attempting to stop 'ora.crf' on 'bjdb1'
CRS-2673: Attempting to stop 'ora.ctssd' on 'bjdb1'
CRS-2673: Attempting to stop 'ora.evmd' on 'bjdb1'
CRS-2673: Attempting to stop 'ora.asm' on 'bjdb1'
CRS-2673: Attempting to stop 'ora.drivers.acfs' on 'bjdb1'
CRS-2677: Stop of 'ora.evmd' on 'bjdb1' succeeded
CRS-2677: Stop of 'ora.mdnsd' on 'bjdb1' succeeded
CRS-2677: Stop of 'ora.ctssd' on 'bjdb1' succeeded
CRS-5017: The resource action "ora.crf stop" encountered the following error:
action for daemon aborted. For details refer to "(:CLSN00108:)" in "/opt/u01/app/11.2.0/grid/log/bjdb1/agent/ohasd/orarootagent_root/orarootagent_root.log".
CRS-2675: Stop of 'ora.crf' on 'bjdb1' failed
CRS-2679: Attempting to clean 'ora.crf' on 'bjdb1'
CRS-2681: Clean of 'ora.crf' on 'bjdb1' succeeded
CRS-2675: Stop of 'ora.asm' on 'bjdb1' failed
CRS-2679: Attempting to clean 'ora.asm' on 'bjdb1'
CRS-2681: Clean of 'ora.asm' on 'bjdb1' succeeded
CRS-2673: Attempting to stop 'ora.cluster_interconnect.haip' on 'bjdb1'
CRS-2677: Stop of 'ora.cluster_interconnect.haip' on 'bjdb1' succeeded
CRS-2673: Attempting to stop 'ora.cssd' on 'bjdb1'
CRS-2677: Stop of 'ora.cssd' on 'bjdb1' succeeded
CRS-2673: Attempting to stop 'ora.gipcd' on 'bjdb1'
CRS-2677: Stop of 'ora.gipcd' on 'bjdb1' succeeded
CRS-2673: Attempting to stop 'ora.gpnpd' on 'bjdb1'
CRS-2677: Stop of 'ora.gpnpd' on 'bjdb1' succeeded
CRS-2677: Stop of 'ora.drivers.acfs' on 'bjdb1' succeeded
CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'bjdb1' has completed
CRS-4133: Oracle High Availability Services has been stopped.
root@bjdb1:/>
#查看crsctl资源状态,此时肯定没有
root@bjdb1:/>crsctl stat res -t
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4000: Command Status failed, or completed with errors.
#查看crsd.bin,cssd.bin,evmd.bin,都已经没有了相关进程
root@bjdb1:/>ps -ef|grep crsd.bin
root 8126466 25428158 0 15:52:50 pts/0 0:00 grep crsd.bin
root@bjdb1:/>ps -ef|grep cssd.bin
root 8126470 25428158 0 15:53:01 pts/0 0:00 grep cssd.bin
root@bjdb1:/>ps -ef|grep evmd.bin
root 35520600 25428158 0 15:53:13 pts/0 0:00 grep evmd.bin
#查看pmon进程,也都没有了
root@bjdb1:/>ps -ef|grep pmon|grep -v grep
root@bjdb1:/>
#尝试再次启动crs,成功!
root@bjdb1:/>crsctl start crs
CRS-4123: Oracle High Availability Services has been started.
#查看crsctl资源,依然报错,说明还没有完全起来
root@bjdb1:/>crsctl stat res -t
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4000: Command Status failed, or completed with errors.
#等待一段时间,可以查GI相关的核心后台进程
root@bjdb1:/>ps -ef|grep crsd.bin|grep -v grep
root@bjdb1:/>ps -ef|grep cssd.bin|grep -v grep
grid 10747994 26542192 0 15:55:03 - 0:00 /opt/u01/app/11.2.0/grid/bin/ocssd.bin
root@bjdb1:/>ps -ef|grep pmon
root 39387390 25428158 0 15:57:23 pts/0 0:00 grep pmon
root@bjdb1:/>ps -ef|grep pmon|grep -v grep
grid 39911466 1 0 15:58:47 - 0:00 asm_pmon_+ASM2
root@bjdb1:/>ps -ef|grep pmon|grep -v grep
root 37814470 25428158 0 15:59:27 pts/0 0:00 grep pmon
grid 39911466 1 0 15:58:47 - 0:00 asm_pmon_+ASM2
root@bjdb1:/>
root@bjdb1:/>ps -ef|grep crsd.bin
root@bjdb1:/>ps -ef|grep cssd.bin
grid 10747994 26542192 0 15:55:03 - 0:00 /opt/u01/app/11.2.0/grid/bin/ocssd.bin
root@bjdb1:/>ps -ef|grep evmd.bin
grid 40173760 1 0 15:57:10 - 0:00 /opt/u01/app/11.2.0/grid/bin/evmd.bin
root@bjdb1:/>ps -ef|grep crsd.bin
root 37683238 1 0 15:59:54 - 0:01 /opt/u01/app/11.2.0/grid/bin/crsd.bin reboot
root@bjdb1:/>
#当核心进程都起来时,再次查看crsctl资源情况,发现已经可以正常查询,各资源正在启动
root@bjdb1:/>crsctl stat res -t
--------------------------------------------------------------------------------
NAME TARGET STATE SERVER STATE_DETAILS
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.DATA.dg
ONLINE ONLINE bjdb1
ora.LISTENER.lsnr
ONLINE ONLINE bjdb1
ora.OCR_VOTE1.dg
ONLINE ONLINE bjdb1
ora.asm
ONLINE ONLINE bjdb1 Started
ora.gsd
OFFLINE OFFLINE bjdb1
ora.net1.network
ONLINE ONLINE bjdb1
ora.ons
ONLINE ONLINE bjdb1
ora.registry.acfs
ONLINE ONLINE bjdb1
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.LISTENER_SCAN1.lsnr
1 ONLINE OFFLINE
ora.bjdb.db
1 ONLINE OFFLINE
2 ONLINE OFFLINE
ora.bjdb1.vip
1 ONLINE ONLINE bjdb1
ora.bjdb2.vip
1 ONLINE OFFLINE STARTING
ora.cvu
1 ONLINE ONLINE bjdb1
ora.oc4j
1 ONLINE ONLINE bjdb1
ora.scan1.vip
1 ONLINE OFFLINE STARTING
最后等待一段时间后,再次查询,发现节点1各资源已经全部正常。
root@bjdb1:/>crsctl stat res -t
--------------------------------------------------------------------------------
NAME TARGET STATE SERVER STATE_DETAILS
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.DATA.dg
ONLINE ONLINE bjdb1
ora.LISTENER.lsnr
ONLINE ONLINE bjdb1
ora.OCR_VOTE1.dg
ONLINE ONLINE bjdb1
ora.asm
ONLINE ONLINE bjdb1 Started
ora.gsd
OFFLINE OFFLINE bjdb1
ora.net1.network
ONLINE ONLINE bjdb1
ora.ons
ONLINE ONLINE bjdb1
ora.registry.acfs
ONLINE ONLINE bjdb1
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.LISTENER_SCAN1.lsnr
1 ONLINE OFFLINE
ora.bjdb.db
1 ONLINE ONLINE bjdb1 Open
2 ONLINE OFFLINE
ora.bjdb1.vip
1 ONLINE ONLINE bjdb1
ora.bjdb2.vip
1 ONLINE OFFLINE
ora.cvu
1 ONLINE ONLINE bjdb1
ora.oc4j
1 ONLINE ONLINE bjdb1
ora.scan1.vip
1 ONLINE OFFLINE
3.3 解决节点1上GI对应存储空间使用率过高
继续观察节点1的日志:
grid@bjdb1:/opt/u01/app/11.2.0/grid/log/bjdb1>tail -f alert*.log
2016-10-10 16:03:25.373:
[crflogd(39780590)]CRS-9520:The storage of Grid Infrastructure Management Repository is 93% full. The storage location is '/opt/u01/app/11.2.0/grid/crf/db/bjdb1'.
2016-10-10 16:08:30.373:
[crflogd(39780590)]CRS-9520:The storage of Grid Infrastructure Management Repository is 93% full. The storage location is '/opt/u01/app/11.2.0/grid/crf/db/bjdb1'.
2016-10-10 16:09:50.796:
[ctssd(5046446)]CRS-2407:The new Cluster Time Synchronization Service reference node is host bjdb1.
2016-10-10 16:10:20.373:
[crflogd(39780590)]CRS-9520:The storage of Grid Infrastructure Management Repository is 93% full. The storage location is '/opt/u01/app/11.2.0/grid/crf/db/bjdb1'.
2016-10-10 16:15:25.373:
[crflogd(39780590)]CRS-9520:The storage of Grid Infrastructure Management Repository is 93% full. The storage location is '/opt/u01/app/11.2.0/grid/crf/db/bjdb1'.
其实这个之前也看到过,就是需要清理/opt/u01目录空间了!查找可以删除的一些历史日志,解决完这个提示就不会再出现!
3.4 节点2手工挂载OCR,重启CRS
节点1问题已解决,在节点2同样挂载OCR后重启CRS
方法都一样,只是在节点2操作,不再赘述。
#强制停止节点2的crs
root@bjdb2:/>crsctl stop crs -f
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'bjdb2'
CRS-2673: Attempting to stop 'ora.mdnsd' on 'bjdb2'
CRS-2673: Attempting to stop 'ora.crf' on 'bjdb2'
CRS-2673: Attempting to stop 'ora.ctssd' on 'bjdb2'
CRS-2673: Attempting to stop 'ora.evmd' on 'bjdb2'
CRS-2673: Attempting to stop 'ora.asm' on 'bjdb2'
CRS-2673: Attempting to stop 'ora.drivers.acfs' on 'bjdb2'
CRS-2677: Stop of 'ora.crf' on 'bjdb2' succeeded
CRS-2677: Stop of 'ora.mdnsd' on 'bjdb2' succeeded
CRS-2677: Stop of 'ora.evmd' on 'bjdb2' succeeded
CRS-2677: Stop of 'ora.ctssd' on 'bjdb2' succeeded
CRS-2675: Stop of 'ora.asm' on 'bjdb2' failed
CRS-2679: Attempting to clean 'ora.asm' on 'bjdb2'
CRS-2681: Clean of 'ora.asm' on 'bjdb2' succeeded
CRS-2673: Attempting to stop 'ora.cluster_interconnect.haip' on 'bjdb2'
CRS-2677: Stop of 'ora.cluster_interconnect.haip' on 'bjdb2' succeeded
CRS-2673: Attempting to stop 'ora.cssd' on 'bjdb2'
CRS-2677: Stop of 'ora.cssd' on 'bjdb2' succeeded
CRS-2673: Attempting to stop 'ora.gipcd' on 'bjdb2'
CRS-2677: Stop of 'ora.gipcd' on 'bjdb2' succeeded
CRS-2673: Attempting to stop 'ora.gpnpd' on 'bjdb2'
CRS-2677: Stop of 'ora.gpnpd' on 'bjdb2' succeeded
CRS-2677: Stop of 'ora.drivers.acfs' on 'bjdb2' succeeded
CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'bjdb2' has completed
CRS-4133: Oracle High Availability Services has been stopped.
再启动:
crsctl start crs
等待一段时间后查询:
#观察到crs进程已经启动
root@bjdb2:/>ps -ef|grep crsd.bin|grep -v grep
root 22610148 1 0 16:24:15 - 0:01 /opt/u01/app/11.2.0/grid/bin/crsd.bin reboot
#最后使用crsctl查看资源已经恢复正常
root@bjdb2:/>crsctl stat res -t
--------------------------------------------------------------------------------
NAME TARGET STATE SERVER STATE_DETAILS
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.DATA.dg
ONLINE ONLINE bjdb1
ONLINE ONLINE bjdb2
ora.LISTENER.lsnr
ONLINE ONLINE bjdb1
ONLINE ONLINE bjdb2
ora.OCR_VOTE1.dg
ONLINE ONLINE bjdb1
ONLINE ONLINE bjdb2
ora.asm
ONLINE ONLINE bjdb1 Started
ONLINE ONLINE bjdb2 Started
ora.gsd
OFFLINE OFFLINE bjdb1
OFFLINE OFFLINE bjdb2
ora.net1.network
ONLINE ONLINE bjdb1
ONLINE ONLINE bjdb2
ora.ons
ONLINE ONLINE bjdb1
ONLINE ONLINE bjdb2
ora.registry.acfs
ONLINE ONLINE bjdb1
ONLINE ONLINE bjdb2
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.LISTENER_SCAN1.lsnr
1 ONLINE ONLINE bjdb2
ora.bjdb.db
1 ONLINE ONLINE bjdb1 Open
2 ONLINE ONLINE bjdb2 Open
ora.bjdb1.vip
1 ONLINE ONLINE bjdb1
ora.bjdb2.vip
1 ONLINE ONLINE bjdb2
ora.cvu
1 ONLINE ONLINE bjdb1
ora.oc4j
1 ONLINE ONLINE bjdb1
ora.scan1.vip
1 ONLINE ONLINE bjdb2
#查看运行在节点2上的监听程序,之前故障时,scan的监听就在节点2上
root@bjdb2:/>ps -ef|grep tns
grid 5308430 1 0 Aug 17 - 5:05 /opt/u01/app/11.2.0/grid/bin/tnslsnr LISTENER_SCAN1 -inherit
grid 5505240 1 1 Aug 17 - 27:23 /opt/u01/app/11.2.0/grid/bin/tnslsnr LISTENER -inherit
root@bjdb2:/>
至此,完成本次RAC集群CRS-4535,CRS-4000故障的处理;值得注意的是,巡检发现故障后,我整个troubleshooting解决过程,RAC数据库对外都是可以提供服务的,这点也说明了RAC的稳定性!