oracle 11G rac服务不能停止

时间:2022-01-26 08:35:51
1.问题:
节点二用crsctl stop crs -f停rac服务,无法停止,d.bin相关的9个进程都还存在
版本:oracle 11.2.0.4 for solaris

2.分析:
查看/abcapp/oragrid/11.2.0/log/abc208下的alert.log文件,日志如下:
[/abcapp/oragrid/11.2.0/bin/scriptagent.bin(10605)]CRS-5818:Aborted command 'clean' for resource 'ora.oc4j'. Details at (:CRSAGF00
113:) {2:26009:18659} in /abcapp/oragrid/11.2.0/log/abc208/agent/crsd/scriptagent_oragrid/scriptagent_oragrid.log.
2017-08-30 23:28:10.192: 
[crsd(62374)]CRS-2757:Command 'Clean' timed out waiting for response from the resource 'ora.oc4j'. Details at (:CRSPE00111:) {2:2600
9:18659} in /abcapp/oragrid/11.2.0/log/abc208/crsd/crsd.log.
/abcapp/oragrid/11.2.0/log/abc208/crsd/crsd.log报错如下:
2017-08-30 23:48:10.228: [UiServer][47]{2:26009:18672} Container [ Name: ORDER
        MESSAGE: 
        TextMessage[CRS-2680: Clean of 'ora.oc4j' on 'abc208' failed]
        MSGTYPE: 
        TextMessage[1]
        OBJID: 
        TextMessage[ora.oc4j]
        WAIT: 
        TextMessage[0]
]
2017-08-30 23:48:10.228: [   CRSPE][46]{2:26009:18672} Sequencer for [ora.oc4j 1 1] has completed with error: CRS-0216: Could not st
op resource 'ora.oc4j'.
2017-08-30 23:48:10.230: [UiServer][47]{2:26009:18673} Container [ Name: ORDER
        MESSAGE: 
        TextMessage[CRS-2503: Resource 'ora.oc4j' is in UNKNOWN state and must be stopped first]
        MSGTYPE: 
        TextMessage[1]
        OBJID: 
        TextMessage[ora.oc4j]
        WAIT: 
        TextMessage[0]
]

/abcapp/oragrid/11.2.0/log/abc208/agent/crsd/scriptagent_oragrid/scriptagent_oragrid.log如下:

2017-08-30 22:37:10.040: [ora.oc4j][46]{1:63945:12686} [check] Executing action script: /abcapp/oragrid/11.2.0/bin/oc4jctl[check]
2017-08-30 22:37:49.597: [    AGFW][9]{1:63945:12686} Agent received the message: AGENT_HB[Engine] ID 12293:21601515
2017-08-30 22:38:10.044: [   AGENT][58]{1:63945:12686} {1:63945:12686} Created alert : (:CRSAGF00113:) :  Aborting the command: chec
k for resource: ora.oc4j 1 1
2017-08-30 22:38:10.044: [ora.oc4j][58]{1:63945:12686} [check] Killing action script: check
2017-08-30 22:38:10.044: [    AGFW][58]{1:63945:12686} Command: check for resource: ora.oc4j 1 1 completed with status: TIMEDOUT
2017-08-30 22:38:10.072: [    AGFW][46]{1:63945:12686} Received unknown resource status code: 255
2017-08-30 22:38:49.600: [    AGFW][9]{1:63945:12686} Agent received the message: AGENT_HB[Engine] ID 12293:21601539
2017-08-30 22:39:10.047: [ora.oc4j][46]{1:63945:12686} [check] Executing action script: /abcapp/oragrid/11.2.0/bin/oc4jctl[check]
2017-08-30 22:39:49.603: [    AGFW][9]{1:63945:12686} Agent received the message: AGENT_HB[Engine] ID 12293:21601561
2017-08-30 22:40:10.049: [   AGENT][58]{1:63945:12686} {1:63945:12686} Created alert : (:CRSAGF00113:) :  Aborting the command: chec
k for resource: ora.oc4j 1 1

上面明显为oc4j服务停不下来阻塞了后面的服务引起,oc4j为jvm的进程,理论上杀掉grid用户下的java进程即可。
-bash-4.1$ kill -9 10789
-bash-4.1$ ps -ef |grep 10789
 oragrid 10789     1   0   May 29 ?         847:17 /abcapp/oragrid/11.2.0/jdk/bin/sparcv9/java -server -Xcheck:jni -Xms128M -Xmx
杀了很多遍,没有反应。
说明问题是由java 进程僵死导致的。而检查发现实例1上没有跑oc4j服务,grid用户下没有对应java进程,所以,不会有这个问题。

3.解决:
节点二重启OS,执行init 6,若执行后没有反应的话,将crsd进程kill后,os就能重启了。
启动OS后能正常启crs服务和数据库实例,并启动oc4j服务,crsctl start res ora.oc4j,最后节点一重启crs服务非常顺利。