【问题记录】sbd——数据库突然hang住

时间:2021-04-16 21:44:00

早上客户反映服务器出现错误,得到如下错误反馈:

<?xml version="1.0" encoding="GB2312"?>
<ErrorMessages><Error><MsgFileName>BI00000000000020130716090500.xml</MsgFileName><ProcessTime>20130716090522</ProcessTime><Sender>bjkjfedex</Sender><Receiver>CIQBMS</Receiver><ERROR_INFO>报文入库出错.java.lang.Exception: GccKjMsgDao.saveMsgToDb() error, msgType:CIQBMS_KJ_DECL,e:org.springframework.jdbc.UncategorizedSQLException: Hibernate operation: Cannot open connection; uncategorized SQLException for SQL [???]; SQL state [72000]; error code [1034]; ORA-01034: ORACLE not available ORA-27101: shared memory realm does not exist Linux-x86_64 Error: 2: No such file or directory ; nested exception is java.sql.SQLException: ORA-01034: ORACLE not available ORA-27101: shared memory realm does not exist Linux-x86_64 Error: 2: No such file or directory</ERROR_INFO><ERROR_LEVEl>0</ERROR_LEVEl><ERROR_TYPE>2</ERROR_TYPE><ERROR_STATUS>0</ERROR_STATUS></Error></ErrorMessages>

于是检查数据库,发现数据库已关闭,并不能开启。

检查alert log,看trace文件:

[oracle@app2 bdump]$ more /u02/oracle/admin/sbd/bdump/sbd_ckpt_7861.trc
/u02/oracle/admin/sbd/bdump/sbd_ckpt_7861.trc
Oracle Database 10g Enterprise Edition Release 10.2.0.5.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
ORACLE_HOME = /u02/oracle/product/10.2.0/db_1
System name:    Linux
Node name:      app2
Release:        2.6.32-220.el6.x86_64
Version:        #1 SMP Wed Nov 9 08:03:13 EST 2011
Machine:        x86_64
Instance name: sbd
Redo thread mounted by this instance: 1
Oracle process number: 10
Unix process pid: 7861, image: oracle@app2 (CKPT)


*** 2013-07-16 07:57:02.519
*** SERVICE NAME:(SYS$BACKGROUND) 2013-07-16 07:57:02.513
*** SESSION ID:(327.1) 2013-07-16 07:57:02.513
ORA-00206: error in writing (block 3, # blocks 1) of control file
ORA-00202: control file: '/ora/oradata/sbd/control03.ctl'
ORA-27072: File I/O error
Linux-x86_64 Error: 30: Read-only file system

Additional information: 4
Additional information: 3
Additional information: -1
ORA-00206: error in writing (block 3, # blocks 1) of control file
ORA-00202: control file: '/ora/oradata/sbd/control02.ctl'
ORA-27072: File I/O error
Linux-x86_64 Error: 30: Read-only file system
Additional information: 4
Additional information: 3
Additional information: -1
ORA-00206: error in writing (block 3, # blocks 1) of control file
ORA-00202: control file: '/ora/oradata/sbd/control01.ctl'
ORA-27072: File I/O error
Linux-x86_64 Error: 5: Input/output error
Additional information: 4
Additional information: 3
Additional information: -1
error 221 detected in background process
ORA-00221: error on write to control file
ORA-00206: error in writing (block 3, # blocks 1) of control file
ORA-00202: control file: '/ora/oradata/sbd/control03.ctl'
ORA-27072: File I/O error
Linux-x86_64 Error: 30: Read-only file system
Additional information: 4
Additional information: 3
Additional information: -1
ORA-00206: error in writing (block 3, # blocks 1) of control file
ORA-00202: control file: '/ora/oradata/sbd/control02.ctl'
ORA-27072: File I/O error
Linux-x86_64 Error: 30: Read-only file system
Additional information: 4
Additional information: 3
Additional information: -1
ORA-00206: error in writing (block 3, # blocks 1) of control file
ORA-00202: control file: '/ora/oradata/sbd/control01.ctl'
ORA-27072: File I/O error
Linux-x86_64 Error: 5: Input/output error
Additional information: 4
Additional information: 3
Additional information: -1
Tue Jul 16 07:57:02 CST 2013
CKPT: terminating instance due to error 221
Instance terminated by CKPT, pid = 7861

感觉是I/O出现问题,CKPT不能更新控制文件,于是将库关闭。

查看系统日志 /var/log/messages:

ul 16 07:56:32 app2 kernel: lpfc 0000:13:00.0: 0:1305 Link Down Event x2 received Data: x2 x20 x80000 x0 x0
Jul 16 07:56:32 app2 fcoemon: received fc event message 559
Jul 16 07:56:32 app2 fcoemon: seconds:1373932592 host1 event_datalen:4
Jul 16 07:56:32 app2 fcoemon: event_num:6318 event_code:3 event_data:0
Jul 16 07:57:02 app2 kernel: rport-1:0-0: blocked FC remote port time out: removing target and saving binding
Jul 16 07:57:02 app2 kernel: lpfc 0000:13:00.0: 0:(0):0203 Devloss timeout on WWPN 20:32:00:80:e5:2e:43:7a NPort x0000e4 Data: x0 x7 x0
Jul 16 07:57:02 app2 kernel: sd 1:0:0:2: [sdd] Unhandled error code
Jul 16 07:57:02 app2 kernel: sd 1:0:0:2: [sdd] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Jul 16 07:57:02 app2 kernel: sd 1:0:0:2: [sdd] CDB: Write(10): 2a 00 00 4e c0 8f 00 00 20 00
Jul 16 07:57:02 app2 kernel: end_request: I/O error, dev sdd, sector 5161103
Jul 16 07:57:02 app2 kernel: __ratelimit: 1830 callbacks suppressed
Jul 16 07:57:02 app2 kernel: Buffer I/O error on device sdd1, logical block 645130——sdd1是挂载 /ora 的,出了问题
Jul 16 07:57:02 app2 kernel: lost page write due to I/O error on sdd1
Jul 16 07:57:02 app2 kernel: Buffer I/O error on device sdd1, logical block 645131
Jul 16 07:57:02 app2 kernel: lost page write due to I/O error on sdd1
Jul 16 07:57:02 app2 kernel: Buffer I/O error on device sdd1, logical block 645132
Jul 16 07:57:02 app2 kernel: lost page write due to I/O error on sdd1
Jul 16 07:57:02 app2 kernel: Buffer I/O error on device sdd1, logical block 645133
Jul 16 07:57:02 app2 kernel: lost page write due to I/O error on sdd1

Jul 16 07:57:02 app2 kernel: sd 1:0:0:2: [sdd] Unhandled error code
Jul 16 07:57:02 app2 kernel: sd 1:0:0:2: [sdd] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Jul 16 07:57:02 app2 kernel: sd 1:0:0:2: [sdd] CDB: Write(10): 2a 00 23 c5 ad 6f 00 00 10 00
Jul 16 07:57:02 app2 kernel: end_request: I/O error, dev sdd, sector 600157551
Jul 16 07:57:02 app2 kernel: Aborting journal on device sdd1-8.

联系主机组,检查硬件、存储。

后主机组反馈是因为底部的软件出现问题,修复好之后,数据库即可以正常开启。