诊断:记一次存储异常CRASH致数据库无法正常打开的恢复

时间:2023-03-09 08:38:03
诊断:记一次存储异常CRASH致数据库无法正常打开的恢复

数据库存储异常crash,首先控制文件出现问题

ORA-: ?????  ????
ORA-: ???? : '/oracledata/oradata/orc11rac/orc11rac/system01.dbf'
ORA-: ????????? - ??????
/home/oracle>oerr ora
, , "file is more recent than control file - old control file"
// *Cause: The control file change sequence number in the data file is
// greater than the number in the control file. This implies that
// the wrong control file is being used. Note that repeatedly causing
// this error can make it stop happening without correcting the real
// problem. Every attempt to open the database will advance the
// control file change sequence number until it is great enough.
// *Action: Use the current control file or do backup control file recovery to
// make the control file current. Be sure to follow all restrictions
// on doing a backup control file recovery.
/home/oracle>oerr ora
, , "data file %s: '%s'"
// *Cause: Reporting file name for details of another error
// *Action: See associated error message
/home/oracle>oerr ora
, , "database file %s failed verification check"
// *Cause: The information in this file is inconsistent with information
// from the control file. See accompanying message for reason.
// *Action: Make certain that the db files and control files are the correct
// files for this database.

这个问题可以采取重建控制文件然后进行recover database进行解决。
需要注意的是,在RAC环境中,需要关闭cluster_database。
即在单线程环境下进行操作。
否则可能会遇到如下问题:

ORA-: CREATE CONTROLFILE failed
ORA-: operation requires database is in EXCLUSIVE mode

本以为,事情可以过去,但是在recover的时候,文件、redolog、archivedlog都出现讹误,常规手段恢复后都无法打开。
最后采取_allow_resetlogs_corruption参数的方式进行尝试。
在pfile文件中添加参数

*._allow_resetlogs_corruption=true

使用该参数resetlogs打开数据库时,可能会由于SCN不一致而遭遇到ORA-00600 2662号错误。

ORA-: internal error code, arguments: [], [], [], [], [], [], [], []
- =

每一次尝试重启,ORA-600的错误参数是会变动的。

ORA-: internal error code, arguments: [], [], [], [], [], [], [], []
- =

可以发现,从19980到19972,这个值在缩小,这个错误,如果值相对较近,可以尝试多重启几次。
但是需要重启2497次,这个是短期内无法接受。

此时我们可以通过Oracle的内部事件来调整SCN:

增进SCN有两种常用方法:

1.通过immediate trace name方式(在数据库Open状态下)

alter session set events 'IMMEDIATE trace name ADJUST_SCN level x';

2.通过10015事件(在数据库无法打开,mount状态下)

alter session set events '10015 trace name adjust_scn level x';

注:level 1为增进SCN 10亿 (1 billion) (1024*1024*1024),通常Level 1已经足够。也可以根据实际情况适当调整。

SQL> alter session set events 'IMMEDIATE trace name ADJUST_SCN level 10';

Session altered.

SQL> alter database open;
alter database open
*
ERROR at line 1:
ORA-01113: file 1 needs media recovery
ORA-01110: data file 1: '/oracledata/oradata/orc11rac/orc11rac/system01.dbf' SQL> recover database
Media recovery complete.
SQL> alter database open;
alter database open
*
ERROR at line 1:
ORA-00603: ORACLE server session terminated by fatal error
Process ID: 27474
Session ID: 1105 Serial number: 5

仍无法打开,后台报错

ORA-: internal error code, arguments: [], [], [], [], [], [], [], []

ORA-600的报错发生了变化,上述操作已经生效。但是诱发了新的错误。

DESCRIPTION: 

A mismatch has been detected between Redo records and Rollback (Undo)
records. We are validating the Undo block sequence number in the undo block against
the Redo block sequence number relating to the change being applied. This error is reported when this validation fails. ARGUMENTS:
Arg [a] Undo record seq number
Arg [b] Redo record seq number FUNCTIONALITY:
KERNEL TRANSACTION UNDO ORA- [] [a] [b] [ ] [ ] [ ]
Versions: 7.2. - 9.2. Source: ktuc.c
===========================================================================
Meaning: seq# mismatch while adding an undo record to an undo block. This
is done by the application of redo.
---------------------------------------------------------------------------
Argument Description: a. (ktubhseq): undo record seq# - this is the seq# of the block that
this undo record WILL BE APPLIED TO.
This is from the Undo Block. It is
NOT the seq# of the undo block itself. b. (ktudbseq): redo RECORD seq# - this is the seq# number in the block
that this redo WILL BE APPLIED TO.
This is from the Redo Record. ---------------------------------------------------------------------------
Diagnosis: This error is raised in kturdb which handles the adding of undo records
by the application of redo. When we try to apply redo to an undo block (forward changes are made by
the application of redo to a block) we check that the seq# in the undo
record matches the seq# in the redo record. These seq# should be the
same because when we apply a redo record we must apply it to the
correct version of the block. We can only apply a redo record to a
block that contains the same seq# as in the redo record. If the seq# do not match then this error is raised. This implies some
kind of block corruption in either the redo or the undo block. 7.3.x - 8.1..x
ASSERT2(ubh->ktubhseq == db->ktudbseq, OERI(), KSESVSGN,
ubh->ktubhseq, db->ktudbseq);
9.2.x
ksesic2(OERI(), ksenrg(ubh->ktubhseq), ksenrg(db->ktudbseq)); struct ktubh
{
kxid ktubhxid; /* txid of tx currently using or last used this block */
ub2 ktubhseq; /* undo block sequence number */
ub1 ktubhcnt; /* high water mark record index, number of undo entries */
ub1 ktubhirb; /* rollback record index, rec index to start the rollback */
ub1 ktubhicl; /* collecting record index, rec index to start retrieving col info */
ub1 ktubhflg; /* dummy */
ub2 ktubhidx[]; /* byte offset of record in block, grows at runtime */
}; struct ktudb Kernel Transaction Undo Data operation Block (redo)
{
ub2 ktudbsiz; /* size of entry */
ub2 ktudbspc; /* verification: space left in undo block */
ub2 ktudbflg; /* flag to indicate the kind of redo operation */
kxid ktudbxid; /* current tx id */
ub2 ktudbseq; /* block sequence number */
ub1 ktudbrec; /* new record index for this change */
};

处理方式是
1、新建一个UNDO表空间;
2、修改undo管理为manual;
本次选择了手工的方式,参数文件中修改

*.undo_management=manual
SQL> startup mount
ORACLE instance started. Total System Global Area 1.3429E+10 bytes
Fixed Size 2149040 bytes
Variable Size 6845105488 bytes
Database Buffers 6576668672 bytes
Redo Buffers 4730880 bytes
Database mounted.
SQL> alter database open; Database altered.

至此,数据库成功打开。此时已经可以导出需要的数据进行备份。
某些版本的数据库仍需要进行TEMP表空间的temp文件添加。
但此时已经可以导出需要的数据进行备份。
继续观察后台日志报错,也可以新建新的UNDO表空间为auto管理。