利用告警日志分析、解决TM-TX锁

时间:2021-12-11 08:46:16
本文将模拟一个Deadlock detected并从数据库SQL和告警日志分析定位相关死锁对象。
之后将展示一个生产环境中的死锁的处理过程。
[oracle@gtlions ~]$ sqlplus gtlions/gtlions

SQL*Plus: Release 10.2.0.5.0 - Production on Mon May 7 14:07:03 2012

Copyright (c) 1982, 2010, Oracle.  All Rights Reserved.


Connected to:
Oracle Database 10g Enterprise Edition Release 10.2.0.5.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options

SQL> create table tm_lock_t1(id int);

Table created.

SQL> create table tm_lock_t2(id int);

Table created.

SQL> insert into tm_lock_t1 values(1);

1 row created.

SQL> insert into tm_lock_t2 values(2);

1 row created.

SQL> commit;

Commit complete.

在会话一执行:
SQL> select * from v$mystat where rownum=1;

       SID STATISTIC#       VALUE
---------- ---------- ----------
       148        0           1

SQL> update tm_lock_t1 set id=1+1;

1 row updated.
在会话二执行:
SQL> select * from v$mystat where rownum=1;

       SID STATISTIC#       VALUE
---------- ---------- ----------
       147        0           1

SQL> update tm_lock_t2 set id=2+1;

1 row updated.
返回在会话一执行:
SQL> update tm_lock_t2 set id=3+1;

这个时候就会产生死锁,我们在会话三查看相关信息:
SQL> set pagesize 150;
SQL> set linesize 150;
Select * From dba_blockers;

HOLDING_SESSION
---------------
        147

SQL> Select * From v$lock Where Sid=147;

ADDR         KADDR             SID TY        ID1      ID2       LMODE    REQUEST     CTIME        BLOCK
---------------- ---------------- ---------- -- ---------- ---------- ---------- ---------- ---------- ----------
00000000697A8D70 00000000697A8D98     147 TM      51800        0           3      0        63        0
000000006846DAE0 000000006846DB18     147 TX     196636      225           6      0        63        1 #可以看到147阻塞了其他会话,其他会话数=1,也就是我们测试的会话148。

我们来看下产生死锁的语句的对象信息,首先看下TM锁的P1和P2的含义:
SQL> Select PARAMETER1,PARAMETER2 From v$event_name Where Name Like '%TM%'
  2  ;

PARAMETER1                             PARAMETER2
---------------------------------------------------------------- ----------------------------------------------------------------
name|mode                             object #
name|mode                             QT_OBJ#
果不然是我们的TM_LOCK_T2,可以看到P2是对象ID
SQL> select object_name from dba_objects where object_id=51800;

OBJECT_NAME
--------------------------------------------------------------------------------------------------------------------------------
TM_LOCK_T2
#我们在看看相关TX锁的信息,首先看下TX锁的P1和P2的含义:
SQL> Select PARAMETER1,PARAMETER2 From v$event_name Where Name ='enq: TX - row lock contention';

PARAMETER1                             PARAMETER2
---------------------------------------------------------------- ----------------------------------------------------------------
name|mode                             usn<<16 | slot
#其相关计算关系如下:
SQL> select XIDUSN,XIDSLOT,XIDSQN from V$TRANSACTION where XIDSQN=225;

    XIDUSN    XIDSLOT      XIDSQN
---------- ---------- ----------
     3       28         225

SQL> Select 3*power(2,16)+28,196636 From dual;

3*POWER(2,16)+28     196636
---------------- ----------
      196636     196636

注意,一直到这里还是没有出现死锁,我们说的死锁是会立即返回说明产生了死锁了的,并中断命令,上述只能说是锁等待,这个一定要注意,如果要再现一个死锁现象,你只需要在会话二再执行如下:
SQL> update tm_lock_t1 set id=1+1;
紧接着,你可以在会话一看到如下的信息,说明发生了死锁:
SQL> update tm_lock_t2 set id=3+1;
update tm_lock_t2 set id=3+1
       *
ERROR at line 1:
ORA-00060: deadlock detected while waiting for resource
这个时候你可以去看下告警日志:
[oracle@gtlions bdump]$ tail -f alert_gt10g.log
QMNC started with pid=16, OS id=5866
Mon May 07 13:51:31 CST 2012
Completed: ALTER DATABASE OPEN
Mon May 07 13:53:00 CST 2012
CREATE TABLESPACE gtlions DATAFILE '/u01/oracle/10g/oradata/gt10g/gtlions01.ora' SIZE 10M AUTOEXTEND ON NEXT 5M MAXSIZE 100M
Mon May 07 13:53:02 CST 2012
Completed: CREATE TABLESPACE gtlions DATAFILE '/u01/oracle/10g/oradata/gt10g/gtlions01.ora' SIZE 10M AUTOEXTEND ON NEXT 5M MAXSIZE 100M
Mon May 07 13:53:02 CST 2012
CREATE TEMPORARY TABLESPACE gtlionstmp TEMPFILE   '/u01/oracle/10g/oradata/gt10g/gtlionstmp01.ora'  SIZE 10M AUTOEXTEND ON NEXT 5M MAXSIZE 100M
Completed: CREATE TEMPORARY TABLESPACE gtlionstmp TEMPFILE   '/u01/oracle/10g/oradata/gt10g/gtlionstmp01.ora'  SIZE 10M AUTOEXTEND ON NEXT 5M MAXSIZE 100M
Mon May 07 14:26:36 CST 2012
ORA-00060: Deadlock detected. More info in file /u01/oracle/10g/admin/gt10g/udump/gt10g_ora_6087.trc.

告警日志留下了发生死锁的信息,我们可以看下具体的trc日志:
[oracle@gtlions bdump]$ more /u01/oracle/10g/admin/gt10g/udump/gt10g_ora_6087.trc



*** 2012-05-07 14:26:36.072
*** ACTION NAME:() 2012-05-07 14:26:36.072
*** MODULE NAME:(SQL*Plus) 2012-05-07 14:26:36.072
*** SERVICE NAME:(SYS$USERS) 2012-05-07 14:26:36.072
*** SESSION ID:(148.8) 2012-05-07 14:26:36.072  #错误会话的sid=148
DEADLOCK DETECTED ( ORA-00060 )
[Transaction Deadlock]
The following deadlock is not an ORACLE error. It is a
deadlock due to user error in the design of an application
or from issuing incorrect ad-hoc SQL. The following
information may aid in determining the deadlock:
Deadlock graph:
                       ---------Blocker(s)--------  ---------Waiter(s)---------
Resource Name          process session holds waits  process session holds waits
TX-0001001c-00000109        15     148     X             19     147           X
TX-0003001c-000000e1        19     147     X             15     148           X
session 148: DID 0001-000F-00000011    session 147: DID 0001-0013-00000004
session 147: DID 0001-0013-00000004    session 148: DID 0001-000F-00000011
Rows waited on:
Session 147: obj - rowid = 0000CA57 - AAAMpXAAFAAAAAUAAA #对象ID
  (dictionary objn - 51799, file - 5, block - 20, slot - 0) #对象ID、文件ID、块ID
Session 148: obj - rowid = 0000CA58 - AAAMpYAAFAAAAAcAAA #对象ID
  (dictionary objn - 51800, file - 5, block - 28, slot - 0) #对象ID、文件ID、块ID
Information on the OTHER waiting sessions: #产生死锁会话的sid=147
Session 147:
  sid: 147 ser: 92 audsid: 60013 user: 58/GTLIONS
    flags: (0x41) USR/- flags_idl: (0x1) BSY/-/-/-/-/-
    flags2: (0x8)
  pid: 19 O/S info: user: oracle, term: pts/4, ospid: 6088

 
如果我们需要根据告警信息来定位呢?也很简单,如上,我们可以换算一下相关的信息:
SQL> Select dbms_rowid.rowid_object('AAAMpXAAFAAAAAUAAA'),dbms_rowid.rowid_object('AAAMpYAAFAAAAAcAAA') From dual;

DBMS_ROWID.ROWID_OBJECT('AAAMPXAAFAAAAAUAAA') DBMS_ROWID.ROWID_OBJECT('AAAMPYAAFAAAAACAAA')
--------------------------------------------- ---------------------------------------------
                    51799                          51800

SQL> Select dbms_rowid.rowid_block_number('AAAMpXAAFAAAAAUAAA') From dual;

DBMS_ROWID.ROWID_BLOCK_NUMBER('AAAMPXAAFAAAAAUAAA')
---------------------------------------------------
                         20

SQL> Select to_number('0000CA57','xxxxxxxx'),to_number('0000CA58','xxxxxxxx') From dual ;

TO_NUMBER('0000CA57','XXXXXXXX') TO_NUMBER('0000CA58','XXXXXXXX')
-------------------------------- --------------------------------
               51799                51800
一次在生产环境中的日常巡检中,在告警日志中发现了大量如下信息:
Global Enqueue Services Deadlock detected. More info in file
    /opt/oracle/product/admin/..../.../_ora_6087.trc.        
Global Enqueue Services Deadlock detected. More info in file
    /opt/oracle/product/admin/..../.../_ora_60xx.trc.        
Global Enqueue Services Deadlock detected. More info in file
    /opt/oracle/product/admin/..../.../_ora_6xxx.trc.    
首先询问了开发人员,开是否有正在执行的相关未提交的SQL语句,说是没有,于是继续查看trc文件,发现如下信息:
user session for deadlock lock 0xc00000023b0d4d08
  pid=179 serial=50640 audsid=19934825 user: 59/RBT
  O/S info: user: root, term: unknown, ospid: 1234, machine: tomcat108
            program: JDBC Thin Client
  application name: JDBC Thin Client, hash value=2546894660
  Current SQL Statement:
  delete from tab1 where ID=:1    
上面可以看到产生死锁的会话来源和语句。
继续往下看:
----------resource 0xc00000020bc3ecc0----------------------
resname       : [0x225f6][0x0],[TM]
这里说明了相关的资源的信息:
0x225f6是相关的对象ID,其实这次死锁的产生原因是由于上述的tab1有个键被别的表tab2创建为外键,但是在tab2没有创建相关的索引造成的,可以使用以下语句快速定位相关主外键的信息:
select uc.constraint_name, uc.owner child_owner, uc.table_name child_table, ucc.column_name child_column
,ucp.owner parent_owner, ucp.table_name parent_table, ucp.column_name parent_column
from dba_constraints uc, dba_cons_columns ucc, dba_cons_columns ucp
where uc.owner = 'TRSDATA'
and uc.table_name = 'TABLEB'
and ucc.constraint_name = uc.constraint_name
and ucp.constraint_name = uc.r_constraint_name
在本例中,在tab2上创建下外键的索引就解决了问题,在实际操作过程中,由于业务在不间断的运行,因此创建索引的时候加上了online参数。
-The End-