彻底搞清楚library cache lock的成因和解决方法(一)
问题描述:
接到应用人员的报告,说是在任何对表CSNOZ629926699966的操作都会hang,包括desc CSNOZ629926699966,
例如:
ora9i@cs_dc02:/ora9i > sqlplus pubuser/pubuser
SQL*Plus: Release 9.2.0.4.0 - Production on Mon Jan 10 10:11:06 2005
Copyright (c) 1982, 2002, Oracle Corporation. All rights reserved.
Connected to:
Oracle9i Enterprise Edition Release 9.2.0.4.0 - 64bit Production
With the Partitioning and Real Application Clusters options
JServer Release 9.2.0.4.0 - Production
SQL> conn pubuser/pubuser
Connected.
SQL> desc CSNOZ629926699966
。。。
这个进程 hang 了
。。。
询问了一下之前有无特别的操作,业务人员说很久以前执行了脚本,但是该教本运行很久都没有结果,然后他就退出了会话,再之后,就出现了上面的情况。脚本内容如下:
$ cat CSNOZ629926699966.sh
#!/bin/sh
sqlplus pubuser/pubuser@csmisc << EOF  #use your username/password
create table CSNOZ629926699966 as select * from CSNOZ62992266cs
where mid not in ( select mid from pubuser.SUBSCRIPTION_BAK_200412@newdb where servid='020999011964' and status in ('A','B','S'));
exit;
$ 
$ 
$ 
$ 
解决过程:
ora9i@cs_dc02:/ora9i > sqlplus "/ as sysdba"
SQL*Plus: Release 9.2.0.4.0 - Production on Mon Jan 10 10:19:13 2005
Copyright (c) 1982, 2002, Oracle Corporation. All rights reserved.
Connected to:
Oracle9i Enterprise Edition Release 9.2.0.4.0 - 64bit Production
With the Partitioning and Real Application Clusters options
JServer Release 9.2.0.4.0 - Production
SQL> select * from v$lock where block=1;
no rows selected
SQL> /
no rows selected
SQL> /
no rows selected
SQL> 
我们看到目前没有锁的信息
SQL> select xidusn, object_id, session_id, locked_mode from v$locked_object;
。。。
 XIDUSN  OBJECT_ID SESSION_ID LOCKED_MODE
---------- ---------- ---------- -----------
        14         18         37           3
。。。
SQL> /
。。。
 XIDUSN  OBJECT_ID SESSION_ID LOCKED_MODE
---------- ---------- ---------- -----------
        14         18         37           3
。。。
SQL> /
。。。
 XIDUSN  OBJECT_ID SESSION_ID LOCKED_MODE
---------- ---------- ---------- -----------
        14         18         37           3
。。。
SQL> 
查找 v$locked_object,我们发现了一个可疑的会话,SID 37:
SQL> select object_name,owner,object_type from dba_objects where object_id=18;
。。。 。。。
OBJECT_NAME                    OWNER                          OBJECT_TYPE
------------------------------ ------------------------------ ------------------
OBJ$                           SYS                            TABLE
。。。 。。。
SQL>
奇怪怎么一直有这个锁??
初步猜测是由于SID为37的会话执行了上面的DDL语句,并在语句未完成前异常退出,
造成了所有访问那个(DDL语句中涉及到的)对象的进程都hang了。
接下来我们看看等待事件:
SQL> select event,sid,p1,p2,p3 from v$session_wait where event not like 'SQL*%' and event not like 'rdbms%';
EVENT                                                                    P1         P2        SID
---------------------------------------------------------------- ---------- ---------- ----------
pmon timer                                                              300          0          1
ges remote message                                                       32          0          4
gcs remote message                                                       64          0          5
gcs remote message                                                       64          0          7
smon timer                                                              300          0         19
library cache lock                                               1.3835E+19 1.3835E+19         30
wakeup time manager                                                       0          0         22
7 rows selected.
SQL> /
EVENT                                                                    P1         P2        SID
---------------------------------------------------------------- ---------- ---------- ----------
pmon timer                                                              300          0          1
ges remote message                                                       32          0          4
gcs remote message                                                       64          0          5
gcs remote message                                                       64          0          7
smon timer                                                              300          0         19
library cache lock                                               1.3835E+19 1.3835E+19         30
wakeup time manager                                                       0          0         22
7 rows selected.
SQL> /
EVENT                                                                    P1         P2        SID
---------------------------------------------------------------- ---------- ---------- ----------
pmon timer                                                              300          0          1
ges remote message                                                       32          0          4
gcs remote message                                                       64          0          5
gcs remote message                                                       64          0          7
smon timer                                                              300          0         19
library cache lock                                               1.3835E+19 1.3835E+19         30
wakeup time manager                                                       0          0         22
7 rows selected.
SQL> /
EVENT                                                                    P1         P2        SID
---------------------------------------------------------------- ---------- ---------- ----------
pmon timer                                                              300          0          1
ges remote message                                                       32          0          4
gcs remote message                                                       64          0          5
gcs remote message                                                       64          0          7
smon timer                                                              300          0         19
library cache lock                                               1.3835E+19 1.3835E+19         30
wakeup time manager                                                       0          0         22
7 rows selected.
SQL>
我们注意到下面的事件:
EVENT                                                                    P1         P2        SID
---------------------------------------------------------------- ---------- ---------- ----------
。。。
library cache lock 1.3835E+19 1.3835E+19 30
。。。
P1 是句柄地址(handle address),也就是'library cache lock'发生的地址。
P2 是一个状态对象,在这里,它表示在对象上加载的锁的地址(lock address)。
P1 和 P2都是科学计数发表示的10进制数。
这些信息再次证实了上面的猜测,SID 37阻塞了SID 30。
找出这两个可疑进程的sid和serial,然后对他们设置10046事件:
SQL> select sid,serial# from v$session where sid in (30,37);
       SID    SERIAL#
---------- ----------
        30      24167
        37       2707
SQL> exec dbms_system.set_ev(30,24167,10046,12,'');
PL/SQL procedure successfully completed.
SQL> exec dbms_system.set_ev(37,2707,10046,12,'');
PL/SQL procedure successfully completed.
SQL>
跟踪期间咱们再次测试一下,看看有没有其他线索。
新开一个进程,找出其sid, serial和spid等信息:
ora9i@cs_dc01:/ora9i > sqlplus pubuser/pubuser                                                                                  
SQL*Plus: Release 9.2.0.4.0 - Production on Mon Jan 10 11:36:25 2005
Copyright (c) 1982, 2002, Oracle Corporation. All rights reserved.
Connected to:
Oracle9i Enterprise Edition Release 9.2.0.4.0 - 64bit Production
With the Partitioning and Real Application Clusters options
JServer Release 9.2.0.4.0 - Production
SQL> select distinct sid from v$mystat;
       SID
----------
        33
SQL> select sid,serial# from v$session where sid=33;
       SID    SERIAL#
---------- ----------
        33       6639
SQL> SELECT SPID,PID FROM V$PROCESS WHERE ADDR=(SELECT PADDR FROM V$SESSION WHERE SID=37);
SPID                PID
------------ ----------
20552                26
SQL> SELECT SPID,PID FROM V$PROCESS WHERE ADDR=(SELECT PADDR FROM V$SESSION WHERE SID=30);
SPID                PID
------------ ----------
22580                28
SQL> show parameter dump
NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
background_core_dump                 string      partial
background_dump_dest                 string      /ora9i/app/oracle/admin/csmisc
                                                 /bdump
core_dump_dest                       string      /ora9i/app/oracle/admin/csmisc
                                                 /cdump
max_dump_file_size                   string      UNLIMITED
shadow_core_dump                     string      partial
user_dump_dest                       string      /ora9i/app/oracle/admin/csmisc
                                                 /udump
SQL>        
然后,再尝试对 CSNOZ629926699966 表进行操作
SQL> desc CSNOZ629926699966
。。。
还是hang住了。
于是中断这个操作(CTRL + C):
SQL> desc CSNOZ629926699966
ERROR:
ORA-01013: user requested cancel of current operation
SQL> select tname from tab where tname='CSNOZ629926699966';
no rows selected
SQL> 
查看PUBUSER用户下的这个表,居然不存在!!
进一步证实了前面的猜测,也就是说会话37阻塞了其他所有操作表CSNOZ629926699966的会话,造成了进程的hang,当然,包括上面的SID 30和SID 33的DDL语句
现在,我们结束10046的事件跟踪:
SQL> exec dbms_system.set_ev(30,24167,0,0,'');
PL/SQL procedure successfully completed.
SQL> exec dbms_system.set_ev(37,2707,0,0,'');
PL/SQL procedure successfully completed.
SQL> 
根据上面记录的信息,我们知道这两个会话产生的跟踪信息分别为:
SID为30的会话,产生的跟踪文件为:/ora9i/app/oracle/admin/csmisc/udump/csmisc2_ora_22580.trc
SID为37的会话,产生的跟踪文件为:/ora9i/app/oracle/admin/csmisc/udump/csmisc2_ora_20552.trc
看看trace文件:
ora9i@cs_dc02:/ora9i > cd /ora9i/app/oracle/admin/csmisc/udump
ora9i@cs_dc02:/ora9i/app/oracle/admin/csmisc/udump > ll -tlc
total 4432
-rw-r-----   1 ora9i      dba         332995 Jan 10 12:00 csmisc2_ora_22580.trc
-rw-r-----   1 ora9i      dba           3168 Jan 10 11:59 csmisc2_ora_20552.trc
-rw-r-----   1 ora9i      dba         407133 Jan  7 15:10 csmisc2_ora_2708.trc
-rw-r-----   1 ora9i      dba            640 Jan  7 14:48 csmisc2_ora_835.trc
-rw-r-----   1 ora9i      dba           1590 Dec 30 22:50 csmisc2_ora_16244.trc
-rw-r-----   1 ora9i      dba        1308403 Dec 30 22:44 csmisc2_ora_16033.trc
-rw-r-----   1 ora9i      dba            616 Dec 28 14:16 csmisc2_ora_2176.trc
-rw-r-----   1 ora9i      dba            644 Dec  8 18:22 csmisc2_ora_21083.trc
ora9i@cs_dc02:/ora9i/app/oracle/admin/csmisc/udump > mailx -s "csmisc2_ora_22580.trc" zhangdp@aspire-tech.com < csmisc2_ora_22580.trc
ora9i@cs_dc02:/ora9i/app/oracle/admin/csmisc/udump > mailx -s "csmisc2_ora_20552.trc" zhangdp@aspire-tech.com < csmisc2_ora_20552.trc
ora9i@cs_dc02:/ora9i/app/oracle/admin/csmisc/udump > exit
SQL>
我们看到SID为30的会话,产生的跟踪文件(csmisc2_ora_22580.trc)为的主要内容是:
/ora9i/app/oracle/admin/csmisc/udump/csmisc2_ora_22580.trc
Oracle9i Enterprise Edition Release 9.2.0.4.0 - 64bit Production
With the Partitioning and Real Application Clusters options
JServer Release 9.2.0.4.0 - Production
ORACLE_HOME = /ora9i/app/oracle/product/920
System name: HP-UX
Node name: cs_dc02
Release: B.11.11
Version: U
Machine: 9000/800
Instance name: csmisc2
Redo thread mounted by this instance: 2
Oracle process number: 28
Unix process pid: 22580, image: oracle@cs_dc02 (TNS V1-V3)
*** 2005-01-10 11:31:49.416
*** SESSION ID:(30.24167) 2005-01-10 11:31:49.354
WAIT #0: nam='library cache lock' ela= 507258 p1=-4611686013547141416 p2=-4611686013691716064 p3=1301
WAIT #0: nam='library cache lock' ela= 505686 p1=-4611686013547141416 p2=-4611686013691716064 p3=1301
WAIT #0: nam='library cache lock' ela= 507678 p1=-4611686013547141416 p2=-4611686013691716064 p3=1301
WAIT #0: nam='library cache lock' ela= 507595 p1=-4611686013547141416 p2=-4611686013691716064 p3=1301
WAIT #0: nam='library cache lock' ela= 507880 p1=-4611686013547141416 p2=-4611686013691716064 p3=1301
WAIT #0: nam='library cache lock' ela= 507317 p1=-4611686013547141416 p2=-4611686013691716064 p3=1301
WAIT #0: nam='library cache lock' ela= 507703 p1=-4611686013547141416 p2=-4611686013691716064 p3=1301
WAIT #0: nam='library cache lock' ela= 507683 p1=-4611686013547141416 p2=-4611686013691716064 p3=1301
WAIT #0: nam='library cache lock' ela= 508265 p1=-4611686013547141416 p2=-4611686013691716064 p3=1301
WAIT #0: nam='library cache lock' ela= 507100 p1=-4611686013547141416 p2=-4611686013691716064 p3=1301
WAIT #0: nam='library cache lock' ela= 507684 p1=-4611686013547141416 p2=-4611686013691716064 p3=1301
WAIT #0: nam='library cache lock' ela= 505889 p1=-4611686013547141416 p2=-4611686013691716064 p3=1301
WAIT #0: nam='library cache lock' ela= 507731 p1=-4611686013547141416 p2=-4611686013691716064 p3=1301
WAIT #0: nam='library cache lock' ela= 507650 p1=-4611686013547141416 p2=-4611686013691716064 p3=1301
WAIT #0: nam='library cache lock' ela= 507604 p1=-4611686013547141416 p2=-4611686013691716064 p3=1301
WAIT #0: nam='library cache lock' ela= 507698 p1=-4611686013547141416 p2=-4611686013691716064 p3=1301
。。。 。。。
我们看到SID 30的跟踪文件中的等待事件就是在V$SESSION_WAIT中看到的'library cache lock' .
再看看SID为37的会话,产生的跟踪文件(csmisc2_ora_20552.trc)为的主要内容是:
Oracle9i Enterprise Edition Release 9.2.0.4.0 - 64bit Production
With the Partitioning and Real Application Clusters options
JServer Release 9.2.0.4.0 - Production
ORACLE_HOME = /ora9i/app/oracle/product/920
System name: HP-UX
Node name: cs_dc02
Release: B.11.11
Version: U
Machine: 9000/800
Instance name: csmisc2
Redo thread mounted by this instance: 2
Oracle process number: 26
Unix process pid: 20552, image: oracle@cs_dc02 (TNS V1-V3)
*** 2005-01-10 11:33:22.702
*** SESSION ID:(37.2707) 2005-01-10 11:33:22.690
WAIT #1: nam='SQL*Net message to dblink' ela= 4 p1=675562835 p2=1 p3=0
*** 2005-01-10 11:35:07.452
WAIT #1: nam='SQL*Net message from dblink' ela= 102293555 p1=675562835 p2=1 p3=0
WAIT #1: nam='SQL*Net message to dblink' ela= 3 p1=675562835 p2=1 p3=0
*** 2005-01-10 11:36:55.980
WAIT #1: nam='SQL*Net message from dblink' ela= 105969709 p1=675562835 p2=1 p3=0
WAIT #1: nam='SQL*Net message to dblink' ela= 4 p1=675562835 p2=1 p3=0
*** 2005-01-10 11:39:05.416
WAIT #1: nam='SQL*Net message from dblink' ela= 126390826 p1=675562835 p2=1 p3=0
WAIT #1: nam='SQL*Net message to dblink' ela= 4 p1=675562835 p2=1 p3=0
*** 2005-01-10 11:41:12.878
WAIT #1: nam='SQL*Net message from dblink' ela= 124461520 p1=675562835 p2=1 p3=0
WAIT #1: nam='SQL*Net message to dblink' ela= 4 p1=675562835 p2=1 p3=0
*** 2005-01-10 11:43:01.285
WAIT #1: nam='SQL*Net message from dblink' ela= 105859385 p1=675562835 p2=1 p3=0
WAIT #1: nam='SQL*Net message to dblink' ela= 4 p1=675562835 p2=1 p3=0
*** 2005-01-10 11:44:48.200
WAIT #1: nam='SQL*Net message from dblink' ela= 104397696 p1=675562835 p2=1 p3=0
WAIT #1: nam='SQL*Net message to dblink' ela= 4 p1=675562835 p2=1 p3=0
。。。 。。。
现在我们来dump 系统状态(systemstate),看看更详细的信息。
首先简单的介绍一下 event systemstate。
很多人把 systemstate 事件理解为dump发生的那一刻的系统内所有进程的信息,这是个错误的概念,事实上,
转储 system state 产生的跟踪文件是从dump那一刻开始到dump任务完成之间一段事件内的系统内所有进程的信息。
dump systemstate产生的跟踪文件包含了系统中所有进程的进程状态等信息。每个进程对应跟踪文件中的一段内容,反映该进程的状态信息,包括进程信息,会话信息,enqueues信息(主要是lock的信息),缓冲区的信息和该进程在SGA区中持有的(held)对象的状态等信息。
那么通常在什么情况下使用systemstate比较合适呢?
 Oracle推荐的使用systemstate事件的几种情况是:
- 数据库 hang 住了
 - 数据库很慢
 - 进程正在hang
 - 数据库出现某些错误
 - 资源争用
 
dump systemstate的语法为:
    ALTER SESSION SET EVENTS 'immediate trace name systemstate level 10';
也可以使用ORADEBUG实现这个功能
    ORADEBUG DUMP SYSTEMSTATE level 10
如果希望在数据库发生某种错误时除非systemstate事件,可以在参数文件(spfile或者pfile)中设置event参数,
例如,当系统发生死锁(出现ORA-00060错误)时dump systemstate:
    event = "60 trace name systemstate level 10"
 
言归正传,我们dump系统状态:
SQL> ALTER SESSION SET EVENTS 'IMMEDIATE TRACE NAME SYSTEMSTATE LEVEL 8'; 
Session altered.
SQL> host
ora9i@cs_dc02:/ora9i >cd /ora9i/app/oracle/admin/csmisc/udump
ora9i@cs_dc02:/ora9i/app/oracle/admin/csmisc/udump > ll -ctl
-rw-r-----   1 ora9i      dba        1070863 Jan 10 13:02 csmisc2_ora_22580.trc
-rw-r-----   1 ora9i      dba        1345368 Jan 10 13:01 csmisc2_ora_22568.trc
-rwxrwxrwx   1 ora9i      dba          44114 Jan 10 12:52 ass1015.awk
-rw-r-----   1 ora9i      dba         407133 Jan  7 15:10 csmisc2_ora_2708.trc
-rw-r-----   1 ora9i      dba            640 Jan  7 14:48 csmisc2_ora_835.trc
-rw-r-----   1 ora9i      dba           1590 Dec 30 22:50 csmisc2_ora_16244.trc
-rw-r-----   1 ora9i      dba        1308403 Dec 30 22:44 csmisc2_ora_16033.trc
-rw-r-----   1 ora9i      dba            616 Dec 28 14:16 csmisc2_ora_2176.trc
-rw-r-----   1 ora9i      dba            644 Dec  8 18:22 csmisc2_ora_21083.trc
ora9i@cs_dc02:/ora9i/app/oracle/admin/csmisc/udump >
ora9i@cs_dc02:/ora9i/app/oracle/admin/csmisc/udump > mailx -s "22568" zhangdp@aspire-tech.com < csmisc2_ora_22568.trc
这个跟踪文件很大(因为它包含了所有进程的信息),那么我们从哪里开始看起呢?
首先,通过在跟踪文件中查找字符串"waiting for 'library cache lock'",我们找到了被阻塞进程的信息:
PROCESS 28: ----------------被阻塞的Oracle进程,这里PROCESS 28对应了V$PROCESS中的PID的值,
    也就是说我们可以根据这一信息在V$PROCESS和V$SESSION找到被阻塞的会话的信息
  ----------------------------------------
  SO: c000000109c83bf0, type: 2, owner: 0000000000000000, flag: INIT/-/-/0x00
  (process) Oracle pid=28, calls cur/top: c00000010b277890/c00000010b277890, flag: (0) -
            int error: 0, call error: 0, sess error: 0, txn error 0
  (post info) last post received: 17 24 6
              last post received-location: ksusig
              last process to post me: c000000109c840f8 25 0
              last post sent: 0 0 15
              last post sent-location: ksasnd
              last process posted by me: c000000109c7ff90 1 6
    (latch info) wait_event=0 bits=0
    Process Group: DEFAULT, pseudo proc: c000000109eefda0
    O/S info: user: ora9i, term: pts/th, ospid: 22580  ----------------该进程的操作系统进程号,对应于V$PROCESS中的SPID
    OSD pid info: Unix process pid: 22580, image: oracle@cs_dc02 (TNS V1-V3)
    ----------------------------------------
    SO: c000000109f02c68, type: 4, owner: c000000109c83bf0, flag: INIT/-/-/0x00
    (session) trans: 0000000000000000, creator: c000000109c83bf0, flag: (100041) USR/- BSY/-/-/-/-/-
              DID: 0002-001C-00000192, short-term DID: 0000-0000-00000000
              txn branch: 0000000000000000
              oct: 0, prv: 0, sql: c00000011f8ea068, psql: c00000011f8ea068, user: 50/PUBUSER
    O/S info: user: ora9i, term: , ospid: 22536, machine: cs_dc02
              program: sqlplus@cs_dc02 (TNS V1-V3)
    application name: SQL*Plus, hash value=3669949024
    waiting for 'library cache lock' blocking sess=0x0 seq=18589 wait_time=0
                handle address=c000000122e2a6d8, lock address=c00000011a449e20, 100*mode+namespace=515
。。。 。。。
    SO: c00000010b277890, type: 3, owner: c000000109c83bf0, flag: INIT/-/-/0x00
    (call) sess: cur c000000109f02c68, rec 0, usr c000000109f02c68; depth: 0
      ----------------------------------------
      SO: c00000011a449e20, type: 51, owner: c00000010b277890, flag: INIT/-/-/0x00
      LIBRARY OBJECT LOCK: lock=c00000011a449e20 handle=c000000122e2a6d8 request=S
      call pin=0000000000000000 session pin=0000000000000000
      htl=c00000011a449e90[c00000011a4bc350,c00000011a4bc350] htb=c00000011a4bc350
      user=c000000109f02c68 session=c000000109f02c68 count=0 flags=[00] savepoint=463
      the rest of the object was already dumped
。。。 。。。
请注意下面的信息:
    waiting for 'library cache lock' blocking sess=0x0 seq=18589 wait_time=0
                handle address=c000000122e2a6d8, lock address=c00000011a449e20, 100*mode+namespace=515
这段信息告诉我们ORACLE PID为 28的进程(PROCESS 28),正在等待'library cache lock' ,通过‘handle address=c000000122e2a6d8’我们可以找到阻塞它的会话的ORACLE PID信息。
还要注意这段信息:
      LIBRARY OBJECT LOCK: lock=c00000011a449e20 handle=c000000122e2a6d8 request=S
      call pin=0000000000000000 session pin=0000000000000000
      htl=c00000011a449e90[c00000011a4bc350,c00000011a4bc350] htb=c00000011a4bc350
      user=c000000109f02c68 session=c000000109f02c68 count=0 flags=[00] savepoint=463
这里就是阻塞PROCESS 28进程的会话的信息。
简单的记住这个依据的要点是:
waiting session的'handle address'的值对应于blocking session的'handle'的值。
回过头来,看看这个值,它应于上面我们在V$SESSION_WAIT中看到的P1和P2的值:
SQL> select to_number('C000000122E2A6D8','XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX') from dual;
TO_NUMBER('C000000122E2A6D8','XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX')
----------------------------------------------------------------
                                                      1.3835E+19
SQL>
问题的成因已经基本上明确了,这里推荐两种解决问题的方法:
方法1,根据 c000000122e2a6d8 地址,我们可以得到当前在library cache中相应的锁信息:               
SQL> l
  1  select INST_ID,USER_NAME,KGLNAOBJ,KGLLKSNM,KGLLKUSE,KGLLKSES,KGLLKMOD,KGLLKREQ,KGLLKPNS,KGLLKHDL
  2* from X$KGLLK where KGLLKHDL = 'C000000122E2A6D8' order by KGLLKSNM,KGLNAOBJ
SQL> /
   INST_ID USER_NAME     KGLNAOBJ                 KGLLKSNM KGLLKUSE         KGLLKSES       KGLLKMOD   KGLLKREQ KGLLKPNS         KGLLKHDL
---------- ------------- ---------------------- ---------- ---------------- ---------------- ---------- ---------- ---------------- ----------------
         2 PUBUSER       CSNOZ629926699966              30 C000000109F02C68 C000000109F02C68      0          2 00               C000000122E2A6D8
         2 PUBUSER       CSNOZ629926699966              37 C000000108C99E28 C000000108C99E28      3          0 00               C000000122E2A6D8
SQL>
按照Oracle推荐的做法,我们现在应该使用'alter system kill session'命令kill掉SID 37,
结果得到了ORA-00031错误:
SQL> alter system kill session '37,2707';
alter system kill session '37,2707'
*
ERROR at line 1:
ORA-00031: session marked for kill
SQL>
检查SID 37的状态:
SQL> set linesize 150
SQL> col program for a50
SQL> select sid,serial#,status,username,program from v$session where sid=37;
       SID    SERIAL# STATUS   USERNAME                       PROGRAM
---------- ---------- -------- ------------------------------ --------------------------------------------------
        37       2707 KILLED   PUBUSER                        sqlplus@cs_dc02 (TNS V1-V3)
SQL> 
再次证实了我们最初的想法—— 有人在执行了某个需要运行很久的DDL(多数是语句效率低,当然不排除遭遇bug的可能),
然后没等语句结束就异常退出了会话。
这个例子中我们在上面的跟踪文件已经找到了该会话对应的操作系统进程(SPID),如果在其他情况下,我们如何找到这种状态为'KILLED'
的操作系统进程号(SPID)呢?
下面给出了一个方法,可以借鉴:
SQL> l
  1  SELECT s.username,s.status,
  2  x.ADDR,x.KSLLAPSC,x.KSLLAPSN,x.KSLLASPO,x.KSLLID1R,x.KSLLRTYP,
  3  decode(bitand (x.ksuprflg,2),0,null,1)
  4  FROM x$ksupr x,v$session s
  5  WHERE s.paddr(+)=x.addr
  6  and bitand(ksspaflg,1)!=0
  7* and s.sid=37
SQL> /
USERNAME                       STATUS   ADDR               KSLLAPSC   KSLLAPSN KSLLASPO       KSLLID1R KS D
------------------------------ -------- ---------------- ---------- ---------- ------------ ---------- -- -
PUBUSER                        KILLED   C000000109C831E0         41         15 16243                17
SQL>
x$ksupr.ADDR列的值对应了V$PROCESS 中的ADDR的值,知道了这个SPID的地址,找到这个操作系统进程(SPID)就简单了,例如:
SQL> select spid,pid from v$process where addr='C000000109C831E0';
SPID                PID
------------ ----------
20552                26
SQL>
现在,我们只需要在操作系统上 kill 操作系统进程20552就可以了:
ora9i@cs_dc02:/ora9i > ps -ef  grep 20552
   ora9i 20552     1  0  Jan  8  ?         0:01 oraclecsmisc2 (LOCAL=NO)
   ora9i 14742 14740  0 17:19:02 pts/ti    0:00 grep 20552
ora9i@cs_dc02:/ora9i > kill -9 20552 
ora9i@cs_dc02:/ora9i > ps -ef  grep 20552
   ora9i 14966 14964  0 17:40:01 pts/ti    0:00 grep 20552
ora9i@cs_dc02:/ora9i > 
再来检查一下SID 37的信息,我们看到这个会话是真的被kill掉了,
ora9i@cs_dc02:/ora9i > exit
SQL> select sid,serial#,status,username,program from v$session where sid=37;
no rows selected
SQL> l
  1  SELECT s.username,s.status,
  2  x.ADDR,x.KSLLAPSC,x.KSLLAPSN,x.KSLLASPO,x.KSLLID1R,x.KSLLRTYP,
  3  decode(bitand (x.ksuprflg,2),0,null,1)
  4  FROM x$ksupr x,v$session s
  5  WHERE s.paddr(+)=x.addr
  6  and bitand(ksspaflg,1)!=0
  7* and s.sid=37
SQL> /
no rows selected
SQL>
回到刚才hang住的会话,它已经恢复了正常操作,
并且我们已经得到了'ORA-04043: object CSNOZ629926699966 does not exist'这个正常的信息:
SQL> desc CSNOZ629926699966
ERROR:
ORA-04043: object CSNOZ629926699966 does not exist
SQL>
在开一个会话,测试一把:
ora9i@cs_dc02:/ora9i > sqlplus pubuser/pubuser
SQL*Plus: Release 9.2.0.4.0 - Production on Mon Jan 10 17:42:16 2005
Copyright (c) 1982, 2002, Oracle Corporation. All rights reserved.
Connected to:
Oracle9i Enterprise Edition Release 9.2.0.4.0 - 64bit Production
With the Partitioning and Real Application Clusters options
JServer Release 9.2.0.4.0 - Production
SQL> set timing on
SQL> desc CSNOZ629926699966
ERROR:
ORA-04043: object CSNOZ629926699966 does not exist
SQL> 
当发出命令'desc CSNOZ629926699966'的时候,我们看到系统立刻返回了ORA-04043: object CSNOZ629926699966 does not exist'信息,问题就此解决了。
这里,简单的介绍一下X$KGLLK,这个基表保存了库缓存中对象的锁的信息,它对于解决这类问题特别有用,其名称的含义如下:
  [K]ernel Layer                                                                
    [G]eneric Layer                                                             
      [L]ibrary Cache Manager  ( defined and mapped from kqlf )                 
        Object Locks                                                            
          X$KGLLK - Object [L]oc[K]s                                            
KGLNAOBJ 列包含了在librarky cache中的对象上执行命令的语句的前80个字符(其实从这里我们也可以大大缩小范围了)
X$KGLLK.KGLLKUSE 和 x$kgllk.KGLLKSES 对应于跟踪文件中的owner的值
X$KGLLK.KGLLKADR 
X$KGLLK.KGLLKHDL 对应于跟踪文件中的handle的值(handle=C000000122E2A6D8),也就是'library cache lock'的地址
X$KGLLK.KGLLKPNS 对应于跟踪文件中的session pin的值
X$KGLLK.KGLLKSPN对应于跟踪文件中的savepoint的值
我们再来看一下更全面的信息:
SQL> set linesize 2000
SQL> select * from X$KGLLK where KGLLKHDL = 'C000000122E2A6D8' order by KGLLKSNM,KGLNAOBJ                           
  2  /
ADDR                   INDX    INST_ID KGLLKADR         KGLLKUSE         KGLLKSES           KGLLKSNM KGLLKHDL         KGLLKPNC         KGLLKPNS       KGLLKCNT   KGLLKMOD   KGLLKREQ   KGLLKFLG   KGLLKSPN KGLLKHTB           KGLNAHSH KGLHDPAR           KGLHDNSP USER_NAME                  KGLNAOBJ
---------------- ---------- ---------- ---------------- ---------------- ---------------- ---------- ---------------- ---------------- ---------------- ---------- ---------- ---------- ---------- ---------- ---------------- ---------- ---------------- ---------- ------------------------------ ------------------------------------------------------------
800003FB0007E4D0         33          2 C00000011A449E20 C000000109F02C68 C000000109F02C68         30 C000000122E2A6D8 00               00                    0          0          2          0        463 C00000011A4BC350 3990848181 C000000122E2A6D8          1 PUBUSER                            CSNOZ629926699966
800003FB0007E5B0         34          2 C00000011A44A150 C000000108C99E28 C000000108C99E28         37 C000000122E2A6D8 00               00                    1          3          0          0        179 C00000011A4BB328 3990848181 C000000122E2A6D8          1 PUBUSER                            CSNOZ629926699966
SQL> set linesize 100
SQL> l
  1* select * from X$KGLLK where KGLLKHDL = 'C000000122E2A6D8' order by KGLLKSNM,KGLNAOBJ
SQL> /
ADDR                   INDX    INST_ID KGLLKADR         KGLLKUSE         KGLLKSES           KGLLKSNM
---------------- ---------- ---------- ---------------- ---------------- ---------------- ----------
KGLLKHDL         KGLLKPNC         KGLLKPNS           KGLLKCNT   KGLLKMOD   KGLLKREQ   KGLLKFLG
---------------- ---------------- ---------------- ---------- ---------- ---------- ----------
  KGLLKSPN KGLLKHTB           KGLNAHSH KGLHDPAR           KGLHDNSP USER_NAME
---------- ---------------- ---------- ---------------- ---------- ------------------------------
KGLNAOBJ
------------------------------------------------------------
800003FB0007E4D0         33          2 C00000011A449E20 C000000109F02C68 C000000109F02C68         30
C000000122E2A6D8 00               00                        0          0          2          0
       463 C00000011A4BC350 3990848181 C000000122E2A6D8          1 PUBUSER
CSNOZ629926699966
800003FB0007E5B0         34          2 C00000011A44A150 C000000108C99E28 C000000108C99E28         37
C000000122E2A6D8 00               00                        1          3          0          0
       179 C00000011A4BB328 3990848181 C000000122E2A6D8          1 PUBUSER
CSNOZ629926699966
SQL> 
————待续————

请扫描二维码添加站长微信