转载自 http://blog.****.net/tianlesoftware/article/details/6534239
Introduction
This post is about oracle rac 10g, it is important to notice the version number of oracle. Because different version may not work for this post. Cache fusion technology was partially implemented in Oracle 8i in OPS (Oracle Parallel Server). Before Oracle 8i the situation was different. If we take a case of multi-instance Oracle Parallel server and if one of the instance ask for a block of data which is currently modified by other instance of same database, then the holding instance needs to write the data to disk so that requesting instance can read the same data. This is called “Disk Ping”. This has greatly effected the performance of the database. With Oracle 8i, partial cache fusion was implemented.
Concept of cache fusion
Cache Fusion basically is about fusing the memory buffer cache of multiple instance into one single cache. For example, we have two instance in a RAC which is using the same datafiles and each instance is having its own memory buffer cache in there own SGA, then cache fusion will make the database behave as if it has a single instance and the total buffer cache is the sum of buffer cache of all the instance.
This behavior is possible because of high speed interconnect existing in the cluster between each instance. Each of instance is connected to other instance using a high-speed interconnect. This makes it possible to share the memory between 2 or more servers. Previously only datafile sharing was possible, now because of interconnect, even the cache memory can be shared.
We will discuss following topics before discussing Cache Fusion
(1)Cache Coherency
(2)Multi-Version consistency model
(3)Resource Co-ordination – Synchronization
(4)Global Cache Service (GCS)
(5)Global Enqueue Service
(6)Global Resource Directory
(7)GCS resource modes and roles
(8)Past Images
(9)Block access modes and buffer states
2.1 Cache Coherency
In a single instance database, check this scenario, some one first read and changed some rows in a datablock but did not commit. Then the second people issue a sql statement to read the same datablock. Oracle will make a copy of the datablock and use the UNDO information from UNDO tablespace to modify the datablock so the second people will see the correct information. This is called maintaining consistency of data.
Now consider a multi-instance enviroment. The datablock may exist in different instance. So maintaining the consistency of data blocks in the buffer cache of multiple instance is called “Cache Coherency”.
2.2 Multi-Version consistency model
Multi version consistency model distinguishes between current version of data block and one or mode read consistent version of data block. The current block is the one which contains all the changes, committed as well as uncommitted. Example a user fired a DML on a data block which is not present in any of the instance. Then this block will be read from disk into buffer cache where the value gets changed. After then user commits and fires another DML on same data block. Now that data block is dirty and contains committed as well as uncommitted changes.
Suppose this data block is requested by another user for reading, then oracle will make a copy and apply undo information andmake a Consistent Read “CR” copy of this block and ship it to requesting instance. Thus we have multiple versions of same data blocks, each of them are consistent with respect to the user who requested.
During the course of operation there can be many more version of same data block, each of them consistent with respect to some point in time.
2.3 Resource Co-ordination – Synchronization
In case of multi instance system such as RAC, where same resources (example data block) are getting used concurrently, effective synchronization is required for maintaining consistency. With in the shared cache, co-ordination of concurrent task is called synchronization. The synchronization provided by Oracle RAC provides a cluster wide concurrency of resource and in turn ensure integrity of shared data. All though there is synchronization within the cache, there is some cost involved for doing the same. If we talk about low level operation of synchronization, its just a data copy operation or data transfer operation.
According to Oracle studies, accessing the block in a local cache is much faster then accessing the block from another instance cache with in the cluster. Because with local cache is the in memory copy and with other instance cache, the data transfer needs to be done over high speed interconnect which is obviously slower then in memory copy. Worst is the copy from disk, which is much slower then above two process.
2.4 Global Cache Service
Global Cache Service (GCS) is the main component of Oracle Cache Fusion technology. This is represented by background process LMSn. There can be max 10 LMS process for an instance. The main function of GCS is to track the status and location of data blocks. Status of data block means the mode and role of data block (I will explain mode and role further). GCS is the main mechanism by which cache coherency among “multiple cache” is maintained. GCS is also responsible for block transfer between the instances.
2.5 Global Enqueue Service
Global Enqueue Service (GES) tracks the status of all Oracle enqueuing mechanism. This involves all non-cache fusion intra instance operations. GES performs concurrency control on dictionary cache locks, library cache locks and transactions. It performs this operation for resources that are accessed by more then one instance.
Enqueue services are also present in single instance database. These are responsible for locking the rows on a table using different locking modes.
2.6 Global Resource Directory
GES and GCS together maintains Global Resource Directory (GRD). GRD is like an in-memory database which contains details about all the blocks that are present in cache. GRD know what is the location of latest version of block, what is the mode of block, what is the role of block (Mode and role will be discussed shortly) etc. When ever a user ask for any data block GCS gets all the information from GRD. GRD is a distributed resource, meaning that each instance maintain some part of GRD. This distributed nature of GRD is a key to fault tolerance of RAC. GRD is stored in SGA.
Typically GRD contains following and more information
(1)Data Block Address – This is the address of data block being modified
(2)Location of most current version of data block
(3)Modes of data block
(4)Roles of data block
(5)SCN number of data block
(7)Image of data block – Could be current image or past image.
2.7 GCS resource modes and roles
Mode of data block is decided based on whether a resource holder intends to modify the data or read the data. The modes are as follows:
(1)Null (N) Mode: Null mode is the least restrictive mode. It indicates no access rights. It acts as a place holder.
(2)Shared (S) Mode: Shared mode indicate that database block is being read and not modified. However another session can read the data block
(3)Exclusive (X) Mode: Exclusive mode indicate exclusive access to block. Other resource cannot have write over this data block. However it can have consistent read on this datablock.
GCS resources also has roles. Following are the different roles present:
(1)Local: When a data block is first read into the instance from the disk it has a local role. Meaning that only 1 copy of data block exists in the cache. No other instance cache has a copy of this block.
(2)Global: Global role indicates that multiple copy of data block exists in clustered instance and the data block was dirty when shipped to other instance. For example a user connected to one of the instance request for a data block. This data block is read from disk into an instance. The role granted is local. If another instance request for same block this block will get copied to the requesting instance and if before shipped to other instance, the datablock was modified, the role will change to global.
This role and mode information is maintained in GRD (Global Resource Directory) by GCS (Global Cache Service).
2. 8 Past Images
Past Image concept was introduced in Oracle 9i to maintain data integrity. In an Oracle database, a typical block is not written to disk immediately after it is dirtied. This is to reduce excessive IO. When the same dirty block is requested by some other instance for write purpose, an image of the block is created in owning instance and then the block is shifted to requesting instance. This image copy of the block is called Past Image (PI). In the event of failure Oracle can reconstruct the block by reading PIs. It is also possible to have more then 1 PI of the block, depending on how many times the block was requested in dirty stage.
A past image of the block is different than CR (Consistent read) image. Past image is required to create CR by applying undo data.
“Juggling” Data with Multiple Past Images
(1)Multiple Past Image versions of a data block may be kept by different instances
(2)Upon a checkpoint, only the current image is written to disk; Past Images are discarded
(3)In the event of a failure, current version of block can be reconstructed from PIs
(4)Since PIs are kept in memory, they aid in avoiding frequent disk writes
(5)This avoids “disk pinging” experienced with 8i OPS due to frequent writes to disk
(6)Data is “juggled” in memory, without touching down on the disk
2.9 Block access modes and buffer states
An additional concurrency control concept is the buffer state which is the state of a buffer in the local cache of an instance. The buffer state of a block relates to the access mode of the block. For example, if a buffer state is exclusive current (XCUR), an instance owns the resource in exclusive mode.
To see a buffer’s state, query the “status” column of the V$BH dynamic performance view.
The V$BH view provides information about the block access mode and their buffer state names as follows:
(1)With a block access mode of NULL the buffer state name is CR — An instance can perform a consistent read of the block. That is, if the instance holds an older version of the data.
(2)With a block access mode of S the buffer state name is SCUR — An instance has shared access to the block and can only perform reads.
(3)With a block access mode of X the buffer state name is XCUR –An instance has exclusive access to the block and can modify it.
(4)With a block access mode of NULL the buffer state name is PI — An instance has made changes to the block but retains copies of it as past images to record its state before changes.
(这里 datablock的访问模式只有3种 NULL,S,X 而对应的buffer的state可以有若干种,比如CR SCUR XCUR PI. datablock的mode体现了当前instance持有这个block的状态,而buffer state体现了该data block在整个cache fusion中的状态)
Only the SCUR and PI buffer states are Real Application Clusters-specific. There can be only one copy of any one block buffered in the XCUR state in the cluster database at any time. To perform modifications on a block, a process must assign an XCUR buffer state to the buffer(XCUR是buffer的状态) containing the data block.
For example, if another instance requests read access to the most current version of the same block, then Oracle changes the access mode from exclusive to shared(会降级到share 因为只有share才可以让别人读), sends a current read version of the block to the requesting instance, and keeps a PI buffer(即使别人是request to read 也会生成 pi if the buffer contained a dirty block) if the buffer contained a dirty block.
At this point, the first instance has the current block and the requesting instance also has the current block in shared mode.Therefore, the role of the resource becomes global. There can be multiple shared current (SCUR) versions of this block cached throughout the cluster database at any time.
Block transfer using Cache Fusion
Lets consider a very details example of how the block transfer happens between different instances. For explaining this example I am assuming a 3 node RAC system and also another assumption is that any DML statement is followed by a commit. So if I say that a user executed update that means user executed update + commit. But there is no checkpoint until the end.
Stage 1
In stage 1 datablock is requested by a user C who is connected to instance 3. So a data block is read into the buffer cache of instance 3.
SQL>select sales_rank from salesman where salesid = 10;
Assume this gives a value of 30. This block is read for the first time and its not present in any other instance. So the role of block is LOCAL and the block is read in SHARED mode. Also there are NO PAST IMAGES. So we describe this stage has instance 3 having SL0 mode (SHARED, LOCAL, 0 PAST IMAGES).
Stage 2
In stage 2 user B issues the same select statement against the salesman table. Instance 2 will need the same block; therefore, the block is shipped from instance 3 to instance 2 via cache fusion interconnect. There is no disk read at this time. Both instances are in SHARED mode (S) and role is LOCAL (L). Here if you see carefully that even though the block is present in more then one instance, still we say that role is local because the block is not yet dirtied. Had the block been dirty and then requested by other instance, then in that case the role will change to global.
Stage 3
In stage 3 user B decides to update the row and commit at instance 2. The new sales rank is 24. At this stage, instance 2 acquires EXCLUSIVE lock for updating the data at instance 2 and SHARED lock from instance 3 is downgraded to NULL lock.
SQL>update salesman set sales_rank = 24 where salesid = 10;
SQL>commit;
So instance 2 is having a mode XL0 (Exclusive, Local with 0 past images) and instance 3 is having a NULL lock, which is just a place holder. Also the role of the block is still LOCAL because the block is dirtied for the first time only on instance 2 and no other instance is having any dirty copy of that. If another instance now tries to update same block the role will change to global.
Stage 4
In stage 4 user A decides to update in instance 1 the same row and hence the same block with sales rank of 40. It finds that block is dirtied in instance 2. Therefore the data block is shipped to instance 1 from instance 2, however, a PAST IMAGE of the data block is created on instance 2 and lock mode on instance 2 is downgraded to NULL with a GLOBAL role. Instance 2 now has NG1 (NULL lock with GLOBAL role and 1 PAST IMAGE). At this time instance 1 will have EXCLUSIVE lock with GLOBAL role (XG0).
Stage 5
User C executes a select statement from instance 3 on same row. The data block from instance 1 being the most recent copy (GRD (Global Resource Directory) knows this information about which instance is having the latest copy of data block), it is shipped to instance 3. As a result the lock on instance 1 is converted to SHARED GLOBAL with 1 PAST IMAGE. The reason the lock gets changed to SHARED and not NULL is because instance 3 asked for shared lock (for reading data) and not exclusive lock (for updating data). If the instance 3 asked for exclusive lock then the instance 1 would have had NULL lock.
Also the instance 3 will now hold SG0 (SHARED, GLOBAL with 0 PAST IMAGES) (这里说如果 用户C在instance 3上执行select语句,data block会从instance 1 发送到instance 3 同时 instance1上的datablock 会产生一个pi并且对data block的访问模式会由X降级到S。 经测试这是不对的。我在我的环境中instance 1先update一个datablock,这时该instance的buffer中会有两个对应该数据块的buffer,一个是XCUR状态,一个是CR状态。然后我们去instance2上select这个数据块,不会导致instance 1的X mode降级到S,也不会产生pi,同时instance2 对应的buffer也不是scur模式,而是CR模式。猜测这里oracle只是在instance2上对改data block做了一个CR读,所以不存在dirty data block的ship 也不会产生pi。 最根本的原因可能是普通的SELECT不会产生一个shared lock。不知道原文中是怎么产生shared lock的)
Stage 6
User B issues the same select statement against the salesman table on instance 2. Instance 2 will request for a consistent copy of buffer from another instance, which happens to be the current master.
Therefore instance 1 will ship the block to instance 2, where it will be required with SG1 (SHARED, GLOBAL with 1 PAST IMAGE).So instance 2 mode becomes SG1.
Stage 7
User C on instance 3 updates the same row. Therefore the instance 3 requires an exclusive lock and instance 1 and instance 2 will be downgraded to NULL lock with GLOBAL role and 1 PAST IMAGE. Instance 3 will have EXCLUSIVE lock, GLOBAL role and with no PAST IMAGES (XG0).
Stage 8
The checkpoint is initiated and a “Write to Disk” takes place at instance 3. As a result previous past images will be discarded (as they are not required for recovery) and instance 3 will hold that block in EXCLUSIVE lock LOCAL role with no PAST IMAGES (XL0).
Further if any instance wants to read or write on the same block then a copy will be again shifted from instance 3.
Test on my RAC
We have a two nodes RAC, and we will check the block transfer with below table.
SQL> select id1, DBMS_ROWID.ROWID_BLOCK_NUMBER(rowid) from t; ID1 DBMS_ROWID.ROWID_BLOCK_NUMBER(ROWID)
---------- ------------------------------------
1 75010
2 75010
3 75010
4 75010
5 75010
6 75010
7 75010
8 75011
9 75011
10 75011
11 75011
12 75011
We use the row id1=8 which is in block 75011.
stage 1
We issue alter system flush buffer cache on both nodes then we issue below statement on instance 1.
SQL> select id1, DBMS_ROWID.ROWID_BLOCK_NUMBER(rowid) from t where id1=8; ID1 DBMS_ROWID.ROWID_BLOCK_NUMBER(ROWID)
---------- ------------------------------------
8 75011
Because we first cleared the buffer, so oracle will read this block from disk. The block are read in S mode, with local role and no pi now. So we can see now it is SL0 in instance 1. Based on what we said before "With a block access mode of S the buffer state name is SCUR — An instance has shared access to the block and can only perform reads." So the buffer state should be SCUR now. Block state is scur too.
SQL> select
INST_ID,
o.object_name,
b.status,
FILE# ,
BLOCK#
from gv$bh b , dba_objects o
where b.objd = o.data_object_id and o.object_name = 'T' and block#=75011 and b.status <> 'free' order by b.status,INST_ID,BLOCK# ; INST_ID OBJECT_NAME STATUS FILE# BLOCK#
---------- ------------------ ------- ---------- ----------
1 T scur 1 75011 SQL> select
INST_ID,
2 3 o.object_name,
4 decode(state,0,'free',1,'xcur',2,'scur',3,'cr', 4,'read',5,'mrec',6,'irec',7,'write',8,'pi') state,
5 dbarfil,
6 dbablk,
ba
7
8 from x$bh b , dba_objects o
9 where b.obj = o.data_object_id and o.object_name = 'T' and dbablk=75011 and state <> 0 order by state; INST_ID OBJECT_NAME STATE DBARFIL DBABLK BA
---------- ------------------ ----- ---------- ---------- ----------------
1 T scur 1 75011 000000039779A000
Be notice that we use free to filter thoes free buffers to make the result more clear.
Stage 2
I issue the same select statement on instance 2.
SYS@cngracs2 > select id1, DBMS_ROWID.ROWID_BLOCK_NUMBER(rowid) from t where id1=8; ID1 DBMS_ROWID.ROWID_BLOCK_NUMBER(ROWID)
---------- ------------------------------------
8 75011
Instance request the block in S mode. Block was shipped to instance 2 from instance 1. Now instance 1 has the block 75011 in S L 0, still no PI and global role because it is not dirtied before ship. Instance is also have SL0.
V$BH query as below
select
INST_ID,
o.object_name,
b.status,
FILE# ,
BLOCK#
from gv$bh b , dba_objects o
where b.objd = o.data_object_id and o.object_name = 'T' and block#=75011 and b.status <> 'free' order by b.status,INST_ID,BLOCK# ; INST_ID OBJECT_NAME STATUS FILE# BLOCK#
---------- ------------------ ------- ---------- ----------
1 T scur 1 75011
2 T scur 1 75011
X$BH on instace 1 is
INST_ID OBJECT_NAME STATE DBARFIL DBABLK BA
---------- ------------------ ----- ---------- ---------- ----------------
1 T scur 1 75011 000000039779A000
X$BH on instace 2 is
INST_ID OBJECT_NAME STATE DBARFIL DBABLK BA
---------- ---------------------------------------- ----- ---------- ---------- ----------------
2 T scur 1 75011 000000039EC7C000
Stage 3
I update the block with below statement on instance 2.
SYS@cngracs2 > update t set text='good' where id1=8; 1 row updated. SYS@cngracs2 > commit; Commit complete.
Instance 2 request the datablock in X mode, so block mode in instance 1 down grade to NULL. So buffer state is CR based on what we said before. "With a block access mode of NULL the buffer state name is CR — An instance can perform a consistent read of the block. That is, if the instance holds an older version of the data." But instance have block in X mode so buffer state is XCUR. So now instance 1 has block in NL0. Instance 2 has block in XL0.
V$BH query as below
INST_ID OBJECT_NAME STATUS FILE# BLOCK#
---------- -------------------- ------- ---------- ----------
1 T cr 1 75011
2 T cr 1 75011
2 T xcur 1 75011
X$BH on instance 2 query as below
INST_ID OBJECT_NAME STATE DBARFIL DBABLK BA
---------- ---------------------------------------- ----- ---------- ---------- ----------------
2 T cr 1 75011 000000039EC7C000
2 T xcur 1 75011 0000000385724000
V$BH shows instance 1 buffer state is cr, we know this with no surprise. But it shows instance 2 also has a cr. Looks like the cr on instance 2 is down graded from the scur in stage 2. So it means when update on instance 2, oracle copied a new copy of the block on instance 2 and update it.Guess this is what oracle do when doing update -- it will generate a new block and update, never do the operation on the orignal block. Anyway this is just my guess. Ok, X$BH query proved this. Because the orignal buffer 00000000039EC7C000 was cr state. xcur state is a new buffer.
Stage 4
In this stage we need to test PI. When a block is dirty and shipped to other instance, it will generate a PI, mean while the role will change to Global. We plan to test as below.
First update on instance 2 with below statement
update t set text='good' where id1=8; commit;
Then update on instance 1 with below statement quickly.(You have to be quickly or there may be a checkpoint and block will not be dirty)
update t set text='wonderful' where id1=8; commit;
This will make the dirty block shipped to instance 1. So instance 2 will have a pi.
Let`s do it.
Issue below in instance 2 then quick go next step.
SYS@cngracs2 > update t set text='good' where id1=8; 1 row updated. SYS@cngracs2 > commit; Commit complete.
Then issue X$BH query on instance 1 and quickly go next step.
SQL> / no rows selected
It is ok with no rows return. This means data buffer on instance 1 is reused.
Then issue X$BH query on instance 2 and quickly go next step.
SYS@cngracs2 > / INST_ID OBJECT_NAME STATE DBARFIL DBABLK BA
---------- ---------------------------------------- ----- ---------- ---------- ----------------
2 T cr 1 75011 000000039EC7C000
2 T xcur 1 75011 0000000385724000 SYS@cngracs2 > / INST_ID OBJECT_NAME STATE DBARFIL DBABLK BA
---------- ---------------------------------------- ----- ---------- ---------- ----------------
2 T cr 1 75011 000000039EC7C000
2 T cr 1 75011 0000000385724000
2 T xcur 1 75011 000000039FAE2000
OK. The first / is what we issued in last stage. The second / is what we just issued here. I want to compare the result here. We can see when we issue update on instance 2. A new buffer shows with xcur. The orignal xcur buffer 0000000000385724000 was down graded to cr again. This proved our guess before ——oracle will generate a new buffer to update, never use the orignal one. Ok continue
Issue below on instance 1 quickly and go next step.
SQL> update t set text='good' where id1=8; 1 row updated. Elapsed: 00:00:00.01
SQL> commit; Commit complete.
Notice that this will cause the block 75011 in 000000039FAE2000 shipped to instance 1 and generate a pi.
Then issue x$bh on instance 2 to check.
SYS@cngracs2 > / INST_ID OBJECT_NAME STATE DBARFIL DBABLK BA
---------- ---------------------------------------- ----- ---------- ---------- ----------------
2 T cr 1 75011 000000039EC7C000
2 T xcur 1 75011 0000000385724000 SYS@cngracs2 > / INST_ID OBJECT_NAME STATE DBARFIL DBABLK BA
---------- ---------------------------------------- ----- ---------- ---------- ----------------
2 T cr 1 75011 000000039EC7C000
2 T cr 1 75011 0000000385724000
2 T xcur 1 75011 000000039FAE2000 SYS@cngracs2 > / INST_ID OBJECT_NAME STATE DBARFIL DBABLK BA
---------- ---------------------------------------- ----- ---------- ---------- ----------------
2 T cr 1 75011 000000039EC7C000
2 T cr 1 75011 0000000385724000
2 T pi 1 75011 000000039FAE2000
The last / is what we just issued. We can see there is a PI. And the pi buffer address is 0000000039FAE2000 which is the one was in xcur. So oracle change the buffer state to PI and copy the block to other instance. But you should know pi will be discared once check point.