ORACLE rac 的一些基本概念

集群件的组件
oracle的集群件包括以下后台进程：cluster synchronization service（css）
cluster ready services（crs） event manager （evm）

CSS：This component manages the cluster configuration by controlling which nodes are members of the
cluster. When a node joins or leaves the cluster, CSS notifies the other nodes of the change in configuration.
If this process fails, then the cluster will be restarted. Under Linux, CSS is implemented
by the ocssd daemon, which runs as the root user.（通过管理集群中节点成员身份完成整个集群的配置工作，每当有新节点加到集群或者有节点离开时，css负责通知集群的所有节点变更进群配置信息）

CRS:This component manages high availability operations within the cluster. Objects managed by CRS
are known as resources and can include databases, instances, services, listeners, virtual IP addresses,
and application processes. By default, CRS manages four application process resources: Oracle Net
listeners, virtual IP addresses, the Global Services Daemon (GSD), and the Oracle Notification Service
(ONS). Configuration information about each resource is stored in the Oracle Cluster Registry
(OCR). When the status of a resource changes, CRS generates an event.
CRS monitors resources, such as instances and listeners. In the event of the failure of a resource,
CRS will attempt to automatically restart the component. By default, CRS will attempt to restart the
resource five times before giving up.
Under Linux, CRS is implemented as the crsd daemon, which runs as the root user. In the
event of a failure, this process restarts automatically.（负责集群的高可用性。被crs管理的对象叫做集群资源，这些资源包括数据库、实例、服务、监听器、vip地址以及应用程序
默认情况下crs 管理4个应用程序进程：监听器、vip地址、GSD、ONS，这些资源都保存在ocr中。每当某个资源发生了变化，crs就会生成一个事件）

EVM:The EVM component publishes events created by Oracle Clusterware. Under Linux, EVM is implemented
as the evmd daemon, which runs as the root user.
You can specify callout scripts, which will be executed by EVM when a specified event occurs.
These callouts are managed by the racgevt process.
In addition to the background processes, Oracle Clusterware also communicates with the
Oracle Notification Service (ONS), which is a publish and subscribe service that communicates FAN
events to clients.（负责对外发布crs生成的事件。我们可以指定回调脚本，当有事件发生时，EVM就可以调用我们自己的脚本，这种调用时通过racgevt进程进行的。）

ORACLE Cluster Registry（OCR）;:The Oracle Cluster Registry (OCR) maintains cluster and database configuration information for
RAC and Oracle Clusterware resources, including information about nodes, databases, instances,
services, applications, and listeners.
The OCR is similar in many ways to the Windows registry. Information is stored in a hierarchy
of key-value pairs within a directory tree structure. The OCR can be updated using EM, the Server
Control Utility (SRVCTL), and DBCA.
The OCR must be located on shared storage and must be accessible from all nodes. On each
node, the location of the OCR is specified in the file /etc/oracle/ocr. loc:
ocr 负责维护整个集群的配置信息，包括rac以及clusterware资源，包括的信息有节点成员、数据库、实例、服务、监听器、应用程序等。

Voting Disk
The voting disk, which must reside on shared storage, manages cluster membership information. It
is used by RAC to determine instances that are members of the cluster. It is also used to arbitrate
cluster ownership between the remaining instances in the event of a network failure.
In Oracle 10.2 and above, you can to create multiple voting disks. Oracle recommends configuring
an odd number of voting disks (e.g., three or five). By default in Oracle 10.2, the OUI will
create three voting disks, although you can specify a single voting disk if your storage provides mirroring
at the hardware level. The voting disk is critical to the operation of Oracle Clusterware and of
the database. Therefore, if you choose to define a single voting disk or you are still using Oracle 10.1,
you should use external mirroring to provide redundancy.

voting disk: oracle clusterware 利用这个文件管理集群的节点成员的身份，根据这里的记录判断哪个节点数据集群的成员。
并在出现"脑裂"时，仲裁哪个partition 获得集群的控制权而其他partition必须从集群中剔除

Administering Oracle Clusterware
CRSCTL: Performs various administrative operations for Oracle Clusterware
CRS_STAT: Reports the current state of resources configured in the OCR
OCRCONFIG: Performs various administrative operations on the OCR
OCRCHECK: Verifies the integrity of the OCR
OCRDUMP: Dumps the contents of the OCR to a text file

CRSCTL :must be run as root

rac的网络
public 网络用于对外提供数据查询等服务
private 网络用于rac心跳网络和cache fusion
virtual 网络是在clusterware最后安装阶段，通过vipca创建的；作为一个nodeapps类型的crsresource注册到ocr中，并由crs维护状态；
vip会绑定到节点的public 网卡上；那么public 网卡就有两个地址；相对于vip，这块网卡的原地址叫做public ip
当某个节点发生故障时，crs会把故障节点的vip转移到其他节点上；每个节点的listener会同时在public网卡的public网卡的public ip 和 vip两个地址上监听；
客户端的tnsnames.ora一般会配置指向节点的VIp

VIP原理：
假设是一个两个节点的rac环境，正常运行时每个节点都会有一个VIP，节点1叫做VIP1，节点2叫做VIP2。现在节点2发生了故障，比如异常关机。

1、crs在检测到节点2的异常后，会触发clusterware的重构，最后把节点2剔除集群。由节点1组成新的集群
2、节点2的VIP 转移到节点1,；这是节点1的public NIC就会有vip1，vip2两个VIP和public ip1 三个ip地址；这时节点1的vip所在的网卡就会处于混杂模式（promiscuous mode）,
并发出ARP包通知其他节点，vip2对应的MAC地址已经改变了。
3、用户对VIP2的连接请求会被ip层（ip层的主要功能是确定路由，而不负责数据传送完整性，不关心数据的成功送达与否）路由到节点1.
4、因为节点1现在有了vip2地址，所以数据包会顺利的通过链路层、网络层、传输层。
5、但是节点1上的监听器值监听vip1和public ip1两个ip地址，并没有监听vip2这个地址，也就是说，在应用层没有对应的应用程序接受这个歌数据包，这个错误能够立即被捕获。所以
连接请求虽然来到了节点1，但是节点1会立即返回一个错误消息给客户端。
6、客户端能够立刻收到这个错误，客户端会立即做failover，使用另一个vip地址，也就是客户端会重新发起向vip1的连接请求

rac的后台进程
1、LMSn：这个进程是cache fusion的主要进程，负责数据块在实例间的传递，对应的服务叫做GCS（Global Cache Service），这个进程的名称来源于Lock Manager Service
此进程的数量通过参数 GCS_SERVER_PROCESSES来控制默认是2个，取值范围0-9.
2、LMD：这个进程负责的是Global Enqueue Service（GES）.具体来说，这个进程在多个实例之间协调对数据块的访问顺序，保证数据的一致性访问。他和前面的LMSN进程的GCS服务和GRD
构成RAC最核心的功能cache fusion
3、LCK 这个进程负责non-cache fusion资源的同步访问，每个实例有一个lck进程
4、LMON：各个实例的LMON进程会定期通信，以检查集群中各节点的健康状态，当某个节点出现故障时，负责集群重构，GRD恢复等操作，它提供的服务叫做Cluster Group Services（CGS）
LMON 被赋予了自检功能，这个功能就是LMON提供的CGS服务。这个服务有以下几个要点
a、LMON提供了节点监控功能：这个功能是用来记录应用层各个节点的健康状态，节点的健康状态时通过保存在GRD中的位图来记录的。每个节点1位，0代表节点关闭，1代表节点正常运行。
各个节点的LMON会互相通信，确认这个位图的一致性。
b、节点上的LMON进程间会定期进行通信，这个通信可以通过CM层完成也可以不通过cm，直接通过网络层。
c、lmon可以好下层的clusterware合作也可以单独合作。当lmon检测到实例级别的”脑裂时“，Lmon会先通知下层的clusterware，但是lmon不会无尽等待clusterware层的处理结果。如果发生
等待超时，lmon会自动触发IMR（instance membership reconfiguration）。lmon进程提供的IMR功能能够看做是oracle在数据库层提供的”脑裂“、”io隔离机制“
d、lmon主要借助两种心跳机制来完成健康监测。
1、节点间的网络心跳（network heartbeat）：可以想象成节点间定时发送ping包，检测节点状态
2、通过控制文件的磁盘心跳（controlfile heartbeat）：每个节点的CKPT进程每3秒钟更新一次控制文件一个数据块，这个数据块叫做checkpoint progress record；控制文件时共享的
，因此是李健可以相互检查对方是否及时更新以判断状态。
5、DIAG :监控实例的健康状态，并在实例运行时错误收集诊断数据记录到Alert.log日志中。
6、GSD负责从客户端工具接受用户命令。为用户管理提供接口

RAC Split-Brain决议的投票算法

获知“裂脑”是治疗“癫痫”病的一种手术。医生们认为癫痫病发作是由于大脑“异常放电”所至。
为了阻止“异常放电”波及整个大脑（左、右半脑），就用手术来割断病人左右脑的连接神经。
使今后病人在发病时至少还有半个大脑正常、能控制行为。但施行过手术的所谓“裂脑人”在术后有一段不适应期，
表现得行为分裂，仿佛体内存在着2个人，时常要发生冲突。
例如，右脑想让一只手去脸上挠痒痒，左脑却不认同、以为是谁的手要“登鼻子上脸”就让另一只手去阻止，
于是明明是自己的2只手，却互不相让、扭在一起扳起了手腕

计算机的行为和人的行为存在相同之处

在集群里，节点间通过心跳（也就是，人左右脑的连接神经）了解彼此的健康状况，以确保协同作业
如果只是心跳出了状况，但各节点还可以正常运行（也就是，人左右脑还是ok的）
这时，每个节点都认为其他的节点宕机了，自己是整个集群环境中的"唯一建在者"
自己应该获得共享磁盘即资源）的控制权
于是明明是共享盘的子节点，却互不相让、扭在一起扳起了手腕


集群重构时，所有的active节点和正在加入集群的节点都会参与到reconfig中，
那些没有应答(ack)的节点都将不再被归入新的集群关系中。
集群重构包括多个阶段:
1.初始化阶段 — reconfig manager(由集群成员号最低的节点担任)向其他节点发送启动reconfig的信号
2.投票阶段 — 节点向reconfig manager发送该节点所了解的成员关系
3.脑裂检查阶段 — reconfig manager检查是否脑裂
4.驱逐阶段 — reconfig manager驱逐非成员节点
5.更新阶段 — reconfig manager向成员节点发送权威成员关系信息

在脑裂检查阶段，Reconfig Manager会找出那些没有Network Heartbeat而有Disk Heartbeat的节点，
并通过Network Heartbeat(如果可能的话)和Disk Heartbeat的信息来计算所有竞争子集群(subcluster)内的节点数目，
并依据以下2种因素决定哪个子集群应当存活下去:
1.拥有最多节点数目的子集群
2.若子集群内数目相等则为拥有最低节点号的子集群，比如在一个2节点的RAC环境中总是1号节点会获胜

注意：这里说的低节点号不是指 RAC节点(1号节点 ,2号节点)这样的节点号，而是实际上下文的节点号， olsnodes -n 获得的节点号

所以，投票只是对“健康状况”的报道，而不是用来决定谁应该留下来

假设集群*有3个节点，其中1号实例没有被启动，集群中只有2个活动节点(active node)，发生2号节点的网络失败的故障，因2号节点的member number较小故其通过voting disk向3号节点发起驱逐

若没有vote disk，在network heartbeat不可用的情况下， cluster分裂成多个 subcluster时，
它们如何知道对方的subcluster中node的数量呢？
它们如何发送killblock 以实现evictee的驱逐通知呢？
votedisk都是读的


其实VOTE DISK的个数和票数没关系
几个都是1票
只不过对于VOTE DISK它是多数可用的工作原理，我们冗余这个只是保证当VOTE DISK出问题的时候所有节点可以继续工作
当你有 1 个vote disk，并且它损坏了，则集群停止工作
当你有 2 个并且有 1 个损坏，则集群停止工作，因为，可用1/2没有>不可用1/2；注意是>;而不是>=
当你有 3 个并且有 1 个损坏，则集群还是好的，因为，2/3 > 1/3
当你有 4 个并且有 1 个损坏，则集群也还是好的，因为，3/4 > 1/4
当你有 3 个并且有 2 个损坏，则集群停止工作，因为，1/3 < 2/3
当你有 4 个并且有 2 个损坏，则集群停止工作，因为，可用1/2没有>不可用1/2
由上可知，oracle推荐vote disk为奇数个，3个votedisk ，那么其中一个节点只要能访问其中2个就 ok，
而如果共有2个votedisk 那么节点需要能正常访问所有的vote disk，如果是 4个ok吗？当然，4个也可以，
但4个和3个在出故障时是一样的，这样你便浪费了一个disk。

秒客网

ORACLE rac 的一些基本概念

相关文章