In-memory Computing with SAP HANA读书笔记 - 第七章:Business continuity and resiliency for SAP HANA

时间:2021-06-24 16:52:41

本文为In-memory Computing with SAP HANA on Lenovo X6 Systems第七章Business continuity and resiliency for SAP HANA的读书笔记。

In-memory Computing with SAP HANA读书笔记 - 第七章:Business continuity and resiliency for SAP HANA

Overview of business continuity options

业务连续性有不同的级别,采用何种级别取决于需求

Developing a business continuity plan highly depends on the type of business a company is doing, and it differs (among other factors) by country, regulatory requirements, and employee size.

业务连续性的目标:
* Recovery Time Objective (RTO) defines the maximum tolerated time to get a system online again.
* Recovery Point Objective (RPO) defines the maximum tolerated time span to which data must be restored. It also defines the amount of time for which data is tolerated to be lost. An RPO of zero means that the system must be designed to not lose data in any of the considered events.
* Recovery Consistency Objective (RCO) defines the level of consistency of business processes and data that is spread out over multitier environments.

HA和DR是有区别的

HA covers a hardware failure (for example, one node becomes unavailable because of a faulty processor, memory DIMM, storage, or network failure)
HA is implemented by introducing standby nodes. During normal operation, these nodes do not actively participate in processing data, but they do receive data that is replicated from the worker nodes. If a worker node fails, the standby node takes over and continues data processing.

DR covers the event when multiple nodes in a scale-out configuration fail, or a whole data center goes down because of a fire, flood, or other disaster, and a secondary site must take over the SAP HANA system.

HANA的HA/DR可以在两个层次实现:
1. 基础设施层 - 底层数据复制,例如基于General Parallel File System (GPFS)的存储复制
2. 应用层 - 两端执行相同的指令,可通过SAP HANA System Replication (SSR)实现,SSR不支持自动failover

In-memory Computing with SAP HANA读书笔记 - 第七章:Business continuity and resiliency for SAP HANA

GPFS based storage replication

Lenovo所有的HANA解决方案都基于GPFS。
在HA方案中,有两份数据冗余,在DR方案中,有三份数据冗余。所有的数据复制都是同步的。

SAP HANA System Replication

SSR是基于应用的复制,支持同步和异步,但日志的apply只支持异步。
如果主点失效,failover只能手工做。
支持级联复制

原理需要说明一下:

Every SAP HANA process that is running on the primary system’s worker nodes must have a corresponding process on a secondary worker node to which it replicates its activity.
The only difference between the primary and secondary system is the fact that one cannot connect to the secondary HANA installation and run queries on that database. They can also
be called active and passive systems.
Upon start of the secondary HANA system, each process establishes a connection to its primary counterpart and requests the data that is in main memory, which is called a snapshot.
After the snapshot is transferred, the primary system continuously sends the log information to the secondary system that is running in recovery mode. At the time of this writing, SSR
does not support replaying the logs immediately as they are received; therefore, the secondary site system acknowledges and persists the logs only. To avoid having to replay
hours or days of transaction logs upon a failure, SSR asynchronously transmits a new incremental data snapshot periodically.

SSR复制中,standby node可以承载非生产应用。

Special considerations for DR and long-distance HA setups

需要考虑延迟
一般不考虑同步

HA and DR for single-node SAP HANA

先解释一下single node:

High availability (HA) scenarios for SAP Business Suite with SAP HANA are supported, but are restricted to the simplest case of two servers, one being the worker node and one acting as a standby node. In this case, the database is not partitioned, but the entire database is on a single node. This configuration is sometimes also referred to as a single-node HA configuration. Because of these restrictions with regards to scalability, SAP decided to allow configurations with a higher memory per core ratio, specifically for this use case.

single node就是只有一个work node,即非scale out的情形。物理上可以有2-3个node。

In-memory Computing with SAP HANA读书笔记 - 第七章:Business continuity and resiliency for SAP HANA

注意到:
1. 所有的HA方案都是可以自动切换的;而所有的DR都必须手工切换
2. 所有的HA方案,standby node都不能接受工作负载。而DR方案都可以。
3. 所有的HA方案,GPFS都是一套,而DR方案是两套。
4. HA的复制是同步的,DR的复制可以是同步或异步。

High availability (by using GPFS)

单个数据中心,三个物理node,分别为worker(active), standby 和quorum node。

worker node接受所有工作负载,standby node只用于接管,不能处理工作负载。quorum node用于防止split brain。

存储使用服务器本地存储。
In-memory Computing with SAP HANA读书笔记 - 第七章:Business continuity and resiliency for SAP HANA

使用同步复制,数据两份冗余。切换无需人工干预

Stretched high availability (by using GPFS)

与single node HA相比,距离更长,其它都一样。
称为stretched HA。

quorum node应放置在第三站点,如果条件不具备,就放在主站点。

In-memory Computing with SAP HANA读书笔记 - 第七章:Business continuity and resiliency for SAP HANA

Disaster recovery (by using GPFS)

In-memory Computing with SAP HANA读书笔记 - 第七章:Business continuity and resiliency for SAP HANA

数据同步复制。
quorum node应放置在第三站点,如果条件不具备,就放在主站点。

注意到这个图和前面两个非常类似,唯一不同是HANA DB只在一个worker node上,而前面两个图,HANA DB都是跨worker node和standby node。

而且由于是DR而非HA,因此不能自动切换。(所有的HA都可自动切换,所有的DR都不能自动切换)

但好处是standby node可以接受工作负载,例如开发和测试。

其实这里谈到的HA和DR的区别类似于Oracle的RAC和ADG的区别。

Disaster recovery (by using SAP HANA System Replication)

前面的方案都是一个GPFS集群,而此方案中,两个节点的融合是在应用层实现的,而非GPFS层。因此需要两个独立的GPFS集群,如下图:

In-memory Computing with SAP HANA读书笔记 - 第七章:Business continuity and resiliency for SAP HANA

切换需要手工做,复制可同步或异步。

HA plus DR (by using GPFS)

In-memory Computing with SAP HANA读书笔记 - 第七章:Business continuity and resiliency for SAP HANA

GPFS的这套方案只用一套GPFS集群。
数据有三分拷贝。HA优先实现本地保护,DR实现站点保护。

HA (by using GPFS) plus DR (by using SSR)

In-memory Computing with SAP HANA读书笔记 - 第七章:Business continuity and resiliency for SAP HANA

本地和远端两套GPFS集群,数据三份拷贝。本地HA的两份拷贝是通过GPFS实现的,而灾备端的第三份拷贝是SSR实现的。

SSR的复制根据距离可以是同步或异步。

HA and DR for scale-out SAP HANA

In-memory Computing with SAP HANA读书笔记 - 第七章:Business continuity and resiliency for SAP HANA

Scale-out SAP HANA installations can implement two levels of redundancy to keep their database instance from going offline. The first step is to add a server node to the scale-out
cluster that acts as a hot-standby node. The second step is to set up another scale-out cluster in a distinct data center that takes over operation if there is a disaster at the primary site.

复制仍通过GPFS或SSR实现

HA by using GPFS storage replication

使用的GPFS文件系统的复制(HA是总共两份数据),既然是scale-out,使用的就是GPFS FPO版本。

DR by using GPFS storage replication

DR方案中,GPFS总共有三份数据拷贝。

In-memory Computing with SAP HANA读书笔记 - 第七章:Business continuity and resiliency for SAP HANA

只有一套GPFS集群,用于HA的数据拷贝是同步复制的,用于DR的数据拷贝可以是异步的。

quorum node防止主点和备点直接网络中断导致的脑裂。

这种方案中,灾备点的配置有些昂贵。

切换是手工的。

灾备点可承载非生产应用,如QA或培训环境。

HA by using GPFS replication plus DR by using SAP HANA Replication

In-memory Computing with SAP HANA读书笔记 - 第七章:Business continuity and resiliency for SAP HANA
单节点失效可通过主点的standby node 接管(HA), 多节点失效可通过DR切换到备点。

复制可同步或异步。主点和备点各一套GPFS集群和HANA数据库实例。

HA and DR for SAP HANA on Flex System

Flex System是一体机而已,其它概念相同,此处略。

Backup and restore

Basic operating system backup and recovery

操作系统分区一级的备份。

Basic database backup and recovery

Saving the savepoints and the database logs technically is impossible in a consistent way, and thus does not constitute a consistent backup from which it can be recovered. Therefore, a simple file-based backup of the persistency layer of SAP HANA is insufficient.

SAP HANA Studio 或 SAP HANA SQL 接口可启动备份,HANA只支持全备,不支持增量备份。

The backup files are saved to a defined staging area that might be on the internal disks, an external disk on an NFS share,8 or a directly attached SAN subsystem. In addition to the data backup files, the SAP HANA configuration files and backup catalog files must be saved to be recovered. For point-in-time recovery, the log area also must be backed up.

除数据外,配置文件也需要备份

File-based backup tool integration

Database backups by using GPFS snapshots

原理:
GPFS supports a snapshot feature with which you can take a consistent and stable view of the file system that can then be used to create a backup (which is similar to enterprise storage snapshot features). While the snapshot is active, GPFS stores any changes to files in a temporary delta area. After the snapshot is released, the delta is merged with the original data and any further changes are applied on this data.

Taking only a GPFS snapshot does not ensure that you have a consistent backup that you can use to perform a restore. SAP HANA must be instructed to flush out any pending changes to disk to ensure a consistent state of the files in the file system.

没错,存储层的快照必须与应用配合以保证数据一致性。所有的数据库备份都是一样的,类似于freeze和thaw。

Backup tool integration with Backint for SAP HANA

HANA提供API与第三方备份工具集成,即Backint,可以认为类似于Oracle DB中的RMAN。

详见http://scn.sap.com/docs/DOC-34483

目前认证的有Symentec NBU, EMC networker, IBM和Commvault等。

Tivoli Storage Manager for ERP 6.4

Symantec NetBackup 7.5 for SAP HANA

Backup and restore as a DR strategy

The use of backup and restore as a DR solution is a basic way of providing DR. Depending on the RPO, it might be a viable way to achieve DR. The basic concept is to back up the data on the primary site regularly (at least daily) to a defined staging area, which might be an external disk on an NFS share or a directly attached SAN subsystem (this subsystem does not need to be dedicated to SAP HANA). After the backup is done, it must be transferred to the secondary site, for example, by a simple file transfer (can be automated) or by using the replication function of the storage system that is used to hold the backup files.

In-memory Computing with SAP HANA读书笔记 - 第七章:Business continuity and resiliency for SAP HANA

本书的笔记到本章就结束了,Thanks for you time, enjoy reading!