构建健壮的大规模分布式系统

构建健壮的大规模分布式系统

实际的生产环境中，经常会由于机器故障、机房掉电、网络异常、软件bug等原因，造成整个系统中某台机器、某些集群异常，无法提供稳定的服务；而系统也可能因为某些突发事件、外部攻击等原因，出现流量瞬间的大幅度增长，超过系统承载能力。因此，在系统设计时，需要充分的考虑系统的优雅降级、流量控制等。最近阅读了不少相关的文档，本文进行了整理，列举了一些构建大规模分布式服务的design principle如下。

Design Target

在设计初期，就需要充分考虑可扩展性、系统的可用性、运维和管理的便捷性以及成本，几点说明如下：

Scalability : Resource usage increase linearly(or better!) with load
Availability : Resilience to partial failure, Graceful degradation, Recoverability from failure
Manageability : Simplicity, Maintainability, Diagnostics
Cost : Machine cost and Operation cost

System Availability

系统的可用度指标，需要明确两个指标MTBF和MTTR，定义如下

MTBR : Mean Time Between Failure (系统平均故障间隔时间)
MTTR : Mean Time To Repair(系统平均修复时间)

系统的可用度Availability = MTBF / (MTBF + MTTR)，因此，要提高系统的可用度，一方面要降低故障概率(理论上可无限降低，但实际上肯定不会为0)，更重要的一方面，需要提升系统的修复速度，如重启时间等。

Design for failure

实际生产环境中，由于硬件老化、机房掉电、网络拥塞等，任何可能情况就会出现。所有的操作，都有可能故障，读写文件、读写网络等等。在系统设计和开发时，需要考虑任何失败下的异常处理。

Assume every operation will fail and resource may be unavailable
Simple failure recovery path and tested frequently. never shutdown service normal, hard-fail for test.
Be restartable at any time, Start fast.
Decouple components, Fail fast, Isolate failures
Allow emergency human intervention

Partition the service

要构建强健的分布式服务，首先需要能够将整个服务进行拆分，分成不同的组件，进行故障隔离和恢复，动态的调整每个部件的容量，来保障服务的稳定性。

Functional Segmentation : Decouple different functional areas; Scale each component independently

Horizontal Split : Split strategies(Modulo,Range-based) ; Aggregation/Routing Proxy

Micro-partitions : load balancing, failure-recovery

Selective replication : Additional replicas to spread the load of hot partitions

Automation

大规模分布式集群管理中，自动化非常重要，随着集群规模的不断扩大，频繁的故障的异常、机器的上架与下架、服务部署和重启，这些工作是无法靠几个人手工完成的，需要自动化集群管理工具，来完成以上的工作，并保证系统的稳定。

Service fail more frequently with system scale-up/out
Automatic deployment and provisioning
- Failover fast, deploy new instance
- Configuration and code as a unit, keep deployment simple
Staged rollout—scale-unit
- Split scale-unit with 1%, 5%, 10%, 20%, 50%, 100%
- each rollout with a scale-unit
Automatic rollback when exception based on error-detection

Monitor Everything

监控的重要性非常重要，才能保证系统出现大规模异常的前期和初期，进行快速定位修复，避免更大的损失，以下列举需要监控的部分内容。

Performance of all operations and specific span
All the input data
Log with warn/fatal/error level
Fault tolerance mechanisms(系统中自动的一些降级操作等，也需要监控，明确当前系统的运行状态)
- Retry
- Turn-off/on advanced function
- Drop request

为了支持这种完善这样的监控，需要系统在设计时充分考虑Trace的需求，具体可以参考Google的Dapper论文，而Twitter的Zipkin是基于这个论文的一个开源版本实现。

Failure Detection

除了数据监控，还需要通过日志等来进行实时预警发现系统中存在的异常，以及通过挖掘系统中的日志，寻找关联来发现更为深层次的原因，Ebay的Lessons给了我们一个样例：

Log requests, response, exception and external resource activity
Message broadcast on message bus
Listener automate failure detection and notification
- Real-time state monitoring : exception and alert
- Historical reports

Quick Diagnose

数据监控、异常报警都快速的告知我们当前系统的运行状态，以及存在出现了部分异常，而异常的原因定位则更为关键，只有定位了根本原因才能快速的修复，缩短MTTR时间。

Give the detail information to diagnose
- 500 requests failed is bad
- Here is the requests list and when they happened
Chain of evidence : The path from beginning to the end of request
Debugging in the production
- Support online debug
- Get the snapshot of specific modules, shipping it out of production
- Configurable logging
Maintain and inspect interface
- Restful interface based on http
- Running data with key-value pair output

Audit

最后是系统的审计(即任何影响系统的改动，如自动的数据更新、手动的修改部署等，都需要明确的记录并且方便的Dashboard查看，任何故障都是有原因的，绝大多数情况下，如果某个操作和故障的起始时间点match，这个操作或更新的嫌疑就非常大了。

Configuration audit trail
- Audit Configuration/Binary/Data changes
- What, when, who, which instances
Dynamic data update history

Alert is an art

报警是一个非常需要好好把握尺度的事情，包括报警的阈值、报警的接收人等，需要绝对的保证不漏报，尽可能不误报，并且减少不必要的打扰。

Minimal receivers of alert
Alerts-to-trouble ticket ratio(with a goal of near one)
Number of systems health issues without corresponding alerts.

参考的文章列表如下：

On Designing and Deploying Internet-Scale Services, Lessons from eBay, Microsoft Autopilot, Google Dapper

转载来源：Leoncom-《构建健壮的大规模分布式系统》

秒客网