实际的生产环境中,经常会由于机器故障、机房掉电、网络异常、软件bug等原因,造成整个系统中某台机器、某些集群异常,无法提供稳定的服务;而系统也可能因为某些突发事件、外部攻击等原因,出现流量瞬间的大幅度增长,超过系统承载能力。因此,在系统设计时,需要充分的考虑系统的优雅降级、流量控制等。最近阅读了不少相关的文档,本文进行了整理,列举了一些构建大规模分布式服务的design principle如下。
Design Target
- Scalability : Resource usage increase linearly(or better!) with load
- Availability : Resilience to partial failure, Graceful degradation, Recoverability from failure
- Manageability : Simplicity, Maintainability, Diagnostics
- Cost : Machine cost and Operation cost
System Availability
- MTBR : Mean Time Between Failure (系统平均故障间隔时间)
- MTTR : Mean Time To Repair(系统平均修复时间)
系统的可用度Availability = MTBF / (MTBF + MTTR),因此,要提高系统的可用度,一方面要降低故障概率(理论上可无限降低,但实际上肯定不会为0),更重要的一方面,需要提升系统的修复速度,如重启时间等。
Design for failure
- Assume every operation will fail and resource may be unavailable
- Simple failure recovery path and tested frequently. never shutdown service normal, hard-fail for test.
- Be restartable at any time, Start fast.
- Decouple components, Fail fast, Isolate failures
- Allow emergency human intervention
Partition the service
Functional Segmentation : Decouple different functional areas; Scale each component independently
Horizontal Split : Split strategies(Modulo,Range-based) ; Aggregation/Routing Proxy
Micro-partitions : load balancing, failure-recovery
Selective replication : Additional replicas to spread the load of hot partitions
- Service fail more frequently with system scale-up/out
- Automatic deployment and provisioning
- Failover fast, deploy new instance
- Configuration and code as a unit, keep deployment simple
- Staged rollout—scale-unit
- Split scale-unit with 1%, 5%, 10%, 20%, 50%, 100%
- each rollout with a scale-unit
- Automatic rollback when exception based on error-detection
Monitor Everything
- Performance of all operations and specific span
- All the input data
- Log with warn/fatal/error level
- Fault tolerance mechanisms(系统中自动的一些降级操作等,也需要监控,明确当前系统的运行状态)
- Retry
- Turn-off/on advanced function
- Drop request
Failure Detection
- Log requests, response, exception and external resource activity
- Message broadcast on message bus
- Listener automate failure detection and notification
- Real-time state monitoring : exception and alert
- Historical reports
Quick Diagnose
- Give the detail information to diagnose
- 500 requests failed is bad
- Here is the requests list and when they happened
- Chain of evidence : The path from beginning to the end of request
- Debugging in the production
- Support online debug
- Get the snapshot of specific modules, shipping it out of production
- Configurable logging
- Maintain and inspect interface
- Restful interface based on http
- Running data with key-value pair output
- Configuration audit trail
- Audit Configuration/Binary/Data changes
- What, when, who, which instances
- Dynamic data update history
Alert is an art
- Minimal receivers of alert
- Alerts-to-trouble ticket ratio(with a goal of near one)
- Number of systems health issues without corresponding alerts.
On Designing and Deploying Internet-Scale Services, Lessons from eBay, Microsoft Autopilot, Google Dapper