https://cs.stanford.edu/~matei/courses/2015/6.S897/
http://blog.fnil.net/blog/ac1fa10ff9b2404ed0b91bdfaf76a87d/
http://pages.cs.wisc.edu/~remzi/Classes/739/Papers/paxos.pdf
https://www8.cs.umu.se/kurser/5DV131/VT15/handouts/L7_dm.pdf
http://lamport.azurewebsites.net/pubs/lamport-paxos.pdf
Paxos理论介绍(2): Multi-Paxos与Leader
微信自研生产级paxos类库PhxPaxos实现原理介绍
《The Chubby lock service for loosely-coupled distributed systems》
《The Chubby lock service for loosely-coupled distributed systems》
https://zhuanlan.zhihu.com/p/21438357?refer=lynncui
https://docs.google.com/viewer?url=https%3A%2F%2Fraft.github.io%2Fraft.pdf
https://ramcloud.stanford.edu/~ongaro/thesis.pdf
Paxos 我还着重推荐阅读微信后端团队写的系列博客,包括他们开源的 phxpaxos 实现,基本上将所有问题都讨论到了,并且通俗易懂。
但是 Raft 真的好理解多了,我读的是《In Search of an Understandable Consensus Algorithm》,论文写到这么详细的步骤,你不想理解都难。毕竟 Raft 号称就是一个 Understandable Consensus Algorithm。无论从任何角度,都推荐阅读这一篇论文。
首先能理解 paxos 的一些难点,其次是了解 Raft 的实现,加深对 Etcd 等系统的理解。这篇论文还有一个 250 多页的加强版《CONSENSUS: BRIDGING THEORY AND PRACTICE》,教你一行一行写出一个 Raft 实现,
最后,我还读了《Building Consistent Transactions with Inconsistent Replication》,包括作者的演讲,作者也开放了源码。
关于 TAPIR 的解读推荐两篇博客:Building Consistent Transactions with Inconsistent Replication和Paper review: Building Consistent Transactions with Inconsistent Replication (SOSP’15)。 TAPIR 的源码只包含了 normal case 的处理,恢复之类的过程都是没有的,对于 recovery 的一些疑问,可以参考 A FEW WORDS ABOUT INCONSISTENT REPLICATION (IR),
Introduction
I often argue that the toughest thing about distributed systems is changing the way you think. The below is a collection of material I've found useful for motivating these changes.
Thought Provokers
Ramblings that make you think about the way you design. Not everything can be solved with big servers, databases and transactions.
- Harvest, Yield and Scalable Tolerant Systems - Real world applications of CAP from Brewer et al
- On Designing and Deploying Internet Scale Services - James Hamilton
- The Perils of Good Abstractions - Building the perfect API/interface is difficult
- Chaotic Perspectives - Large scale systems are everything developers dislike - unpredictable, unordered and parallel
- Data on the Outside versus Data on the Inside - Pat Helland
- Memories, Guesses and Apologies - Pat Helland
- SOA and Newton's Universe - Pat Helland
- Building on Quicksand - Pat Helland
- Why Distributed Computing? - Jim Waldo
- A Note on Distributed Computing - Waldo, Wollrath et al
- Stevey's Google Platforms Rant - Yegge's SOA platform experience
Latency
- Latency Exists, Cope! - Commentary on coping with latency and it's architectural impacts
- Latency - the new web performance bottleneck - not at all new (see Patterson), but noteworthy
- The Tail At Scale - the latencychallenges inherent of dealing with latency in large scale systems
Amazon
Somewhat about the technology but more interesting is the culture and organization they've created to work with it.
- A Conversation with Werner Vogels - Coverage of Amazon's transition to a service-based architecture
- Discipline and Focus - Additional coverage of Amazon's transition to a service-based architecture
- Vogels on Scalability
- SOA creates order out of chaos @ Amazon
Current "rocket science" in distributed systems.
- MapReduce
- Chubby Lock Manager
- Google File System
- BigTable
- Data Management for Internet-Scale Single-Sign-On
- Dremel: Interactive Analysis of Web-Scale Datasets
- Large-scale Incremental Processing Using Distributed Transactions and Notifications
- Megastore: Providing Scalable, Highly Available Storage for Interactive Services - Smart design for low latency Paxos implementation across datacentres.
- Spanner - Google's scalable, multi-version, globally-distributed, and synchronously-replicated database.
- Photon - Fault-tolerant and Scalable Joining of Continuous Data Streams. Joins are tough especially with time-skew, high availability and distribution.
- Mesa: Geo-Replicated, Near Real-Time, Scalable Data Warehousing - Data warehousing system that stores critical measurement data related to Google's Internet advertising business.
Consistency Models
Key to building systems that suit their environments is finding the right tradeoff between consistency and availability.
- CAP Conjecture - Consistency, Availability, Parition Tolerance cannot all be satisfied at once
- Consistency, Availability, and Convergence - Proves the upper bound for consistency possible in a typical system
- CAP Twelve Years Later: How the "Rules" Have Changed - Eric Brewer expands on the original tradeoff description
- Consistency and Availability - Vogels
- Eventual Consistency - Vogels
- Avoiding Two-Phase Commit - Two phase commit avoidance approaches
- 2PC or not 2PC, Wherefore Art Thou XA? - Two phase commit isn't a silver bullet
- Life Beyond Distributed Transactions - Helland
- If you have too much data, then 'good enough' is good enough - NoSQL, Future of data theory - Pat Helland
- Starbucks doesn't do two phase commit - Asynchronous mechanisms at work
- You Can't Sacrifice Partition Tolerance - Additional CAP commentary
- Optimistic Replication - Relaxed consistency approaches for data replication
Theory
Papers that describe various important elements of distributed systems design.
- Distributed Computing Economics - Jim Gray
- Rules of Thumb in Data Engineering - Jim Gray and Prashant Shenoy
- Fallacies of Distributed Computing - Peter Deutsch
- Impossibility of distributed consensus with one faulty process - also known as FLP [access requires account and/or payment, a free version can be found here]
- Unreliable Failure Detectors for Reliable Distributed Systems. A method for handling the challenges of FLP
- Lamport Clocks - How do you establish a global view of time when each computer's clock is independent
- The Byzantine Generals Problem
- Lazy Replication: Exploiting the Semantics of Distributed Services
- Scalable Agreement - Towards Ordering as a Service
- Scalable Eventually Consistent Counters over Unreliable Networks - Scalable counting is tough in an unreliable world
Languages and Tools
Issues of distributed systems construction with specific technologies.
- Programming Distributed Erlang Applications: Pitfalls and Recipes - Building reliable distributed applications isn't as simple as merely choosing Erlang and OTP.
Infrastructure
- Principles of Robust Timing over the Internet - Managing clocks is essential for even basics such as debugging
Storage
Paxos Consensus
Understanding this algorithm is the challenge. I would suggest reading "Paxos Made Simple" before the other papers and again afterward.
- The Part-Time Parliament - Leslie Lamport
- Paxos Made Simple - Leslie Lamport
- Paxos Made Live - An Engineering Perspective - Chandra et al
- Revisiting the Paxos Algorithm - Lynch et al
- How to build a highly available system with consensus - Butler Lampson
- Reconfiguring a State Machine - Lamport et al - changing cluster membership
- Implementing Fault-Tolerant Services Using the State Machine Approach: a Tutorial - Fred Schneider
Other Consensus Papers
- Mencius: Building Efficient Replicated State Machines for WANs - consensus algorithm for wide-area network
Gossip Protocols (Epidemic Behaviours)
- How robust are gossip-based communication protocols?
- Astrolabe: A Robust and Scalable Technology For Distributed Systems Monitoring, Management, and Data Mining
- Epidemic Computing at Cornell
- Fighting Fire With Fire: Using Randomized Gossip To Combat Stochastic Scalability Limits
- Bi-Modal Multicast
- ACM SIGOPS Operating Systems Review - Gossip-based computer networking
- SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol
P2P
- Chord: A Scalable Peer-to-peer Lookup Protocol for Internet Applications
- Kademlia: A Peer-to-peer Information System Based on the XOR Metric
- Pastry: Scalable, decentralized object location and routing for large-scale peer-to-peer systems
- PAST: A large-scale, persistent peer-to-peer storage utility - storage system atop Pastry
- SCRIBE: A large-scale and decentralised application-level multicast infrastructure - wide area messaging atop Pastry
============
Distributed Systems
- General Papers
- Topics
External Papers
Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications
Linearizability: A Correctness Condition for Concurrent Objects
Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial
Hoard: A Scalable Memory Allocator for Multithreaded Applications
MillWheel: Fault-Tolerant Stream Processing at Internet Scale
Omega: flexible, scalable schedulers for large compute clusters
Orleans: Distributed Virtual Actors for Programmability and Scalability
Sinfonia: A New Paradigm for Building Scalable Distributed Systems
The Chubby Lock Service for Loosely-Coupled Distributed Systems
The Join Calculus: a Language for Distributed Mobile Programming
Transactional Client-Server Cache Consistency: Alternatives and Performance
Untraceable Electronic Mail, Return Addresses, and Digital Pseudonyms
Viewstamped Replication: A New Primary Copy Method to Support Highly-Available Distributed Systems
Other Hosted Papers
A response to Cheriton and Skeen's Criticism of Causal and Totally Ordered Communication
A Universal Modular ACTOR Formalism for Artificial Intelligence
A Versatile Scheme for Routing Highly Variable Traffic in Service Overlays and IP Backbones
Beehive: O(1) Lookup Performance for Power-Law Query Distributions in Peer-to-Peer Overlays
Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services
Chain Replication for Supporting High Throughput and Availability
Consistent Global States of Distributed Systems: Fundamental Concepts and Mechanisms
Copysets: Reducing the Frequency of Data Loss in Cloud Storage
Dapper, a Large-Scale Distributed Systems Tracing Infrastructure
Distributed Snapshots: Determining Global States of Distributed Systems
Herbivore: A Scalable and Efficient Protocol for Anonymous Communication
How the Hidden Hand Shapes the Market for Software Reliability
Implementing the Omega failure detector in the crash-recovery failure model
Impossibility of Distributed Consensuswith One Faulty Process
Kelips*: Building an Efficient and Stable P2P DHT Through Increased Memory and Background Overhead
Large-scale Incremental Processing Using Distributed Transactions and Notifications
Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center
Oblivious routing of highly variable traffic in service overlays and IP backbones
Pastry: Scalable, decentralized object location and routing for large-scale peer-to-peer systems
SIFT: Design and Analysis of a Fault-Tolerant Computer for Aircraft Control
The Akamai Network: A Platform for High-Performance Internet Applications
The Dining CryptographersProblem: Unconditional Sender and Recipient Untraceability
Understanding the Limitations of Causally and Totally Ordered Communication
Viewing Control Structures as Patterns of Passing Messages
ZooKeeper: Wait-free coordination for Internet-scale systems
Tiered Replication: A Cost-effective Alternative to Full Cluster Geo-replication
Topics
Datastores
Calvin: Fast Distributed Transactions for Partitioned Database Systems
Consistency Tradeoffs in Modern Distributed Database System Design
CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data
Don’t Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS
HaLoop: Efficient Iterative Data Processing on Large Clusters
Making Reliable Distributed Systems in the Presence of Software Errors
Managing Update Conflicts in Bayou, a Weakly Connected Replicated Storage System
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters
Towards a Next Generation Data Center Architecture: Scalability and Commoditization
Freenet: A Distributed Anonymous Information Storage and Retrieval System
Megastore: Providing Scalable, Highly Available Storage for Interactive Services
RADOS: A Scalable, Reliable Storage Service for Petabyte-scale Storage Clusters
Physics
-
“On the Electrodynamics of Moving Bodies” (1905) — Einstein
By solving the asymmetries that arise in Maxwell’s equations, Einstein’s 1905 paper set the stage for current distributed systems work by demonstrating that there is no absolute frame of reference and by providing an upper bound on the speed of communication.