分布式系统中的故障转移有哪些算法?

I'm planning on making a distributed database system using a shared-nothing architecture and multiversion concurrency control. Redundancy will be achieved through asynchronous replication (it's allowed to lose some recent changes in case of a failure, as long as the data in the system remains consistent). For each database entry, one node has the master copy (only that node has write access to it), in addition to which one or more nodes have secondary copies of the entry for scalability and redundancy purposes (the secondary copies are read-only). When the master copy of an entry is updated, it is timestamped and sent asynchronously to nodes with secondary copies so that finally they will get the latest version of the entry. The node that has the master copy can change at any time - if another node needs to write that entry, it will request the current owner of the master copy to give that node the ownership of that entry's master copy, and after receiving ownership that node can write the entry (all transactions and writes are local).

我计划使用无共享架构和多版本并发控制来构建一个分布式数据库系统。冗余将通过异步复制实现(只要系统中的数据保持一致，就可以在失败的情况下丢失一些最近的更改)。对于每个数据库条目，一个节点拥有主副本(只有该节点拥有对它的写访问权)，此外，一个或多个节点出于可伸缩性和冗余目的拥有该条目的辅助副本(辅助副本是只读的)。当一个条目的主副本被更新时，它将被时间戳和异步发送到具有次要副本的节点，以便最终它们将获得条目的最新版本。主副本的节点可以在任何时候改变——如果另一个节点需要编写条目,它将请求的当前所有者主副本给该节点条目的主副本的所有权,所有权后,节点可以编写条目(所有事务和本地)写道。

Lately I've been thinking about what to do when a node in the cluster goes down, that what strategy to use for failover. Here are some questions. I hope that you would know available alternatives to at least some of them.

最近，我一直在思考当集群中的一个节点宕机时应该做些什么，这是用于故障转移的策略。这里有一些问题。我希望你们至少能知道一些替代方案。

What algorithms there are for doing failover in a distributed system?
在分布式系统中进行故障转移有哪些算法?
What algorithms there are for consensus in a distributed system?
在分布式系统中，有什么算法可以达成一致?
How should the nodes in the cluster determine that a node is down?
集群中的节点应该如何确定一个节点是向下的?
How should the nodes determine that what database entries had their master copy on the failed node at the time of failure, so that other nodes may recover those entries?
节点应该如何确定哪些数据库条目在失败节点上具有主副本，以便其他节点可以恢复这些条目?
How to decide that which node(s) has the latest secondary copy of some entry?
如何确定哪个节点拥有某个条目的最新二次副本?
How to decide that which node's secondary copy should be promoted to be the new master copy?
如何决定哪个节点的次要副本应该被提升为新的主副本?
How to handle it, if the node which was though to be down, suddenly comes back as if nothing happened?
如何处理呢，如果节点是向下的，突然回来，好像什么都没发生?
How to avoid split-brain scenarios, where the network is temporarily split into two, and both sides think that the other side has died?
如何避免大脑分裂的情况，即网络暂时分为两部分，双方都认为对方已经死亡?

6 个解决方案

#1

* What algorithms there are for doing failover in a distributed system?

Possibly not algorithms, so much as systems. You need to design your architecture around the questions you've asked.

可能不是算法，而是系统。您需要围绕您提出的问题设计您的体系结构。

* What algorithms there are for consensus in a distributed system?

You probably want to implement Paxos. Simple Paxos is not too hard to get right. If you're are trying to make it bullet proof, read Google's 'Paxos Made Live' paper. If you're hoping to make it high-performance, look at Multi-Paxos.

您可能需要实现Paxos。简单的Paxos不难做到。如果你想让它防弹，请阅读谷歌的“Paxos Made Live”文件。如果您希望使其高性能，请查看Multi-Paxos。

* How should the nodes in the cluster determine that a node is down?

Depends. Heartbeats are actually a pretty good way to do this. The problem is that you have false positives, but that's kind of unavoidable, and in a cluster on the same LAN with manageable load they're accurate. The good thing about Paxos is that false positives are dealt with automatically. However, if you actually need failure information for some other purpose then you need to make sure it's ok that you detect a node as failed, but it actually is just under load and taking time to respond to a heartbeat.

视情况而定。心跳是一种很好的方式。问题是您有假阳性，但是这是不可避免的，并且在具有可管理负载的同一LAN上的集群中，它们是准确的。Paxos的好处是可以自动处理假阳性。但是，如果您确实需要一些用于其他目的的失败信息，那么您需要确保您检测到一个节点是失败的，但是它实际上只是处于负载状态，并且需要时间来响应一个心跳。

* How should the nodes determine that what database entries had their master copy on the failed node at the time of failure, so that other nodes may recover those entries?
* How to decide that which node(s) has the latest secondary copy of some entry?
* How to decide that which node's secondary copy should be promoted to be the new master copy?

I think you might really benefit from reading the Google FileSystem paper. In GFS there's a dedicated master node which keeps track of which nodes have which blocks. This scheme might work for you, but the key is to keep accesses to this master minimal.

我认为您可能会从阅读谷歌文件系统文件中获益。在GFS中有一个专用的主节点，它跟踪哪些节点拥有哪些块。这个方案可能对您有用，但是关键是将对这个master的访问最小化。

If you don't store this information on a dedicated node, you're going to have to store it everywhere. Try tagging the data with the master holder's id.

如果不将此信息存储在专用节点上，则必须将其存储在任何地方。尝试用主持有者的id标记数据。

* How to handle it, if the node which was though to be down, suddenly comes back as if nothing happened?

See above, but the basic point is that you have to be careful because a node that is no longer the master might think that it is. One thing that I don't think you've solved: how does an update get to the master - i.e. how does a client know which node to send the update to?

上面看到的，但是最基本的一点是您必须小心，因为不再是主节点的节点可能会认为它是主节点。有一件事我认为您还没有解决:更新是如何到达主节点的——即客户端如何知道要将更新发送到哪个节点?

* How to avoid split-brain scenarios, where the network is temporarily split into two, and both sides think that the other side has died?

Paxos works here by preventing progress in the case of a perfect split. Otherwise, as before, you have to be very careful.

Paxos的工作原理是阻止完美分裂的进展。否则，和以前一样，你必须非常小心。

In general, solve the question of knowing which node gets which data item as the master, and you'll be a long way towards fixing your architecture. Note that you can't just have the node receiving the update be the master - what if two updates happen concurrently? Don't rely on a synchronised global clock either - that way madness lies. You probably want to avoid running consensus on every write if you can help it, so instead perhaps have a slow master-failover protocol and a fast write path.

一般来说，要解决的问题是，知道哪个节点获取哪个数据项作为主节点，并且您将在修复您的体系结构方面有很长的路要走。注意，您不能让接收更新的节点成为主节点——如果同时发生两个更新怎么办?也不要依赖于同步的全球时钟——疯狂就在于此。如果可以的话，您可能希望避免在每次写入时都运行一致，因此可能需要使用缓慢的主故障转移协议和快速写入路径。

Feel free to shoot me a mail off line if you want to know more details. My blog http://the-paper-trail.org deals with a lot of this stuff.

如果你想知道更多的细节，请随时给我写信。我的博客http://the-paper-trail.org上有很多这样的东西。

cheers,

欢呼,

Henry

亨利

#2

You are asking an absolutely massive question, and a lot of what you want to know is still in active research.

你问的是一个绝对庞大的问题，你想知道的很多东西仍在积极研究中。

Some thoughts:

一些想法:

Distributed systems are difficult, because there are no foolproof systems to deal with failures; in an asynchronous system, there is no way to be sure that a node is down or whether there is network delay. This may sound trivial, but it really isn't.
分布式系统很困难，因为没有万无一失的系统来处理故障;在异步系统中，无法确定某个节点是否已停机或是否存在网络延迟。这听起来可能微不足道，但实际上并非如此。
Achieving consensus can be done by the Paxos family of algorithms, versions of which are used in Google's bigtable, and in other places.
达成共识可以通过Paxos系列算法来实现，该算法的版本在谷歌的bigtable中使用，在其他地方也可以实现。

You'll want to delve into a distributed systems textbook (or several). I like Tannenbaum's Distributed Systems: Principles and Paradigms

您将希望深入研究一个分布式系统教科书(或多个)。我喜欢坦南鲍姆的分布式系统:原则和范例

#3

A great blog that talks a lot about distributed systems and distributed algorithms -- including implementing Paxos -- is http://the-paper-trail.org/

一个关于分布式系统和分布式算法(包括实现Paxos)的很棒的博客是http://thescript.org/。

#4

This problem was solved by DEC for VMS with the Distributed Lock Manager. Modern solutions are based on this design. Read the Wikipedia article for some current solutions. You should look at OCFS2, which is now part of the Linux kernel.

这个问题由DEC用分布式锁管理器解决。现代的解决方案基于这种设计。阅读*上的文章，了解一些当前的解决方案。您应该看看OCFS2，它现在是Linux内核的一部分。

#5

Tackling just a small part of your question: there's no way in the scenario you describe to decide (in the abstract) which node(s) have the latest secondary copy. At best, some node can poll and determine (after a bit of communication) who among the nodes that they know of / can see, and that know of / can see them, and that can't see the old master has the most current copy. But:

只处理问题的一小部分:在您描述的场景中，无法(在抽象中)确定哪个节点拥有最新的辅助副本。在最好的情况下，一些节点可以进行轮询并确定它们所知道的节点之间的(经过一些通信)，并且知道/可以看到它们，并且不能看到老主拥有当前的副本。但是:

They can't find out the status of nodes they can't reach
他们无法找到无法到达的节点的状态
They can't find out the status of nodes that can't reach them
它们无法找到无法到达它们的节点的状态
They can't be sure that what they think they know about the status of a node that can see the old master when they can't is current--the master could have updated the shared neighbor after the neighbor reported status.
他们不能确定自己认为自己知道的节点状态是否为当前节点(当他们不能时，可以看到旧主节点)——主节点可以在邻居报告状态后更新共享邻居。

On the broader issues, you may want to look at how something like memcached and the like handle the issues, and especially read through the lists to see what problems they've encountered when theory met practice.

在更广泛的问题上，您可能想看看memcached之类的东西是如何处理这些问题的，特别是要通读列表，看看它们在实践中遇到了什么问题。

#6

-1

I don't know but when you are done I want to download your distributed database system.

我不知道，但当你完成后我想下载你的分布式数据库系统。

#1

* What algorithms there are for doing failover in a distributed system?