The HDFS HA feature addresses the above problems by providing the option of running two NameNodes in the
same cluster, in an Active/Passive configuration. These are referred to as the Active NameNode and the Standby
NameNode. Unlike the Secondary NameNode, the Standby NameNode is hot standby, allowing a fast failover
to a new NameNode in the case that a machine crashes, or a graceful administrator-initiated failover for the
purpose of planned maintenance. You cannot have more than two NameNodes.
2、active NN响应client的请求
In a typical HA cluster, two separate machines are configured as NameNodes. At any point in time, one of the
NameNodes is in an Active state, and the other is in a Standby state. The Active NameNode is responsible for
all client operations in the cluster, while the Standby is simply acting as a slave, maintaining enough state to
provide a fast failover if necessary.
3、Quorum-based Storage
In order for the Standby node to keep its state synchronized with the Active node in this implementation, both
nodes communicate with a group of separate daemons called JournalNodes. When any namespace modification
is performed by the Active node, it durably logs a record of the modification to a majority of these JournalNodes.
The Standby node is capable of reading the edits from the JournalNodes, and is constantly watching them for
changes to the edit log. As the Standby Node sees the edits, it applies them to its own namespace. In the event
of a failover, the Standby will ensure that it has read all of the edits from the JournalNodes before promoting
itself to the Active state. This ensures that the namespace state is fully synchronized before a failover occurs.
In order to provide a fast failover, it is also necessary that the Standby node has up-to-date information regarding
the location of blocks in the cluster. In order to achieve this, the DataNodes are configured with the location of
both NameNodes, and they send block location information and heartbeats to both.
It is vital for the correct operation of an HA cluster that only one of the NameNodes be active at a time. Otherwise,
the namespace state would quickly diverge between the two, risking data loss or other incorrect results. In order
to ensure this property and prevent the so-called "split-brain scenario," the JournalNodes will only ever allow a
single NameNode to be a writer at a time. During a failover, the NameNode which is to become active will simply
take over the role of writing to the JournalNodes, which will effectively prevent the other NameNode from
continuing in the Active state, allowing the new Active NameNode to safely proceed with failover.
4、 JournalNode
There must be at least three JournalNode daemons, since edit log modifications must be written to a majority
of JournalNodes. This will allow the system to tolerate the failure of a single machine. You can also run more
than three JournalNodes, but in order to actually increase the number of failures the system can tolerate,
you should run an odd number of JournalNodes, (three, five, seven, etc.) Note that when running with N
JournalNodes, the system can tolerate at most (N - 1) / 2 failures and continue to function normally. If the
requisite quorum is not available, the NameNode will not format or start, and you will see an error similar to
this:
12/10/01 17:34:18 WARN namenode.FSEditLog: Unable to determine input streams from QJM
to [10.0.1.10:8485, 10.0.1.10:8486, 10.0.1.10:8487]. Skipping.
java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.
5、no SecondaryNode
Note:
In an HA cluster, the Standby NameNode also performs checkpoints of the namespace state, and
thus it is not necessary to run a Secondary NameNode, CheckpointNode, or BackupNode in an HA
cluster. In fact, to do so would be an error. If you are reconfiguring a non-HA-enabled HDFS cluster
to be HA-enabled, you can reuse the hardware which you had previously dedicated to the Secondary
NameNode.
6、Fencing Configuration
When you use Quorum-based Storage, only one NameNode will ever be allowed to write to the
JournalNodes, so there is no potential for corrupting the file system metadata in a "split-brain"
scenario. But when a failover occurs, it is still possible that the previous Active NameNode could serve
read requests to clients - and these requests may be out of date - until that NameNode shuts down
when it tries to write to the JournalNodes. For this reason, it is still desirable to configure some fencing
methods even when using Quorum-based Storage.
7、Automatic Failover Configuration
Apache ZooKeeper is a highly available service for maintaining small amounts of coordination data, notifying
clients of changes in that data, and monitoring clients for failures. The implementation of automatic HDFS
failover relies on ZooKeeper for the following things:
• Failure detection - each of the NameNode machines in the cluster maintains a persistent session in ZooKeeper.
If the machine crashes, the ZooKeeper session will expire, notifying the other NameNode that a failover
should be triggered.
• Active NameNode election - ZooKeeper provides a simple mechanism to exclusively elect a node as active.
If the current active NameNode crashes, another node can take a special exclusive lock in ZooKeeper indicating
that it should become the next active NameNode.
8、ZKFC
The ZKFailoverController (ZKFC) is a new component - a ZooKeeper client which also monitors and manages
the state of the NameNode. Each of the machines which runs a NameNode also runs a ZKFC, and that ZKFC is
responsible for:
• Health monitoring - the ZKFC pings its local NameNode on a periodic basis with a health-check command.
So long as the NameNode responds promptly with a healthy status, the ZKFC considers the node healthy.
If the node has crashed, frozen, or otherwise entered an unhealthy state, the health monitor will mark it as
unhealthy.
• ZooKeeper session management - when the local NameNode is healthy, the ZKFC holds a session open in
ZooKeeper. If the local NameNode is active, it also holds a special lock znode. This lock uses ZooKeeper's
support for "ephemeral" nodes; if the session expires, the lock node will be automatically deleted.
• ZooKeeper-based election - if the local NameNode is healthy, and the ZKFC sees that no other node currently
holds the lock znode, it will itself try to acquire the lock. If it succeeds, then it has "won the election", and is
responsible for running a failover to make its local NameNode active. The failover process is similar to the
manual failover described above: first, the previous active is fenced if necessary, and then the local NameNode
transitions to active state.
9、Automatic Failover FAQ
(1)、Is it important that I start the ZKFC and NameNode daemons in any particular order?
No. On any given node you may start the ZKFC before or after its corresponding NameNode.
(2)、What additional monitoring should I put in place?
You should add monitoring on each host that runs a NameNode to ensure that the ZKFC remains running.
In some types of ZooKeeper failures, for example, the ZKFC may unexpectedly exit, and should be restarted
to ensure that the system is ready for automatic failover. Additionally, you should monitor each of the
servers in the ZooKeeper quorum. If ZooKeeper crashes, automatic failover will not function.
(3)、What happens if ZooKeeper goes down?
If the ZooKeeper cluster crashes, no automatic failovers will be triggered. However, HDFS will continue
to run without any impact. When ZooKeeper is restarted, HDFS will reconnect with no issues.
(4)、Can I designate one of my NameNodes as primary/preferred?
No. Currently, this is not supported. Whichever NameNode is started first will become active. You may
choose to start the cluster in a specific order such that your preferred node starts first.
(5)、How can I initiate a manual failover when automatic failover is configured?
Even if automatic failover is configured, you can initiate a manual failover using the hdfs haadmin -failover
command. It will perform a coordinated failover.
10、JT HA的mapred.ha.fencing.methods参数配置
A list of scripts or Java classes that will be used to fence the active JobTracker during failover.
Only one JobTracker should be active at any given time,but you can simply configure mapred.ha.fencing.methods
as shell(/bin/true) since the JobTrackers fence themselves, and split-brain is avoided by the old active JobTracker shutting itself
down if another JobTracker takes over.