[置顶] CDH4.2的HA配置

时间:2023-01-08 18:18:10

一、NameNode的HA

1、 core-site.xml 

• For MRv1:
<property>
<name>fs.default.name/name>
<value>hdfs://mycluster</value>
</property>
• For YARN:
<property>
<name>fs.defaultFS</name>
<value>hdfs://mycluster</value>
</property>

<property>
<name>ha.zookeeper.quorum</name>
<value>zk1.example.com:2181,zk2.example.com:2181,zk3.example.com:2181</value>
</property>


 

2、 hdfs-site.xml 

<property>
<name>dfs.nameservices</name>
<value>mycluster</value>
</property>

<property>
<name>dfs.ha.namenodes.mycluster</name>
<value>nn1,nn2</value>
</property>

<property>
<name>dfs.namenode.rpc-address.mycluster.nn1</name>
<value>machine1.example.com:8020</value>
</property>
<property>
<name>dfs.namenode.rpc-address.mycluster.nn2</name>
<value>machine2.example.com:8020</value>
</property>

<property>
<name>dfs.namenode.http-address.mycluster.nn1</name>
<value>machine1.example.com:50070</value>
</property>
<property>
<name>dfs.namenode.http-address.mycluster.nn2</name>
<value>machine2.example.com:50070</value>
</property>

<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>qjournal://node1.example.com:8485;node2.example.com:8485;node3.example.com:8485/mycluster</value>
</property>

===JournalNode===

<property>
<name>dfs.journalnode.edits.dir</name>
<value>/data/1/dfs/jn</value>
</property>

===Client Failover Configuration===

<property>
<name>dfs.client.failover.proxy.provider.mycluster</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>

===Fencing Configuration===

<property>
<name>dfs.ha.fencing.methods</name>
<value>sshfence</value>
</property>
<property>
<name>dfs.ha.fencing.ssh.private-key-files</name>
<value>/home/exampleuser/.ssh/id_rsa</value>
</property>

==================

<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value>
</property>

3、在NameNode的一个节点上执行初始化到ZK的HA状态信息的命令

hdfs zkfc -formatZK

4、格式化NameNode

5、安装和启动JournalNode(要在NameNode之前启动)

 sudo yum install hadoop-hdfs-journalnode

 sudo service hadoop-hdfs-journalnode start

6、启动NameNode

(1)、Start the primary (formatted) NameNode:

sudo -u hdfs hadoop namenode -format

sudo service hadoop-hdfs-namenode start

(2)、Start the standby NameNode:

sudo -u hdfs hdfs namenode -bootstrapStandby
sudo service hadoop-hdfs-namenode start

7、配置自动故障转移:在NameNode节点上安装和运行ZKFC

sudo yum install hadoop-hdfs-zkfc

sudo service hadoop-hdfs-zkfc start

8、验证自动故障转移

kill -9 <pid of NN>

观察效果

 

二、Jobtracker的HA

1、在两个节点上安装HA Jobtracker包

sudo yum install hadoop-0.20-mapreduce-jobtrackerha

2、如果想主动故障恢复,就需要在两个HA jobtracker节点安装zkfc

sudo yum install hadoop-0.20-mapreduce-zkfc

3、配置HA jobtracker

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>logicaljt</value> 
<!-- host:port string is replaced with a logical name -->
</property>
<property>
<name>mapred.jobtrackers.logicaljt</name>
<value>jt1,jt2</value>
<description>Comma-separated list of JobTracker IDs.</description>
</property>
<property>
<name>mapred.jobtracker.rpc-address.logicaljt.jt1</name> 
<!-- RPC address for jt1 -->
<value>myjt1.myco.com:8021</value>
</property>
<property>
<name>mapred.jobtracker.rpc-address.logicaljt.jt2</name> 
<!-- RPC address for jt2 -->
<value>myjt2.myco.com:8022</value>
</property>
<property>
<name>mapred.job.tracker.http.address.logicaljt.jt1</name> 
<!-- HTTP bind address for jt1 -->
<value>0.0.0.0:50030</value>
</property>
<property>
<name>mapred.job.tracker.http.address.logicaljt.jt2</name> 
<!-- HTTP bind address for jt2 -->
<value>0.0.0.0:50031</value>
</property>
<property>
<name>mapred.ha.jobtracker.rpc-address.logicaljt.jt1</name> 
<!-- RPC address for jt1 HA daemon -->
<value>myjt1.myco.com:8023</value>
</property>
<property>
<name>mapred.ha.jobtracker.rpc-address.logicaljt.jt2</name> 
<!-- RPC address for jt2 HA daemon -->
<value>myjt2.myco.com:8024</value>
</property>
<property>
<name>mapred.ha.jobtracker.http-redirect-address.logicaljt.jt1</name> 
<!-- HTTP redirect address for jt1 -->
<value>myjt1.myco.com:50030</value>
</property>
<property>
<name>mapred.ha.jobtracker.http-redirect-address.logicaljt.jt2</name> 
<!-- HTTP redirect address for jt2 -->
<value>myjt2.myco.com:50031</value>
</property>
<property>
<name>mapred.jobtracker.restart.recover</name>
<value>true</value>
</property>

<property>
<name>mapred.job.tracker.persist.jobstatus.active</name>
<value>true</value>
</property>
<property>
<name>mapred.job.tracker.persist.jobstatus.hours</name>
<value>1</value>
</property>
<property>
<name>mapred.job.tracker.persist.jobstatus.dir</name>
<value>/jobtracker/jobsInfo</value>
</property>
<property>
<name>mapred.client.failover.proxy.provider.logicaljt</name>
<value>org.apache.hadoop.mapred.ConfiguredFailoverProxyProvider</value>
</property>
<property>
<name>mapred.client.failover.max.attempts</name>
<value>15</value>
</property>
<property>
<name>mapred.client.failover.sleep.base.millis</name>
<value>500</value>
</property>
<property>
<name>mapred.client.failover.sleep.max.millis</name>
<value>1500</value> 
</property>
<property>
<name>mapred.client.failover.connection.retries</name>
<value>0</value> 
</property>
<property>
<name>mapred.client.failover.connection.retries.on.timeouts</name>
<value>0</value> 
</property>
<property>
<name>mapred.ha.fencing.methods</name>
<value>shell(/bin/true)</value>
</property>
</configuration>

4、启动HA jobtracker:在两个HA jobtracker上启动

sudo service hadoop-0.20-mapreduce-jobtrackerha start

如果没有配置主动故障恢复,两个启动的jobtracker都处于standby状态。

 

可以根据sudo -u mapred hadoop mrhaadmin -getServiceState <id>获取jobtracker状态信息

<id>是 mapred.jobtrackers.<name>里面的name,如上面配置的 jt1 or jt2。

将一个jobtracker切换至Active状态:

sudo -u mapred hadoop mrhaadmin -transitionToActive <id>
sudo -u mapred hadoop mrhaadmin -getServiceState <id>

5、故障转移验证(手动故障转移)

sudo -u mapred hadoop mrhaadmin -failover <id_of_active_JobTracker> <id_of_inactive_JobTracker>

例如:将有故障的active的jt1转移到jt2,这个时候jt2变成active了

sudo -u mapred hadoop mrhaadmin -failover jt1 jt2 

如果转移成功,jt2状态将变成active,执行以下命令可以查看

sudo -u mapred hadoop mrhaadmin -getServiceState jt2

6、配置自动故障转移

(1)、安装配置zookeeper集群(可以公用hdfs ha的zk)

(2)、手动故障转移参数配置

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>logicaljt</value> 
<!-- host:port string is replaced with a logical name -->
</property>
<property>
<name>mapred.jobtrackers.logicaljt</name>
<value>jt1,jt2</value>
<description>Comma-separated list of JobTracker IDs.</description>
</property>
<property>
<name>mapred.jobtracker.rpc-address.logicaljt.jt1</name> 
<!-- RPC address for jt1 -->
<value>myjt1.myco.com:8021</value>
</property>
<property>
<name>mapred.jobtracker.rpc-address.logicaljt.jt2</name> 
<!-- RPC address for jt2 -->
<value>myjt2.myco.com:8022</value>
</property>
<property>
<name>mapred.job.tracker.http.address.logicaljt.jt1</name> 
<!-- HTTP bind address for jt1 -->
<value>0.0.0.0:50030</value>
</property>
<property>
<name>mapred.job.tracker.http.address.logicaljt.jt2</name> 
<!-- HTTP bind address for jt2 -->
<value>0.0.0.0:50031</value>
</property>
<property>
<name>mapred.ha.jobtracker.rpc-address.logicaljt.jt1</name> 
<!-- RPC address for jt1 HA daemon -->
<value>myjt1.myco.com:8023</value>
</property>
<property>
<name>mapred.ha.jobtracker.rpc-address.logicaljt.jt2</name> 
<!-- RPC address for jt2 HA daemon -->
<value>myjt2.myco.com:8024</value>
</property>
<property>
<name>mapred.ha.jobtracker.http-redirect-address.logicaljt.jt1</name> 
<!-- HTTP redirect address for jt1 -->
<value>myjt1.myco.com:50030</value>
</property>
<property>
<name>mapred.ha.jobtracker.http-redirect-address.logicaljt.jt2</name> 
<!-- HTTP redirect address for jt2 -->
<value>myjt2.myco.com:50031</value>
</property>
<property>
<name>mapred.jobtracker.restart.recover</name>
<value>true</value>
</property>

<property>
<name>mapred.job.tracker.persist.jobstatus.active</name>
<value>true</value>
</property>
<property>
<name>mapred.job.tracker.persist.jobstatus.hours</name>
<value>1</value>
</property>
<property>
<name>mapred.job.tracker.persist.jobstatus.dir</name>
<value>/jobtracker/jobsInfo</value>
</property>
<property>
<name>mapred.client.failover.proxy.provider.logicaljt</name>
<value>org.apache.hadoop.mapred.ConfiguredFailoverProxyProvider</value>
</property>
<property>
<name>mapred.client.failover.max.attempts</name>
<value>15</value>
</property>
<property>
<name>mapred.client.failover.sleep.base.millis</name>
<value>500</value>
</property>
<property>
<name>mapred.client.failover.sleep.max.millis</name>
<value>1500</value> 
</property>
<property>
<name>mapred.client.failover.connection.retries</name>
<value>0</value> 
</property>
<property>
<name>mapred.client.failover.connection.retries.on.timeouts</name>
<value>0</value> 
</property>
<property>
<name>mapred.ha.fencing.methods</name>
<value>shell(/bin/true)</value>
</property>
</configuration>

(3)、配置故障恢复控制参数

 mapred-site.xml:

<property>
<name>mapred.ha.automatic-failover.enabled</name>
<value>true</value>
</property>
<property>
<name>mapred.ha.zkfc.port</name>
<value>8018</value> 
<!-- Pick a different port for each failover controller when running one machine->
</property>

 

core-site.xml:

<property>
<name>ha.zookeeper.quorum</name>
<value>zk1.example.com:2181,zk2.example.com:2181,zk3.example.com:2181 </value> 
<!-- ZK ensemble addresses -->
</property>

(4)、初始化在ZK中的HA的状态信息(在某一个jobtracker上面执行就行,在执行前,zk集群必须先启动起来)

sudo service hadoop-0.20-mapreduce-zkfc init

OR

sudo -u mapred hadoop mrzkfc -formatZK

(5)、启动自动故障恢复

在每一个jobtracker节点上启动zkfc和jobtracker:

sudo service hadoop-0.20-mapreduce-zkfc start
sudo service hadoop-0.20-mapreduce-jobtrackerha start

(6)、验证自动故障恢复功能

首先通过以下命令找到那个jt是处于active状态:

sudo -u mapred hadoop mrhaadmin -getServiceState <id>

然后执行kill命令杀死对应的jvm进程:

kill -9 <pid of JobTracker>

最后就看是否成功将active状态切换到另外一个节点上。

无 
     编辑