1. 硬件:
电脑
2. 软件:
Apache Hadoop 2.7.2
ubuntu16.04
JDK1.8.111
3. 准备工作:
1.修改hostname,hosts
2.ssh免密登陆
#!/bin/bash
#配置hadoop脚本
master_username="guozihao"
master_hostname="gzh_master"
echo $master_username,$master_hostname,"\n"
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cd ~/.ssh
touch authorized_keys
cat id_rsa.pub >> authorized_keys
echo "已经将自己的公钥写入authorized_keys中"
chmod 600 authorized_keys
# 这个config是针对master的,如果没有config的话
# 那么ssh免密登录的时候,如果master和远程主机的用户名不一致
# 则只能用 ssh 远程用户名@远程主机名 的方式登录
# 而不能用 ssh 远程主机名 的方式登录
touch config
echo "Host ${master_hostname}\nuser ${master_hostname}\n" > config
echo "已经修改.ssh/config"
scp $master_username@$master_hostname:~/.ssh/id_rsa.pub ./master_rsa.pub
cat master_rsa.pub >> authorized_keys
echo "已经将master公钥写入authorized_keys中"
cd ~/Applications/hadoop-2.7.3/etc/hadoop
scp $master_username@$master_hostname:~/Applications/hadoop-2.7.3/etc/hadoop/* ~/Applications/hadoop-2.7.3/etc/hadoop/
3.安装JDK
4.解压hadoop
4.修改配置文件
- hadoop-env.sh
- core-site.xml
- mapred-site.xml
- hdfs-site.xml
- yarn-site.xml
- slaves
- fairscheduler.xml
修改Hadoop文件下./etc/hadoop/hadoop-env.sh
修改内容:将JAVA_HOME改为自己的Java路径修改Hadoop文件下./etc/hadoop/core-site.xml
修改内容:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://gzh_master:8020</value><!--输入activeNN的主机名和端口号-->
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
</configuration>
修改Hadoop文件下./etc/hadoop/mapred-site.xml
修改内容:
<!-- MR YARN Application properties -->
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
<description>The runtime framework for executing MapReduce jobs.
Can be one of local, classic or yarn.
</description>
</property>
<!-- jobhistory properties -->
<property>
<name>mapreduce.jobhistory.address</name>
<value>lys_sbmaster:10020</value><!--输入standbyNN的主机名和任意端口号-->
<description>MapReduce JobHistory Server IPC host:port</description>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>lys_sbmaster:19888</value><!--输入standbyNN的主机名和任意端口号-->
<description>MapReduce JobHistory Server Web UI host:port</description>
</property>
</configuration>
修改Hadoop文件下./etc/hadoop/hdfs-site.xml
修改内容:
<configuration>
<property>
<name>dfs.nameservices</name>
<value>hadoop-test</value><!--分布式服务名称,可自定义-->
<description>
Comma-separated list of nameservices.
</description>
</property>
<property>
<name>dfs.ha.namenodes.hadoop-test</name>
<value>nn1,nn2</value><!--两个NameNode的代称,可自定义-->
<description>
The prefix for a given nameservice, contains a comma-separated
list of namenodes for a given nameservice (eg EXAMPLENAMESERVICE).
</description>
</property>
<property>
<name>dfs.namenode.rpc-address.hadoop-test.nn1</name>
<!--根据上面的分布式服务名称和NameNode的代称进行定义-->
<value>gzh_master:8020</value>
<description>
RPC address for nomenode1 of hadoop-test
</description>
</property>
<property>
<name>dfs.namenode.rpc-address.hadoop-test.nn2</name>
<value>lys_sbmaster:8020</value>
<description>
RPC address for nomenode2 of hadoop-test
</description>
</property>
<property>
<name>dfs.namenode.http-address.hadoop-test.nn1</name>
<value>gzh_master:50070</value>
<description>
The address and the base port where the dfs namenode1 web ui will listen on.
</description>
</property>
<property>
<name>dfs.namenode.http-address.hadoop-test.nn2</name>
<value>lys_sbmaster:50070</value>
<description>
The address and the base port where the dfs namenode2 web ui will listen on.
</description>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///home/guozihao/Applications/hadoop/hdfs/name</value><!--NameNode用于存储信息的位置路径,可以填入多个,填入多个时用逗号隔开,StandbyNameNode应该要改成自定义的.-->
<description>Determines where on the local filesystem the DFS name node
should store the name table(fsimage). If this is a comma-delimited list
of directories then the name table is replicated in all of the
directories, for redundancy. </description>
</property>
<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>qjournal://lys_sbmaster:8485;yk_slave:8485;wss_slave:8485/hadoop-test</value><!--填入所有Journal Node的主机名,默认端口号是8485-->
<description>A directory on shared storage between the multiple namenodes
in an HA cluster. This directory will be written by the active and read
by the standby in order to keep the namespaces synchronized. This directory
does not need to be listed in dfs.namenode.edits.dir above. It should be
left empty in a non-HA cluster.
</description>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///home/guozihao/Applications/hadoop/hdfs/data</value><!--这里DataNode一定要改成自己的对应的存储路径-->
<description>Determines where on the local filesystem an DFS data node
should store its blocks. If this is a comma-delimited
list of directories, then data will be stored in all named
directories, typically on different devices.
Directories that do not exist are ignored.
</description>
</property>
<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>false</value>
<description>
Whether automatic failover is enabled. See the HDFS High
Availability documentation for details on automatic HA
configuration.
</description>
</property>
<property>
<name>dfs.journalnode.edits.dir</name>
<value>/home/guozihao/Applications/hadoop/hdfs/journal/</value>
</property>
</configuration>
修改Hadoop文件下./etc/hadoop/yarn-site.xml
修改内容:
<configuration>
<!-- Resource Manager Configs -->
<property>
<description>The hostname of the RM.</description>
<name>yarn.resourcemanager.hostname</name>
<value>gzh_master</value> <!--资源管理器,即Active NameNode的主机名-->
</property>
<property>
<description>The address of the applications manager interface in the RM.</description>
<name>yarn.resourcemanager.address</name>
<value>${yarn.resourcemanager.hostname}:8032</value>
</property>
<property>
<description>The address of the scheduler interface.</description>
<name>yarn.resourcemanager.scheduler.address</name>
<value>${yarn.resourcemanager.hostname}:8030</value>
</property>
<property>
<description>The http address of the RM web application.</description>
<name>yarn.resourcemanager.webapp.address</name>
<value>${yarn.resourcemanager.hostname}:8088</value>
</property>
<property>
<description>The https adddress of the RM web application.</description>
<name>yarn.resourcemanager.webapp.https.address</name>
<value>${yarn.resourcemanager.hostname}:8090</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>${yarn.resourcemanager.hostname}:8031</value>
</property>
<property>
<description>The address of the RM admin interface.</description>
<name>yarn.resourcemanager.admin.address</name>
<value>${yarn.resourcemanager.hostname}:8033</value>
</property>
<property>
<description>The class to use as the resource scheduler.</description>
<name>yarn.resourcemanager.scheduler.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
</property>
<property>
<description>fair-scheduler conf location</description>
<name>yarn.scheduler.fair.allocation.file</name>
<value>${yarn.home.dir}/etc/hadoop/fairscheduler.xml</value>
</property>
<property>
<description>List of directories to store localized files in. An
application's localized file directory will be found in:
${yarn.nodemanager.local-dirs}/usercache/${user}/appcache/application_${appid}.
Individual containers' work directories, called container_${contid}, will
be subdirectories of this.
</description>
<name>yarn.nodemanager.local-dirs</name>
<value>/home/guozihao/Applications/hadoop/yarn/local</value>
</property>
<property>
<description>Whether to enable log aggregation</description>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<property>
<description>Where to aggregate logs to.</description>
<name>yarn.nodemanager.remote-app-log-dir</name>
<value>/tmp/logs</value>
</property>
<property>
<description>Amount of physical memory, in MB, that can be allocated
for containers.</description>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>20480</value> <!--每个node用于DFS的存储量大小-->
</property>
<property>
<description>Number of CPU cores that can be allocated
for containers.</description>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>12</value> <!--每个node用于DFS的CPU数量-->
</property>
<property>
<description>the valid service name should only contain a-zA-Z0-9_ and can not start with numbers</description>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
修改Hadoop文件下./etc/hadoop/fairscheduler.xml
修改内容:
<!--这里放入所有datanode的主机名-->
lys_sbmaster
yk_slave
wss_slave
修改Hadoop文件下./etc/hadoop/slaves
修改内容:
<?xml version="1.0"?>
<allocations>
<!--用于划分集群资源-->
<queue name="infrastructure">
<minResources>102400 mb, 50 vcores </minResources>
<maxResources>153600 mb, 100 vcores </maxResources>
<maxRunningApps>200</maxRunningApps>
<minSharePreemptionTimeout>300</minSharePreemptionTimeout>
<weight>1.0</weight>
<aclSubmitApps>root,yarn,search,hdfs</aclSubmitApps>
</queue>
<queue name="tool">
<minResources>102400 mb, 30 vcores</minResources>
<maxResources>153600 mb, 50 vcores</maxResources>
</queue>
<queue name="sentiment">
<minResources>102400 mb, 30 vcores</minResources>
<maxResources>153600 mb, 50 vcores</maxResources>
</queue>
</allocations>
6.启动Hadoop系统
注意:所有操作均在Hadoop部署目录下进行。
启动Hadoop集群:
Step1 :
在各个JournalNode节点上,输入以下命令启动journalnode服务:
sbin/hadoop-daemon.sh start journalnodeStep2:
在[nn1]上,对其进行格式化,并启动:
bin/hdfs namenode -format
sbin/hadoop-daemon.sh start namenodeStep3:
在[nn2]上,同步nn1的元数据信息:
bin/hdfs namenode -bootstrapStandbyStep4:
启动[nn2]:
sbin/hadoop-daemon.sh start namenode
经过以上四步操作,nn1和nn2均处理standby状态
Step5:
将[nn1]切换为Active
bin/hdfs haadmin -transitionToActive nn1Step6:
在[nn1]上,启动所有datanode
sbin/hadoop-daemons.sh start datanode
关闭Hadoop集群:
在[nn1]上,输入以下命令
sbin/stop-dfs.sh