Linux系统搭建Hadoop集群

时间:2022-11-29 03:32:11

一、环境说明

IP地址 主机名 备注 操作系统
192.168.92.11 hserver1 namenode Ubuntu 16.04
192.168.92.12 hserver2 datanode Ubuntu 16.04
192.168.12.13 hserver3 datanode Ubuntu 16.04

二、环境初始化

1. 关闭防火墙

如果使用CentOS系统搭建集群环境,需要将防火墙关闭。本文中使用Ubuntu操作系统,所以可以忽略此步骤。

2. 配置主机名

将三台机器的主机名分别配置为hserver1、hserver2、hserver3:

hostnamectl set-hostname hserver1

配置主机名后将主机名信息写入到hosts文件中:

root@hserver1:~# cat >> /etc/hosts << EOF
192.168.92.11 hserver1
192.168.92.12 hserver2
192.168.92.13 hserver3
EOF

3. 生成密钥并配置免密认证

首先在三台机器上生成密钥:

root@hserver1:~# ssh-keygen -t rsa -f ~/.ssh/id_rsa -N ''

分别对三台机器进行免密配置:

root@hserver1:~# apt-get install sshpass -y
root@hserver1:~# for host in 192.168.92.{11..13} hserver{1..3}; do ssh-keyscan $host >>~/.ssh/known_hosts 2>/dev/null; done
root@hserver1:~# for host in 192.168.92.{11..13}; do sshpass -p'123456' ssh-copy-id root@$host &>/dev/null; done

三、安装JDK和Hadoop

1. 安装JDK

分别在三台机器上安装openjdk-8-jdk-headless:

root@hserver1:~# apt-get install openjdk-8-jdk-headless -y

查看Java版本:

root@hserver1:~# java -version
openjdk version "1.8.0_252"
OpenJDK Runtime Environment (build 1.8.0_252-8u252-b09-1~16.04-b09)
OpenJDK 64-Bit Server VM (build 25.252-b09, mixed mode)

配置环境变量(/etc/prifile文件):

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export CLASSPATH=$:CLASSPATH:$JAVA_HOME/lib/
export PATH=$PATH:$JAVA_HOME/bin

2. 下载Hadoop

【注意】:以下所有的操作均需要在三台机器上进行

Hadoop的官网:https://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-2.9.2/hadoop-2.9.2.tar.gz

这里下载的是2.9.2版本(二进制包),将下载好的包上传到服务器中并解压至/opt/hadoop目录下:

root@hserver1:~# mkdir /opt/hadoop
root@hserver1:~# tar zxf hadoop-2.9.2.tar.gz -C /opt/hadoop/

解压完成后在服务器中创建如下目录:

mkdir /usr/local/hadoop
mkdir /usr/local/hadoop/tmp
mkdir /usr/local/hadoop/var
mkdir /usr/local/hadoop/dfs
mkdir /usr/local/hadoop/dfs/name
mkdir /usr/local/hadoop/dfs/data

3. 配置Hadoop

首先需要修改/opt/hadoop/hadoop-2.9.2/etc/hadoop目录下的一些文件:

  1. 修改core-site.xml文件,在<configuration>块中添加如下配置项:
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/local/hadoop/tmp</value>
<description>Abase for other temporary directories.</description>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://hserver1:9000</value>
</property>
  1. 修改hdfs-site.xml文件,在<configuration>块中添加如下配置项:
<property>
<name>dfs.namenode.name.dir</name>
<value>/usr/local/hadoop/dfs/name</value>
<description>Path on the local filesystem where theNameNode stores the namespace and transactions logs persistently.</description>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/usr/local/hadoop/dfs/data</value>
<description>Comma separated list of paths on the localfilesystem of a DataNode where it should store its blocks.</description>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.permissions.enabled</name>
<value>false</value>
<description>need not permissions</description>
</property>

【注意】dfs.permissions.enabled配置为false后,可以允许不要检查权限就生成dfs上的文件,但是需要防止误删除,请将它设置为true,或者直接将该property节点删除,因为默认就是true。

  1. 将mapred-site.xml.template复制一份,命名为mapred-site.xml并修改该文件,在<configuration>块中添加如下配置项:
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>hserver1:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>hserver1:19888</value>
</property>
  1. 修改slave文件,将其中的localhost删除,添加如下内容:
hserver2
hserver3
  1. 修改yarn-site.xml文件,在<configuration>块中添加如下配置项:
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hserver1</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>${yarn.resourcemanager.hostname}:8032</value>
<description>The address of the applications manager interface in the RM.</description>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>${yarn.resourcemanager.hostname}:8030</value>
<description>The address of the scheduler interface.</description>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>hserver1:8031</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>hserver1:8033</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>${yarn.resourcemanager.hostname}:8088</value>
<description>The http address of the RM web application.</description>
</property>
<property>
<description>The http address of the RM web application.</description>
<name>yarn.resourcemanager.webapp.https.address</name>
<value>${yarn.resourcemanager.hostname}:8090</value>
</property>
<property>
        <name>yarn.scheduler.maximum-allocation-mb</name>
        <value>1024</value>
        <discription>每个节点可用内存,单位MB,默认8182MB</discription>
</property>
<property>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>2.1</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>2048</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>

【注意】yarn.nodemanager.vmem-check-enabled这个的意思是忽略虚拟内存的检查,如果是安装在虚拟机上,这个配置很有用,配上去之后后续操作不容易出问题。如果是实体机上,并且内存够多,可以将这个配置去掉。

  1. 修改hadoop-env.sh脚本,将JAVA_HOME的环境变量修改为如下内容:
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

4. 初始化Hadoop

因为hserver1是namenode,hserver2和hserver3都是datanode,所以只需要对hserver1进行初始化操作,也就是对hdfs进行格式化。进入到hserver1这台机器的/opt/hadoop/hadoop-2.9.2/bin目录,执行如下命令进行初始化:

root@hserver1:/opt/hadoop/hadoop-2.9.2/bin# ./hadoop namenode -format
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it. 20/05/06 16:19:26 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = hserver1/192.168.92.11
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 2.9.2
STARTUP_MSG: classpath = /opt/hadoop/hadoop-2.9.2/etc/hadoop:/opt/hadoop/hadoop-
...
...
STARTUP_MSG: build = https://git-wip-us.apache.org/repos/asf/hadoop.git -r 826afbeae31ca687bc2f8471dc841b66ed2c6704; compiled by 'ajisaka' on 2018-11-13T12:42Z
STARTUP_MSG: java = 1.8.0_252
************************************************************/
20/05/06 16:19:26 INFO namenode.NameNode: registered UNIX signal handlers for [TERM, HUP, INT]
20/05/06 16:19:26 INFO namenode.NameNode: createNameNode [-format]
20/05/06 16:19:27 WARN common.Util: Path /usr/local/hadoop/dfs/name should be specified as a URI in configuration files. Please update hdfs configuration.
20/05/06 16:19:27 WARN common.Util: Path /usr/local/hadoop/dfs/name should be specified as a URI in configuration files. Please update hdfs configuration.
Formatting using clusterid: CID-18e78322-4eac-4cf8-8b79-737a015623ca
20/05/06 16:19:27 INFO namenode.FSEditLog: Edit logging is async:true
20/05/06 16:19:27 INFO namenode.FSNamesystem: KeyProvider: null
20/05/06 16:19:27 INFO namenode.FSNamesystem: fsLock is fair: true
20/05/06 16:19:27 INFO namenode.FSNamesystem: Detailed lock hold time metrics enabled: false
20/05/06 16:19:27 INFO namenode.FSNamesystem: fsOwner = root (auth:SIMPLE)
20/05/06 16:19:27 INFO namenode.FSNamesystem: supergroup = supergroup
20/05/06 16:19:27 INFO namenode.FSNamesystem: isPermissionEnabled = false
20/05/06 16:19:27 INFO namenode.FSNamesystem: HA Enabled: false
20/05/06 16:19:27 INFO common.Util: dfs.datanode.fileio.profiling.sampling.percentage set to 0. Disabling file IO profiling
20/05/06 16:19:27 INFO blockmanagement.DatanodeManager: dfs.block.invalidate.limit: configured=1000, counted=60, effected=1000
20/05/06 16:19:27 INFO blockmanagement.DatanodeManager: dfs.namenode.datanode.registration.ip-hostname-check=true
20/05/06 16:19:27 INFO blockmanagement.BlockManager: dfs.namenode.startup.delay.block.deletion.sec is set to 000:00:00:00.000
20/05/06 16:19:27 INFO blockmanagement.BlockManager: The block deletion will start around 2020 May 06 16:19:27
20/05/06 16:19:27 INFO util.GSet: Computing capacity for map BlocksMap
20/05/06 16:19:27 INFO util.GSet: VM type = 64-bit
20/05/06 16:19:27 INFO util.GSet: 2.0% max memory 966.7 MB = 19.3 MB
20/05/06 16:19:27 INFO util.GSet: capacity = 2^21 = 2097152 entries
20/05/06 16:19:27 INFO blockmanagement.BlockManager: dfs.block.access.token.enable=false
20/05/06 16:19:27 WARN conf.Configuration: No unit for dfs.heartbeat.interval(3) assuming SECONDS
20/05/06 16:19:27 WARN conf.Configuration: No unit for dfs.namenode.safemode.extension(30000) assuming MILLISECONDS
20/05/06 16:19:27 INFO blockmanagement.BlockManagerSafeMode: dfs.namenode.safemode.threshold-pct = 0.9990000128746033
20/05/06 16:19:27 INFO blockmanagement.BlockManagerSafeMode: dfs.namenode.safemode.min.datanodes = 0
20/05/06 16:19:27 INFO blockmanagement.BlockManagerSafeMode: dfs.namenode.safemode.extension = 30000
20/05/06 16:19:27 INFO blockmanagement.BlockManager: defaultReplication = 2
20/05/06 16:19:27 INFO blockmanagement.BlockManager: maxReplication = 512
20/05/06 16:19:27 INFO blockmanagement.BlockManager: minReplication = 1
20/05/06 16:19:27 INFO blockmanagement.BlockManager: maxReplicationStreams = 2
20/05/06 16:19:27 INFO blockmanagement.BlockManager: replicationRecheckInterval = 3000
20/05/06 16:19:27 INFO blockmanagement.BlockManager: encryptDataTransfer = false
20/05/06 16:19:27 INFO blockmanagement.BlockManager: maxNumBlocksToLog = 1000
20/05/06 16:19:27 INFO namenode.FSNamesystem: Append Enabled: true
20/05/06 16:19:28 INFO namenode.FSDirectory: GLOBAL serial map: bits=24 maxEntries=16777215
20/05/06 16:19:28 INFO util.GSet: Computing capacity for map INodeMap
20/05/06 16:19:28 INFO util.GSet: VM type = 64-bit
20/05/06 16:19:28 INFO util.GSet: 1.0% max memory 966.7 MB = 9.7 MB
20/05/06 16:19:28 INFO util.GSet: capacity = 2^20 = 1048576 entries
20/05/06 16:19:28 INFO namenode.FSDirectory: ACLs enabled? false
20/05/06 16:19:28 INFO namenode.FSDirectory: XAttrs enabled? true
20/05/06 16:19:28 INFO namenode.NameNode: Caching file names occurring more than 10 times
20/05/06 16:19:28 INFO snapshot.SnapshotManager: Loaded config captureOpenFiles: falseskipCaptureAccessTimeOnlyChange: false
20/05/06 16:19:28 INFO util.GSet: Computing capacity for map cachedBlocks
20/05/06 16:19:28 INFO util.GSet: VM type = 64-bit
20/05/06 16:19:28 INFO util.GSet: 0.25% max memory 966.7 MB = 2.4 MB
20/05/06 16:19:28 INFO util.GSet: capacity = 2^18 = 262144 entries
20/05/06 16:19:28 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.window.num.buckets = 10
20/05/06 16:19:28 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.num.users = 10
20/05/06 16:19:28 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.windows.minutes = 1,5,25
20/05/06 16:19:28 INFO namenode.FSNamesystem: Retry cache on namenode is enabled
20/05/06 16:19:28 INFO namenode.FSNamesystem: Retry cache will use 0.03 of total heap and retry cache entry expiry time is 600000 millis
20/05/06 16:19:28 INFO util.GSet: Computing capacity for map NameNodeRetryCache
20/05/06 16:19:28 INFO util.GSet: VM type = 64-bit
20/05/06 16:19:28 INFO util.GSet: 0.029999999329447746% max memory 966.7 MB = 297.0 KB
20/05/06 16:19:28 INFO util.GSet: capacity = 2^15 = 32768 entries
20/05/06 16:19:28 INFO namenode.FSImage: Allocated new BlockPoolId: BP-2038544107-192.168.92.11-1588753168340
20/05/06 16:19:28 INFO common.Storage: Storage directory /usr/local/hadoop/dfs/name has been successfully formatted.
20/05/06 16:19:28 INFO namenode.FSImageFormatProtobuf: Saving image file /usr/local/hadoop/dfs/name/current/fsimage.ckpt_0000000000000000000 using no compression
20/05/06 16:19:28 INFO namenode.FSImageFormatProtobuf: Image file /usr/local/hadoop/dfs/name/current/fsimage.ckpt_0000000000000000000 of size 322 bytes saved in 0 seconds .
20/05/06 16:19:28 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
20/05/06 16:19:28 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at hserver1/192.168.92.11
************************************************************/

以上的内容中如果没有出现报错的话,代表初始化成功。初始化完成后在/usr/local/hadoop/dfs/name/目录下可以看到新增了一个current目录和一些文件:

root@hserver1:/usr/local/hadoop/dfs/name# tree /usr/local/hadoop/dfs/name/
/usr/local/hadoop/dfs/name/
└── current
├── fsimage_0000000000000000000
├── fsimage_0000000000000000000.md5
├── seen_txid
└── VERSION 1 directory, 4 files

5. 启动Hadoop

因为hserver1是namenode,hserver2和hserver3都是datanode,所以只需要在hserver1上执行启动命令:

root@hserver1:/usr/local/hadoop/dfs/name# cd /opt/hadoop/hadoop-2.9.2/sbin/
root@hserver1:/opt/hadoop/hadoop-2.9.2/sbin# ./start-all.sh
This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh
Starting namenodes on [hserver1]
hserver1: starting namenode, logging to /opt/hadoop/hadoop-2.9.2/logs/hadoop-root-namenode-hserver1.out
hserver2: starting datanode, logging to /opt/hadoop/hadoop-2.9.2/logs/hadoop-root-datanode-hserver2.out
hserver3: starting datanode, logging to /opt/hadoop/hadoop-2.9.2/logs/hadoop-root-datanode-hserver3.out
Starting secondary namenodes [0.0.0.0]
The authenticity of host '0.0.0.0 (0.0.0.0)' can't be established.
ECDSA key fingerprint is SHA256:EluzQS5IRZaQAqRlc2O+h1rOS7jfaBSNlmgKqeknA6c.
Are you sure you want to continue connecting (yes/no)? yes
0.0.0.0: Warning: Permanently added '0.0.0.0' (ECDSA) to the list of known hosts.
0.0.0.0: starting secondarynamenode, logging to /opt/hadoop/hadoop-2.9.2/logs/hadoop-root-secondarynamenode-hserver1.out
starting yarn daemons
starting resourcemanager, logging to /opt/hadoop/hadoop-2.9.2/logs/yarn-root-resourcemanager-hserver1.out
hserver3: starting nodemanager, logging to /opt/hadoop/hadoop-2.9.2/logs/yarn-root-nodemanager-hserver3.out
hserver2: starting nodemanager, logging to /opt/hadoop/hadoop-2.9.2/logs/yarn-root-nodemanager-hserver2.out

第一次执行上面命令的时候需要进行交互,输入yes即可

6. 测试Hadoop

启动Hadoop后,需要测试是否启动成功

可以在浏览器中输入namenode的IP地址192.168.92.11:50070即可访问overview页面

访问192.168.92.11:8088即可访问cluster页面。

四、上传本地文件至HDFS文件系统

  1. 首先创建一个目录用于存放上传的文件
root@hserver1:/opt/hadoop/hadoop-2.9.2/bin# ./hdfs dfs -mkdir /upload

# 如果需要创建多级目录可以使用-p选项
  1. 此时在namenode的overview页面→Utilities→Browse the file system选项中可以看到新创建的目录以及信息

  2. 将本地的/home/test.log文件上传到hdfs文件系统中

# 前面的路径为文件在服务器中的路径,后面的为hdfs中的路径
root@hserver1:/opt/hadoop/hadoop-2.9.2/bin# ./hdfs dfs -put /home/test.log /upload
root@hserver1:/opt/hadoop/hadoop-2.9.2/bin# ./hdfs dfs -ls /upload
Found 1 items
-rw-r--r-- 2 root supergroup 24 2020-05-07 14:47 /upload/test.log
  1. 上传完成后可以在浏览器中对应的目录下查看到该文件的信息