大数据学习——Flume介绍与安装

时间:2022-01-11 18:17:31

Flume


实验环境:
shiyanlou
- CentOS6.6 64
- JDK 1.7.0_55 64
- Hadoop 1.1.2

Flume 介绍

Flume是Cloudera提供的日志收集系统。Flume支持在日志系统中定制各类数据发送方,用于收集数据;同时,Flume提供对数据进行简单处理,并写到各种数据接受方(可定制)的能力。
Flume是一个分布式、可靠、高可用的海量日志采集、聚合和传输的系统。

Flume特点

  • Reliability:数据可靠性,包括End-to-end,Store on failure和Best effort
  • Scalability:Flume的3大组件collector、master和storage tier都是可伸缩的
  • Manageability:利用ZooKeeper和gossip,保证配置数据的一致性、高可用,同时多Master
  • Extensibility:基于Java,用户可以为Flume添加各种新的功能。

Flume架构

大数据学习——Flume介绍与安装

其中最重要的抽象是data flow,描述了数据从产生、传输、处理并最终写入目标的一条路径。
上图实线是data flow。
Agent用于采集数据,agent是flume中产生数据流的地方,同时,agent会将产生的数据流传输到collector。对应的,collector用于对数据进行聚合,往往产生一个更大的流。

Flume提供了从console(控制台)、RPC(Thrift-RPC)、text(文件)、tail(UNIX tail)、syslog(syslog日志系统,支持TCP和UDP等2种模式),exec(命令执行)等数据源上收集数据的能力。
同时,Flume的数据接受方,可以是console(控制台)、text(文件)、dfs(HDFS文件)、RPC(Thrift-RPC)和syslogTCP(TCP syslog日志系统)等。

其中,收集数据有2种主要工作模式:
- Push Sources:外部系统会主动地将数据推送到Flume中,如RPC、syslog
- Polling Sources:Flume到外部系统中获取数据,一般使用轮询的方式,如text和exec
注意,在Flume中,agent和collector对应,而source和sink对应。
Source和sink强调发送、接受方的特性(如数据格式、编码等),而agent和collector关注功能。

Flume Master用于管理数据流的配置。Flume Master间使用gossip协议同步数据。

安装部署Flume

下载地址

http://flume.apache.org/download.html

cd /home/shiyanlou/install-pack
tar -xzf flume-1.5.2-bin.tar.gz
mv apache-flume-1.5.2-bin /app/flume-1.5.2
sudo vi /etc/profile
export FLUME_HOME=/app/flume-1.5.2
export FLUME_CONF_DIR=$FLUME_HOME/conf
export PATH=$PATH:$FLUME_HOME/bin
source /etc/profile
echo $PATH

cd /app/flume-1.5.2/conf
cp flume-env.sh.template flume-env.sh
sudo vi flume-env.sh
JAVA_HOME= /app/lib/jdk1.7.0_55
JAVA_OPTS="-Xms100m -Xmx200m -Dcom.sun.management.jmxremote"
cp flume-conf.properties.template flume-conf.properties
sudo vi flume-conf.properties
# The configuration file needs to define the sources, the channels and the sinks.
# Sources, channels and sinks are defined per agent, in this case called 'a1'
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# For each one of the sources, the type is defined
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

#The channel can be defined as follows.
a1.sources.r1.channels = c1
# Each sink's type must be defined
a1.sinks.k1.type = logger

#Specify the channel the sink should use
a1.sinks.k1.channel = c1

# Each channel's type is defined.
a1.channels.c1.type = memory
# Other config values specific to each type of channel(sink or source)
# can be defined as well
# In this case, it specifies the capacity of the memory channel
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
cd /app/flume-1.5.2
./bin/flume-ng agent --conf ./conf/ --conf-file ./conf/flume-conf.properties --name a1 -Dflume.root.logger=INFO,console

下面的测试在shiyanlou无法进行:
另开一个终端

#sudo yum install telnet
telnet localhost 44444
hello world

在原来的终端上,可以收到来自于telnet发出的消息。

cd /app/flume-1.5.2/conf
cp flume-conf.properties.template flume-conf2.properties
sudo vi flume-conf2.properties
a1.sources = r1
a1.sinks = k1
a1.channels = c1
a1.sources.r1.type = exec
a1.sources.r1.channels = c1
a1.sources.r1.command = tail -F /app/hadoop-1.1.2/logs/hadoop-shiyanlou-namenode-b393a04554e1.log
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = hdfs://hadoop:9000/class12/out_flume
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
a1.sinks.k1.hdfs.rollSize = 4000000
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.writeFormat = Text
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.batchSize = 10
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
cd /app/flume-1.5.2
./bin/flume-ng agent --conf ./conf/ --conf-file ./conf/flume-conf2.properties --name a1 -Dflume.root.logger=INFO,console

这时会不断收集hadoop-hadoop-namenode-hadoop1.log的数据写入HDFS中。

查看hdfs中/class12/out_flume中的文件

hadoop fs -ls /class12/out_flume
hadoop fs -cat /class12/out_flume/events-.1433921305493