Flume的数据流由事件（Event）贯穿始终。事件是Flume的基本数据单位，它携带日志数据（字节数组形式）并且携带有头信息，这些Event由Agent外部的Source生成，当Source捕获事件后会进行特定的格式化，然后Source会把事件推入（单个或多个）Channel中。可以把Channel看作是一个缓冲区，它将保存事件直到Sink处理完该事件。Sink负责持久化日志或者把事件推向另一个Source。以下是Flume的一些核心概念：

（1）Events：一个数据单元，带有一个可选的消息头,可以是日志记录、avro 对象等。

（2）Agent：JVM中一个独立的Flume进程，包含组件Source、Channel、Sink。

（3）Client：运行于一个独立线程，用于生产数据并将其发送给Agent。

（4）Source：用来消费传递到该组件的Event,从Client收集数据，传递给Channel。

（5）Channel：中转Event的一个临时存储，保存Source组件传递过来的Event，其实就是连接 Source 和 Sink ，有点像一个消息队列。

（6）Sink：从Channel收集数据，运行在一个独立线程。

Flume以Agent为最小的独立运行单位，一个Agent就是一个JVM。单Agent由Source、Sink和Channel三大组件构成，如下图所示：

flume安装与配置

值得注意的是，Flume提供了大量内置的Source、Channel和Sink类型。不同类型的Source、Channel和Sink可以自由组合。组合方式基于用户设置的配置文件，非常灵活。比如：Channel可以把事件暂存在内存里，也可以持久化到本地硬盘上；Sink可以把日志写入HDFS、Hbase、ES甚至是另外一个Source等等。Flume支持用户建立多级流，也就是说多个Agent可以协同工作

flume安装

下载解压

下载命令：wget

flume只需下载二进制文件（bin）

flume官网下载：http://mirrors.hust.edu.cn/apache/flume/1.8.0/apache-flume-1.8.0-bin.tar.gz

$ wget http://archive.cloudera.com/cdh5/cdh/5/flume-ng-1.6.0-cdh5.7.1.tar.gz

操作显示：

$ wget http://mirrors.hust.edu.cn/apache/flume/1.8.0/apache-flume-1.8.0-bin.tar.gz

--2018-12-24 15:36:52--

apache-flume-1.8.0-bi 100%[=======================>] 55.97M 5.21MB/s in 12s

2018-12-24 15:37:04 (4.83 MB/s) - ‘apache-flume-1.8.0-bin.tar.gz’ saved [58688757/58688757]

$ ls有apache-flume-1.8.0-bin.tar.gz

$ tar -xvf flume-ng-1.6.0-cdh5.7.1.tar.gz

操作显示:

$ tar -xvf apache-flume-1.8.0-bin.tar.gz

$ ls有

apache-flume-1.8.0-bin apache-flume-1.8.0-bin.tar.gz

$ rm flume-ng-1.6.0-cdh5.7.1.tar.gz

$ mv apache-flume-1.6.0-cdh5.7.1-bin flume-1.6.0-cdh5.7.1

(删除、重命名没有操作)

配置环境变量

$ cd /home/Hadoop

$ vim .bash_profile（没找到这文件，可能用的是.profile，但用这个也可以）

export FLUME_HOME=/home/hadoop/app/cdh/flume-1.6.0-cdh5.7.1

export PATH=$PATH:$FLUME_HOME/bin

操作：

~~$ cd ~/hadoop~~ ~~（在用户目录下才生效，否则版本验证失败）~~

$ cd ~

$ vim .bash_profile

export FLUME_HOME=~/software/apache-flume-1.8.0-bin

export PATH=$PATH:$FLUME_HOME/bin

$ source .bash_profile

操作显示：

-bash: export: `/home/user/software/apache-flume-1.8.0-bin': not a valid identifier

删除FLUME_HOME等号后的空格后不再报错

版本验证失败，将.bash_profile复制到home下

~/hadoop$ cp .bash_profile ~/

~$ source .bash_profile

配置flume-env.sh文件

修改conf下的flume-env.sh，在里面配置JAVA_HOME

$ cd app/cdh/flume-1.6.0-cdh5.7.1/conf/

$ cp flume-env.sh.template flume-env.sh

$ vim flume-env.sh

export JAVA_HOME=/home/hadoop/app/jdk1.7.0_79

export HADOOP_HOME=/home/hadoop/app/cdh/hadoop-2.6.0-cdh5.7.1

操作：（jdk位置：/home/user/jdk1.8.0_171、Hadoop位置：/home/user/hadoop）

~/software/apache-flume-1.8.0-bin/conf$ cp flume-env.sh.template flume-env.sh

$ vim flume-env.sh

export JAVA_HOME=/home/user/jdk1.8.0_171

export HADOOP_HOME=/home/user/hadoop

文件中原有文字：

# If this file is placed at FLUME_CONF_DIR/flume-env.sh, it will be sourced during Flume startup.

如果此文件放置在 flume _ conf _ dir/fume-env. sh, 它将被获取

在flume启动过程中。

# Enviroment variables can be set here. 环境变量可以在这里设置。

# export JAVA_HOME=/usr/lib/jvm/java-8-oracle

# Give Flume more memory and pre-allocate, enable remote monitoring via JMX

# export JAVA_OPTS="-Xms100m -Xmx2000m -Dcom.sun.management.jmxremote"

# Let Flume write raw event data and configuration information to its log files for debugging purposes. Enabling these flags is not recommended in production,

# as it may result in logging sensitive user information or encryption secrets.

# export JAVA_OPTS="$JAVA_OPTS -Dorg.apache.flume.log.rawdata=true -Dorg.apache.flume.log.printconfig=true "

# Note that the Flume conf directory is always included in the classpath.

#FLUME_CLASSPATH=""

翻译：

# 给 flume 更多的内存和预分配, 通过 jmx 启用远程监控

# 导出 java _ opts = "-xms100-xmx2000 m-Dcom.sun.management.jmxremote"

# 让 flume 将原始事件数据和配置信息写入其日志文件, 以便进行调试。在生产中不建议启用这些标志,因为它可能会导致记录敏感的用户信息或加密机密。

# export java _ opts = "$JAVA _ opts-Dorg.apache.flume.log.rawdata = true-Dorg.apache.flume.log.printconfig = true "

# 请注意, flume conf 目录始终包含在类路径中。

#FLUME_CLASSPATH = ""

版本验证

$ flume-ng version

显示：

-bash: flume: command not found

版本验证失败，将.bash_profile复制到home下

操作：

~/hadoop$ cp .bash_profile ~/

~$ source .bash_profile

版本验证成功

显示：Flume 1.8.0

Source code repository: https://git-wip-us.apache.org/repos/asf/flume.git

Revision: 99f591994468633fc6f8701c5fc53e0214b6da4f

Compiled by denes on Fri Sep 15 14:58:00 CEST 2017

From source with checksum fbb44c8c8fb63a49be0a59e27316833d

Flume部署示例

Avro

Flume可以通过Avro监听某个端口并捕获传输的数据，具体示例如下：

// 创建一个Flume配置文件

$ cd app/cdh/flume-1.6.0-cdh5.7.1

$ mkdir example

$ cp conf/flume-conf.properties.template example/netcat.conf

操作：

$ cd ~/software/apache-flume-1.8.0-bin

$ mkdir example

$ cp conf/flume-conf.properties.template example/netcat.conf

查看：

~/software/apache-flume-1.8.0-bin/example$ ls

netcat.conf

// 配置netcat.conf用于实时获取另一终端输入的数据

$ vim example/netcat.conf

# Name the components on this agent

a1.sources = r1

a1.sinks = k1

a1.channels = c1

# Describe/configure the source

a1.sources.r1.type = netcat

a1.sources.r1.bind = localhost

a1.sources.r1.port = 44444

# Describe the sink

a1.sinks.k1.type = logger

# Use a channel that buffers events in memory

a1.channels.c1.type = memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

操作：

$ vim netcat.conf

源文件显示：

# The configuration file needs to define the sources,

# the channels and the sinks. 配置文件需要定义源、通道和接收器。

# Sources, channels and sinks are defined per agent,

# in this case called 'agent'源、通道和接收器是为代理定义的, 在这种情况下称为 "agent "

agent.sources = seqGenSrc

agent.channels = memoryChannel

agent.sinks = loggerSink

# For each one of the sources, the type is defined定义源

agent.sources.seqGenSrc.type = seq

# The channel can be defined as follows.定义通道

agent.sources.seqGenSrc.channels = memoryChannel

# Each sink's type must be defined定义接收器

agent.sinks.loggerSink.type = logger

#Specify the channel the sink should use定义接收器使用的通道

agent.sinks.loggerSink.channel = memoryChannel

# Each channel's type is defined.定义通道类型

agent.channels.memoryChannel.type = memory

# Other config values specific to each type of channel(sink or source) can be defined as well定义每种类型通道的

# In this case, it specifies the capacity of the memory channel内存容量

agent.channels.memoryChannel.capacity = 100

修改文件如上

// 运行FlumeAgent，监听本机的44444端口

$ flume-ng agent -c conf -f example/netcat.conf -n a1 -Dflume.root.logger=INFO,console

操作（绝对路径，根据实际替换）：

$flume-ng agent -c conf -f ~/software/apache-flume-1.8.0-bin/example/netcat.conf -n a1 -Dflume.root.logger=INFO,console

flume安装与配置

// ~~打开另一终端，通过telnet登录localhost的44444，输入测试数据~~

~~$ telnet localhost 44444~~

~~显示：telnet: command not found~~

~~安装telnet后还是这样，尝试用Windows往虚拟机端口写~~

~~telnet 10.2.68.104 44444~~

~~失败，只能连上22/23端口，44444端口连接失败~~

改用nc登录localhost的44444，输入测试数据：

nc -v localhost 44444

flume安装与配置

// 查看flume收集数据情况

flume安装与配置

Spool

Spool用于监测配置的目录下新增的文件，并将文件中的数据读取出来。需要注意两点：拷贝到spool目录下的文件不可以再打开编辑、spool目录下不可包含相应的子目录。具体示例如下：

// 创建两个Flume配置文件

$ cd app/cdh/flume-1.6.0-cdh5.7.1

$ cp conf/flume-conf.properties.template example/spool1.conf

$ cp conf/flume-conf.properties.template example/spool2.conf

操作：

$ cd ~/software/apache-flume-1.8.0-bin

$ cp conf/flume-conf.properties.template example/spool1.conf

$ cp conf/flume-conf.properties.template example/spool2.conf

// 配置spool1.conf用于监控目录avro_data的文件，将文件内容发送到本地60000端口。

//监控目录需要换

操作：

$ vim example/spool1.conf

# Namethe components

local1.sources= r1

local1.sinks= k1

local1.channels= c1

# Source

local1.sources.r1.type= spooldir

local1.sources.r1.spoolDir= ~/avro_data

# Sink

local1.sinks.k1.type= avro

local1.sinks.k1.hostname= localhost

local1.sinks.k1.port= 60000

#Channel

local1.channels.c1.type= memory

# Bindthe source and sink to the channel

local1.sources.r1.channels= c1

local1.sinks.k1.channel= c1

// 配置spool2.conf用于从本地60000端口获取数据并写入HDFS

创建HDFS测试目录

$ vim example/spool2.conf

# Namethe components

a1.sources= r1

a1.sinks= k1

a1.channels= c1

# Source

a1.sources.r1.type= avro

a1.sources.r1.channels= c1

a1.sources.r1.bind= localhost

a1.sources.r1.port= 60000

# Sink

a1.sinks.k1.type= hdfs

a1.sinks.k1.hdfs.path= hdfs://localhost:9000/user/wcbdd/flumeData

a1.sinks.k1.rollInterval= 0

a1.sinks.k1.hdfs.writeFormat= Text

a1.sinks.k1.hdfs.fileType= DataStream

# Channel

a1.channels.c1.type= memory

a1.channels.c1.capacity= 10000

# Bind the source and sink to the channel

a1.sources.r1.channels= c1

a1.sinks.k1.channel= c1

// 分别打开两个终端，运行如下命令启动两个Flume Agent

$ flume-ng agent -c conf -f example/spool2.conf -n a1

$ flume-ng agent -c conf -f example/spool1.conf -n local1

操作：

cd ~/software/apache-flume-1.8.0-bin

$ flume-ng agent -c conf -f example/spool2.conf -n a1

$ flume-ng agent -c conf -f example/spool1.conf -n local1

// 查看本地文件系统中需要监控的avro_data目录内容（文件不存在，分别在本地和HDFS创建文件夹）

$ cd avro_data

$ cat avro_data.txt

显示：

-bash: cd: avro_data/: No such file or directory

cat: avro_data.txt: No such file or directory

操作

~$ mkdir avro_data

~/avro_data$ touch avro_data.txt

创建HDFS文件夹

原hdfs://localhost:9000/user/wcbdd/flumeData

修改spool配置文件监听、写入路径

查看HDFS文件中的创建文件夹命令，

sink配置详解：

https://blog.****.net/xiaolong_4_2/article/details/81945204

MongoDB写入

http://www.cnblogs.com/cswuyg/p/4498804.html

https://java-my-life.iteye.com/blog/2238085

https://blog.****.net/tinico/article/details/41079825?utm_source=blogkpcl14

flume+mongodb流式日志采集：

https://wenku.baidu.com/view/66f1e436ba68a98271fe910ef12d2af90242a81b.html

下载mongodb插件源码：mongosink（打成jar包），和mongodb java驱动

mongosink下载地址：https://github.com/leonlee/flume-ng-mongodb-sink

Clone the repository

Install latest Maven and build source by 'mvn package'

Generate classpath by 'mvn dependency:build-classpath'

Append classpath in $FLUME_HOME/conf/flume-env.sh

Add the sink definition according to Configuration

通过mvn包build编译，下载依赖项，追加flume-env.sh，根据sink官网的配置说明配置sink definition，打包

mongosink打jar包

工程报错：The method configure(Context) of type MongoSink must override a superclass method

https://blog.****.net/kwuwei/article/details/38365839

经过查看，compiler已经是1.8

是因为Build path未修改

打包方法：

https://blog.****.net/ssbb1995/article/details/78983915

cd 到pom.xml所在位置，然后输出命令 mvn clean package

报错：

[ERROR] Unknown lifecycle phase "?clean?package". You must specify a valid lifecycle phase or a goal in the format <plugin-prefix>:<goal> or <plugin-group-id>:<plugin-artifact-id>[:<plugin-version>]:<goal>. Available lifecycle phases are: validate, initialize, generate-sources, process-sources, generate-resources, process-resources, compile, process-classes, generate-test-sources, process-test-sources, generate-test-resources, process-test-resources, test-compile, process-test-classes, test, prepare-package, package, pre-integration-test, integration-test, post-integration-test, verify, install, deploy, pre-site, site, post-site, site-deploy, pre-clean, clean, post-clean. -> [Help 1]

[ERROR]

[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.

[ERROR] Re-run Maven using the -X switch to enable full debug logging.

[ERROR]

[ERROR] For more information about the errors and possible solutions, please read the following articles:

[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/LifecyclePhaseNotFoundException

先尝试用eclipse自带的mvn install打了一个

尝试Maven build：https://blog.****.net/qq_28553681/article/details/80988190

在Goals上填写：clean package 打包成功，显示

[INFO] Building jar: E:\studyMaterial\work\eclipse\flume-ng-mongodb-sink\target\flume-ng-mongodb-sink-1.0.0.jar

Flume配置参考：https://www.cnblogs.com/ywjy/p/5255161.html（引用https://blog.****.net/tinico/article/details/41079825）

开启MongoDB

D:\Program Files\MongoDB\Server\3.4\bin>mongod.exe --port 65521 --dbpath "D:\MongoDB\DBData"

或

mongod --dbpath="D:\MongoDB\DBData"

启动Flume

# cd F:\temp\apache-flume-1.6.0-bin\bin

flume-ng.cmd agent --conf ..\conf -f ..\conf\mongo-agent.properties -n agent

报错：

2019-02-03 19:47:47,926 (New I/O worker #1) [WARN - org.apache.avro.ipc.NettyServer$NettyServerAvroHandler.exceptionCaught(NettyServer.java:201)] Unexpected exception from downstream.

org.apache.avro.AvroRuntimeException: Excessively large list allocation request detected: 1863125608 items! Connection closed.

at org.apache.avro.ipc.NettyTransportCodec$NettyFrameDecoder.decodePackHeader(NettyTransportCodec.java:167)

at org.apache.avro.ipc.NettyTransportCodec$NettyFrameDecoder.decode(NettyTransportCodec.java:139)

at org.jboss.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:425)

at org.jboss.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:310)

at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)

at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)

at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)

at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268)

at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255)

at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)

at org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)

at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)

at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)

at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)

at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)

at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:748)

修改flume配置中MongoDB端口无果

查阅https://blog.****.net/ma0903/article/details/48209681?utm_source=blogxgwz1

有可能flume端接收的协议，与client发送数据的协议不一致。

比如：flume接收avro，client发送tcp

Windows配置flume

到Apache的Flume官网（http://flume.apache.org/download.html）下载apache-flume-8.0-bin.tar.gz

（http://www.apache.org/dist/flume/1.8.0/）

解压文件中打开docs文件夹中的index.html即可本地查看文档

2.解压到目录，例如D:\software\apache-flume-1.8.0-bin

3.新建FLUME_HOME变量，填写flume安装目录D:\software\apache-flume-1.8.0-bin

4.编辑系统变量path，追加%FLUME_HOME%\conf和%FLUME_HOME%\bin

5.复制并重命名flume\config目录下的三个文件，并去掉.template后缀

Win+R输入cmd，进入命令窗口，输入

flume-ng version正常，证明环境是ok的。

直接用1.9的没配环境变量，加了example.conf文件，出现报错

flume-ng agent --conf ../conf --conf-file ../conf/example.conf --name a1 -property flume.root.logger=INFO,console

D:\Program Files (x86)\flume\win-apache-flume-1.9.0-bin\apache-flume-1.9.0-bin\bin>powershell.exe -NoProfile -InputFormat none -ExecutionPolicy unrestricted -File D:\Program Files (x86)\flume\win-apache-flume-1.9.0-bin\apache-flume-1.9.0-bin\bin\flume-ng.ps1 agent --conf ../conf --conf-file ../conf/example.conf --name a1 -property flume.root.logger=INFO,console

处理 -File“D:\Program”失败，因为该文件不具有 '.ps1' 扩展名。请指定一个有效的 Windows PowerShell 脚本文件名，然后重试。

Windows PowerShell

原因：软件中调用了一个.bat文件.bat文件无法识别路径中有空格，更改安装路径https://blog.****.net/yanhuatangtang/article/details/80404097

更换路径重新解压后：在bin目录下可以运行flume-ng version,但在默认目录不行（已经配过环境变量）

bin目录下查看version成功但显示（注意，如果出现问题尝试回来修复）：

WARN: Config directory not set. Defaulting to D:\Programs\flume\apache-flume-1.8.0-bin\conf

Sourcing environment configuration script D:\Programs\flume\apache-flume-1.8.0-bin\conf\flume-env.ps1

WARN: Did not find D:\Programs\flume\apache-flume-1.8.0-bin\conf\flume-env.ps1

WARN: HADOOP_PREFIX or HADOOP_HOME not found

WARN: HADOOP_PREFIX not set. Unable to include Hadoop's classpath & java.library.path

WARN: HBASE_HOME not found

WARN: HIVE_HOME not found

使用以下资料测试：https://blog.****.net/ycf921244819/article/details/80341502

前面配置都正常，在用第二个窗口进行telnet时显示：

正在连接localhost...无法打开到主机的连接。在端口 50000: 连接失败

Centos每次开机需要在右上角手动联网

192.168.43.156

配置flume环境变量

flume/conf下的fime-env.sh更改JAVA_HOME

java –verbose查看jdk位置

（本地为C:\Program Files\Java\jdk1.8.0_131）

flume安装与配置

flume

Flume工作原理

flume安装

下载解压

配置环境变量

配置flume-env.sh文件

版本验证

Flume部署示例

Avro

Spool

MongoDB写入

Windows配置flume

配置flume环境变量

相关推荐