flume安装与配置
目录
flume
centos:https://blog.****.net/qq_39160721/article/details/80255194
Linux:https://blog.****.net/u011254180/article/details/80000763
文档:http://flume.apache.org/FlumeUserGuide.html
蜜罐系统:Linux
项目用的1.8
下载详细:https://blog.****.net/qq_41910230/article/details/80920873
https://yq.aliyun.com/ask/236859
管道配置详解:http://www.cnblogs.com/gongxijun/p/5661037.html
参考Linux下直接下载(Linux公社):https://www.linuxidc.com/Linux/2016-12/138722.htm
目前搭建时用到的安装包与jar包:https://download.****.net/download/lagoon_lala/10949262
Flume工作原理
Flume的数据流由事件(Event)贯穿始终。事件是Flume的基本数据单位,它携带日志数据(字节数组形式)并且携带有头信息,这些Event由Agent外部的Source生成,当Source捕获事件后会进行特定的格式化,然后Source会把事件推入(单个或多个)Channel中。可以把Channel看作是一个缓冲区,它将保存事件直到Sink处理完该事件。Sink负责持久化日志或者把事件推向另一个Source。以下是Flume的一些核心概念:
(1)Events:一个数据单元,带有一个可选的消息头,可以是日志记录、avro 对象等。
(2)Agent:JVM中一个独立的Flume进程,包含组件Source、Channel、Sink。
(3)Client:运行于一个独立线程,用于生产数据并将其发送给Agent。
(4)Source:用来消费传递到该组件的Event,从Client收集数据,传递给Channel。
(5)Channel:中转Event的一个临时存储,保存Source组件传递过来的Event,其实就是连接 Source 和 Sink ,有点像一个消息队列。
(6)Sink:从Channel收集数据,运行在一个独立线程。
Flume以Agent为最小的独立运行单位,一个Agent就是一个JVM。单Agent由Source、Sink和Channel三大组件构成,如下图所示:
值得注意的是,Flume提供了大量内置的Source、Channel和Sink类型。不同类型的Source、Channel和Sink可以自由组合。组合方式基于用户设置的配置文件,非常灵活。比如:Channel可以把事件暂存在内存里,也可以持久化到本地硬盘上;Sink可以把日志写入HDFS、Hbase、ES甚至是另外一个Source等等。Flume支持用户建立多级流,也就是说多个Agent可以协同工作
flume安装
下载解压
下载命令:wget
flume只需下载二进制文件(bin)
flume官网下载:http://mirrors.hust.edu.cn/apache/flume/1.8.0/apache-flume-1.8.0-bin.tar.gz
$ wget http://archive.cloudera.com/cdh5/cdh/5/flume-ng-1.6.0-cdh5.7.1.tar.gz
操作显示: $ wget http://mirrors.hust.edu.cn/apache/flume/1.8.0/apache-flume-1.8.0-bin.tar.gz --2018-12-24 15:36:52-- apache-flume-1.8.0-bi 100%[=======================>] 55.97M 5.21MB/s in 12s 2018-12-24 15:37:04 (4.83 MB/s) - ‘apache-flume-1.8.0-bin.tar.gz’ saved [58688757/58688757] $ ls有apache-flume-1.8.0-bin.tar.gz |
$ tar -xvf flume-ng-1.6.0-cdh5.7.1.tar.gz
操作显示: $ tar -xvf apache-flume-1.8.0-bin.tar.gz $ ls有 apache-flume-1.8.0-bin apache-flume-1.8.0-bin.tar.gz |
$ rm flume-ng-1.6.0-cdh5.7.1.tar.gz
$ mv apache-flume-1.6.0-cdh5.7.1-bin flume-1.6.0-cdh5.7.1
(删除、重命名没有操作)
配置环境变量
$ cd /home/Hadoop
$ vim .bash_profile(没找到这文件,可能用的是.profile,但用这个也可以)
export FLUME_HOME=/home/hadoop/app/cdh/flume-1.6.0-cdh5.7.1
export PATH=$PATH:$FLUME_HOME/bin
操作:
$ cd ~ $ vim .bash_profile export FLUME_HOME=~/software/apache-flume-1.8.0-bin export PATH=$PATH:$FLUME_HOME/bin |
$ source .bash_profile
操作显示: -bash: export: `/home/user/software/apache-flume-1.8.0-bin': not a valid identifier 删除FLUME_HOME等号后的空格后不再报错 |
版本验证失败,将.bash_profile复制到home下
~/hadoop$ cp .bash_profile ~/ ~$ source .bash_profile |
配置flume-env.sh文件
修改conf下的flume-env.sh,在里面配置JAVA_HOME
$ cd app/cdh/flume-1.6.0-cdh5.7.1/conf/
$ cp flume-env.sh.template flume-env.sh
$ vim flume-env.sh
export JAVA_HOME=/home/hadoop/app/jdk1.7.0_79
export HADOOP_HOME=/home/hadoop/app/cdh/hadoop-2.6.0-cdh5.7.1
操作:(jdk位置:/home/user/jdk1.8.0_171、Hadoop位置:/home/user/hadoop) ~/software/apache-flume-1.8.0-bin/conf$ cp flume-env.sh.template flume-env.sh $ vim flume-env.sh export JAVA_HOME=/home/user/jdk1.8.0_171 export HADOOP_HOME=/home/user/hadoop |
文件中原有文字: # If this file is placed at FLUME_CONF_DIR/flume-env.sh, it will be sourced during Flume startup. 如果此文件放置在 flume _ conf _ dir/fume-env. sh, 它将被获取 在flume启动过程中。 # Enviroment variables can be set here. 环境变量可以在这里设置。 # export JAVA_HOME=/usr/lib/jvm/java-8-oracle
# Give Flume more memory and pre-allocate, enable remote monitoring via JMX # export JAVA_OPTS="-Xms100m -Xmx2000m -Dcom.sun.management.jmxremote"
# Let Flume write raw event data and configuration information to its log files for debugging purposes. Enabling these flags is not recommended in production, # as it may result in logging sensitive user information or encryption secrets. # export JAVA_OPTS="$JAVA_OPTS -Dorg.apache.flume.log.rawdata=true -Dorg.apache.flume.log.printconfig=true "
# Note that the Flume conf directory is always included in the classpath. #FLUME_CLASSPATH="" 翻译: # 给 flume 更多的内存和预分配, 通过 jmx 启用远程监控 # 导出 java _ opts = "-xms100-xmx2000 m-Dcom.sun.management.jmxremote"
# 让 flume 将原始事件数据和配置信息写入其日志文件, 以便进行调试。在生产中不建议启用这些标志,因为它可能会导致记录敏感的用户信息或加密机密。 # export java _ opts = "$JAVA _ opts-Dorg.apache.flume.log.rawdata = true-Dorg.apache.flume.log.printconfig = true "
# 请注意, flume conf 目录始终包含在类路径中。 #FLUME_CLASSPATH = "" |
版本验证
$ flume-ng version
显示: -bash: flume: command not found |
版本验证失败,将.bash_profile复制到home下
操作: ~/hadoop$ cp .bash_profile ~/ ~$ source .bash_profile |
版本验证成功
显示:Flume 1.8.0 Source code repository: https://git-wip-us.apache.org/repos/asf/flume.git Revision: 99f591994468633fc6f8701c5fc53e0214b6da4f Compiled by denes on Fri Sep 15 14:58:00 CEST 2017 From source with checksum fbb44c8c8fb63a49be0a59e27316833d |
Flume部署示例
Avro
Flume可以通过Avro监听某个端口并捕获传输的数据,具体示例如下:
// 创建一个Flume配置文件
$ cd app/cdh/flume-1.6.0-cdh5.7.1
$ mkdir example
$ cp conf/flume-conf.properties.template example/netcat.conf
操作: $ cd ~/software/apache-flume-1.8.0-bin $ mkdir example $ cp conf/flume-conf.properties.template example/netcat.conf |
查看: ~/software/apache-flume-1.8.0-bin/example$ ls netcat.conf |
// 配置netcat.conf用于实时获取另一终端输入的数据
$ vim example/netcat.conf
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
# Describe the sink
a1.sinks.k1.type = logger
# Use a channel that buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
操作: $ vim netcat.conf 源文件显示: # The configuration file needs to define the sources, # the channels and the sinks. 配置文件需要定义源、通道和接收器。 # Sources, channels and sinks are defined per agent, # in this case called 'agent'源、通道和接收器是为代理定义的, 在这种情况下称为 "agent " agent.sources = seqGenSrc agent.channels = memoryChannel agent.sinks = loggerSink # For each one of the sources, the type is defined定义源 agent.sources.seqGenSrc.type = seq # The channel can be defined as follows.定义通道 agent.sources.seqGenSrc.channels = memoryChannel # Each sink's type must be defined定义接收器 agent.sinks.loggerSink.type = logger #Specify the channel the sink should use定义接收器使用的通道 agent.sinks.loggerSink.channel = memoryChannel # Each channel's type is defined.定义通道类型 agent.channels.memoryChannel.type = memory # Other config values specific to each type of channel(sink or source) can be defined as well定义每种类型通道的 # In this case, it specifies the capacity of the memory channel内存容量 agent.channels.memoryChannel.capacity = 100 |
修改文件如上 |
// 运行FlumeAgent,监听本机的44444端口
$ flume-ng agent -c conf -f example/netcat.conf -n a1 -Dflume.root.logger=INFO,console
操作(绝对路径,根据实际替换): $flume-ng agent -c conf -f ~/software/apache-flume-1.8.0-bin/example/netcat.conf -n a1 -Dflume.root.logger=INFO,console |
// 打开另一终端,通过telnet登录localhost的44444,输入测试数据
$ telnet localhost 44444
|
安装telnet后还是这样,尝试用Windows往虚拟机端口写
|
改用nc登录localhost的44444,输入测试数据: nc -v localhost 44444 |
// 查看flume收集数据情况
Spool
Spool用于监测配置的目录下新增的文件,并将文件中的数据读取出来。需要注意两点:拷贝到spool目录下的文件不可以再打开编辑、spool目录下不可包含相应的子目录。具体示例如下:
// 创建两个Flume配置文件
$ cd app/cdh/flume-1.6.0-cdh5.7.1
$ cp conf/flume-conf.properties.template example/spool1.conf
$ cp conf/flume-conf.properties.template example/spool2.conf
操作: $ cd ~/software/apache-flume-1.8.0-bin $ cp conf/flume-conf.properties.template example/spool1.conf $ cp conf/flume-conf.properties.template example/spool2.conf |
// 配置spool1.conf用于监控目录avro_data的文件,将文件内容发送到本地60000端口。
//监控目录需要换
操作: $ vim example/spool1.conf
# Namethe components local1.sources= r1 local1.sinks= k1 local1.channels= c1 # Source local1.sources.r1.type= spooldir local1.sources.r1.spoolDir= ~/avro_data # Sink local1.sinks.k1.type= avro local1.sinks.k1.hostname= localhost local1.sinks.k1.port= 60000 #Channel local1.channels.c1.type= memory # Bindthe source and sink to the channel local1.sources.r1.channels= c1 local1.sinks.k1.channel= c1 |
// 配置spool2.conf用于从本地60000端口获取数据并写入HDFS
创建HDFS测试目录
$ vim example/spool2.conf
# Namethe components a1.sources= r1 a1.sinks= k1 a1.channels= c1 # Source a1.sources.r1.type= avro a1.sources.r1.channels= c1 a1.sources.r1.bind= localhost a1.sources.r1.port= 60000 # Sink a1.sinks.k1.type= hdfs a1.sinks.k1.hdfs.path= hdfs://localhost:9000/user/wcbdd/flumeData a1.sinks.k1.rollInterval= 0 a1.sinks.k1.hdfs.writeFormat= Text a1.sinks.k1.hdfs.fileType= DataStream # Channel a1.channels.c1.type= memory a1.channels.c1.capacity= 10000 # Bind the source and sink to the channel a1.sources.r1.channels= c1 a1.sinks.k1.channel= c1 |
// 分别打开两个终端,运行如下命令启动两个Flume Agent
$ flume-ng agent -c conf -f example/spool2.conf -n a1
$ flume-ng agent -c conf -f example/spool1.conf -n local1
操作: cd ~/software/apache-flume-1.8.0-bin $ flume-ng agent -c conf -f example/spool2.conf -n a1 $ flume-ng agent -c conf -f example/spool1.conf -n local1 |
// 查看本地文件系统中需要监控的avro_data目录内容(文件不存在,分别在本地和HDFS创建文件夹)
$ cd avro_data
$ cat avro_data.txt
显示: -bash: cd: avro_data/: No such file or directory cat: avro_data.txt: No such file or directory |
操作 ~$ mkdir avro_data ~/avro_data$ touch avro_data.txt
|
创建HDFS文件夹
原hdfs://localhost:9000/user/wcbdd/flumeData
修改spool配置文件监听、写入路径
查看HDFS文件中的创建文件夹命令,
sink配置详解:
https://blog.****.net/xiaolong_4_2/article/details/81945204
MongoDB写入
http://www.cnblogs.com/cswuyg/p/4498804.html
https://java-my-life.iteye.com/blog/2238085
https://blog.****.net/tinico/article/details/41079825?utm_source=blogkpcl14
flume+mongodb流式日志采集:
https://wenku.baidu.com/view/66f1e436ba68a98271fe910ef12d2af90242a81b.html
下载mongodb插件源码:mongosink(打成jar包),和mongodb java驱动
mongosink下载地址:https://github.com/leonlee/flume-ng-mongodb-sink
Clone the repository Install latest Maven and build source by 'mvn package' Generate classpath by 'mvn dependency:build-classpath' Append classpath in $FLUME_HOME/conf/flume-env.sh Add the sink definition according to Configuration |
通过mvn包build编译,下载依赖项,追加flume-env.sh,根据sink官网的配置说明配置sink definition,打包
mongosink打jar包
工程报错:The method configure(Context) of type MongoSink must override a superclass method
https://blog.****.net/kwuwei/article/details/38365839
经过查看,compiler已经是1.8
是因为Build path未修改
打包方法:
https://blog.****.net/ssbb1995/article/details/78983915
cd 到pom.xml所在位置,然后输出命令 mvn clean package
报错:
[ERROR] Unknown lifecycle phase "?clean?package". You must specify a valid lifecycle phase or a goal in the format <plugin-prefix>:<goal> or <plugin-group-id>:<plugin-artifact-id>[:<plugin-version>]:<goal>. Available lifecycle phases are: validate, initialize, generate-sources, process-sources, generate-resources, process-resources, compile, process-classes, generate-test-sources, process-test-sources, generate-test-resources, process-test-resources, test-compile, process-test-classes, test, prepare-package, package, pre-integration-test, integration-test, post-integration-test, verify, install, deploy, pre-site, site, post-site, site-deploy, pre-clean, clean, post-clean. -> [Help 1] [ERROR] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch. [ERROR] Re-run Maven using the -X switch to enable full debug logging. [ERROR] [ERROR] For more information about the errors and possible solutions, please read the following articles: [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/LifecyclePhaseNotFoundException |
先尝试用eclipse自带的mvn install打了一个
尝试Maven build:https://blog.****.net/qq_28553681/article/details/80988190
在Goals上填写:clean package 打包成功,显示
[INFO] Building jar: E:\studyMaterial\work\eclipse\flume-ng-mongodb-sink\target\flume-ng-mongodb-sink-1.0.0.jar |
Flume配置参考:https://www.cnblogs.com/ywjy/p/5255161.html(引用https://blog.****.net/tinico/article/details/41079825)
开启MongoDB
D:\Program Files\MongoDB\Server\3.4\bin>mongod.exe --port 65521 --dbpath "D:\MongoDB\DBData" 或 mongod --dbpath="D:\MongoDB\DBData" |
启动Flume
|
报错:
2019-02-03 19:47:47,926 (New I/O worker #1) [WARN - org.apache.avro.ipc.NettyServer$NettyServerAvroHandler.exceptionCaught(NettyServer.java:201)] Unexpected exception from downstream. org.apache.avro.AvroRuntimeException: Excessively large list allocation request detected: 1863125608 items! Connection closed. at org.apache.avro.ipc.NettyTransportCodec$NettyFrameDecoder.decodePackHeader(NettyTransportCodec.java:167) at org.apache.avro.ipc.NettyTransportCodec$NettyFrameDecoder.decode(NettyTransportCodec.java:139) at org.jboss.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:425) at org.jboss.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:310) at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70) at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564) at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559) at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268) at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255) at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88) at org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108) at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318) at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89) at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178) at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108) at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) |
修改flume配置中MongoDB端口无果
查阅https://blog.****.net/ma0903/article/details/48209681?utm_source=blogxgwz1
有可能flume端接收的协议,与client发送数据的协议不一致。
比如:flume接收avro,client发送tcp
Windows配置flume
- 到Apache的Flume官网(http://flume.apache.org/download.html)下载apache-flume-8.0-bin.tar.gz
(http://www.apache.org/dist/flume/1.8.0/)
解压文件中打开docs文件夹中的index.html即可本地查看文档
2.解压到目录,例如D:\software\apache-flume-1.8.0-bin
3.新建FLUME_HOME变量,填写flume安装目录D:\software\apache-flume-1.8.0-bin
4.编辑系统变量path,追加%FLUME_HOME%\conf和%FLUME_HOME%\bin
5.复制并重命名flume\config目录下的三个文件,并去掉.template后缀
Win+R输入cmd,进入命令窗口,输入
flume-ng version正常,证明环境是ok的。
直接用1.9的没配环境变量,加了example.conf文件,出现报错 flume-ng agent --conf ../conf --conf-file ../conf/example.conf --name a1 -property flume.root.logger=INFO,console
D:\Program Files (x86)\flume\win-apache-flume-1.9.0-bin\apache-flume-1.9.0-bin\bin>powershell.exe -NoProfile -InputFormat none -ExecutionPolicy unrestricted -File D:\Program Files (x86)\flume\win-apache-flume-1.9.0-bin\apache-flume-1.9.0-bin\bin\flume-ng.ps1 agent --conf ../conf --conf-file ../conf/example.conf --name a1 -property flume.root.logger=INFO,console 处理 -File“D:\Program”失败,因为该文件不具有 '.ps1' 扩展名。请指定一个有效的 Windows PowerShell 脚本文件名,然后重试。 Windows PowerShell 原因:软件中调用了一个.bat文件.bat文件无法识别路径中有空格,更改安装路径https://blog.****.net/yanhuatangtang/article/details/80404097 更换路径重新解压后:在bin目录下可以运行flume-ng version,但在默认目录不行(已经配过环境变量) |
bin目录下查看version成功但显示(注意,如果出现问题尝试回来修复): WARN: Config directory not set. Defaulting to D:\Programs\flume\apache-flume-1.8.0-bin\conf Sourcing environment configuration script D:\Programs\flume\apache-flume-1.8.0-bin\conf\flume-env.ps1 WARN: Did not find D:\Programs\flume\apache-flume-1.8.0-bin\conf\flume-env.ps1 WARN: HADOOP_PREFIX or HADOOP_HOME not found WARN: HADOOP_PREFIX not set. Unable to include Hadoop's classpath & java.library.path WARN: HBASE_HOME not found WARN: HIVE_HOME not found |
使用以下资料测试:https://blog.****.net/ycf921244819/article/details/80341502
前面配置都正常,在用第二个窗口进行telnet时显示: 正在连接localhost...无法打开到主机的连接。 在端口 50000: 连接失败 |
Centos每次开机需要在右上角手动联网
192.168.43.156
配置flume环境变量
flume/conf下的fime-env.sh更改JAVA_HOME
java –verbose查看jdk位置
(本地为C:\Program Files\Java\jdk1.8.0_131)