Flume Agent Configuration 2


As discussed in previous post, we will discuss in detail about the properties in flume agent configuration properties. For ease of understanding, we will consider the same flume.conf file created in our previous post.

Flume agent configuration file flume.conf resembles a Java property file format with hierarchical property settings. Here the filename flume.conf is not fixed, and we can provide any name to it and need to use the same name in <conf-file> when starting agent with flume-ng command.

Flume Agent Configuration:

We will describe the properties in our flume.conf file by section wise.

First section:

Agent1.sources = netcat-source
Agent1.channels = memory-channel
Agent1.sinks = logger-sink

These first three lines name the agent and define the sources, sinks, and channels associated with it.

The first qualifier in the above three lines is the agent name. We can give any name (starting with character) to agent but it should not start with any digit or special character.

Second qualifier denotes any component among sources, channels and sinks. Here the keywords (sources, channels, sinks) used for second qualifier are fixed and these can’t be replaced with any other names to refer the same components.

Right hand side values are just names given to three components of agent. These can be any strings without space in between. Though it is optional but it is preferable to use descriptive names which will help in debugging log messages. If we want to specify multiple values on each line then values should be space separated.

For example, netcat-source is a single value but if we specify it as netcat source then it is treated as two sources (netcat, source) for the same agent.

Second section:

Agent1.sources.netcat-source.type = netcat
Agent1.sources.netcat-source.bind = localhost
Agent1.sources.netcat-source.port = 44444

These lines specify the configuration for the source. Here, first qualifier is same as the first qualifier in first section and it is agent name. Second qualifier is a reserved keyword for sources. Third qualifier is the source name given in the first section. Fourth qualifier specifies additional properties of source. Right side values are specific values for source. We can specify as many properties as available for the source.

Netcat Source:

Since we are using the Netcat source, the configuration values specify how it should bind to the network.

A netcat-like source opens a specified port and listens for data and turns each line of text into an event. The expectation is that the supplied data is newline separated text. Each line of text is turned into a Flume event and sent via the connected channel.

Below are some of the additional properties that can be set on netcat source. The required properties are in bold.

Property Name Default Description
type The component type name, needs to be netcat
bind Host name or IP address to bind to
port Port # to bind to
max-line-length Max line length per event body (in bytes)
ack-every-event 512 Respond with an “OK" for every event received
selector.type TRUE replicating or multiplexing
selector.* replicating Depends on the selector.type value

Third section:

Agent1.sinks.logger-sink.type = logger

Qualifiers and values are similar to the same as in second section. The above line specifies the sink to be used is the logger sink which is further configured via the command line or the log4j property file.

Logger Sink:

Logger sink is typically useful for testing/debugging purpose. If the sink type is logger, and other configuration properties are specified in log4j.properties file in FLUME_CONF_DIR as shown below then all the events will be written into log file specified in flume.log.file under flume.log.dir directory.

As in the above settings, flume.log.dir is ./logs, whenever we start flume agent, it will create logs folder in pwd (present working directory) and writes its log messages into flume.log file in the same folder.

In the previous post, we passed a Java option (-Dflume.root.logger=INFO,console) when starting agent with flume-ng command to force Flume to log to the console.

So, we were able to see the event messages on console itself.

In this post, at the bottom, we will start the same agent Agent1 without the above java option (-Dflume.root.logger=INFO,console) , then it will automatically picks the properties from log4j.properties file and log messages will be written into log file.

Fourth Section:

Agent1.channels.memory-channel.type = memory
Agent1.channels.memory-channel.capacity = 1000
Agent1.channels.memory-channel.transactionCapacity = 100

These lines specify the channel to be used and then add the type specific configuration values. In this case we are using the memory channel and we specify its capacity but it is non-persistent there is no external storage mechanism.

Memory Channel:

The events are stored in an in-memory queue with configurable max size. It’s ideal for flows that need higher throughput and are prepared to lose the staged data in the event of a agent failures.

Below are some of the additional properties for memory channel. The required properties are in bold.

Property Name Default Description
type The component type name, needs to be memory
capacity 100 The maximum number of events stored in the channel
transactionCapacity 100 The maximum number of events the channel will take from a source or give to a sink per transaction
keep-alive 3 Timeout in seconds for adding or removing an event

Fifth Section:

Agent1.sources.netcat-source.channels = memory-channel
Agent1.sinks.logger-sink.channel = memory-channel

These last lines configure the channel to be used for the source and sink.

Note: The <conf-dir> directory would include a shell script flume-env.sh and potentially a log4j properties file.

Sections named in this post are not the actual sections in the flume configuration file, just for ease of explanation, the properties are divided into sections.

Logging Messages into Log File Instead of Console:

Start agent with below command.

Flume Agent 1

Open another terminal and connect to specified port through curl utility and type messages and hit enter. Finally close curl connection with ctrl+c key.

Flume Curl 1

Now open flume.log file in logs folder in the current working directory.

We can see message events written into flume.log file in the below screen shot.

flume.log

So, the network traffic events are successfully routed to log file.


About Siva

Senior Hadoop developer with 4 years of experience in designing and architecture solutions for the Big Data domain and has been involved with several complex engagements. Technical strengths include Hadoop, YARN, Mapreduce, Hive, Sqoop, Flume, Pig, HBase, Phoenix, Oozie, Falcon, Kafka, Storm, Spark, MySQL and Java.


Leave a comment

Your email address will not be published. Required fields are marked *

2 thoughts on “Flume Agent Configuration

  • Ganesh

    Hi,
    Since there will be multiple users . Each user has to authorize his account before system extracts the data. For this token needs to be generated as well.How can we get approval from each user dynamically and how flume would extract the data. Please suggest.
    Also, can the code like above be set to reusable mode. If yes, then how. By reusable I mean can I use similar code for other external API.

    Thanks,
    Ganesh


Review Comments
default image

I have attended Siva’s Spark and Scala training. He is good in presentation skills and explaining technical concepts easily to everyone in the group. He is having excellent real time experience and provided enough use cases to understand each concepts. Duration of the course and time management is awesome. Happy that I found a right person on time to learn Spark. Thanks Siva!!!

Dharmeswaran ETL / Hadoop Developer Spark Nov 2016 September 21, 2017

.