Flume Agent – Collect Data From Command to a Flat File 1


In this post, we will discuss about flume agent configuration and setup for collecting data from an output of a command line tool into a flat file.

We will use Exec Source type, File Channel and File Roll sink type in configuration of our agent. Lets name our agent as Agent2. We will discuss more about each component and their additional properties at the bottom of this post but we will focus on agent configuration and deployment in the beginning of the post itself.

Flume Agent – Exec Source, File Roll Sink and File Channel:

Lets create an agent Agent2 in flume.conf properties file under Flume_Home/conf directory. We can either use the existing flume.conf file by appending our new agent properties at the bottom of the file or can create new file with our agent only.

Add the below properties in flume.conf file.

and make sure below things before starting agent.

  • Parent directory given in Agent2.sinks.file-sink.sink.directory property should already be created and the flume user has write access to it. Even if the flume user has access to create files and parent directory is not created prior to starting agent, flume process will not create the directory on the fly.
  • Flume user should have write access to Agent2.channels.file-channel.checkpointDir and Agent2.channels.file-channel.dataDir directory locations and these should be created prior to starting the agent or if the flume user has write access to the given path, then flume JVM process will create these folders/files on the fly.
Start Flume Agent:

Now start the flume agent with the below command in terminal

Below is the screen shot of started agent:

Flume Agent2

After some time of running the agent stop the agent by pressing ctrl+c key.

Now open the output directory in another terminal and we can see new files created under the target directory. Below is the screen shot of new files and contents.

Flume Agent2 Output

In the above screen, we can observe the log messages copied from /var/log/syslog file into 1411*-1 file and this file is constantly open for writing by flume agent. This file will be closed only once the agent is stopped by hitting ctrl+c key.

We can also observe the files created under the File channel’s checkpoint directory and data directory locations.

Flume channel op

So we have successfully configured Agent2 with Exec Source, File Roll sink and File channel. Now we can jump into deep insight of each component used in this agent.

Details of Components:

Exec Source:

Exec source runs a given Unix command on shell and captures its output as the input to the Flume agent. This process will be continued to produce events to flume agent continuously. If the process exits for any reason, the source will also exit and will produce no further data. This source is best suitable for command that produce streams of data continuously.

In our above example, we have used below Unix command.

The tail command is used to display contents of a file from the end. Below are examples of its usage. It accepts below arguments.

By default it displays last 10 lines of a file. It accepts below arguments:

We can also specify additional properties on Exec source. Below are a list of properties that are allowed on Exec source. The required properties are in bold.

Property Name Default Description
type The component type name, needs to be exec
command The command to execute
shell Tells in which shell to run the above command. e.g /bin/bash
restartThrottle 10000 Amount of time (in millis) to wait before attempting a restart
restart FALSE Whether the executed cmd should be restarted if it dies
logStdErr FALSE Whether the command’s stderr should be logged
batchSize 20 The max number of lines to read and send to the channel at a time

The ‘shell’ config is used to invoke the ‘command’ through a command shell (such as Bash or C).

Warning:

Exec Source can not guarantee that if there is a failure to put the event into the Channel. In such cases, the data will be lost. As a for instance, the tail -F [log file] – like use case, where an application writes to a log file on disk and Flume tails the file, sending each line as an event. In this use case, if the channel fills up and Flume can’t send an event, then flume has no way of indicating to the application writing the log file that it needs to retain the log. There is absolutely zero guarantee of event delivery when using this source. For stronger reliability guarantees, consider the Spooling Directory Source or direct integration with Flume via the SDK.

File Roll Sink:

The output of the agent is written to a file on the local file system as specified in the configuration file. By default, Flume rotates (rolls) to a new file every 30 seconds, In our setup, we have disabled this feature to track what’s going on in a single file itself.

Required properties are in bold.