Flume Avro Client – Collecting a Remote File into Local File

In this post, we will discuss about setup of a Flume Agent using Avro Client, Avro Source, JDBC Channel, and File Roll sink.

First we will create Agent3 in flume.conf file under FLUME_HOME/conf directory.

Flume Agent – Avro Source, JDBC Channel and File Roll Sink:

  • Add the below configuration properties in flume.conf file to create Agent3.

  •  Make sure /usr/lib/flume/agent/files/ directory is created and Flume use has write permissions to this location.
  • Create a sample file for giving it as input to Avro Client. Lets create AvroClientInput.txt in the home directory itself.
  • Now start the agent with below command.

 Below is the screen shot of starting agent from terminal.

Flume Agent3

  • Now open the host connection through Avro client to send the test file to specified port with the below command in another terminal.

An Avro client included in the Flume distribution can send a given file to Flume Avro source with the above command. The above command will send the contents of /home/siva/AvroClientInput.txt file to the Flume source listening on localhost:11111 port.

Below is the screen shot of another terminal. We have created test input file first and then avro-client is used to send the file at localhost:11111 port.

Flume Avro Client

  • Validate the output in the destination directory /usr/lib/flume/agent/files/ .

Flume Avro Client Op

So, we have successfully configured the agent with Avro Source and JDBC channel into File Roll sink.

Below are the indepth details about Avro source and JDBC Channels.

Avro Source:

Avro is a data serialization framework and it manages the packaging and transport of data from one point to another point across the network. An Avro Source Listens on Avro port and receives events from external Avro client streams. When connected with the built-in Avro Sink on another Flume agent, it can create chain-agents topology.

By default, Flume  distribution supports both an Avro source and a standalone Avro client. Avro Client reads a file and sends it to an Avro source anywhere on the network and need not to be on same local machine as we used in our example. But if it is outside the local machine Avro client requires the explicit hostname and port of the Avro source to which it should send the file.

Finally, Avro collects the events from file sent by avro client and passes them to file sink.

Below are the properties related to avro source. Required properties are in bold.

Property Name Default Description
type The component type name, needs to be avro
bind hostname or IP address to listen on
port Port # to bind to
threads Maximum number of worker threads to spawn
compression-type none This can be “none” or “deflate”. The compression-type must match the compression-type of matching AvroSource
ssl FALSE Set this to true to enable SSL encryption
JDBC Channel:

Similar to File channel, JDBC channel also provide persistent storage of events to prevent event loss in case of agent failure. By default JDBC channel uses an embedded Derby database to store events. This is a durable channel that’s ideal for flows where recoverability is important.

In below properties table, required properties are in bold.

Property Name Default Description
type The component type name, needs to be jdbc
db.type DERBY Database vendor, needs to be DERBY.
driver.class org.apache.derby.jdbc.EmbeddedDriver Class for vendor’s JDBC driver
driver.url (constructed from other properties) JDBC connection URL
db.username “sa” User id for db connection
db.password password for db connection
connection.properties.file JDBC Connection property file path
create.schema TRUE If true, then creates db schema if not there
create.index TRUE Create indexes to speed up lookups
create.foreignkey TRUE  
transaction.isolation “READ_COMMITTED” Isolation level for db session READ_UNCOMMITTED, READ_COMMITTED, SERIALIZABLE, REPEATABLE_READ
maximum.connections   Max connections allowed to db
maximum.capacity   Max number of events in the channel
sysprop.*   DB Vendor specific properties
sysprop.user.home   Home path to store embedded Derby database

Hive CLI Commands

In our previous posts, we have seen about Hive Overview and Hive Architecture and now we will discuss about the default service in hive, Hive Command Line Interface and Hive CLI Commands.

Ways to Interact with Hive

  • CLI, command-line interface .
  • Karmasphere (http://karmasphere.com ) (commercial product),
  • Cloudera’s open source Hue (https://git hub.com/cloudera/hue ),
  • A new “Hive-as-a-service” offering from Qubole (http://qubole.com)
  • A simple web interface called Hive web interface (HWI), and
  • programmatic access through JDBC, ODBC, and a Thrift server

Hive CLI Commands

Hive CLI (Command Line Interface) , which is nothing but Hive Shell is the default service in Hive and it is the most common way of interacting with Hive. We can run both batch and Interactive shell commands via CLI service which we will cover in the following sections.

We can get the list of commands/options allowed on Hive CLI with $ hive –service cli –help command from terminal. Below are the Options/arguments allowed for CLI Service. Examples for these options are provided below the table.

             Argument   Description
 -d,–define <key=value>  Defining new variables for Hive Session.
 –database <databasename>  Specify the database to use in Hive Session
 -e <quoted-query-string>  Running a Hive Query from the command line.
 -f <filename>  Execute the hive queries from file
 -h <hostname>  Connecting to Hive Server on remote host
 -p <port>  Connecting to Hive Server on port number
 –hiveconf <property=value>  Setting Configuration Property for current Hive Session
 –hivevar <key=value>  Same as –define argument
 -i <filename>  Initialization of Hive Session from an SQL properties file
 -S,–silent  Silent mode in interactive shell, suppresses log messages
 -v,–verbose  Verbose mode (prints executed SQL to the console)

We will provide these options or arguments at the time of starting Hive CLI service itself.

For Example, $ hive -d flag=N

Types of Hive Variables

Here we define variables or properties with –define, –hivevar and –hiveconf arguments. In Hive there can be four types of variables or properties. They are

  • env : Environment variables defined by the shell. These are read only in Hive session
  • hivevar : User-defined custom variables and these can be overridden with set command in hive session/shell.
  • hiveconf : Hive-specific configuration properties
  • system : Configuration properties defined by Java.

Examples of using these options

As shown in the table, both –define and –hivevar arguments are same and used for defining hive variables.

Scenario 1: — define or –hivevar Options

Lets define a hive variable address with value ‘country, state, city, post’ so that whenever we need to pull all these columns from a table we can simply use the address variable in our HiveQL queries instead of writing all these columns.

But after some time during the session, if we need to know what are the values of the defined variables for that hive session, we can use set command in Hive Shell. We will discuss more about this in the following sections.

Scenario 2: — database Option

When we have multiple databases in Hive and for current session we need to use a particular database then we can set the database with this argument as shown below or we can try use database command in hive shell.

Scenario 3: -S, -e Options, Environment variables & Redirecting Output to File

Lets run HiveQL commands in batch mode, or single shot commands, and make use of hive variables and redirect the output to a file on local FS. Here -S, silent will suppress the log messages like (OK, Time Taken …lines) from the output.

Lets create VAR1 environment variable and try to access that variable value in Hive Query.

Hive Single Shot Commands in Silent ModeWe can see the cat command output to verify the results.

Scenario 4: Connecting Remote Hive Server

By Using -h and -p options to hive command we login into remote Hive server as shown below. It gives the more flexibility to work on production Hive server from Hive client in Development environment, instead of logging directly into Hive Server Machine for accessing hive tables in production.

Scenario 5: Running Queries from file

When we have our HiveQL written in separate file then we can perform the batch execution of HiveQL commands. Here the file can be present on Local File System or HDFS. Below are the examples of running queries from file.

Local FS:

We can queries from file even when, we are already in Hive shell with below command.

HDFS:

Hive Query Execution from HDFS file

Scenario 6: .hiverc file Initialization Script

We can initialize a Hive before entering interactive mode with -i option. If the CLI is invoked without the -i option, then Hive will attempt to load $HIVE_HOME/bin/.hiverc and $HOME/.hiverc as initialization files

The typical properties that can be placed in .hiverc file are like:

Example:

Hive Batch Mode Commands

As discussed in the above sections Hive supports below two types of batch mode commands.

  • hive -e “<query-string>”
  • hive -f <filepath>  – executes one or more SQL queries from a file
Comments in Hive Scripts

While writing Batch scripts for Hive, we can embed comments in the scripts file by prefixing with — before comment line as shown below.

Refer Next post for Interactive shell Commands.

 

 

References

Apache Hive Manual –> https://cwiki.apache.org/confluence/display/Hive/LanguageManual