Monthly Archives: August 2014


Flume Data Collection into HBase 5

We will discuss about collection of data into HBase directly through flume agent. In our previous posts under flume category, we have covered setup of flume agents for file roll, logger and HDFS sink types. In this, we are going to explore the details of HBase sink and its setup with live example. As we have already covered File channel , Memory channel and JDBC Channel, so we will try to make […]


Flume Data Collection into HDFS 2

In this post, we will discuss about setup of an agent for Flume data collection into HDFS . In this post, we will setup an agent with Sequence Generator Source, HDFS Sink and Memory channel and start that agent and verify its functionality. Flume data collection into HDFS Flume Agent – Sequence Generator Source, HDFS Sink and Memory channel: Add the below configuration properties in flume.conf file to create Agent4 with Sequence source, memory […]


Flume Avro Client – Collecting a Remote File into Local File 1

In this post, we will discuss about setup of a Flume Agent using Avro Client, Avro Source, JDBC Channel, and File Roll sink. First we will create Agent3 in flume.conf file under FLUME_HOME/conf directory. Flume Agent – Avro Source, JDBC Channel and File Roll Sink: Add the below configuration properties in flume.conf file to create Agent3.

 Make sure /usr/lib/flume/agent/files/ directory is created and Flume use has write permissions to this location. […]


Hive CLI Commands 1

In our previous posts, we have seen about Hive Overview and Hive Architecture and now we will discuss about the default service in hive, Hive Command Line Interface and Hive CLI Commands. Ways to Interact with Hive CLI, command-line interface . Karmasphere (http://karmasphere.com ) (commercial product), Cloudera’s open source Hue (https://git hub.com/cloudera/hue ), A new “Hive-as-a-service” offering from Qubole (http://qubole.com) A simple web interface called Hive web interface (HWI), and programmatic […]


org.apache.flume.EventDeliveryException: Failed to open file

Error Scenario: org.apache.flume.EventDeliveryException: Failed to open file  We will receive this error message when a flume agent is started and it is trying to start FILE_ROLL sink with a given target sink directory. Below are the error messages sequence from ~/logs/flume.log file.

Root Cause: When a flume agent is started with FILE_ROLL sink type and without creating <sink.directory> folder prior to starting agent or  Flume user doesn’t have enough […]


Flume Agent – Collect Data From Command to a Flat File 1

In this post, we will discuss about flume agent configuration and setup for collecting data from an output of a command line tool into a flat file. We will use Exec Source type, File Channel and File Roll sink type in configuration of our agent. Lets name our agent as Agent2. We will discuss more about each component and their additional properties at the bottom of this post but we […]


Flume Agent Configuration 2

As discussed in previous post, we will discuss in detail about the properties in flume agent configuration properties. For ease of understanding, we will consider the same flume.conf file created in our previous post. Flume agent configuration file flume.conf resembles a Java property file format with hierarchical property settings. Here the filename flume.conf is not fixed, and we can provide any name to it and need to use the same name […]


Flume Agent Setup – Netcat Source, Console Sink 2

In this post, we will discuss about setting up of simple flume agent using Netcat as source and Console as sink. In this example of single-node Flume deployment, we create a Netcat source which listens on a port (localhost:44444) for network connections and logger sink type to log network traffic to console. For sending network traffic, we can either use curl utility or traditional tool telnet. We prefer using curl in this […]


Apache Flume Installation 10

In this post, we briefly discuss about Apache Flume Installation and Configuration on Ubuntu machine. The current version of Apache Flume is called as Flume NG (Next Generation) and it’s old version is renamed as Flume OG (Old Generation). In this post, we will discuss about Flume NG only. Prerequisite:  JDK 1.6 or later versions of Java installed on our Ubuntu machine. Memory – Sufficient memory for configurations used by sources, channels or […]


Flume Architecture 7

This post describes basics of Apache Flume overview and illustrates its architecture. What is Flume ? : Flume is a highly reliable, distributed and configurable streaming data collection tool. Flume can transport log files across a large number of hosts into HDFS. Need for Flume: These days, most of the new data is contained in high-throughput streams like Application logs,  social media updates, Web Server logs, Network logs and website click streams create fast-moving streams […]


Java vs Hive 3

In this post we will discuss the differences between Java vs Hive with the help of word count example. We will examine the Word Count Algorithm first using the Java MapReduce API and then using Hive. The following Java implementation is included in the Apache Hadoop distribution.

For implementing the Word Count algorithm we need to write 63 lines of Java code and we need to compile it and build a Jar […]


Hive vs RDBMS

In this post we will discuss about the differences between Hive vs RDBMS (traditional relation databases). Few examples of traditional relational databases are MySQL, PostgreSQL, Oracle 11g, MS SQL Server etc. Below are the key features of Hive that differ from RDBMS. Hive resembles a traditional database by supporting SQL interface but it is not a full database. Hive can be better called as data warehouse instead of database. Hive enforces […]