Monthly Archives: October 2014


Creating Custom UDF in Hive – Auto Increment Column in Hive 13

In this post we will describe about the process of creating custom UDF in Hive. Though there are many generic UDFs (User defined functions)  provided by Hive we might need to write our custom UDFs sometime to meet our requirements. In this post, we will discuss about one of the general requirement for the clients, those migrating from any traditional RDBMSs to Hive, they will expect Auto Increment Column in […]


Cloudera Manager Installation on Amazon EC2 22

In this post, we will discuss about hadoop installation on cloud storage. Though there are number of posts available across internet on this topic, we are documenting the procedure for Cloudera Manager Installation on Amazon EC2 instances with some of our practical views on installation and tips and hints to avoid getting into issues. This post also gives a basic introduction on usage of Amazon AWS cloud services. Creation of […]


HBase Interview Questions and Answers Part – 1 1

Below are a few important Hadoop HBase Interview Questions and Answers that are suitable for hadoop freshers or experienced developers. 1. What is HBase? HBase is Column-Oriented , Open-Source, Multidimensional, Distributed database. It run on the top of HDFS. 2. Why do we use HBase? HBase provide random read and write, can perform thousand of operation per second on large data set. HBase support record level record level operations on database […]


Hive Connectivity With Hunk (Splunk) 3

In this post we will discuss about the configuration required for Hive connectivity with Hunk, Hadoop flavor of Splunk, the famous visualization tool. Splunk Overview: Splunk tool captures, indexes and correlates real-time data in a searchable repository from which it can generate graphs, reports, dashboards and visualizations. Splunk released a product called Hunk: Splunk Analytics for Hadoop, which supports accessing, searching, and reporting on external data sets located in Hadoop from […]


Hive on Tez – Hive Integration with Tez 1

In this post, we will discuss about Hive integration with Tez framework or Enabling Tez for Hive Queries. And we will also run sample hive queries both on Mapreduce and Tez frameworks and we will evaluate the performance difference between Tez and MR Frameworks. Tez Advantages: Tez offers a customizable execution architecture that allows us to express complex computations as data flow graphs and allows for dynamic performance optimizations based […]


Apache Tez – Successor of Mapreduce Framework 4

Apache Tez Overview What is Apache Tez? Apache Tez is another execution framework project from Apache Software Foundation and it is built on top of Hadoop YARN. It is considered as a more flexible and powerful successor of the mapreduce framework. Apache Tez Features: Tez provides, Performance gain over Map Reduce also Provides backward compatibility to Mapreduce framework. Optimal resource management Plan reconfiguration at run-time Dynamic physical data flow decisions Tez is […]


Merging Small Files Into Avro File

This post is a continuation for previous post on working with small files issue. In previous we have merged huge number of small files on HDFS directory into sequencefile and in this post we will merge huge number of small files on local file system into avro file on HDFS output directory. We will store the file names as keys and file contents as values in Avro file. We will […]


Cannot create an instance of InputFormat

Error Scenario: java.io.IOException: Cannot create an instance of InputFormat class We will get this error message when we try to execute simple hadoop fs commands or running any hive queries. Below is the complete error message.

Root Cause: This error message will be received when there are any spaces or spelling mistakes in any of the site.xml configuration file. Suppose in below core-site.xml file value for io.compression.codecs has spaces which […]


Reading and Writing SequenceFile Example

This post is continuation for previous post on hadoop sequence files. In this post we will discuss about Reading and Writing SequenceFile Examples using Apache Hadoop 2 API. Writing Sequence File Example: As discussed in the previous post, we will use static method SequenceFile.createWriter(conf, opts) to create SequenceFile.Writer instance and we will use append(key, value) method to insert each record into sequencefile. In the below example program, we are reading contents […]


Hadoop Sequence Files example 5

In addition to text files, hadoop also provides support for binary files. Out of these binary file formats, Hadoop Sequence Files are one of the hadoop specific file format that stores serialized key/value pairs. In this post we will discuss about basic details and format of hadoop sequence files examples. Hadoop Sequence Files: Advantages: As binary files, these are more compact than text files Provides optional support for compression at different […]


Avro MapReduce 2 API Example

Avro provides support for both old Mapreduce Package API (org.apache.hadoop.mapred) and new Mapreduce Package API (org.apache.hadoop.mapreduce). Avro data can be used as both input and output from a MapReduce job, as well as the intermediate format. In this post we will provide an example run of Avro Mapreduce 2 API. This post can be treated as continuation for the previous post on Avro Mapreduce API. In this post, we will create […]