Monthly Archives: May 2014


HBase Installation in Pseudo Distribution Mode 4

This post describes the procedure for HBase Installation on Ubuntu Machine in pseudo distributed mode using HDFS configuration. Prerequisites:  Java is one of the main prerequisite. JDK 1.6 or later versions of Java installation is required to run HBase. Hadoop 1 or Hadoop 2 installed on pseudo distributed or fully distributed cluster. HBase Installation Procedure: Follow below steps in the same order to complete the HBase Installation on Ubuntu machine. […]


HBase Overview

HBase is the Hadoop’s database and Below is the high level HBase Overview. HBase Overview: What is HBase ? HBase is a scalable distributed column oriented database built on top of Hadoop and HDFS. Apache HBase is open-source non-relational database implemented based on Google’s Big Table – A Distributed storage system for structured data. HBase provides random and real time read/write access to Big Data. Need For HBase: Although most […]


Mapreduce Program to calculate Missing Count 4

Use Case Description: This post describes an approach to use case scenario, where an input file contains some columns and its corresponding values as records. But some of these columns may have blanks/nulls instead of actual values. I.e. data is missing for some columns. And developer needs to write a Mapreduce Program to calculate missing count and percentage for each input column. Per suppose consider the below text files as […]


Hadoop Output Formats 2

Hadoop Output Formats We have discussed input formats supported by hadoop in previous post. In this post, we will have an overview of the hadoop output formats and their usage. Hadoop provides output formats that corresponding to each input format. All hadoop output formats must implement the interface org.apache.hadoop.mapreduce.OutputFormat. OutputFormat describes the output-specification for a Map-Reduce job. Based on Output specification, Mapreduce job checks that the output directory doesn’t already […]


Mapreduce Interview Questions and Answers for Experienced Part – 3 2

Below are a few more hadoop mapreduce interview questions and answers for experienced and freshers hadoop developers. Hadoop Mapreduce Interview Questions and Answers for Experienced: 1.  After restart of namenode, Mapreduce jobs started failing which worked fine before restart. What could be the wrong ? The cluster could be in a safe mode after the restart of a namenode. The administrator needs to wait for namenode to exit the safe […]


Mapreduce Interview Questions and Answers for Experienced Part – 2

Below are a few more hadoop mapreduce interview questions and answers for experienced and freshers hadoop developers. Hadoop Mapreduce Interview Questions and Answers for Experienced: 1.  What is side data distribution in Mapreduce framework ? The extra read-only data needed by a mapreduce job to process the main data set is called as side data. There are two ways to make side data available to all the map or reduce […]


Hadoop Interview Questions and Answers Part – 4

Below are a few more hadoop interview questions and answers for freshers and experienced hadoop developers. Hadoop Interview questions and answers 1.  What is the default block size in HDFS ? As of Hadoop-2.4.0 release, the default block size in HDFS is 256 MB and prior to that it was 128 MB. 2.  What is the benefit of large block size in HDFS ? Main benefit of large block size […]


Eclipse Mapreduce Example

Running Sample Mapreduce – Word Count Program in Eclipse This post is an extension of previous post about configuring eclipse for Hadoop. Once the configuration is done successfully, we can run the sample mapreduce programs in Eclipse IDE. In this Eclipse Mapreduce Example post, We will discuss the development of sample Word Count mapredue program from scratch and execute the jar file on hadoop cluster and verify the results. 1.  Start eclipse […]


Eclipse Configuration for Hadoop 11

Eclipse is a powerful IDE for java development. Since Hadoop and Mapreduce programming is done in java, it would be better to do our programming in a well-featured Integrated Development Environment (IDE). So, In this post, we are going to learn how to install eclipse on Ubuntu machine and configure it for Hadoop and Mapreduce programming. Let’s start with downloading and installing Eclipse on ubuntu machine. 1. Install Eclipse: Download latest […]


Incompatible clusterIDs 4

Error Scenario: Incompatible clusterIDs When we receive the error messages in data node logs similar to below then it belongs to this error scenario.

Root Cause: This error message is received when the cluster ID of name node and cluster ID of data node are different. We can see the cluster ID of name node in <dfs.namenode.name.dir>/current/VERSION file and cluster ID of data node in <dfs.datanode.data.dir>/current/VERSION file. These files look like […]


storage directory does not exist 1

Error Scenario:   storage directory does not exist or is not accessible / Exception in namenode join When we receive the error messages in name node log or data node logs similar to below then it belongs to this error scenario.

Root Cause: This kind of error messages are received, when we are configuring hadoop without creating source directories for saving namenode or datanode meta data files in local file […]


Install Hadoop on Multi Node Cluster

This post is written under the assumption that, an user reading this post already have an idea about installing and configuring hadoop on single node cluster. If not, it is better to go through the post Installing Hadoop on single node cluster In this post we will briefly discuss about installing & configuring hadoop-2.3.0 version on multiple node cluster. For ease of simplicity, we will consider small cluster of 3 nodes, […]