Monthly Archives: April 2014


Hadoop Input Formats 5

Hadoop Input Formats: As we discussed about files being broken into splits as part of the job startup and the data in a split is being sent to the mapper implementation in our Mapreduce Job Flow post, in this post, we will go into detailed discussion on input formats supported by Hadoop and Mapreduce and how the input files are processed in Mapreduce job. Input Splits, Input Formats and Record Reader: […]


Hadoop Interview Questions and Answers Part – 3 2

Below are a few more hadoop interview questions and answers for both freshers and experienced hadoop developers and administrators. Hadoop Interview questions and answers 1.  What is a Backup Node? It is an extended checkpoint node that performs checkpointing and also supports online streaming of file system edits. It maintains an in memory, up-to-date copy of file system namespace and accepts a real time online stream of file system edits […]


About Me

The main motto of this site is to provide tutorials on Hadoop technology and Big data tools so that any IT programmer can easily learn  the most emerging technology, Hadoop and use it for solving Big data processing problems. All the concepts covered in the site are presented with simple tested examples and provides more hands-on visibility of the technology on first read it self. All tutorials are free with […]


Creating Custom Hadoop Writable Data Type 11

If none of the built-in Hadoop Writable data types matches our requirements some times, then we can create custom Hadoop data type by implementing Writable interface or WritableComparable interface. Common Rules for creating custom Hadoop Writable Data Type A custom hadoop writable data type which needs to be used as value field in Mapreduce programs must implement Writable interface org.apache.hadoop.io.Writable. MapReduce key types should have the ability to compare against each […]


Hadoop Data Types 3

Hadoop provides Writable interface based data types for serialization and de-serialization of data storage in HDFS and mapreduce computations. Serialization Serialization is the process of converting object data into byte stream data for transmission over a network across different nodes in a cluster or for persistent data storage. Deserialization Deserialization is the reverse process of serialization and converts byte stream data into object data for reading data from HDFS. Hadoop […]


Combiner in Mapreduce 3

Combiners In Mapreduce Combiner is a semi-reducer in mapreduce. This is an optional class which can be specified in mapreduce driver class to process the output of map tasks before submitting it to reducer tasks. Purpose In Mapreduce framework, usually the output from the map tasks is large and data transfer between map and reduce tasks will be high. Since the data transfer across the network is expensive and to […]


Predefined Mapper and Reducer Classes

Hadoop provided some predefined Mapper and Reducer classes in its Java API and these will be helpful in writing simple or default mapreduce jobs. A few among the entire list of predefined mapper and reducer classes are provided below. Identity Mapper Identity Mapper is the default Mapper class provided by hadoop and this will be picked automatically when no mapper is specified in Mapreduce driver class. Identity Mapper class implements […]


MapReduce Job Flow 4

Mapreduce Job Flow Through YARN Implementation This post is to describe the mapreduce job flow – behind the scenes, when a job is submit to hadoop through submit() or waitForCompletion() method on Job object. This Mapreduce job flow is explained with the help of Word Count mapreduce program described in our previous post. Here the flow is described as per the YARN (Mapreduce2) implementation. submit() method submits the job to the hadoop […]


MapReduce Programming Model 1

In this post, we are going to review the building blocks & programming model of example mapreduce program word count run in previous post in this Mapreduce Category. We will not go too deep into code, our focus will be mainly on structure of the mapreduce program written in java and at the end of post we will submit the mapreduce job to execute this program. Before starting with word […]


HDFS Offline Edits Viewer Tool – oev 1

HDFS OEV Tool Similar to Offline Image Viewer (oiv), Hadoop also provides viewer tool for edits log files since these are also not in human-readable format. This is called as HDFS Offline Edits Viewer (oev) tool. This tool doesn’t require HDFS to be running and it can run in offline entirely. oev supports both binary (native binary format that Hadoop uses internally) and XML input formats for parsing and provides output in […]


HDFS Offline Image Viewer Tool – oiv

Usually fsimage files, which contain file system namespace on namenodes are not human-readable. So, Hadoop provided HDFS Offline Image viewer in hadoop-2.0.4 release to view the fsimage contents in readable format. This is completely offline in its functionality and doesn’t require HDFS cluster to be running. It can easily process very large fsimage files quickly and present in required output format. HDFS Offline Image Viewer: Syntax for this command:

Here […]


HDFS Distributed File Copy Tool – distcp 7

HDFS Distributed File copy Hadoop provides HDFS Distributed File copy (distcp) tool for copying large amounts of HDFS files within or in between HDFS clusters. It is implemented based on Mapreduce framework and thus it submits a map-only mapreduce job to parallelize the copy process. Usually this tool is useful for copying files between clusters from production to development environments. It supports some advanced command options while copying files. Below are the […]