Hadoop Input Formats 5


Hadoop Input Formats:

As we discussed about files being broken into splits as part of the job startup and the data in a split is being sent to the mapper implementation in our Mapreduce Job Flow post, in this post, we will go into detailed discussion on input formats supported by Hadoop and Mapreduce and how the input files are processed in Mapreduce job.

Input Splits, Input Formats and Record Reader:

As we discussed earlier, on job startup, each input file is broken into splits and each map processes a single split. Each split is further divided into records of key/value pairs which are processed by map tasks one record at a time.

To get split details of an input file, Hadoop provides an InputSplit class in org.apache.hadoop.mapreduce package and its implementation is as follows.

From the above two methods, programmer can get length of a split and storage locations. A good input split size is equal to the HDFS block size. But if the splits are too smaller than the default HDFS block size, then managing splits and creation of map tasks becomes an overhead than the job execution time.

But these file splits need not be taken care by Mapreduce programmer because Hadoop provides InputFormat class in org.apache.hadoop.mapreduce package for the below two responsibilities.

  • To provide details on how to split an input file into the splits.
  • To create a RecordReader class that will generate the series of key/value pairs from a split.

To meet these two requirements, Hadoop provides below implementation for InputFormat class with two methods.

The RecordReader class is also an abstract class in the org.apache.hadoop.mapreduce package:

Thus, Record reader creates key/value pairs from input splits and writes on to Context, which will be shared with Mapper class. Mapper class’s run() method retrieves these key/value pairs from context by calling getCurrentKey() and getCurrentValue() methods and passes onto map() method for further processing of the record.

Mappers run() method:

Thus, finally key/value pairs from each input record are sent to map() task.

[Read Next Page]


Profile photo of Siva

About Siva

Senior Hadoop developer with 4 years of experience in designing and architecture solutions for the Big Data domain and has been involved with several complex engagements. Technical strengths include Hadoop, YARN, Mapreduce, Hive, Sqoop, Flume, Pig, HBase, Phoenix, Oozie, Falcon, Kafka, Storm, Spark, MySQL and Java.


Leave a comment

Your email address will not be published. Required fields are marked *

5 thoughts on “Hadoop Input Formats

  • Madonna

    Great post. I was checking continuously this blog and I
    am impressed! Very useful information particularly the last part
    🙂 I care for such information much. I was looking for this particular
    info for a long time. Thank you and best of luck.

  • Charlotte

    Nice post. I was checking constantly this weblog and I’m impressed!
    Very useful information specially the closing phase 🙂 I
    take care of such information much. I used to be looking for
    this particular info for a very long time. Thank you and good
    luck.


Review Comments
default image

I am a plsql developer. Intrested to move into bigdata.

Neetika Singh ITA Hadoop in Dec/2016 December 22, 2016

.