Hadoop Input Formats:
As we discussed about files being broken into splits as part of the job startup and the data in a split is being sent to the mapper implementation in our Mapreduce Job Flow post, in this post, we will go into detailed discussion on input formats supported by Hadoop and Mapreduce and how the input files are processed in Mapreduce job.
Input Splits, Input Formats and Record Reader:
As we discussed earlier, on job startup, each input file is broken into splits and each map processes a single split. Each split is further divided into records of key/value pairs which are processed by map tasks one record at a time.
To get split details of an input file, Hadoop provides an InputSplit class in org.apache.hadoop.mapreduce package and its implementation is as follows.
From the above two methods, programmer can get length of a split and storage locations. A good input split size is equal to the HDFS block size. But if the splits are too smaller than the default HDFS block size, then managing splits and creation of map tasks becomes an overhead than the job execution time.
But these file splits need not be taken care by Mapreduce programmer because Hadoop provides InputFormat class in org.apache.hadoop.mapreduce package for the below two responsibilities.
- To provide details on how to split an input file into the splits.
- To create a RecordReader class that will generate the series of key/value pairs from a split.
To meet these two requirements, Hadoop provides below implementation for InputFormat class with two methods.
The RecordReader class is also an abstract class in the org.apache.hadoop.mapreduce package:
Thus, Record reader creates key/value pairs from input splits and writes on to Context, which will be shared with Mapper class. Mapper class’s run() method retrieves these key/value pairs from context by calling getCurrentKey() and getCurrentValue() methods and passes onto map() method for further processing of the record.
Mapper’s run() method:
Thus, finally key/value pairs from each input record are sent to map() task.
[Read Next Page]