Built-in Hadoop Input Formats:
Hadoop provided some built in InputFormat implementations in the org.apache.hadoop.mapreduce.lib.input package:
FileInputFormat: Base class for all file-based InputFormat implementations.
Some of the important sub classes of the FileInputFormat class are:
- TextInputFormat :
The default InputFormat class when no other class is specified. It treats the input files as text files.
An InputFormat for plain text files. Files are broken into lines. Each line is divided into key and value parts by a separator byte. If no such a byte exists, the key will be the entire line and value will be empty.
An input format to read input files with fixed length records. These need not be text files and can be binary files. Users must configure the record length property by calling: FixedLengthInputFormat.setRecordLength(conf, recordLength);
- NLineInputFormat :
It splits N lines of input as one split which will be fed to a single map task. It can be used in applications, that splits the input file such that by default, one line is fed as a value to one map task, and key is the offset. i.e. (k,v) is (LongWritable, Text).
- CombineFileInputFormat :
This input file format is suitable for processing huge number of small files. CombineFileInputFormat packs many small files into each split so that each mapper has more to process. Thus it can improve the efficiency of mapreduce job by making less number of map tasks to process huge number of small files.
An abstract InputFormat class that returns MultiFileSplit’s in getSplits() method from the files under the input paths.
- SequenceFileInputFormat :
Hadoop specific Binary file format for efficient file processing.
- SequenceFileAsTextInputFormat :
SequenceFileAsTextInputFormat is a sub class of SequenceFileInputFormat. This class is similar to SequenceFileInputFormat, except it generates SequenceFileAsTextRecordReader which converts the input keys and values to their String forms by calling toString() method.
- SequenceFileAsBinaryInputFormat :
SequenceFileAsBinaryInputFormat is another sub class of SequenceFileInputFormat. It is an input format for reading keys, values from Sequence Files in binary (raw) format.
- MultipleInputs :
This class supports MapReduce jobs that have multiple input paths with a different InputFormat and Mapper for each path.
- DBInputFormat :
A InputFormat that reads input data from an SQL table. DBInputFormat emits LongWritables containing the record number as key and DBWritables as value. The SQL query, and input class can be using one of the two setInput() methods.
Prevent Input File Splitting:
If we don’t want files to be split, so that a single mapper can process each input file in its entirety, we can override the isSplitable() method of FileInputFormat class to return false.
For example, here’s a non splittable FileInputFormat implementation:
public class NonSplittableFileInputFormat extends FileInputFormat
protected boolean isSplitable(JobContext context, Path file)