Hadoop Input Formats 5

Built-in Hadoop Input Formats:

Hadoop provided some built in InputFormat implementations in the org.apache.hadoop.mapreduce.lib.input package:

FileInputFormat: Base class for all file-based InputFormat implementations.

Some of the important sub classes of the FileInputFormat class are:

  • TextInputFormat :

The default InputFormat class when no other class is specified. It treats the input files as text files.

  • KeyValueTextInputFormat : 

An InputFormat for plain text files. Files are broken into lines. Each line is divided into key and value parts by a separator byte. If no such a byte exists, the key will be the entire line and value will be empty.

  • FixedLengthInputFormat : 

An input format to read input files with fixed length records. These need not be text files and can be binary files. Users must configure the record length property by calling: FixedLengthInputFormat.setRecordLength(conf, recordLength);

  • NLineInputFormat :

It splits N lines of input as one split which will be fed to a single map task. It can be used in applications, that splits the input file such that by default, one line is fed as a value to one map task, and key is the offset. i.e. (k,v) is (LongWritable, Text).

  • CombineFileInputFormat :

This input file format is suitable for processing huge number of small files. CombineFileInputFormat packs many small files into each split so that each mapper has more to process. Thus it can improve the efficiency of mapreduce job by making less number of map tasks to process huge number of small files.

  • MultiFileInputFormat :

An abstract InputFormat class that returns MultiFileSplit’s in getSplits() method from the files under the input paths.

  • SequenceFileInputFormat :

Hadoop specific Binary file format for efficient file processing.

  • SequenceFileAsTextInputFormat :

SequenceFileAsTextInputFormat is a sub class of SequenceFileInputFormat. This class is similar to SequenceFileInputFormat, except it generates SequenceFileAsTextRecordReader which converts the input keys and values to their String forms by calling toString() method.

  • SequenceFileAsBinaryInputFormat :

SequenceFileAsBinaryInputFormat is another sub class of SequenceFileInputFormat. It is an input format for reading keys, values from Sequence Files in binary (raw) format.

  • MultipleInputs :

This class supports MapReduce jobs that have multiple input paths with a different InputFormat and Mapper for each path.

  • DBInputFormat :

A InputFormat that reads input data from an SQL table. DBInputFormat emits LongWritables containing the record number as key and DBWritables as value. The SQL query, and input class can be using one of the two setInput() methods.

Prevent Input File Splitting:

If we dont want files to be split, so that a single mapper can process each input file in its entirety, we can override the isSplitable() method of FileInputFormat class to return false.

For example, here’s a non splittable FileInputFormat  implementation:

About Siva

Senior Hadoop developer with 4 years of experience in designing and architecture solutions for the Big Data domain and has been involved with several complex engagements. Technical strengths include Hadoop, YARN, Mapreduce, Hive, Sqoop, Flume, Pig, HBase, Phoenix, Oozie, Falcon, Kafka, Storm, Spark, MySQL and Java.

Leave a comment

Your email address will not be published. Required fields are marked *

5 thoughts on “Hadoop Input Formats

  • Madonna

    Great post. I was checking continuously this blog and I
    am impressed! Very useful information particularly the last part
    🙂 I care for such information much. I was looking for this particular
    info for a long time. Thank you and best of luck.

  • Charlotte

    Nice post. I was checking constantly this weblog and I’m impressed!
    Very useful information specially the closing phase 🙂 I
    take care of such information much. I used to be looking for
    this particular info for a very long time. Thank you and good