Hadoop Output Formats 2


Hadoop Output Formats

We have discussed input formats supported by hadoop in previous post. In this post, we will have an overview of the hadoop output formats and their usage.

Hadoop provides output formats that corresponding to each input format. All hadoop output formats must implement the interface org.apache.hadoop.mapreduce.OutputFormat.

OutputFormat describes the output-specification for a Map-Reduce job. Based on Output specification,

  • Mapreduce job checks that the output directory doesn’t already exist.

  • OutputFormat provides the RecordWriter implementation to be used to write out the output files of the job.

These two requirements of the OutputFormat are accomplished with below two methods in the interface.

This method checks that output directory doesn’t exist already and throws an exception when it already exists, so that output is not overwritten.

This method Gets the RecordWriter for the given task.

org.apache.hadoop.mapreduce.RecordWriter<K,V> class implementations are used to write the output <key, value> pairs to an output file.

Built-In Hadoop Output Formats

Hadoop provided some built in InputFormat implementations in the org.apache.hadoop.mapreduce.lib.output package:

FileOutputFormat

Base class for all file-based OutputFormat implementations.

Some of the important sub classes of the FileOutputFormat class are:

TextOutputFormat

The default output format provided by hadoop is TextOuputFormat and it writes records as lines of text. If file output format is not specified explicitly, then text files are created as output files.

Output Key-value pairs can be of any format because TextOutputFormat converts these into strings with toString() method. Output key-value pairs are tab delimited by default.

For reading these output text files as input, KeyValueTextInputFormat is best suitable, since it breaks input lines into key value pairs based on a separator character.

SequenceFileOutputFormat

This output format class is useful to write out sequence files which is a best option when the output files need to be fed into another mapreduce jobs as input files, since these are compressed and compact.

SequenceFileAsBinaryOutputFormat

SequenceFileAsBinaryOutputFormat is a direct subclass of SequenceFileOutputFormat and it is counter part for SequenceFileAsBinaryInputFormat. It writes keys and values to Sequence Files in binary format.

MapFileOutputFormat

It is also a direct subclass of FileOutputFormat and it is used to write output as Map files.

MultipleOutputs

The MultipleOutputs class is used to write output data to multiple outputs. Below are the two main use cases of MultipleOutputs.

  1. Job output can be written to additional outputs other than the default output. Each additional output, or named output, may be configured with its own OutputFormat, with its own key class and value class.
  2. Write data to different files provided by user

MultipleOutputs supports counters to count the number records written to each output name. But these are disabled by default.

Usage pattern for job submission:

[Read Next Page]


Profile photo of Siva

About Siva

Senior Hadoop developer with 4 years of experience in designing and architecture solutions for the Big Data domain and has been involved with several complex engagements. Technical strengths include Hadoop, YARN, Mapreduce, Hive, Sqoop, Flume, Pig, HBase, Phoenix, Oozie, Falcon, Kafka, Storm, Spark, MySQL and Java.


Leave a comment

Your email address will not be published. Required fields are marked *

2 thoughts on “Hadoop Output Formats


Review Comments
default image

I am a plsql developer. Intrested to move into bigdata.

Neetika Singh ITA Hadoop in Dec/2016 December 22, 2016

.