Hadoop Output Formats
We have discussed input formats supported by hadoop in previous post. In this post, we will have an overview of the hadoop output formats and their usage.
Hadoop provides output formats that corresponding to each input format. All hadoop output formats must implement the interface org.apache.hadoop.mapreduce.OutputFormat.
OutputFormat describes the output-specification for a Map-Reduce job. Based on Output specification,
Mapreduce job checks that the output directory doesn’t already exist.
OutputFormat provides the RecordWriter implementation to be used to write out the output files of the job.
These two requirements of the OutputFormat are accomplished with below two methods in the interface.
This method checks that output directory doesn’t exist already and throws an exception when it already exists, so that output is not overwritten.
This method Gets the RecordWriter for the given task.
org.apache.hadoop.mapreduce.RecordWriter<K,V> class implementations are used to write the output <key, value> pairs to an output file.
Built-In Hadoop Output Formats
Hadoop provided some built in InputFormat implementations in the org.apache.hadoop.mapreduce.lib.output package:
Base class for all file-based OutputFormat implementations.
Some of the important sub classes of the FileOutputFormat class are:
The default output format provided by hadoop is TextOuputFormat and it writes records as lines of text. If file output format is not specified explicitly, then text files are created as output files.
Output Key-value pairs can be of any format because TextOutputFormat converts these into strings with toString() method. Output key-value pairs are tab delimited by default.
For reading these output text files as input, KeyValueTextInputFormat is best suitable, since it breaks input lines into key value pairs based on a separator character.
This output format class is useful to write out sequence files which is a best option when the output files need to be fed into another mapreduce jobs as input files, since these are compressed and compact.
SequenceFileAsBinaryOutputFormat is a direct subclass of SequenceFileOutputFormat and it is counter part for SequenceFileAsBinaryInputFormat. It writes keys and values to Sequence Files in binary format.
It is also a direct subclass of FileOutputFormat and it is used to write output as Map files.
The MultipleOutputs class is used to write output data to multiple outputs. Below are the two main use cases of MultipleOutputs.
Job output can be written to additional outputs other than the default output. Each additional output, or named output, may be configured with its own
OutputFormat, with its own key class and value class.
- Write data to different files provided by user
MultipleOutputs supports counters to count the number records written to each output name. But these are disabled by default.
Usage pattern for job submission:
[Read Next Page]