Hadoop Sequence Files example 3


In addition to text files, hadoop also provides support for binary files. Out of these binary file formats, Hadoop Sequence Files are one of the hadoop specific file format that stores serialized key/value pairs. In this post we will discuss about basic details and format of hadoop sequence files examples.

Hadoop Sequence Files:

Advantages:

  • As binary files, these are more compact than text files
  • Provides optional support for compression at different levels – record, block.
  • Files can be split and processed in parallel
  • As HDFS and MapReduce are optimized for large files, Sequence Files can be used as containers for large number of small files thus solving hadoop’s drawback of processing huge number of small files.
  • Extensively used in MapReduce jobs as input and output formats. Internally, the temporary outputs of maps are also stored using Sequence File format.

Limitations:

  • Similar to other Hadoop files, SequenceFiles are append only.
  • As these are specific to hadoop, as of now, there is only Java API available to interact with sequence files. Multi Language support is not yet provided.

Hadoop Sequence File Format:

Hadoop SequenceFile is a flat file consisting of binary key/value pairs. Based on compression type, there are 3 different SequenceFile formats:

  • Uncompressed format
  • Record Compressed format
  • Block-Compressed format

A sequence file consists of a header followed by one or more records. All the above three formats uses the same header structure and it is as shown below.

SequenceFile Header

First 3 bytes of a sequence file are “SEQ” , which denotes that the file is a sequence file and followed by a 1 byte representing the actual version number (e.g SEQ4 or SEQ6) .

Sync marker denotes the end of headerThe sync marker permits seeking to a random point in a file, which is required to be able to efficiently split large files for parallel processing by Mapreduce.

MetaData is secondary key-value list that can be written during the initialization of sequence file writer.

Below are the formats of three sequence file types listed above.

Uncompressed SequenceFile Format

  • Header as shown above
  • Record
    • Record length
    • Key length
    • Key
    • Value
  • A sync-marker after every few 100 bytes or so.

Below is the record structure.

SequenceFile Record

Record-Compressed SequenceFile Format

  • Header as shown above
  • Record
    • Record length
    • Key length
    • Key
    • Compressed Value (only values are compressed here but not keys)
  • A sync-marker after every few 100 bytes or so.

Block-Compressed SequenceFile Format

  • Header
  • Record Block
    • Uncompressed number of records in the block
    • Compressed key-lengths block-size
    • Compressed key-lengths block
    • Compressed keys block-size
    • Compressed keys block
    • Compressed value-lengths block-size
    • Compressed value-lengths block
    • Compressed values block-size
    • Compressed values block
  • A sync-marker after every block.

Below is the horizontal view of the record structure and block structure sequence file formats.

Sequence File Format

SequenceFiles Java API:

Apache Hadoop provides various classes to create/read/sort SequenceFiles and below are some of the important classes useful in dealing with hadoop sequence files.

  • SequenceFile – org.apache.hadoop.io.SequenceFile

This is the main class to write/create sequence files and read sequence files. It provides SequenceFile.Writer, SequenceFile.Reader and SequenceFile.Sorter classes for writing, reading and sorting respectively.

For compressed sequence file creations there are special classes SequenceFile.RecordCompressWriter & SequenceFile.BlockCompressWriter.

But to create an instance of any of the above writer class flavors, we use one of the static methods createWriter(). There are several overloaded versions of it but they all require at minimum, Configuration object and varargs  Writer.Option… object to specify the options to create the file with.

In the varargs Writer.Option, we need to specify at least file name, file system, key and value classes parameters to create the sequence file. Compression type, codec, write progress, and a Metadata instance to be stored in the SequenceFile header can be provided optionally.

Hadoop Sequence Files example call to createWriter() static Method:

Once we have a SequenceFile.Writer instance, then we can write key-value pairs, using the append() method.  After finishing of writing, we need to call the close() method.

Similar to writer instance, SequenceFile.Reader instance is used to read the sequence files and it can read any of the SequenceFile formats created with above Writer instance.

Instance of SequenceFile.Reader class can be created with one of its constructor methods.

Method definition and Example call to SequenceFile.Reader() constructor:

Once we have Reader instance we can iterate over all records by repeatedly invoking one of its next() methods. Below are some of the interesting methods of reader class.

next() method will return true if it finds next key-value pair otherwise false. 

SequenceFile.Sorter class sorts key/value pairs in a sequence-format file. Instance can be created by calling one of its constructor methods.

Hadoop Sequence Files example.


Profile photo of Siva

About Siva

Senior Hadoop developer with 4 years of experience in designing and architecture solutions for the Big Data domain and has been involved with several complex engagements. Technical strengths include Hadoop, YARN, Mapreduce, Hive, Sqoop, Flume, Pig, HBase, Phoenix, Oozie, Falcon, Kafka, Storm, Spark, MySQL and Java.


Leave a comment

Your email address will not be published. Required fields are marked *

3 thoughts on “Hadoop Sequence Files example

  • Rodrigo Brito

    Helo. I have some questions about it. I need to create a sequencefile in reduce function. In reduce function i will need specify the key/value output. I have a hadoop custom type to process image. So my map function output key/value is text/customType. But my output in reduce function will be a sequence file or i don’t need specify it?


Review Comments
default image

I am a plsql developer. Intrested to move into bigdata.

Neetika Singh ITA Hadoop in Dec/2016 December 22, 2016

.