Reading and Writing SequenceFile Example

This post is continuation for previous post on hadoop sequence files. In this post we will discuss about Reading and Writing SequenceFile Examples using Apache Hadoop 2 API.

Writing Sequence File Example:

As discussed in the previous post, we will use static method SequenceFile.createWriter(conf, opts) to create SequenceFile.Writer instance and we will use append(key, value) method to insert each record into sequencefile.

In the below example program, we are reading contents from a text file (syslog) on local file system and writing it to sequence file on hadoop. Here, we are using integer counter as key and each line from input file as value in sequence file format’s <key, value>.

For verification of (key, value) pairs in sequence file, we are printing first 50 records onto console. Copy below code snippet into SequenceFileWrite.java program file.

Compile this program and build jar file (Say Seq.jar) and we will use this jar file to run SequenceFileWrite program on hadoop.

Run it with below command.

SequenceFile Write Example

Verify Output:

Verify the output sequence file /out/syslog.seq file with hadoop fs -cat command.  With this command we can see whether it is sequence file or not with first three bytes (SEQ) and we can know the writable classes of key and value and compression type and codec classes used in this sequence file.

From the below screen shot, we can understand that /out/syslog.seq is a sequence file as it has first three bytes as SEQ and

Key class is – org.apache.hadoop.io.IntWritable
Value class is – org.apache.hadoop.io.Text
Compression Codec – org.apache.hadoop.io.compress.DefaultCodec

SequenceFile Format

Reading SequenceFile Example:

Now we will see how to read the above created sequence file through hadoop 2 API. We will create SequenceFile.Reader instance and use next(key, value) method to iterate over each record in the sequence file.

In the below program note that, we didn’t mention compression type or codec to the sequence file, that we used while creating it. By default reader instance will get these details from the file format itself and decompresses the file according to the codec found in the file format. Also note that, we have used getKeyClass() and getValueClass() methods on reader instance to retrieve the class names of (key,value) pairs in sequence file.

In the below program we are reading the contents of sequence file and printing them on console. Copy below code snippet into SequenceFileRead.java program file.

Compile this program and build jar file (Say Seq.jar) and we will use this jar file to run SequenceFileRead program on hadoop.

Run it with below command.

Below is the screenshot of first 10 lines of output from above command run.

SequenceFile Read Example

Reading SequenceFile with Command-line Interface:

There is an alternative way for viewing the contents of sequence file from command line interface. Hadoop provides command hadoop fs -text to display the contents of sequence file in text format.

This command looks at a file’s magic number to detect the type of the file and
appropriately convert it to text. It can recognize gzipped files and sequence files, otherwise, it assumes the input is plain text.  

hadoop fs text command

Hadoop Sequence Files example

In addition to text files, hadoop also provides support for binary files. Out of these binary file formats, Hadoop Sequence Files are one of the hadoop specific file format that stores serialized key/value pairs. In this post we will discuss about basic details and format of hadoop sequence files examples.

Hadoop Sequence Files:

Advantages:

  • As binary files, these are more compact than text files
  • Provides optional support for compression at different levels – record, block.
  • Files can be split and processed in parallel
  • As HDFS and MapReduce are optimized for large files, Sequence Files can be used as containers for large number of small files thus solving hadoop’s drawback of processing huge number of small files.
  • Extensively used in MapReduce jobs as input and output formats. Internally, the temporary outputs of maps are also stored using Sequence File format.

Limitations:

  • Similar to other Hadoop files, SequenceFiles are append only.
  • As these are specific to hadoop, as of now, there is only Java API available to interact with sequence files. Multi Language support is not yet provided.

Hadoop Sequence File Format:

Hadoop SequenceFile is a flat file consisting of binary key/value pairs. Based on compression type, there are 3 different SequenceFile formats:

  • Uncompressed format
  • Record Compressed format
  • Block-Compressed format

A sequence file consists of a header followed by one or more records. All the above three formats uses the same header structure and it is as shown below.

SequenceFile Header

First 3 bytes of a sequence file are “SEQ” , which denotes that the file is a sequence file and followed by a 1 byte representing the actual version number (e.g SEQ4 or SEQ6) .

Sync marker denotes the end of headerThe sync marker permits seeking to a random point in a file, which is required to be able to efficiently split large files for parallel processing by Mapreduce.

MetaData is secondary key-value list that can be written during the initialization of sequence file writer.

Below are the formats of three sequence file types listed above.

Uncompressed SequenceFile Format

  • Header as shown above
  • Record
    • Record length
    • Key length
    • Key
    • Value
  • A sync-marker after every few 100 bytes or so.

Below is the record structure.

SequenceFile Record

Record-Compressed SequenceFile Format

  • Header as shown above
  • Record
    • Record length
    • Key length
    • Key
    • Compressed Value (only values are compressed here but not keys)
  • A sync-marker after every few 100 bytes or so.

Block-Compressed SequenceFile Format

  • Header
  • Record Block
    • Uncompressed number of records in the block
    • Compressed key-lengths block-size
    • Compressed key-lengths block
    • Compressed keys block-size
    • Compressed keys block
    • Compressed value-lengths block-size
    • Compressed value-lengths block
    • Compressed values block-size
    • Compressed values block
  • A sync-marker after every block.

Below is the horizontal view of the record structure and block structure sequence file formats.

Sequence File Format

SequenceFiles Java API:

Apache Hadoop provides various classes to create/read/sort SequenceFiles and below are some of the important classes useful in dealing with hadoop sequence files.

  • SequenceFile – org.apache.hadoop.io.SequenceFile

This is the main class to write/create sequence files and read sequence files. It provides SequenceFile.Writer, SequenceFile.Reader and SequenceFile.Sorter classes for writing, reading and sorting respectively.

For compressed sequence file creations there are special classes SequenceFile.RecordCompressWriter & SequenceFile.BlockCompressWriter.

But to create an instance of any of the above writer class flavors, we use one of the static methods createWriter(). There are several overloaded versions of it but they all require at minimum, Configuration object and varargs  Writer.Option… object to specify the options to create the file with.

In the varargs Writer.Option, we need to specify at least file name, file system, key and value classes parameters to create the sequence file. Compression type, codec, write progress, and a Metadata instance to be stored in the SequenceFile header can be provided optionally.

Hadoop Sequence Files example call to createWriter() static Method:

Once we have a SequenceFile.Writer instance, then we can write key-value pairs, using the append() method.  After finishing of writing, we need to call the close() method.

Similar to writer instance, SequenceFile.Reader instance is used to read the sequence files and it can read any of the SequenceFile formats created with above Writer instance.

Instance of SequenceFile.Reader class can be created with one of its constructor methods.

Method definition and Example call to SequenceFile.Reader() constructor:

Once we have Reader instance we can iterate over all records by repeatedly invoking one of its next() methods. Below are some of the interesting methods of reader class.

next() method will return true if it finds next key-value pair otherwise false. 

SequenceFile.Sorter class sorts key/value pairs in a sequence-format file. Instance can be created by calling one of its constructor methods.

Hadoop Sequence Files example.