Hadoop Data Types 3


Hadoop provides Writable interface based data types for serialization and de-serialization of data storage in HDFS and mapreduce computations.

Serialization

Serialization is the process of converting object data into byte stream data for transmission over a network across different nodes in a cluster or for persistent data storage.

Deserialization

Deserialization is the reverse process of serialization and converts byte stream data into object data for reading data from HDFS. Hadoop provides Writables for serialization and deserialization purpose.  

Writable and WritableComparable Interfaces

To provide mechanisms for serialization and deserialization of data, Hadoop provided two important interfaces Writable and WritableComparable. Writable interface specification is as follows:

WritableComparable interface is sub-interface of Hadoop’s Writable and Java’s Comparable interfaces. and its specification is shown below:

The standard java.lang.Comparable Interface contains single method compareTo() method for comparing the operators passed to it.

The compareTo() method returns -1 , 0 , or 1 depending on whether the compared object is less than, equal to, or greater than the current object.

The above two interfaces are provided in org.apache.hadoop.io package

Constraints on Key-values in Mapreduce

Hadoop data types used in Mapreduce for key or value fields must satisfy two constraints.

  • Any data type used for a Value field in mapper or reducer input/output must implement Writable Interface.
  • Any data type used for a Key field in mapper or reducer input/output must implement WritableComparable interface along with Writable interface to compare the keys of this type with each other for sorting purposes.

Writable Classes – Hadoop Data Types

Hadoop provides classes that wrap the Java primitive types and implement the WritableComparable and Writable Interfaces. They are provided in the org.apache.hadoop.io package.

All the Writable wrapper classes have a get() and a set() method for retrieving and storing the wrapped value.

Primitive Writable Classes

These are Writable Wrappers for Java primitive data types and they hold a single primitive value that can be set either at construction or via a setter method.

All these primitive writable wrappers have get() and set() methods to read or write the wrapped value. Below is the list of primitive writable data types available in Hadoop.

  • BooleanWritable
  • ByteWritable
  • IntWritable
  • VIntWritable
  • FloatWritable
  • LongWritable
  • VLongWritable
  • DoubleWritable

In the above list VIntWritable and VLongWritable are used for variable length Integer types and variable length long types respectively.

Serialized sizes of the above primitive writable data types are same as the size of actual java data type. So, the size of IntWritable is 4 bytes and LongWritable is 8 bytes.

Array Writable Classes

Hadoop provided two types of array writable classes, one for single-dimensional and another for two-dimensional arrays. But the elements of these arrays must be other writable objects like IntWritable or LongWritable only but not the java native data types like int or float.

  • ArrayWritable
  • TwoDArrayWritable

Map Writable Classes

Hadoop provided below MapWritable data types which implement java.util.Map interface

  • AbstractMapWritable – This is abstract or base class for other MapWritable classes.
  • MapWritable                   – This is a general purpose map mapping Writable keys to Writable values.
  • SortedMapWritable    – This is a specialization of the MapWritable class that also implements the SortedMap interface.

Other Writable Classes

  • NullWritable

NullWritable is a special type of Writable representing a null value. No bytes are read or written when a data type is specified as NullWritable. So, in Mapreduce, a key or a value can be declared as a NullWritable when we don’t need to use that field.

  • ObjectWritable

This is a general-purpose generic object wrapper which can store any objects like Java primitives, String, Enum, Writable, null, or arrays.

  • Text

Text can be used as the Writable equivalent of java.lang.String and It’s max size is 2 GB. Unlike java’s String data type, Text is mutable in Hadoop.

  • BytesWritable

BytesWritable is a wrapper for an array of binary data.

  • GenericWritable

It is similar to ObjectWritable but supports only a few types. User need to subclass this GenericWritable class and need to specify the types to support. 

Example Program to Test Writables

Lets write a WritablesTest.java program to test most of the data types mentioned above in this post with get(), set(), getBytes(), getLength(), put(), containsKey(), keySet() methods.

Compile and run the above program.

WritablesTest

We have successfully tested some of the built-in hadoop data types with few examples.


About Siva

Senior Hadoop developer with 4 years of experience in designing and architecture solutions for the Big Data domain and has been involved with several complex engagements. Technical strengths include Hadoop, YARN, Mapreduce, Hive, Sqoop, Flume, Pig, HBase, Phoenix, Oozie, Falcon, Kafka, Storm, Spark, MySQL and Java.


Leave a comment

Your email address will not be published. Required fields are marked *

3 thoughts on “Hadoop Data Types


Review Comments
default image

I have attended Siva’s Spark and Scala training. He is good in presentation skills and explaining technical concepts easily to everyone in the group. He is having excellent real time experience and provided enough use cases to understand each concepts. Duration of the course and time management is awesome. Happy that I found a right person on time to learn Spark. Thanks Siva!!!

Dharmeswaran ETL / Hadoop Developer Spark Nov 2016 September 21, 2017

.