Creating Custom Hadoop Writable Data Type 9


If none of the built-in Hadoop Writable data types matches our requirements some times, then we can create custom Hadoop data type by implementing Writable interface or WritableComparable interface.

Common Rules for creating custom Hadoop Writable Data Type

  • A custom hadoop writable data type which needs to be used as value field in Mapreduce programs must implement Writable interface org.apache.hadoop.io.Writable.
  • MapReduce key types should have the ability to compare against each other for sorting purposes. A custom hadoop writable data type that can be used as key field in Mapreduce programs must implement WritableComparable interface which intern extends Writable (org.apache.hadoop.io.Writable) and Comparable (java.lang.Comparable) interfaces.
  • So, i.e. a data type created by implementing WritableComparable Interface can be used as either key or value field data type.

Since a data type implementing WritableComparable can be used as data type for key or value fields in mapreduce programs, Lets define a custom data type which can used for both key and value fields. In this post, Lets create a custom data type to process Web Logs from a server and count the occurrences of each IP address. In this sample, lets consider a web log record with five fields – Request No, Site URL, Request Date, Request Time and IP address.  A sample record from web log file  is as shown below.

We can treat the entities of the above record as built-in Writable data types forming a new custom data type. We can consider the Request No as IntWritable and other four fields as Text data types. Complete input file Web_Log.txt used in this post is attached here 

Creating Custom Hadoop Writable Data Type

Lets create a WebLogWritable Data type to serialize and deserialize the above mentioned Web Log record.

Let’s briefly discuss  on the methods written in this WebLogWritable class and their purpose.

All Writable implementations must have a default constructor so that the MapReduce framework can instantiate them, then populate their fields by calling readFields() . Writable instances are mutable and often reused so we have provided write() method. We have also provided custom constructor to set the object fields.

Set() and getIP() methods are setter and getter methods to store or retrieve data. The compareTo() method returns a negative integer, 0,  or a positive integer, if our object is less than, equal to, or greater than the object being compared to respectively.

In equals() method, we consider the objects equal if both the IP addresses and the time-stamps are the same. If the objects are not equal, we decide the sort order first based on the user IP address and then based on the time-stamp.

The hashCode() method is used by the HashPartitioner, the default partitioner in MapReduce, to choose a reduce partition. Usage of IP Address in our hashCode() method ensures that the intermediate WebLogWritable data will be partitioned based on the request host name/IP address.

Creating Mapper Class With Custom Data Types

Below is the implementation of WebLogMapper class which read Web_Log.txt records and tokenizes the records and writes each Web Log record with count 1.

Here Mapper output key is WebLogWritable and Output value is IntWritable which is a count value.

Creating Reducer Class With Custom Data Types

Below is the implementation of WebLogReducer class which accumulates the count values of each web log record and emits IP Addresses of each web log record along with its total occurrences in the file Web_Log.txt.

Here Reducer Input key is WebLogWritable and Input value is IntWritable but output key is Text and output value is IntWritable.

Creating Mapreduce driver Class With Custom Data Types

In this example, We have to specify the mapOutputKeyclass as WebLogWritable in the driver class and rest of the implementation is as usual.

So, Lets create a WebLogReader.java file with the custom Writable class, Mapper class and Reducer classes as shown below to test the functionality of custom writable data type WebLogWritable.

For easy of compilation and maintenance, we have written all WebLogWritable, WebLogMapper and WebLogReducer classes as static and maintained a single WebLogReader public main class in our java source file.

Run WebLogReader.java program
  • Compile WebLogReader.java file and create a jar file with classes.

javac & jar weblog

  • Copy input file Web_Log.txt into HDFS and execute the jar file with WebLogReader driver class.

job weblogjob weblog2

Validate the results

Lets verify the results of the above mapreduce job with the help of ‘head’ unix command to display only first 10 records from the output file.

cat headThus, we have successfully created a custom WebLogWritable data type and used this to read web log records and generated counts of each IP address in the Web log files.


Profile photo of Siva

About Siva

Senior Hadoop developer with 4 years of experience in designing and architecture solutions for the Big Data domain and has been involved with several complex engagements. Technical strengths include Hadoop, YARN, Mapreduce, Hive, Sqoop, Flume, Pig, HBase, Phoenix, Oozie, Falcon, Kafka, Storm, Spark, MySQL and Java.


Leave a comment

Your email address will not be published. Required fields are marked *

9 thoughts on “Creating Custom Hadoop Writable Data Type

  • somesh

    Thank you for post. It is very detailed. I have a question. In the mapper we can get the website url(the second attribute) by parsing the line and pass it as key to achieve the same result. So why do we need to use custom key in this case?

     

    Thanks

    • Profile photo of Siva
      Siva Post author

      Hi Somesh, you are correct. We just used this example to show passing more than field at a time to reducer. And in case, if we need any other field apart from IP or url from the weblog record (multiple fields in map output key) then we definitely need custom writable. Here in the above example, just for ease, we have used only one field.

  • somesh

    Hi,

    Can you please post (or send an email) about few of the real time errors in map reduce and the resolutions for them. Also if possible real time scenarios will help us.

    Thanks
    Somesh

  • akshay

    Hi Siva,

    Great post ,it’s very helpful for  beginners.

    I have one doubt :

    when compareTo(), hashcode() and equals() methods of WebLogWritable will be called?

    will these method be executed in shuffling and sorting phase after all maps will be created ?

    Please tell the complete flow .

    Thanks in advance…

  • sravani

    Hi I implemented same program.getting eof exception.Here are the exception details

    java.lang.RuntimeException: java.io.EOFException at org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:135)

  • Steve brad

    nice post.I am beginner in Map Reduce .can you explain how  comapreTo,Equals And hash code method will work in detail.

    thank you.

  • himanshu

    Hi Shiva,
    I am really happy to see all your post and the clarification you provide in a very precise way but i have a question i wouls like to share with you please FYI
    I went for interview last weekend and they asked me for writing a program like some data was given in the below format
    Name age salary
    Him 15 20k
    tim 35 20k
    kim 25 20k
    bim 11 20k
    sim 40 20k
    lim 21 20k
    rim 45 20k

    So they asked me to output top 10 salaried record for each age group if(age>10 && age=20 && age=30) return 3; but the output file for each age group only should contain top 10 salaried record Please help me with the logic .


Review Comments
default image

I am a plsql developer. Intrested to move into bigdata.

Neetika Singh ITA Hadoop in Dec/2016 December 22, 2016

.