MapReduce Programming Model 1


In this post, we are going to review the building blocks & programming model of example mapreduce program word count run in previous post in this Mapreduce Category. We will not go too deep into code, our focus will be mainly on structure of the mapreduce program written in java and at the end of post we will submit the mapreduce job to execute this program.

Before starting with word count mapreduce program structure, we can have a overview on the parts of basic mapreduce program.

Overview:

Mapreduce is a framework for processing big data in two phases Map & Reduce.  Both the phases have key-value pairs as input and output.

Map phase implements Mapper function, in which user-provided code will be executed on each key-value pair (k1, v1) read from the input files. The output of the mapper function would be zero or more key-value pairs (k2, v2) which are called intermediate pairs. Here the key is what the data will be grouped on and the value is the information related to the analysis in the reducer.

Reduce phase takes mapper output (grouped key-value data)  (k2, v2) and runs reduce function on each key-value group.  reduce function iterates over the list of values associated with a key and produces outputs like aggregations, statistics etc.. Once the reduce function is done, it sends zero or more key-value pairs (k3, v3) to the final the output file.

By default, Mapreduce input and output file formats are text file formats.

MapReduce Programming Model in Java:

In order to express the above functionality in code, we need three things: A map () function, reduce () function and some driver code to run the job. an abstract map () function is present in Mapper class and reduce () function in Reducer class. These Mapper and Reducer classes are provided by Hadoop Java API.

And any specific mapper/reducer implementations should be subclass these classes and override the abstract functions map () and reduce ().

Below is the code snippet for sample mapper & reducer classes:

The driver is the main part of Mapreduce job and it communicates with Hadoop framework and specifies the configuration elements needed to run a mapreduce job. It is the place where programmer specifies which mapper/reducer classes a mapreduce job should run and also input/output file paths  along with their formats. There is no default parent Driver class provided by Hadoop; the driver logic usually exists in the main method of the class.

Below is the code snippet of example driver:

On high level, the user of mapreduce framework needs to specify the following things:

  • The  job’s input location(s) in the distributed file system.
  • The  job’s output location in the distributed file system.
  • The input format.
  • The output format.
  • The class containing the map function.
  • The class containing the reduce function but it is optional.
  • The JAR file containing the mapper and reducer classes and driver classes.

Now, let’s dive into example Word Count program in the next page.

Example Mapreduce Program Word Count:

Mapper class definition is as follows:

Reducer class definition is as follows:

Main Driver Class definition is  as follows:

Perform the below steps in the same order to execute the mapreduce job and view the results of the job.

Compile these programs and save class files into a working directory “wordcount”. In this example programs written in “project” directory and compiled into “project/wordcount” directory.

In order to compile these programs successfully, we need to include Hadoop Java API libraries into classpath argument for javac compiler.

Setup Classpath:

  • To get the list of Hadoop libraries to be included in classpath, issue the hadoop classpath command as shown below.

  •  Add these hadoop classpath libraries into $CLASSPATH environment variable in .bashrc file. Open .bashrc file with gedit.

 Add classpath variables

bashrc classpath

Compile Java Programs:

  • Create a new directory “wordcount” under our current project direcotry to save class files of WordcountMapper.java, WordcountReducer.java and WordCount.java programs.

  • compile above three programs with  -classpath and -d  compiler options to specify the classpath and and output directory for class files.

Build JAR File:

  • Since Hadoop executes mapreduce jobs in the form of jar files, we need to create a wordcount.jar file with the class files present under wordcount directory.

Run Mapreduce Job:

  • Create a sample input file in local file system and copy it to HDFS for testing WordCount mapreduce program.

  • Submit the mapreduce job with below hadoop jar command. Syntax for hadoop jar command is as follows.

hadoop jar wordcount

hadoop jar wordcount2

Here note the counters of the job in the above screen shots.

Validate Results:

  • Let’s try to check the results of the wordcount job in /output/wordcount directory.

So, we have successfully coded sample word count mapreduce program and executed it successfully and verified the results.

Note:

  • Notice in the above coding example that, code is written to process the single records and the framework is responsible for all the work required to turn an enormous data set into a stream of key-value pairs. We will never have to write map or reduce classes that needs to deal with the full data sets.
  • Throughout this post only the new Mapreduce APIs (contained in the org.apache.hadoop.mapreduce package) are used but the old version can be referenced from the org.apache.hadoop.mapred package with your interest.

Sources for this post:

Code in this post is sourced from “Word Count” example program that shipped with hadoop release 2.3.0.


About Siva

Senior Hadoop developer with 4 years of experience in designing and architecture solutions for the Big Data domain and has been involved with several complex engagements. Technical strengths include Hadoop, YARN, Mapreduce, Hive, Sqoop, Flume, Pig, HBase, Phoenix, Oozie, Falcon, Kafka, Storm, Spark, MySQL and Java.

Leave a comment

Your email address will not be published. Required fields are marked *


Review Comments
default image

I have attended Siva’s Spark and Scala training. He is good in presentation skills and explaining technical concepts easily to everyone in the group. He is having excellent real time experience and provided enough use cases to understand each concepts. Duration of the course and time management is awesome. Happy that I found a right person on time to learn Spark. Thanks Siva!!!

Dharmeswaran ETL / Hadoop Developer Spark Nov 2016 September 21, 2017

.