MapReduce Programming Model 1


In this post, we are going to review the building blocks & programming model of example mapreduce program word count run in previous post in this Mapreduce Category. We will not go too deep into code, our focus will be mainly on structure of the mapreduce program written in java and at the end of post we will submit the mapreduce job to execute this program.

Before starting with word count mapreduce program structure, we can have a overview on the parts of basic mapreduce program.

Overview:

Mapreduce is a framework for processing big data in two phases Map & Reduce. Both the phases have key-value pairs as input and output.

Map phase implements Mapper function, in which user-provided code will be executed on each key-value pair (k1, v1) read from the input files. The output of the mapper function would be zero or more key-value pairs (k2, v2) which are called intermediate pairs. Here the key is what the data will be grouped on and the value is the information related to the analysis in the reducer.

Reduce phase takes mapper output (grouped key-value data) (k2, v2) and runs reduce function on each key-value group. reduce function iterates over the list of values associated with a key and produces outputs like aggregations, statistics etc.. Once the reduce function is done, it sends zero or more key-value pairs (k3, v3) to the final the output file.

By default, Mapreduce input and output file formats are text file formats.

MapReduce Programming Model in Java:

In order to express the above functionality in code, we need three things: A map () function, reduce () function and some driver code to run the job. an abstract map () function is present in Mapper class and reduce () function in Reducer class. These Mapper and Reducer classes are provided by Hadoop Java API.

And any specific mapper/reducer implementations should be subclass these classes and override the abstract functions map () and reduce ().

Below is the code snippet for sample mapper & reducer classes:

The driver is the main part of Mapreduce job and it communicates with Hadoop framework and specifies the configuration elements needed to run a mapreduce job. It is the place where programmer specifies which mapper/reducer classes a mapreduce job should run and also input/output file paths along with their formats. There is no default parent Driver class provided by Hadoop; the driver logic usually exists in the main method of the class.

Below is the code snippet of example driver:

On high level, the user of mapreduce framework needs to specify the following things:

  • The job’s input location(s) in the distributed file system.
  • The job’s output location in the distributed file system.
  • The input format.
  • The output format.
  • The class containing the map function.
  • The class containing the reduce function but it is optional.
  • The JAR file containing the mapper and reducer classes and driver classes.

Now, let’s dive into example Word Count program in the next page.

Example Mapreduce Program Word Count:

Mapper class definition is as follows: