Combiner in Mapreduce 4


Combiners In Mapreduce

Combiner is a semi-reducer in mapreduce. This is an optional class which can be specified in mapreduce driver class to process the output of map tasks before submitting it to reducer tasks.

Purpose

In Mapreduce framework, usually the output from the map tasks is large and data transfer between map and reduce tasks will be high. Since the data transfer across the network is expensive and to limit the volume of data transfer between map and reduce tasks.

Combiner functions summarize the map output records with the same key and output of combiner will be sent over network to actual reduce task as input.

Further details

The combiner does not have its own interface and it must implement Reducer interface and reduce() method of combiner will be called on each map output key. The combiner class’s reduce() method must have the same input and output key-value types as the reducer class.

Combiner functions are suitable for producing summary information from a large data set because combiner will replace that set of original map outputs, ideally with fewer
records or smaller records.

Hadoop doesn’t guarantee on how many times a combiner function will be called for each map output key. At times, it may not be executed at all, while at times it may be used once, twice, or more times depending on the size and number of output files generated by the mapper for each reducer.

It is a general practice that, the same reducer class is used a combiner class many times. but this practice leads to some undesired results in some cases. The combiner function must only aggregate values and It is very important that the combiner class not have side effects, and that the actual reducer be able to properly process the results of the combiner.

Note: 

  • When using the same reducer class as combiner class and then, if job’s output has problems, try running the job without the combiner to check the output.

Use of Combiner in Mapreduce Word Count program

A classic example of combiner in mapreduce is with Word Count program, where map task tokenizes each line in the input file and emits output records as (word, 1) pairs for each word in input line.  The reduce() method simply sums the integer counter values associated with each map output key (word).

Since the combiner must have the same interface as the reducer, lets use our WordcountReducer class as combiner class and verify the output results.

Compile the above java program and build a jar file and execute the mapreduce job with commands similar to below.

compile wordcount & jar

Run the mapreduce job and verify the results:

Where as the actual output received without the combiner class is as follows: (referred from our previous post).

So, here the value for the keys or words ‘file’, ‘this’ and ‘word’ is is now incorrectly
specified as 1 instead of 2, 3 and 2 respectively. This is because of the way reduce method in combiner is implemented.

The map output from our mapper for these three words is like:

Combiner Output

Reducer Output

The final output of reduce() method is just a count of integers associated with each key but the not the actual value in the integer as per our reduce() method implementation.

Let’s review the code snippet of reduce() method in our WordcountReducer Class from our previous post.

Since, counter will be incremented by 1 for every value associated with key but not summing the actual values associated with keys, the results are undesirable. This issue is fixed in the next page.

Fix Combiner in Mapreduce Word Count program

In order to fix the above combiner issue, lets try modifying our reduce() method in our combiner or reducer class WordcountReducer.java. Lets copy WordcountReducer.java into WordcountReducer2.java and modify the reduce() method as below.

and change the driver class to use WordcountReducer2.class as combiner as well as reducer classes. Lets copy WordCountWithCombiner.java into WordCountWithFixedCombiner.java program and perform below highlighted changes.

Compile the above two programs and build a jar file and run the mapreduce job as shown in below screen shot:

WordCountWithFixedCombiner

Validate Results

In the below screen shot, we can verify the results of wordcount mapreduce job with fixed combiner issue.

FixCombinerOutput

Thus the Combiner in mapreduce can be used safely for aggregation functions like summation but need to be careful in other cases.


About Siva

Senior Hadoop developer with 4 years of experience in designing and architecture solutions for the Big Data domain and has been involved with several complex engagements. Technical strengths include Hadoop, YARN, Mapreduce, Hive, Sqoop, Flume, Pig, HBase, Phoenix, Oozie, Falcon, Kafka, Storm, Spark, MySQL and Java.


Leave a Reply to Anonymous Cancel reply

Your email address will not be published. Required fields are marked *

4 thoughts on “Combiner in Mapreduce


Review Comments
default image

I have attended Siva’s Spark and Scala training. He is good in presentation skills and explaining technical concepts easily to everyone in the group. He is having excellent real time experience and provided enough use cases to understand each concepts. Duration of the course and time management is awesome. Happy that I found a right person on time to learn Spark. Thanks Siva!!!

Dharmeswaran ETL / Hadoop Developer Spark Nov 2016 September 21, 2017

.