Combiner in Mapreduce 4


Combiners In Mapreduce

Combiner is a semi-reducer in mapreduce. This is an optional class which can be specified in mapreduce driver class to process the output of map tasks before submitting it to reducer tasks.

Purpose

In Mapreduce framework, usually the output from the map tasks is large and data transfer between map and reduce tasks will be high. Since the data transfer across the network is expensive and to limit the volume of data transfer between map and reduce tasks.

Combiner functions summarize the map output records with the same key and output of combiner will be sent over network to actual reduce task as input.

Further details

The combiner does not have its own interface and it must implement Reducer interface and reduce() method of combiner will be called on each map output key. The combiner class’s reduce() method must have the same input and output key-value types as the reducer class.

Combiner functions are suitable for producing summary information from a large data set because combiner will replace that set of original map outputs, ideally with fewer
records or smaller records.

Hadoop doesn’t guarantee on how many times a combiner function will be called for each map output key. At times, it may not be executed at all, while at times it may be used once, twice, or more times depending on the size and number of output files generated by the mapper for each reducer.

It is a general practice that, the same reducer class is used a combiner class many times. but this practice leads to some undesired results in some cases. The combiner function must only aggregate values and It is very important that the combiner class not have side effects, and that the actual reducer be able to properly process the results of the combiner.

Note:

  • When using the same reducer class as combiner class and then, if job’s output has problems, try running the job without the combiner to check the output.

Use of Combiner in Mapreduce Word Count program

A classic example of combiner in mapreduce is with Word Count program, where map task tokenizes each line in the input file and emits output records as (word, 1) pairs for each word in input line. The reduce() method simply sums the integer counter values associated with each map output key (word).

Since the combiner must have the same interface as the reducer, lets use our WordcountReducer class as combiner class and verify the output results.

Compile the above java program and build a jar file and execute the mapreduce job with commands similar to below.

compile wordcount & jar

Run the mapreduce job and verify the results: