Mapreduce Program to calculate Missing Count 4

Use Case Description:

This post describes an approach to use case scenario, where an input file contains some columns and its corresponding values as records. But some of these columns may have blanks/nulls instead of actual values. I.e. data is missing for some columns.

And developer needs to write a Mapreduce Program to calculate missing count and percentage for each input column.

Per suppose consider the below text files as the input file and display the columns and missing counts and percentage as an output from reducer.

Input file –>    missinginput


We need to write mapreduce program to calculate missing count and percentage for each column from the above provided input file.

How to Do It:
  • In Mapper class, Each line offset can be treated as map input key and line text as map input value.
  • We can parse the input value data into separate columns with split() method on String class in map() method of our custom mapper class.
  • For extraction of header columns we can override the default implementation of run() method in Mapper class to convert the first input line into string and parse it into columns. And these columns can be stored in a string array and can be used to write missing counts in map() method.
  • In map() method, each input column is treated as map output key and for each missing cell/value, we will associate an IntWritable value of 1 and for non-missing cell 0 will be assigned.
  • In our custom reducer class, column names are considered input keys and counts as input values.
  • From reducer, we need three output headers: ColumnName, MissingCount and Percentage. This header can be written by overriding run() method of reducer class.
  • For missing counts and percentages, reducer input values for each key can be summed to get total missing count and count of input values for each key will result in total count for each key.
  • Based on missing count for each column and total columns, we can calculate the percentage of missing values for each column.
  • Each input column can be treated as Text reduce output key. Missing Count and Percentage values need to be combined into a single field to write from reducer as output value.
  • As reducer supports only one output data type writable, we can combine these two values into String by StringBuilder and assigning this string to Text Writable. Then Text writable can be written as output value from reducer. Thus we can achieve multiple value output from reducer.

Below is the implementation of our custom Mapper class – MissingCountMapper

Observe the highlighted code in run() method to extract the column names from the first line of input file.

And map() method logic to assign counter value for missing and non-missing columns.

Below is the implementation of custom reducer class – MissingCountReducer

Observe the highlighted logic in run() method to write the header in output file. StringBuilder logic in map() method is used to combine multiple values into single output column from reducer.

Below is the implementation of main driver class – MissingCount

All the three classes can be placed in a single java source file file and can be compiled from terminal or eclipse and missingcount.jar file can be built from class files.

Run and Validate Output results:

Copy the input missinginput.txt file from above path into HDFS path and missingcount.jar file to our working directory of terminal and run hadoop jar command.



Output results:


Observe the header columns and missing columns in the output. total records for each column are 1000 so, the percentage for address column is 130 * 100 / 1000 = 13.0 %.

So, the results are accurate.

Profile photo of Siva

About Siva

Senior Hadoop developer with 4 years of experience in designing and architecture solutions for the Big Data domain and has been involved with several complex engagements. Technical strengths include Hadoop, YARN, Mapreduce, Hive, Sqoop, Flume, Pig, HBase, Phoenix, Oozie, Falcon, Kafka, Storm, Spark, MySQL and Java.

Leave a comment

Your email address will not be published. Required fields are marked *

4 thoughts on “Mapreduce Program to calculate Missing Count

Review Comments
default image

I am a plsql developer. Intrested to move into bigdata.

Neetika Singh ITA Hadoop in Dec/2016 December 22, 2016