Use Case Description:
This post describes an approach to use case scenario, where an input file contains some columns and its corresponding values as records. But some of these columns may have blanks/nulls instead of actual values. I.e. data is missing for some columns.
And developer needs to write a Mapreduce Program to calculate missing count and percentage for each input column.
Per suppose consider the below text files as the input file and display the columns and missing counts and percentage as an output from reducer.
Input file –> missinginput
We need to write mapreduce program to calculate missing count and percentage for each column from the above provided input file.
How to Do It:
- In Mapper class, Each line offset can be treated as map input key and line text as map input value.
- We can parse the input value data into separate columns with split() method on String class in map() method of our custom mapper class.
- For extraction of header columns we can override the default implementation of run() method in Mapper class to convert the first input line into string and parse it into columns. And these columns can be stored in a string array and can be used to write missing counts in map() method.
- In map() method, each input column is treated as map output key and for each missing cell/value, we will associate an IntWritable value of 1 and for non-missing cell 0 will be assigned.
- In our custom reducer class, column names are considered input keys and counts as input values.
- From reducer, we need three output headers: ColumnName, MissingCount and Percentage. This header can be written by overriding run() method of reducer class.
- For missing counts and percentages, reducer input values for each key can be summed to get total missing count and count of input values for each key will result in total count for each key.
- Based on missing count for each column and total columns, we can calculate the percentage of missing values for each column.
- Each input column can be treated as Text reduce output key. Missing Count and Percentage values need to be combined into a single field to write from reducer as output value.
- As reducer supports only one output data type writable, we can combine these two values into String by StringBuilder and assigning this string to Text Writable. Then Text writable can be written as output value from reducer. Thus we can achieve multiple value output from reducer.
Below is the implementation of our custom Mapper class – MissingCountMapper
Observe the highlighted code in run() method to extract the column names from the first line of input file.
And map() method logic to assign counter value for missing and non-missing columns.
Below is the implementation of custom reducer class – MissingCountReducer