Use Case Description:
In this post we will discuss about the usage of Mapreduce Multiple Outputs Output format in Mapreduce jobs by taking one real world use case. In this, we are considering an use case to generate multiple output file names from reducer and these file names should be based on the certain input data parameters. I.e. We need control over the naming of the
In this scenario, we have below sample input data in JSON format which contains data from various types (country, state, city, street, zip). We need to create sub directories up to 5 levels in the format of country/state/city/street/zip and partition the input records by country, state, city, street and zip.
Input file –> json_input
In Mapreduce, by default, one output file per reducer will be created, and files are named by the partition number: part-r-00000, part-r-00001, etc. But in this scenario, we need file names in the format of us/nz/al/lst/1000-r-0000* format and input data records segregated appropriately into each sub directory based on the country, state, city, etc… values.
MapReduce comes with the MultipleOutputs output format class to help us do this. For details on concept of MultipleOutputs please refer the post Mapreduce Output formats.
We will write mapreduce program using MultipleOutputs to partition the data by country, state, city, street and zip. MultipleOutputs allows us to write data to files whose names are derived from the output keys and values, or in fact from an arbitrary string.
File names are of the form name -m- nnnnn for map outputs and name -r- nnnnn for reduce outputs, where name is an arbitrary name that can be set by us in the program, and nnnnn is an integer designating the part number, starting from zero.
How to zip It:
- We need JSONObject to parse our input data and we will build the key with required directory structure in mapper itself and pass our (key,value) pairs to reducer.
- In reducer we will create an instance of MultipleOutputs in the setup() method and assign it to an instance variable. We then use the MultipleOutputs instance in the reduce() method to write to the output, in place of the context. The write() method takes the key and value, as well as a file name.
- Finally close the MultipleOutputs instance in cleanup() method in reducer.
Below are the zip changes that can be used to perform above activities.