A few of the hadoop Mapreduce Interview questions and answers are presented in this post. These are suitable for both beginners and experienced mapreduce developers.
Mapreduce Interview Questions and Answers for Freshers:
1. What is Mapreduce ?
Mapreduce is a framework for processing big data (huge data sets using a large number of commodity computers). It processes the data in two phases namely Map and Reduce phase. This programming model is inherently parallel and can easily process large-scale data with the commodity hardware itself.
It is highly integrated with hadoop distributed file system for processing distributed across data nodes of clusters.
2. What is YARN ?
YARN stands for Yet Another Resource Negotiator which is also called as Next generation Mapreduce or Mapreduce 2 or MRv2.
It is implemented in hadoop 0.23 release to overcome the scalability short come of classic Mapreduce framework by splitting the functionality of Job tracker in Mapreduce frame work into Resource Manager and Scheduler.
3. What is data serialization ?
Serialization is the process of converting object data into byte stream data for transmission over a network across different nodes in a cluster or for persistent data storage.
4. What is deserialization of data ?
Deserialization is the reverse process of serialization and converts byte stream data into object data for reading data from HDFS. Hadoop provides Writables for serialization and deserialization purpose.
5. What are the Key/Value Pairs in Mapreduce framework ?
Mapreduce framework implements a data model in which data is represented as key/value pairs. Both input and output data to mapreduce framework should be in key/value pairs only.
6. What are the constraints to Key and Value classes in Mapreduce ?
Any data type that can be used for a Value field in a mapper or reducer must implement org.apache.hadoop.io.Writable Interface to enable the field to be serialized and deserialized.
By default Key fields should be comparable with each other. So, these must implement hadoop’s org.apache.hadoop.io.WritableComparable Interface which in turn extends hadoop’s Writable interface and java.lang.Comparable interfaces.
7. What are the main components of Mapreduce Job ?
- Main driver class which provides job configuration parameters.
- Mapper class which must extend org.apache.hadoop.mapreduce.Mapper class and provide implementation for map () method.
- Reducer class which should extend org.apache.hadoop.mapreduce.Reducer class.
8. What are the Main configuration parameters that user need to specify to run Mapreduce Job ?
On high level, the user of mapreduce framework needs to specify the following things:
- The job’s input location(s) in the distributed file system.
- The job’s output location in the distributed file system.
- The input format.
- The output format.
- The class containing the map function.
- The class containing the reduce function but it is optional.
- The JAR file containing the mapper and reducer classes and driver classes.
9. What are the main components of Job flow in YARN architecture ?
Mapreduce job flow on YARN involves below components.
- A Client node, which submits the Mapreduce job.
- The YARN Resource Manager, which allocates the cluster resources to jobs.
- The YARN Node Managers, which launch and monitor the tasks of jobs.
- The MapReduce Application Master, which coordinates the tasks running in the MapReduce job.
- The HDFS file system is used for sharing job files between the above entities.
10. What is the role of Application Master in YARN architecture ?
Application Master performs the role of negotiating resources from the Resource Manager and working with the Node Manager(s) to execute and monitor the tasks.
Application Master requests containers for all map tasks and reduce tasks.Once Containers are assigned to tasks, Application Master starts containers by notifying its Node Manager. Application Master collects progress information from all tasks and aggregate values are propagated to Client Node or user.
Application master is specific to a single application which is a single job in classic mapreduce or a cycle of jobs. Once the job execution is completed, application master will no longer exist.
11. What is identity Mapper ?
Identity Mapper is a default Mapper class provided by hadoop. When no mapper is class is specified in Mapreduce job, then this mapper will be executed.
It doesn’t process/manipulate/ perform any computation on input data rather it simply writes the input data into output. It’s class name is org.apache.hadoop.mapred.lib.IdentityMapper.
12. What is identity Reducer ?
It is a reduce phase’s counter part for Identity mapper in map phase. It simply passes on the input key/value pairs into output directory. Its class name is org.apache.hadoop.mapred.lib.IdentityReducer.
When no reducer class is specified in Mapreduce job, then this class will be picked up by the job automatically.
13. What is chain Mapper ?
Chain Mapper class is a special implementation of Mapper class through which a set of mapper classes can be run in a chain fashion, within a single map task.
In this chained pattern execution, first mapper output will become input for second mapper and second mappers output to third mapper, and so on until the last mapper.
Its class name is org.apache.hadoop.mapreduce.lib.ChainMapper.
14. What is chain reducer ?
Chain reducer is similar to Chain Mapper class through which a chain of mappers followed by a single reducer can be run in a single reducer task. Unlike Chain Mapper, chain of reducers will not be executed in this, but chain of mappers will be run followed by a single reducer.
Its class name is org.apache.hadoop.mapreduce.lib.ChainReducer.
15. How can we mention multiple mappers and reducer classes in Chain Mapper or Chain Reducer classes ?
In Chain Mapper,
- ChainMapper.addMapper() method is used to add mapper classes.
- ChainReducer.setReducer() method is used to specify the single reducer class.
- ChainReducer.addMapper() method can be used to add mapper classes.
[Read Next Page]