Hadoop Online Tutorials

50 Mapreduce Interview Questions and Answers Part – 1

A few of the hadoop Mapreduce Interview questions and answers are presented in this post. These are suitable for both beginners and experienced mapreduce developers.

Table of Contents

Mapreduce Interview Questions and Answers for Freshers:

1.  What is Mapreduce ?

Mapreduce is a framework for processing  big data (huge data sets using a large number of commodity computers). It processes the data in two phases namely Map and Reduce phase. This programming model is inherently parallel and can easily process large-scale data with the commodity hardware itself.

It is highly integrated with hadoop distributed file system for processing distributed across data nodes of clusters.

2. What is YARN ?

YARN stands for Yet Another Resource Negotiator which is also called as Next generation Mapreduce or Mapreduce 2 or MRv2.

It is implemented in hadoop 0.23 release to overcome the scalability short come of classic Mapreduce framework by splitting the functionality of Job tracker in Mapreduce frame work into Resource Manager and Scheduler.

3.  What is data serialization ?

Serialization is the process of converting object data into byte stream data for transmission over a network across different nodes in a cluster or for persistent data storage.

4. What is deserialization of data ?

Deserialization is the reverse process of serialization and converts byte stream data into object data for reading data from HDFS. Hadoop provides Writables for serialization and deserialization purpose.  

5.  What are the Key/Value Pairs in Mapreduce framework ?

Mapreduce framework implements a data model in which data is represented as key/value pairs. Both input and output data to mapreduce framework should be in key/value pairs only.

6.  What are the constraints to Key and Value classes in Mapreduce ?

Any data type that can be used for a Value field in a mapper or reducer must implement org.apache.hadoop.io.Writable Interface to enable the field to be serialized and deserialized.

By default Key fields should be comparable with each other.  So, these must implement hadoop’s org.apache.hadoop.io.WritableComparable Interface which in turn extends hadoop’s Writable interface and java.lang.Comparable interfaces.

7.  What are the main components of Mapreduce Job ?
8.  What are the Main configuration parameters that user need to specify to run Mapreduce Job ?

On high level, the user of mapreduce framework needs to specify the following things:

9.  What are the main components of Job flow in YARN architecture ?

Mapreduce job flow on YARN involves below components.

10.  What is the role of Application Master in YARN architecture ?

Application Master performs the role of negotiating resources from the Resource Manager and working with the Node Manager(s) to execute and monitor the tasks.

Application Master requests containers for all map tasks and reduce tasks.Once Containers are assigned to tasks, Application Master starts containers by notifying its Node Manager. Application Master collects progress information from all tasks and aggregate values are propagated to Client Node or user.

Application master is specific to a single application which is a single job in classic mapreduce or a cycle of jobs. Once the job execution is completed, application master will no longer exist.

11.  What is identity Mapper ?

Identity Mapper is a default Mapper class provided by hadoop. When no mapper is class is specified in Mapreduce job, then this mapper will be executed.

It doesn’t process/manipulate/ perform any computation on input data rather it simply writes the input data into output. It’s class name is org.apache.hadoop.mapred.lib.IdentityMapper.

12.  What is identity Reducer ?

It is a reduce phase’s counter part for Identity mapper in map phase. It simply passes on the input key/value pairs into output directory. Its class name is org.apache.hadoop.mapred.lib.IdentityReducer.

When no reducer class is specified in Mapreduce job, then this class will be picked up by the job automatically.

13.  What is chain Mapper ?

Chain Mapper class is a special implementation of Mapper class through which a set of mapper classes can be run in a chain fashion, within a single map task.

In this chained pattern execution, first mapper output will become input for second mapper and second mappers output to third mapper, and so on until the last mapper.

Its class name is org.apache.hadoop.mapreduce.lib.ChainMapper.

14.  What is chain reducer ?

Chain reducer is similar to Chain Mapper class through which a chain of mappers followed by a single reducer can be run in a single reducer task. Unlike Chain Mapper, chain of reducers will not be executed in this, but chain of mappers will be run followed by a single reducer.

Its class name is org.apache.hadoop.mapreduce.lib.ChainReducer.

15.  How can we mention multiple mappers and reducer classes in Chain Mapper or Chain Reducer classes ?

In Chain Mapper,

In ChainReducer,

[Read Next Page]

16.  What is a combiner ?

Combiner is a semi-reducer in Mapreduce framework. it is an optional component and can be specified with Job.setCombinerClass() method.

Combiner functions are suitable for producing summary information from a large data set. Hadoop doesn’t guarantee on how many times a combiner function will be called for each map output key. it may call 0 or 1 or many times.

17.  What are the constraints on combiner implementation ?

Combiner class must implement Reducer interface and must provide implementation for reduce() method. The combiner class’s reduce() method must have same input and output key-value types as the reducer class.

18.  What are the advantages of combiner over reducer or why do we need combiner when we are using same reducer class as combiner class ?

The main purpose of Combiner in Mapreduce frame work is to limit the volume of data transfer between map and reduce tasks.

It is a general practice that same reducer class is used as a combiner class, in this case, the only benefit of combiner class is to minimize the input data to reduce phase from data nodes.

19.  What are the primitive data types in Hadoop ?

Below are the list of primitive writable data types available in Hadoop.

20.  What is NullWritable and how is it special from other Writable data types ?

NullWritable is a special type of Writable representing a null value. No bytes are read or written when a data type is specified as NullWritable. So, in Mapreduce, a key or a value can be declared as a NullWritable when we don’t need to use that field.

21.  What is Text data type in Hadoop and what are the differences from String data type in Java ?

Text is a Writable data type for serialization and de-serialization of string data in Hadoop. It can be treated Wrapper class for java.lang.String in Java. Java Strings are immutable where as Hadoop’s Text Writable is mutable.

22.  What are the uses of GenericWritable class ?

One of the uses is, GenericWritable classes are extended to provide implementations to wrap multiple value instances belonging to different data types.

Whenever multiple value types need to be produced from mapper and we need our reducer to process these multiple value types as a single data type because Hadoop reducers do not allow multiple input value types.

In these scenarios a subclass of GenericWritable can be used.

23.  How to create multiple value type output from Mapper with IntWritable and Text Writable ?

Write a class extending the org.apache.hadoop.io.GenericWritable class. Implement the getTypes() method to return an array of the Writable classes.

24.  What is ObjectWritable data type in Hadoop ?

This is a general-purpose generic object wrapper which can be used to achieve the same objective as GenericWritable. org.apache.hadoop.io.ObjectWritable class can
handle Java primitive types, strings, and arrays without the need of a Writable wrapper.

25.  How do we create Writable arrays in Hadoop ?

Hadoop provides two types of Writable data types for arrays. For one dimensional arrays, ArrayWritable and for two dimensional arrays, TwoDArrayWritable data types are available.

The elements of these arrays must be other writable objects like IntWritable or LongWritable only but not the java native data types like int or float. For example, below is implementation of array of IntWritables.

26.  What are the MapWritable data types available in Hadoop ?

Hadoop provided below MapWritable data types which implement  java.util.Map interface

27.  What is speculative execution in Mapreduce ?

Speculative execution is a mechanism of running multiple copies of same map or reduce tasks on different slave nodes to cope with individual Machine performance.

In large clusters of hundreds of machines, there may be machines which are not performing as fast as others. This may result in delays in a full job due to only one machine not performing well. To avoid this, speculative execution in hadoop can run multiple copies of same map or reduce task on different slave nodes. The results from first node to finish are used.

If other copies were executing speculatively, Hadoop tells the Task Trackers to abandon the tasks and discard their outputs.

28.  What will happen if we run a Mapreduce job with an output directory that already existing ?

Job will fail with org.apache.hadoop.mapred.FileAlreadyExistsException. In this case, delete the output directory and re-execute the job.

29.  What are the naming conventions for output files from Map phase and Reduce Phase ?

Output files from map phase are named as part-m-xxxxx and output files from reduce phase are named as part-r-xxxxx. These part files are created separately by each individual reducer. Here xxxxx is partition number starting from 00000 and increases sequentially by 1 resulting in 00001, 00002 and so on.

30.  Where the output does from Map tasks are stored ?

The mapper’s output (intermediate data) is stored on the Local file system (not on HDFS) of each individual mapper nodes. This is typically a temporary directory location which can be setup in mapreduce.cluster.local.dir configuration property. The intermediate data is deleted after the Hadoop Job completes.

[Read Next Page]

Mapreduce Interview Questions and Answers for experienced
31.  When will the reduce() method will be called from reducers in Mapreduce job flow ?

In a MapReduce job, reducers do not start executing the reduce() method until all Map jobs are completed. Reducers start copying intermediate output data from the mappers as soon as they are available. But reduce() method is called only after all the mappers have finished.

32.  If reducers do not start before all the mappers are completed then why does the progress on MapReduce job shows something like Map(80%) Reduce(20%) ?

As said above, Reducers start copying intermediate output data from map tasks as soon as they are available and task progress calculation counts this data copying as well. So, even though the actual reduce() method is not triggered to run on map output data, job progress displays completion percentage of reduce phase as 10 % or 20 %. But the actual reduce() method processing starts execution only after completion of map phase by 100 %.

33.  Where the output does from Reduce tasks are stored ?

The output from reducers are stored on HDFS cluster but not on local file system. All reducers stores their output part-r-xxxxx files in the output directory specified in the mapreduce job instead of in local FS. But Map tasks output files are not stored on HDFS but they are stored on each individual data nodes local file system.

34.  Can we set arbitrary number of Map tasks in a mapreduce job ?

No. We cannot set the number of map tasks in a mapreduce job. But this can be set at site level with mapreduce.job.maps configuration property in mapred-site.xml file.

35.  Can we set arbitrary number of Reduce tasks in a mapreduce job and if yes, how ?

Yes, we can set the no of reduce tasks at job level in Mapreduce. Arbitrary number of reduce tasks in a job can be setup with job.setNumReduceTasks(N);

Here N is the no of reduce tasks of our choice. Reduce tasks can be setup at site level as well with mapreduce.job.reduces configuration property in mapred-site.xml file.

36.  What happens if we don’t override the mapper methods and keep them as it is ? 

If we do not override any mapper methods, it will act as the IdentityMapper, directly emits each input record as a output record as it is.

37.  What is the use of Context object ?

The job Context object contains configuration data for the job. The Map Context object allows the mapper to interact with the rest of the Hadoop system. It allows mapper to emit output.

38.  Can Reducers talk with each other ?

No, Reducers run in isolation. MapReduce programming model does not allow reducers to communicate with each other.

39.  What are the primary phases of the Mapper ? 

Primary phases of Mapper are: Record Reader, Mapper, Combiner and Partitioner.

40.  What are the primary phases of the Reducer ? 

Primary phases of Reducer are: Shuffle, Sort and Reducer and Output Formatter.

41. What are the side effects of not running a secondary name node?

The cluster performance will degrade over time since edit log will grow bigger and bigger. If the secondary namenode is not running at all, the edit log will grow significantly and it will slow the system down. Also, the system will go into safemode for an extended time since the namenode needs to combine the edit log and the current fs checkpoint image.

42. How many racks do you need to create an Hadoop cluster in order to make sure that the cluster operates reliably?

In order to ensure a reliable operation it is recommended to have at least 2 racks with rack placement configured Hadoop has a built-in rack awareness mechanism that allows data distribution between different racks based on the configuration.

43. What is the procedure for namenode recovery?

A namenode can be recovered in two ways:

44. Hadoop WebUI shows that half of the datanodes are in decommissioning mode. What does that mean? Is it safe to remove those nodes from the network?

This means that namenode is trying retrieve data from those datanodes by moving replicas to remaining datanodes. There is a possibility that data can be lost if administrator removes those datanodes before decommissioning finished .

45. What does the Hadoop administrator have to do after adding new datanodes to the Hadoop cluster?

Since the new nodes will not have any data on them, the administrator needs to start the balancer to redistribute data evenly between all nodes.

46. If the Hadoop administrator needs to make a change, which configuration file does he need to change?

Each node in the Hadoop cluster has its own configuration files and the changes needs to be made in every file. One of the reasons for this is that configuration can be different for every node.

47. Map Reduce jobs take too long. What can be done to improve the performance of the cluster?

One the most common reasons for performance problems on Hadoop cluster is uneven distribution of the tasks. The number tasks has to match the number of available slots on the cluster. Hadoop is not a hardware aware system. It is the responsibility of the developers and the administrators to make sure that the resource supply and demand match.

48. After increasing the replication level, I still see that data is under replicated. What could be wrong?

Data replication takes time due to large quantities of data. The Hadoop administrator should allow sufficient time for data replication depending on the data size. if data size is big enough it is not uncommon that replication will take from a few minutes to a few hours.

More Interview Questions at –> Interview Questions Category