50 Mapreduce Interview Questions and Answers Part – 1 10

Table of Contents

Mapreduce Interview Questions and Answers for experienced
31. When will the reduce() method will be called from reducers in Mapreduce job flow ?

In a MapReduce job, reducers do not start executing the reduce() method until all Map jobs are completed. Reducers start copying intermediate output data from the mappers as soon as they are available. But reduce() method is called only after all the mappers have finished.

32. If reducers do not start before all the mappers are completed then why does the progress on MapReduce job shows something like Map(80%) Reduce(20%) ?

As said above, Reducers start copying intermediate output data from map tasks as soon as they are available and task progress calculation counts this data copying as well. So, even though the actual reduce() method is not triggered to run on map output data, job progress displays completion percentage of reduce phase as 10 % or 20 %. But the actual reduce() method processing starts execution only after completion of map phase by 100 %.

33. Where the output does from Reduce tasks are stored ?

The output from reducers are stored on HDFS cluster but not on local file system. All reducers stores their output part-r-xxxxx files in the output directory specified in the mapreduce job instead of in local FS. But Map tasks output files are not stored on HDFS but they are stored on each individual data nodes local file system.

34. Can we set arbitrary number of Map tasks in a mapreduce job ?

No. We cannot set the number of map tasks in a mapreduce job. But this can be set at site level with mapreduce.job.maps configuration property in mapred-site.xml file.

35. Can we set arbitrary number of Reduce tasks in a mapreduce job and if yes, how ?

Yes, we can set the no of reduce tasks at job level in Mapreduce. Arbitrary number of reduce tasks in a job can be setup with job.setNumReduceTasks(N);

Here N is the no of reduce tasks of our choice. Reduce tasks can be setup at site level as well with mapreduce.job.reduces configuration property in mapred-site.xml file.

36. What happens if we don’t override the mapper methods and keep them as it is ?

If we do not override any mapper methods, it will act as the IdentityMapper, directly emits each input record as a output record as it is.

37. What is the use of Context object ?

The job Context object contains configuration data for the job. The Map Context object allows the mapper to interact with the rest of the Hadoop system. It allows mapper to emit output.

38. Can Reducers talk with each other ?

No, Reducers run in isolation. MapReduce programming model does not allow reducers to communicate with each other.

39. What are the primary phases of the Mapper ?

Primary phases of Mapper are: Record Reader, Mapper, Combiner and Partitioner.

40. What are the primary phases of the Reducer ?

Primary phases of Reducer are: Shuffle, Sort and Reducer and Output Formatter.

41. What are the side effects of not running a secondary name node?

The cluster performance will degrade over time since edit log will grow bigger and bigger. If the secondary namenode is not running at all, the edit log will grow significantly and it will slow the system down. Also, the system will go into safemode for an extended time since the namenode needs to combine the edit log and the current fs checkpoint image.

42. How many racks do you need to create an Hadoop cluster in order to make sure that the cluster operates reliably?

In order to ensure a reliable operation it is recommended to have at least 2 racks with rack placement configured Hadoop has a built-in rack awareness mechanism that allows data distribution between different racks based on the configuration.

43. What is the procedure for namenode recovery?

A namenode can be recovered in two ways:

  • Starting new namenode from backup metadata
  • Promoting secondary namenode to primary namenode.
44. Hadoop WebUI shows that half of the datanodes are in decommissioning mode. What does that mean? Is it safe to remove those nodes from the network?

This means that namenode is trying retrieve data from those datanodes by moving replicas to remaining datanodes. There is a possibility that data can be lost if administrator removes those datanodes before decommissioning finished .

45. What does the Hadoop administrator have to do after adding new datanodes to the Hadoop cluster?

Since the new nodes will not have any data on them, the administrator needs to start the balancer to redistribute data evenly between all nodes.

46. If the Hadoop administrator needs to make a change, which configuration file does he need to change?

Each node in the Hadoop cluster has its own configuration files and the changes needs to be made in every file. One of the reasons for this is that configuration can be different for every node.

47. Map Reduce jobs take too long. What can be done to improve the performance of the cluster?

One the most common reasons for performance problems on Hadoop cluster is uneven distribution of the tasks. The number tasks has to match the number of available slots on the cluster. Hadoop is not a hardware aware system. It is the responsibility of the developers and the administrators to make sure that the resource supply and demand match.

48. After increasing the replication level, I still see that data is under replicated. What could be wrong?

Data replication takes time due to large quantities of data. The Hadoop administrator should allow sufficient time for data replication depending on the data size. if data size is big enough it is not uncommon that replication will take from a few minutes to a few hours.

More Interview Questions at –> Interview Questions Category

About Siva

Senior Hadoop developer with 4 years of experience in designing and architecture solutions for the Big Data domain and has been involved with several complex engagements. Technical strengths include Hadoop, YARN, Mapreduce, Hive, Sqoop, Flume, Pig, HBase, Phoenix, Oozie, Falcon, Kafka, Storm, Spark, MySQL and Java.

Leave a comment

Your email address will not be published. Required fields are marked *

10 thoughts on “50 Mapreduce Interview Questions and Answers Part – 1

Skip to toolbar