Mapreduce Interview Questions and Answers for Experienced Part – 2


Below are a few more hadoop mapreduce interview questions and answers for experienced and freshers hadoop developers.

Hadoop Mapreduce Interview Questions and Answers for Experienced:
1.  What is side data distribution in Mapreduce framework ?

The extra read-only data needed by a mapreduce job to process the main data set is called as side data.

There are two ways to make side data available to all the map or reduce tasks.

    • Job Configuration
    • Distributed cache
2.  How to distribute side data using job configuration ?

Side data can be distributed by setting an arbitrary key-value pairs in the job configuration using the various setter methods on Configuration object.

In the task, we can retrieve the data from the configuration returned by Context ’s

getConfiguration() method.

3.  When can we use side data distribution by Job Configuration and when it is not supposed ?

Side data distribution by job configuration is useful only when we need to pass a small piece of meta data to map/reduce tasks.

We shouldn’t use this mechanism for transferring more than a few KB’s of data because it put pressure on the memory usage, particularly in a system running hundreds of jobs.

4.  What is Distributed Cache in Mapreduce ?

Distributed cache mechanism is an alternative way of side data distribution by copying files and archives to the task nodes in time for the tasks to use them when they run.

To save network bandwidth, files are normally copied to any particular node once per job.

5.  How to supply files or archives to mapreduce job in distributed cache mechanism ?

The files that need to be distributed can be specified as a comma-separated list of URIs as the argument to the -files option in hadoop job command. Files can be on the local file system, on HDFS.

Archive files (ZIP files, tar files, and gzipped tar files) can also be copied to task nodes by distributed cache by using -archives option. these are un-archived on the task node.

The -libjars option will add JAR files to the classpath of the mapper and reducer tasks.

 6.  How distributed cache works in Mapreduce Framework ?

When a mapreduce job is submitted with distributed cache options, the node managers copies the the files specified by the -files , -archives and -libjars options from distributed cache to a local disk. The files are said to be localized at this point.

local.cache.size property can be configured to setup cache size on local disk of node managers. Files are localized under the ${hadoop.tmp.dir}/mapred/local directory on the node manager nodes.

7.  What will hadoop do when a task is failed in a list of suppose 50 spawned tasks ?

It will restart the map or reduce task again on some other node manager and only if the task fails more than 4 times then it will kill the job. The default number of maximum attempts for map tasks and reduce tasks can be configured with below properties in mapred-site.xml file.

mapreduce.map.maxattempts

mapreduce.reduce.maxattempts

The default value for the above two properties is 4 only.

8.  Consider case scenario: In Mapreduce system, HDFS block size is 256 MB and we have 3 files of size 256 KB, 266 MB and 500 MB then how many input splits will be made by Hadoop framework ?

Hadoop will make 5 splits as follows

– 1 split for 256 KB file

– 2 splits for 266 MB file  (1 split of size 256 MB and another split of size 10 MB)

– 2 splits for 500 MB file  (1 Split of size 256 MB and another of size 244 MB)

9.  Why can’t we just have the file in HDFS and have the application read it instead of distributed cache ?

Distributed cache copies the file to all node managers at the start of the job. Now if the node manager runs 10 or 50 map or reduce tasks, it will use the same file copy from distributed cache.

On the other hand, if a file needs to read from HDFS in the job then every map or reduce task will access it from HDFS and hence if a node manager runs 100 map tasks then it will read this file 100 times from HDFS. Accessing the same file from node manager’s Local FS is much faster than from HDFS data nodes.

10.  What mechanism does Hadoop framework provides to synchronize changes made in Distribution Cache during run time of the application ?

Distributed cache mechanism provides service for copying just read-only data needed by a mapreduce job but not the files which can be updated. So, there is no mechanism to synchronize the changes made in distributed cache as changes are not allowed to distributed cached files.

A few more hadoop mapreduce interview questions and answers for experienced will be published in the upcoming posts in this category.


About Siva

Senior Hadoop developer with 4 years of experience in designing and architecture solutions for the Big Data domain and has been involved with several complex engagements. Technical strengths include Hadoop, YARN, Mapreduce, Hive, Sqoop, Flume, Pig, HBase, Phoenix, Oozie, Falcon, Kafka, Storm, Spark, MySQL and Java.

Leave a comment

Your email address will not be published. Required fields are marked *


Review Comments
default image

I have attended Siva’s Spark and Scala training. He is good in presentation skills and explaining technical concepts easily to everyone in the group. He is having excellent real time experience and provided enough use cases to understand each concepts. Duration of the course and time management is awesome. Happy that I found a right person on time to learn Spark. Thanks Siva!!!

Dharmeswaran ETL / Hadoop Developer Spark Nov 2016 September 21, 2017

.