In this post, we will discuss about another 50 Mapreduce Interview Questions and Answers for experienced mapreduce developers.
Mapreduce Interview Questions and Answers for experienced
1. What are the methods in the Mapper class and order of their invocation?
The Mapper contains the run() method, which call its own setup() method only once, it also call a map() method for each input and finally calls it cleanup() method. We can override all above methods in our code.
Each of these methods can access the job’s configuration data by using Context.getConfiguration().
2. What are the methods in the Reducer class and order of their invocation?
The Reducer class contains the run() method, which call its own setup() method only once, it also call a map() method for each input and finally calls it cleanup() method. We can override all above methods in our code.
3. How can we add the arbitrary key-value pairs in your mapper?
We can set arbitrary (key, value) pairs of configuration data in our Job, with Job.getConfiguration().set(“key”, “val”), and we can retrieve this data in mapper with Context.getConfiguration().get(“key”).
This kind of functionality is typically done in the Mapper’s setup() method.
4. Which object can be used to get the progress of a particular job?
5. How can we control particular key should go in a specific reducer?
We can control keys (and hence records) to be processed in a particular Reducer by implementing a custom Partitioner class.
6. What is Nlineoutputformat?
Nlineoutputformat splits ‘n’ lines of input as one split.
7. What is the difference between an HDFS Block and Input Split?
HDFS Block is the physical division of the data and Input Split is the logical division of the data.
8 What is keyvaluetextinputformat?
In keyvaluetextinputformat, each line in the text file is a ‘record‘. The first separator character divides each line. Everything before the separator is the key and everything after the separator is the value. For instance, Key: text, value: text.
9. Why we cannot do aggregation (addition) in a mapper? Why we require reducer for that?
We cannot do aggregation (addition) in a mapper because, sorting is not done in a mapper. Sorting happens only on the reducer side. Mapper method initialization depends upon each input split. While doing aggregation, we will lose the value of the previous instance. For each row, a new mapper will get initialized. For each row, input split again gets divided into mapper, thus we do not have a track of the previous row value.
10. Can we process different input file directories with different input formats, like some text files and some sequence files in a single MR job?
Yes, we can implement this by MultipleInputs.addInputPath() methods in job driver class. might set up the input as follows:
Here Mapper1 class handles TextInputFormat data and Mapper2 class handles SequenceFileInputFormat data.
11. What is the need for serialization in Mapreduce ?
Below are the two necessities for serialization in Hadoop.
- In Hadoop cluster, data is stored in only binary stream format but object structured data can’t be stored directly hadoop data nodes.
- Only Binary stream data can be transferred across data nodes in hadoop cluster. So, Serialization is needed to convert the object structured data into binary stream format.
12. How does the nodes in a hadoop cluster communicate with each other?
Inter process communication between nodes in a hadoop cluster is implemented using Remote Procedure Calls (RPC).
Inter process communication happens in below three stages.
RPC protocol uses serialization to convert the message from source data node into a binary stream data.
Binary stream data is transferred to the remote destination node
Destination node then use De-serialization to convert the binary stream data into object structured data and then it reads object structured data.
13. What is the Hadoop in built serialization framework ?
Writables are the hadoop’s own serialization format which serializes the data into compact size and ensures fast transfer across nodes. Writables are written in Java and supported by Java only.
14. What is Writable and its methods in hadoop library ?
Writable is an Interface in hadoop library and it provides below two methods for serializing and de-serializing the data.
write(DataOutput out) – Writes data into DataOutput binary stream.
readFields(DataInput in) – Reads data from DataInput binary stream.
15. What is WritableComparable in hadoop library ?
WritableComparable interface is a sub interface of the Writable and java.lang.comparable interfaces.