16. What is a combiner ?
Combiner is a semi-reducer in Mapreduce framework. it is an optional component and can be specified with Job.setCombinerClass() method.
Combiner functions are suitable for producing summary information from a large data set. Hadoop doesn’t guarantee on how many times a combiner function will be called for each map output key. it may call 0 or 1 or many times.
17. What are the constraints on combiner implementation ?
Combiner class must implement Reducer interface and must provide implementation for reduce() method. The combiner class’s reduce() method must have same input and output key-value types as the reducer class.
18. What are the advantages of combiner over reducer or why do we need combiner when we are using same reducer class as combiner class ?
The main purpose of Combiner in Mapreduce frame work is to limit the volume of data transfer between map and reduce tasks.
It is a general practice that same reducer class is used as a combiner class, in this case, the only benefit of combiner class is to minimize the input data to reduce phase from data nodes.
19. What are the primitive data types in Hadoop ?
Below are the list of primitive writable data types available in Hadoop.
20. What is NullWritable and how is it special from other Writable data types ?
NullWritable is a special type of Writable representing a null value. No bytes are read or written when a data type is specified as NullWritable. So, in Mapreduce, a key or a value can be declared as a NullWritable when we don’t need to use that field.
21. What is Text data type in Hadoop and what are the differences from String data type in Java ?
Text is a Writable data type for serialization and de-serialization of string data in Hadoop. It can be treated Wrapper class for java.lang.String in Java. Java Strings are immutable where as Hadoop’s Text Writable is mutable.
22. What are the uses of GenericWritable class ?
One of the uses is, GenericWritable classes are extended to provide implementations to wrap multiple value instances belonging to different data types.
Whenever multiple value types need to be produced from mapper and we need our reducer to process these multiple value types as a single data type because Hadoop reducers do not allow multiple input value types.
In these scenarios a subclass of GenericWritable can be used.
23. How to create multiple value type output from Mapper with IntWritable and Text Writable ?
Write a class extending the org.apache.hadoop.io.GenericWritable class. Implement the getTypes() method to return an array of the Writable classes.
24. What is ObjectWritable data type in Hadoop ?
This is a general-purpose generic object wrapper which can be used to achieve the same objective as GenericWritable. org.apache.hadoop.io.ObjectWritable class can
handle Java primitive types, strings, and arrays without the need of a Writable wrapper.
25. How do we create Writable arrays in Hadoop ?
Hadoop provides two types of Writable data types for arrays. For one dimensional arrays, ArrayWritable and for two dimensional arrays, TwoDArrayWritable data types are available.
The elements of these arrays must be other writable objects like IntWritable or LongWritable only but not the java native data types like int or float. For example, below is implementation of array of IntWritables.
26. What are the MapWritable data types available in Hadoop ?
Hadoop provided below MapWritable data types which implement java.util.Map interface
- AbstractMapWritable – This is abstract or base class for other MapWritable classes.
- MapWritable – This is a general purpose map mapping Writable keys to Writable values.
- SortedMapWritable – This is a specialization of the MapWritable class that also implements the SortedMap interface.
27. What is speculative execution in Mapreduce ?
Speculative execution is a mechanism of running multiple copies of same map or reduce tasks on different slave nodes to cope with individual Machine performance.
In large clusters of hundreds of machines, there may be machines which are not performing as fast as others. This may result in delays in a full job due to only one machine not performing well. To avoid this, speculative execution in hadoop can run multiple copies of same map or reduce task on different slave nodes. The results from first node to finish are used.
If other copies were executing speculatively, Hadoop tells the Task Trackers to abandon the tasks and discard their outputs.
28. What will happen if we run a Mapreduce job with an output directory that already existing ?
Job will fail with org.apache.hadoop.mapred.FileAlreadyExistsException. In this case, delete the output directory and re-execute the job.
29. What are the naming conventions for output files from Map phase and Reduce Phase ?
Output files from map phase are named as part-m-xxxxx and output files from reduce phase are named as part-r-xxxxx. These part files are created separately by each individual reducer. Here xxxxx is partition number starting from 00000 and increases sequentially by 1 resulting in 00001, 00002 and so on.
30. Where the output does from Map tasks are stored ?
The mapper’s output (intermediate data) is stored on the Local file system (not on HDFS) of each individual mapper nodes. This is typically a temporary directory location which can be setup in mapreduce.cluster.local.dir configuration property. The intermediate data is deleted after the Hadoop Job completes.
[Read Next Page]