Below are a few more hadoop interview questions and answers for both freshers and experienced hadoop developers and administrators.
Hadoop Interview questions and answers
1. What is a Backup Node?
It is an extended checkpoint node that performs checkpointing and also supports online streaming of file system edits.
It maintains an in memory, up-to-date copy of file system namespace and accepts a real time online stream of file system edits and applies these edits on its own copy of namespace in its main memory.
Thus, it maintains always a latest backup of current file system namespace.
2. What are the differences between Backup Node and Checkpoint node or Secondary Namenode ?
- Multiple checkpoint nodes can be registered with namenode but only a single backup node is allowed to register with namenode at any point of time.
- To create a checkpoint, checkpoint node or secondary namenode needs to download fsimage and edits files from active namenode and apply edits to fsimage and saves a copy of new fsimage as a checkpoint.
But in backup node, no need to download fsimage and edits files from active namenode because, it already has an up-to-date copy of fsimage in its main memory and accesses online streaming of edits which are provided by namenode. So, applying these edits into fsimage in its own main memory and saving a copy in local FS.
So, checkpoint creation in backup node is faster than that of checkpoint node or secondary namenode.
- The diff between checkpoint node and secondary namenode is that checkpoint node can upload the new copy of fsimage file back to namenode after checkpoint creation where as a secondary namenode can’t upload but can only store in its local FS.
- Backup node provides the option of running namenode with no persistent storage but a checkpoint node or secondary namenode doesn’t provide such option.
- In case of namenode failures, data loss in checkpoint node or secondary namenode is certain at least to a minimum amount of data due to time gap between two checkpoints.
But in backup node, data loss is not certain and it maintains namespace which is in sync with namenode at any point of time.
3. What is Safe Mode in HDFS ?
Safe Mode is a maintenance state of NameNode during which Name Node doesn’t allow any changes to the file system.
During Safe Mode, HDFS cluster is ready-only and doesn’t replicate or delete blocks.
Name Node automatically enters safe mode during its start up and maintain blocks replication value within minimum and maximum allowable replication limit.
4. What is Data Locality in HDFS ?
One of the HDFS design idea is that “Moving Computation is cheaper than Moving data”.
If data sets are huge, running applications on nodes where the actual data resides will give efficient results than moving data to nodes where applications are running.
This concept of moving applications to data, is called Data Locality.
This reduces network traffic and increases speed of data processing and accuracy of data since there is no chance of data loss during data transfer through network channels because there is no need to move data.
5. What is a rack?
Rack is a storage area with all the datanodes put together. These datanodes can be physically located at different places. Rack is a physical collection of datanodes which are stored at a single location. There can be multiple racks in a single location.
6. What is Rack Awareness ?
The concept of maintaining Rack Id information by NameNode and using these rack ids for choosing closest data nodes for HDFS file read or writes requests is called Rack Awareness.
By choosing closest data nodes for read/writes request through rack awareness policy, minimizes the write cost and maximizing read speed.
7. How does HDFS File Deletes or Undeletes work?
When a file is deleted from HDFS, it will not be removed immediately from HDFS, but HDFS moves the file into /trash directory. After certain period of time interval, the NameNode deletes the file from the HDFS /trash directory. The deletion of a file releases the blocks associated with the file.
Time interval for which a file remains in /trash directory can be configured with fs.trash.interval property stored in core-site.xml.
As long as a file remains in /trash directory, the file can be undeleted by moving the file from /trash directory into required location in HDFS. Default trash interval is set to 0. So, HDFS Deletes file without storing in trash.
8. What is a Rebalancer in HDFS ?
Rebalancer is a administration tool in HDFS, to balance the distribution of blocks uniformly across all the data nodes in the cluster.
Rebalancing will be done on demand only. It will not get triggered automatically.
HDFS administrator issues this command on request to balance the cluster
If a Rebalancer is triggered, NameNode will scan entire data node list and when
- Under-utilized data node is found, it moves blocks from over-utilized data nodes or not-under-utilized data nodes to this current data node
- If Over-utilized data node is found, it moves blocks from this data node to other under-utilized or not-over-utilized data nodes.
9. What is the need for Rebalancer in HDFS ?
Whenever a new data node is added to the existing HDFS cluster or a data node is removed from the cluster then some of the data nodes in the cluster will have more/less blocks compared to other data nodes.
In this unbalanced cluster, data read/write requests become very busy on some data nodes and some data nodes are under utilized.
In such cases, to make all the data nodes space is uniformly utilized for blocks distribution, rebalancing will be done by Hadoop Administrator.
[Read Next Page]