Below are a few more Hadoop Interview Questions and Answers.
Please refer previous posts on this topic for additional questions on HDFS.
Hadoop Interview Questions and Answers
1. What is HDFS ?
HDFS is a distributed file system implemented on Hadoop’s framework. It is a block-structured distributed file system designed to store vast amount of data on low cost commodity hardware and ensuring high speed process on data.
HDFS stores files across multiple machines and maintains reliability and fault tolerance. HDFS support parallel processing of data by Mapreduce framework.
2. What are the objectives of HDFS file system?
- Easily Store large amount of data across multiple machines
- Data reliability and fault tolerance by maintaining multiple copies of each block of a file.
- Capacity to move computation to data instead of moving data to computation server. I.e. processing data locally.
- Able to provide parallel processing of data by Mapreduce framework.
3. What are the limitations of HDFS file systems?
- HDFS supports file operations reads, writes, appends and deletes efficiently but it doesn’t support file updates.
- HDFS is not suitable for large number of small sized files but best suits for large sized files. Because file system namespace maintained by Namenode is limited by it’s main memory capacity as namespace is stored in namenode’s main memory and large number of files will result in big fsimage file.
4. What is a block in HDFS and what is its size?
It is a fixed size chunk of data usually of size 128 MB. It is the minimum of size of data that HDFS can read/write.
HDFS files are broken into these fixed size chunks of data across multiple machines on a cluster.
Thus, blocks are building bricks of a HDFS file. Each block is maintained in at least 3 copies as mentioned by replication factor in Hadoop configuration to provide data redundancy and maintain fault-tolerance.
5. What are the core components in HDFS Cluster ?
- Name Node
- Secondary Name Node
- Data Nodes
- Checkpoint Nodes
- Backup Node
6. What is a NameNode ?
Namenode is a dedicated machine in HDFS cluster which acts as a master serve that maintains file system namespace in its main memory and serves the file access requests by users. File system namespace mainly contains fsimage and edits files. Fsimage is a file which contains file names, file permissions and block locations of each file.
Usually only one active namenode is allowed in HDFS default architecture.
7. What is a DataNode ?
DataNodes are slave nodes of HDFS architecture which store the blocks of HDFS files and sends blocks information to namenode periodically through heart beat messages.
Data Nodes serve read and write requests of clients on HDFS files and also perform block creation, replication and deletions.
8. Is Namenode machine same as datanode machine as in terms of hardware?
It depends upon the cluster we are trying to create. The Hadoop VM can be there on the same machine or on another machine. For instance, in a single node cluster, there is only one machine,whereas in the development or in a testing environment, Namenode and datanodes are on different machines.
9. What is a Secondary Namenode?
The Secondary NameNode is a helper to the primary NameNode. Secondary NameNode is a specially dedicated node in HDFS cluster whose main function is to take checkpoints of the file system metadata present on namenode. It is not a backup namenode and doesn’t act as a namenode in case of primary namenode’s failures. It just checkpoints namenode’s file system namespace.
For in depth details please refer the post here.
10. What is a Checkpoint Node?
It is an enhanced secondary namenode whose main functionality is to take checkpoints of namenode’s file system metadata periodically. It replaces the role of secondary namenode. Advantage of Checkpoint node over the secondary namenode is that it can upload the result of merge operation of fsimage and edits log files while checkpointing.
For indepth details please refer the post here.
11. What is a checkpoint ?
During Checkpointing process, fsimage file is merged with edits log file and a new fsimage file will be created which is usually called as a checkpoint.
12. What is a daemon?
Daemon is a process or service that runs in background. In general, we use this word in UNIX environment. Hadoop or Yarn daemons are Java processes which can be verified with jps command.
13. What is a heartbeat in HDFS?
A heartbeat is a signal indicating that it is alive. A datanode sends heartbeat to Namenode and task tracker will send its heart beat to job tracker. If the Namenode or job tracker does not receive heart beat then they will decide that there is some problem in datanode or task tracker is unable to perform the assigned task.
14. Are Namenode and Resource Manager run on the same host?
No, in practical environment, Namenode runs on a separate host and Resource Manager runs on a separate host.
15. What is the communication mode between namenode and datanode?
The mode of communication is SSH.
16. If we want to copy 20 blocks from one machine to another, but another machine can copy only 18.5 blocks, can the blocks be broken at the time of replication?
In HDFS, blocks cannot be broken down. Before copying the blocks from one machine to another, the Master node will figure out what is the actual amount of space required, how many block are being used, how much space is available, and it will allocate the blocks accordingly.
17. How indexing is done in HDFS?
Hadoop has its own way of indexing. Depending upon the block size, once the data is stored, HDFS will keep on storing the last part of the data which will say where the next part of the data will be. In fact, this is the base of HDFS.
18. If a data Node is full, then how is it identified?
When data is stored in datanode, then the metadata of that data will be stored in the Namenode. So Namenode will identify if the data node is full.
19. If datanodes increase, then do we need to upgrade Namenode?
While installing the Hadoop system, Namenode is determined based on the
size of the clusters. Most of the time, we do not need to upgrade the Namenode because it does not store the actual data, but just the metadata, so such a requirement rarely arise.
20. Why Reading is done in parallel and Writing is not in HDFS?
Reading is done in parallel because by doing so we can access the data fast. But we do not perform the write operation in parallel because it might result in data written by one node can be overwritten by other.
For example, we have a file and two nodes are trying to write data into the file in parallel, then the first node does not know what the second node has written and vice-versa. So, this makes it confusing which data to be stored and accessed.
21. What is fsimage and edit log in hadoop?
Fsimage is a file which contains file names, file permissions and block locations of each file, and this file is maintained by Namenode for indexing of files in HDFS. We can call it as metadata about HDFS files. The fsimage file contains a serialized form of all the directory and file inodes in the filesystem.
EditLog is a transaction log which contains records for every change that occurs to file system metadata.
Note: Whenever a NameNode is restarted, the latest status of FsImage is built by applying edits records on last saved copy of FsImage.