Below are a few more hadoop interview questions and answers for freshers and experienced hadoop developers.
Hadoop Interview questions and answers
1. What is the default block size in HDFS ?
As of Hadoop-2.4.0 release, the default block size in HDFS is 256 MB and prior to that it was 128 MB.
2. What is the benefit of large block size in HDFS ?
Main benefit of large block size is Quick Seek Time. The time to transfer a large file of multiple blocks operates at the disk transfer rate instead of depending much on seek time.
3. What are the overheads of maintaining too large Block size ?
Usually in Mapreduce framework, each map task operate on one block at a time. So, having too few blocks result in too few map tasks running in parallel for longer time which finally results in overall slow down of job performance.
4. If a file of size 10 MB is copied on to HDFS of block size 256 MB, then how much storage will be allocated to the file on HDFS ?
Even though the default HDFS block size is 256 MB, a file which is smaller than a single block doesn’t occupy full block size. So, in this case, the file will occupy just 10 MB but not 256 MB.
5. What are the benefits of block structure concept in HDFS ?
Main benefit is that the ability to store very large files which can be even larger than the size of single disk (node) as the file is broken into blocks and distributed across various nodes on cluster.
Another important advantage is simplicity of storage management as the blocks are fixed size, it is easy to calculate how many can be stored on a given disk.
Blocks replication feature is useful in fault tolerance.
6. What if we upgrade our Hadoop version in which, default block size is higher than the current Hadoop version’s default block size. Suppose 128 MB (Hadoop 0.20.2) to 256 MB (Hadoop 2.4.0).
All the existing files are maintained at block size of 128 MB but any new files copied on to upgraded hadoop are broken into blocks of size 256 MB.
7. What is Block replication ?
Block replication is a way of maintaining multiple copies of same block across various nodes on cluster to achieve fault tolerance. In this, though one of the data node containing the block becomes dead, the block data can be obtained from other live data nodes which contain the same copy of the block data.
8. What is default replication factor and how to configure it ?
The default replication factor in fully distributed HDFS is 3.
This can be configured with dfs.replication in hdfs-site.xml file at site level.
Replication factor can be setup at file level with below FS command.
In above command ‘N’ is the new replication factor for the file “/filename".
9. What is HDFS distributed copy (distcp) ?
distcp is an utility for launching MapReduce jobs to copy large amounts of HDFS files within or in between HDFS clusters.
Syntax for using this tool.
10. What is the use of fsck command in HDFS ?
HDFS fsck command is a useful to get the files and blocks details of the file system. It’s syntax is:
below are the command options and their purpose.
|-move||Move corrupted files to /lost+found|
|-delete||Delete corrupted files.|
|-openforwrite||Print out files opened for write.|
|-files||Print out files being checked.|
|-blocks||Print out block report.|
|-locations||Print out locations for every block.|
|-racks||Print out network topology for data-node locations.|