Hadoop Interview Questions and Answers Part – 3 2

10.  When will be a cluster in balanced status ?

A cluster is in a balanced status when, % of space used in each data node is within limits of Average % of space used on data nodes +/- Threshold size .

Percentage space used on a data node should not be less than Average % of space used on data nodes – Threshold size.

Percentage space used on a data node should not be greater than Average % of space used on data nodes + Threshold size.

Here Threshold size is configurable value which is 20 % of used spaced by default.

11.  What is Delegation Token in Hadoop ?

Delegation token is an authentication token used to access to secure Namenode from a non-secure client node. HDFS fetchdt command can be used to get the delegation token and store it in a file on the local system. its common syntax is as follows.

This fetchdt uses either RPC or HTTPS (over Kerberos) to get the token, and thus it requires kerberos tickets to be present before the fetchdt run.

Once the token is obtained then user can run an HDFS command without having Kerberos tickets, by pointing HADOOP_TOKEN_FILE_LOCATION environmental variable to the delegation token file.

12. What is Hadoop Streaming ?

Streaming is a generic API that allows programs written in virtually any language to be used as Hadoop Mapper and Reducer implementations

13. What is the characteristic of streaming API that makes it flexible run mapreduce jobs in languages like perl, ruby, awk etc. ?

Hadoop Streaming allows to use arbitrary programs for the Mapper and Reducer phases of a Map Reduce job by having both Mappers and Reducers receive their input on stdin and emit output (key, value) pairs on stdout.

14. What is a metadata?

Metadata is the information about the data stored in data nodes such as
location of the file, size of the file and so on.

15. What is the lowest granularity at which you can apply replication factor in HDFS
– We can choose replication factor per directory
– We can choose replication factor per file in a directory-
– We can choose replication factor per block of a file


16. What is HBase?

A distributed, column-oriented database. HBase uses HDFS for its underlying storage, and supports both batch-style computations using MapReduce and random reads.

17. What is ZooKeeper?

A distributed, highly available coordination service. ZooKeeper provides primitives such as distributed locks that can be used for building distributed applications.

18. What is Chukwa?

A distributed data collection and analysis system. Chukwa runs collectors that store data in HDFS, and it uses MapReduce to produce reports.

19. What is Avro?

A data serialization system for efficient, cross-language RPC, and persistent data storage.

20. What is Sqoop in Hadoop?

It is a tool to transfer the data between Relational database management system(RDBMS) and Hadoop. Thus, we can sqoop the data from RDBMS like mySql or Oracle into HDFS , Hive or HBase as well as exporting data from HDFS, Hive or HBase to RDBMS.

Sqoop will read the table row-by-row and the import process is performed in Parallel. Thus, the output may be in multiple files.

We will cover few more Hadoop interview Questions and answers in upcoming posts.

About Siva

Senior Hadoop developer with 4 years of experience in designing and architecture solutions for the Big Data domain and has been involved with several complex engagements. Technical strengths include Hadoop, YARN, Mapreduce, Hive, Sqoop, Flume, Pig, HBase, Phoenix, Oozie, Falcon, Kafka, Storm, Spark, MySQL and Java.

Leave a comment

Your email address will not be published. Required fields are marked *

2 thoughts on “Hadoop Interview Questions and Answers Part – 3