100 Interview Questions on Hadoop 2

1. What does commodity Hardware in Hadoop world mean? ( D )

a) Very cheap hardware

b) Industry standard hardware

c) Discarded hardware

d) Low specifications Industry grade hardware

2. Which of the following are NOT big data problem(s)? ( D)

a) Parsing 5 MB XML file every 5 minutes

b) Processing IPL tweet sentiments

c) Processing online bank transactions

d) both (a) and (c)

3. What does “Velocity" in Big Data mean? ( D)

a) Speed of input data generation

b) Speed of individual machine processors

c) Speed of ONLY storing data

d) Speed of storing and processing data

4. The term Big Data first originated from: ( C )

a) Stock Markets Domain

b) Banking and Finance Domain

c) Genomics and Astronomy Domain

d) Social Media Domain

5. Which of the following Batch Processing instance is NOT an example of ( D)

BigData Batch Processing?

a) Processing 10 GB sales data every 6 hours

b) Processing flights sensor data

c) Web crawling app

d) Trending topic analysis of tweets for last 15 minutes

6. Which of the following are example(s) of Real Time Big Data Processing? ( D)

a) Complex Event Processing (CEP) platforms

b) Stock market data analysis

c) Bank fraud transactions detection

d) both (a) and (c)

7. Sliding window operations typically fall in the category (C ) of__________________.

a) OLTP Transactions

b) Big Data Batch Processing

c) Big Data Real Time Processing

d) Small Batch Processing

8. What is HBase used as? (A )

a) Tool for Random and Fast Read/Write operations in Hadoop

b) Faster Read only query engine in Hadoop

c) MapReduce alternative in Hadoop

d) Fast MapReduce layer in Hadoop

9. What is Hive used as? (D )

a) Hadoop query engine

b) MapReduce wrapper

c) Hadoop SQL interface

d) All of the above

10. Which of the following are NOT true for Hadoop? (D)

a) It’s a tool for Big Data analysis

b) It supports structured and unstructured data analysis

c) It aims for vertical scaling out/in scenarios

d) Both (a) and (c)

11. Which of the following are the core components of Hadoop? ( D)


b) Map Reduce

c) HBase

d) Both (a) and (b)

12. Hadoop is open source. ( B)

a) ALWAYS True

b) True only for Apache Hadoop

c) True only for Apache and Cloudera Hadoop

d) ALWAYS False

13. Hive can be used for real time queries. ( B )



c) True if data set is small

d) True for some distributions

14. What is the default HDFS block size? ( D )

a) 32 MB

b) 64 KB

c) 128 KB

d) 64 MB

15. What is the default HDFS replication factor? ( C)

a) 4

b) 1

c) 3

d) 2

16. Which of the following is NOT a type of metadata in NameNode? ( C)

a) List of files

b) Block locations of files

c) No. of file records

d) File access control information

17. Which of the following is/are correct? (D )

a) NameNode is the SPOF in Hadoop 1.x

b) NameNode is the SPOF in Hadoop 2.x

c) NameNode keeps the image of the file system also

d) Both (a) and (c)

18. The mechanism used to create replica in HDFS is____________. ( C)

a) Gossip protocol

b) Replicate protocol

c) HDFS protocol

d) Store and Forward protocol

19. NameNode tries to keep the first copy of data nearest to the client machine. ( C)

a) ALWAYS true

b) ALWAYS False

c) True if the client machine is the part of the cluster

d) True if the client machine is not the part of the cluster

20. HDFS data blocks can be read in parallel. ( A )



21. Where is HDFS replication factor controlled? ( D)

a) mapred-site.xml

b) yarn-site.xml

c) core-site.xml

d) hdfs-site.xml

22. Read the statement and select the correct option: ( B)

It is necessary to default all the properties in Hadoop config files.

a) True

b) False

23. Which of the following Hadoop config files is used to define the heap size? (C )

a) hdfs-site.xml

b) core-site.xml

c) hadoop-env.sh

d) Slaves

24. Which of the following is not a valid Hadoop config file? ( B)

a) mapred-site.xml

b) hadoop-site.xml

c) core-site.xml

d) Masters

25. Read the statement:

NameNodes are usually high storage machines in the clusters. ( B)

a) True

b) False

c) Depends on cluster size

d) True if co-located with Job tracker

26. From the options listed below, select the suitable data sources for flume. ( D)

a) Publicly open web sites

b) Local data folders

c) Remote web servers

d) Both (a) and (c)

27. Read the statement and select the correct options: ( A)

distcp command ALWAYS needs fully qualified hdfs paths.

a) True

b) False

c) True, if source and destination are in same cluster

d) False, if source and destination are in same cluster

28. Which of following statement(s) are true about distcp command? (A)

a) It invokes MapReduce in background

b) It invokes MapReduce if source and destination are in same cluster

c) It can’t copy data from local folder to hdfs folder

d) You can’t overwrite the files through distcp command

29. Which of the following is NOT the component of Flume? (B)

a) Sink

b) Database

c) Source

d) Channel

30. Which of the following is the correct sequence of MapReduce flow? ( C )

f) Map ??Reduce ??Combine

a) Combine ??Reduce ??Map

b) Map ??Combine ??Reduce

c) Reduce ??Combine ??Map

31 .Which of the following can be used to control the number of part files ( B) in a map reduce program output directory?

a) Number of Mappers

b) Number of Reducers

c) Counter

d) Partitioner

32. Which of the following operations can’t use Reducer as combiner also? (D)

a) Group by Minimum

b) Group by Maximum

c) Group by Count

d) Group by Average

33. Which of the following is/are true about combiners? (D)

a) Combiners can be used for mapper only job

b) Combiners can be used for any Map Reduce operation

c) Mappers can be used as a combiner class

d) Combiners are primarily aimed to improve Map Reduce performance

e) Combiners can’t be applied for associative operations

34. Reduce side join is useful for (A)

a) Very large datasets

b) Very small data sets

c) One small and other big data sets

d) One big and other small datasets

35. Distributed Cache can be used in (D)

a) Mapper phase only

b) Reducer phase only

c) In either phase, but not on both sides simultaneously

d) In either phase

36. Counters persist the data on hard disk. (B)

a) True

b) False

37. What is optimal size of a file for distributed cache? (C)

a) <=10 MB

b) >=250 MB

c) <=100 MB

d) <=35 MB

38. Number of mappers is decided by the (D)

a) Mappers specified by the programmer

b) Available Mapper slots

c) Available heap memory

d) Input Splits

e) Input Format

39. Which of the following type of joins can be performed in Reduce side join operation? (E)

a) Equi Join

b) Left Outer Join

c) Right Outer Join

d) Full Outer Join

e) All of the above

40. What should be an upper limit for counters of a Map Reduce job? (D)

a) ~5s

b) ~15

c) ~150

d) ~50

41. Which of the following class is responsible for converting inputs to key-value (c) Pairs of Map Reduce

a) FileInputFormat

b) InputSplit

c) RecordReader

d) Mapper

42. Which of the following writables can be used to know value from a mapper/reducer? (C)

a) Text

b) IntWritable

c) Nullwritable

d) String

43. Distributed cache files can’t be accessed in Reducer. (B)

a) True

b) False

44. Only one distributed cache file can be used in a Map Reduce job. (B)

a) True

b) False

45. A Map reduce job can be written in: (D)

a) Java

b) Ruby

c) Python

d) Any Language which can read from input stream

46. Pig is a: (B)

a) Programming Language

b) Data Flow Language

c) Query Language

d) Database

47. Pig is good for: (E)

a) Data Factory operations

b) Data Warehouse operations

c) Implementing complex SQLs

d) Creating multiple datasets from a single large dataset

e) Both (a) and (d)

48. Pig can be used for real-time data updates. (B)

a) True

b) False

49. Pig jobs have the same run time as the native Map Reduce jobs. (B)

a) True

b) False

50. Which of the following is the correct representation to access ‘’Skill" from the (A)

Bag {‘Skills’,55, (‘Skill’, ‘Speed’), {2, (‘San’, ‘Mateo’)}}

a) $3.$1

b) $3.$0

c) $2.$0

d) $2.$1

51. Replicated joins are useful for dealing with data skew. (B)

a) True

b) False

52. Maximum size allowed for small dataset in replicated join is: (C)

a) 10KB

b) 10 MB

c) 100 MB

d) 500 MB

53. Parameters could be passed to Pig scripts from: (E)

a) Parent Pig Scripts

b) Shell Script

c) Command Line

d) Configuration File

e) All the above except (a)

54. The schema of a relation can be examined through: (B)





55. DUMP Statement writes the output in a file. (B)

a) True

b) False

56. Data can be supplied to PigUnit tests from: (C)

a) HDFS Location

b) Within Program

c) Both (a) and (b)

d) None of the above

57. Which of the following constructs are valid Pig Control Structures? (D)

a) If-else

b) For Loop

c) Until Loop

d) None of the above

58. Which of following is the return data type of Filter UDF? (C)

a) String

b) Integer

c) Boolean

d) None of the above

59. UDFs can be applied only in FOREACH statements in Pig. (A)

a) True

b) False

60. Which of the following are not possible in Hive? (E)

a) Creating Tables

b) Creating Indexes

c) Creating Synonym

d) Writing Update Statements

e) Both (c) and (d)

61. Who will initiate the mapper? (A)

a) Task tracker

b) Job tracker

c) Combiner

d) Reducer

62. Categorize the following to the following datatype

a) JSON files – Semi-structured

b) Word Docs , PDF Files , Text files – Unstructured

c) Email body – Unstructured

d) Data from enterprise systems (DB, CRM) – Structured

63. Which of the following are the Big Data Solutions Candidates? (E)

a) Processing 1.5 TB data everyday

b) Processing 30 minutes Flight sensor data

c) Interconnecting 50K data points (approx. 1 MB input file)

d) Processing User clicks on a website

e) All of the above

64. Hadoop is a framework that allows the distributed processing of: (C)

a) Small Data Sets

b) Semi-Large Data Sets

c) Large Data Sets

d) Large and Small Data sets

65. Where does Sqoop ingest data from? (B) & (D)

a) Linux File Directory

b) Oracle

c) HBase

d) MySQL

e) MongoDB

66. Identify the batch processing scenarios from following: (C) & (E)

a) Sliding Window Averages Job

b) Facebook Comments Processing Job

c) Inventory Dynamic Pricing Job

d) Fraudulent Transaction Identification Job

e) Financial Forecasting Job

67. Which of the following is not true about Name Node? (B)& (C) &(D)

a) It is the Master Machine of the Cluster

b) It is Name Node that can store user data

c) Name Node is a storage heavy machine

d) Name Node can be replaced by any Data Node Machine

68. Which of the following are NOT metadata items? (E)

a) List of HDFS files

b) HDFS block locations

c) Replication factor of files

d) Access Rights

e) File Records distribution

69. What decides number of Mappers for a MapReduce job? (C)

a) File Location

b) mapred.map.tasks parameter

c) Input file size

d) Input Splits

70. Name Node monitors block replication process ( B)



c) Depends on file type

71. Which of the following are true for Hadoop Pseudo Distributed Mode? (C)

a) It runs on multiple machines

b) Runs on multiple machines without any daemons

c) Runs on Single Machine with all daemons

d) Runs on Single Machine without all daemons

72. Which of following statement(s) are correct? ( C)