100 Hadoop Certification Dump Questions 7


Hadoop Certification Dump Questions

1. From given below which describes how a client reads a file from HDFS?  ( 1 )

  1. The client queries all DataNodes in parallel. The DataNode that contains the requested data responds directly to the client. The client reads the data directly from the DataNode.
  2. The client contacts the NameNode for the block location(s). The NameNode then queries the DataNodes for block locations. The DataNodes respond to the NameNode, and the NameNode redirects the client to the DataNode that holds the requested data block(s). The client then reads the data directly from the DataNode.
  3. The client contacts the NameNode for the block location(s). The NameNode contacts the DataNode that holds the requested data block. Data is transferred from the DataNode to the NameNode, and then from the NameNode to the client.

2. The question asks you about website, how can you convert the website file into HDFS to work.

Option can be choosed based on which converts XML file to JSON.

3. Can Mapper Communicate while running? (  or ) Is there any way to make communicated while running?

NO

4. You want to populate an associative array in order to perform a map-side join. You’ve decided to put this information in a text file, place that file into the Distributed Cache and read it in your Mapper before any records are processed. Identify which method in the  Mapper you should use to implement code for reading the file and populating the associative array?    ( 3 )

  1. Map
  2. Combine
  3. Setup

5. You have just executed a Map Reduce job. Where is intermediate data written to after being emitted from the Mapper’s map method?  ( 3 )

  1. Intermediate data in streamed across the network from Mapper to the Reducer and is never written to disk
  2. Into in-memory buffers on the TaskTracker node running the Mapper that spill over and are written into HDFS.
  3. Into in-memory buffers that spills over to the local file system of the TaskTracker node running the mappe
  4. Into in-memory buffers that spill over to the local file system (outside HDFS) of the TaskTracker node running the Reducer
  5. Into in-memory buffers on the TaskTracker node running the Reducer that spill over and are written into HDFS.

6. You want to understand more about how users browse your public website, such as which pages they visit prior to placing an order. You have a farm of 200 web servers hosting your website. How will you gather this data for your analysis? ( 1 )

  1. Ingest the Server web logs into HDFS using Flume
  2. Write a Map Reduce job, with the web servers for mappers, and the Hadoop cluster nodes for reduces.
  3. Import all user’s clicks from your OLTP databases into Hadoop, using Sqoop.
  4. Channel these clickstreams into Hadoop using Hadoop Streaming.
  5. Sample the weblogs from the web servers, copying them into Hadoop using curl.

7. Which best describes how TextInputFormat processes input files and line breaks? ( 2)

  1. Input file splits may cross line breaks. A line that crosses file splits is read by the Record Reader of the split that contains the beginning of the broken line.
  2. Input file splits may cross line breaks. A line that crosses file splits is read by the Record Readers of both splits containing the broken line.
  3. The input file is split exactly at the line breaks, so each Record Reader will read a series of complete lines.
  4. Input file splits may cross line breaks. A line that crosses file splits is ignored.
  5. Input file splits may cross line breaks. A line that crosses file splits is read by the Record Reader of the split that contains the end of the broken line.

8. For each input key-value pair, mappers can emit: ( 5 )

  1. As many intermediate key-value pairs as designed. There are no restrictions on the types of those key-value pairs (i.e., they can be heterogeneous).
  2. As many intermediate key-value pairs as designed, but they cannot be of the same type as the input key-value pair.
  3. One intermediate key-value pair, of a different type.
  4. One intermediate key-value pair, but of the same type.
  5. As many intermediate key-value pairs as designed, as long as all the keys have the same types and all the values have the same type.

9. You have user profile records in your OLPT data base, that you want to join with web logs you have already ingested into the Hadoop file system. How will you obtain these user records? ( 3 )

  1. HDFS command
  2. Pig LOAD command
  3. Sqoop import
  4. Hive LOAD DATA command
  5. Ingest with Flume agents
  6. Ingest with Hadoop Streaming

10. Given a directory of files with the following structure: line number, tab character, string

Example:

1  abialkjfjkaoasdfjksdlkjhqweroij

2 kadfjhuwqounahagtnbvaswslmnbfgy

3  kjfteiomndscxeqalkzhtopedkfsikj

11. You want to send each line as one record to your Mapper. Which InputFormat should you use to complete the line: conf.setInputFormat (____.class) ; ? ( 3 )

  1. Sequence File As Text Input Format
  2. Sequence File Input Format
  3. Key Value File As Text Input Format
  4. BDB Input Format

12. You have written a Mapper which invokes the following five calls to the OutputColletor.collect method:

       output.collect (new Text (“Apple”), new Text (“Red”) ) ;

       output.collect (new Text (“Banana”), new Text (“Yellow”) ) ;

       output.collect (new Text (“Apple”), new Text (“Yellow”) ) ;

       output.collect (new Text (“Cherry”), new Text (“Red”) ) ;

       output.collect (new Text (“apple”), new Text (“Green”) ) ;

     How many times will the Reducer’s reduce method be invoked? ( 5 )

  1. 6
  2. 3
  3. 1
  4. 0
  5. 4

13. To process input key-value pairs, your mapper needs to read a 512 MB data file in memory. What is the best way to accomplish this? ( 2)

1.  Serialize the data file, insert in it the JobConf object, and read the data into memory in the configure method             of the mapper.

2. Place the data file in the DistributedCache and read the data into memory in the map method of the mapper.

3. Place the data file in the DataCache and read the data into memory in the configure method of the mapper.

This question came again, but there is no option about configure method of mapper. Instead I chose the setup method of mapper.

14. To process input key-value pairs, your mapper needs to load a 512 MB data file in memory. What Is the best way to accomplish this?   (2)

1.  Place the data file in the Data Cache and read the data into memory in the configure method Of the mapper.

2.  Place the data file in the Distributed Cache and read the data into memory in the map method ofthe mapper.

3.  Place the data file in the Distributed Cache and read the data into memory in the setup method of the mapper.

4. Serialize the data file, insert it in the Job conf object, and read the data into memory in the configure method of          the mapper

15. In a Map Reduce job, the reducer receives all values associated with same key. Which statement best describes the ordering of these values? ( 2)

1. The values are in sorted order.

2. The values are arbitrarily ordered, and the ordering may vary from run to run of the same Map Reduce job.

3. The values are arbitrary ordered, but multiple runs of the same Map Reduce job will always have the same                    ordering.

4. Since the values come from mapper outputs, the reducers will receive contiguous sections of sorted values.

16. For each intermediate key, each reducer task can emit: ( 3 )

  1. As many final key-value pairs as desired. There are no restrictions on the types of those keyvalue pairs (i.e., they can be heterogeneous).
  2. As many final key-value pairs as desired, but they must have the same type as the intermediate key-value pairs.
  3. As many final key-value pairs as desired, as long as all the keys have the same type and all the values have the same type.
  4. One final key-value pair per value associated with the key; no restrictions on the type.
  5. One final key-value pair per key; no restrictions on the type.

17. You need to create a job that does frequency analysis on input data. You will do this by writing a Mapper that uses TextInputFormat and splits each value (a line of text from an input file into individual characters. For each one of these characters, you will emit the character as a key and an IntWritable as the value. As this will produce proportionally more intermediate data than input data, which two resources should you expect to be bottlenecks? ( 2 )

  1. Processor and network I/O
  2. Disk I/O and network I/O
  3. Processor and RAM
  4. Processor and disk I/O

18. Analyze each scenario below and identify which best describes the behavior of the default partitioner?  (4)

  1. The default partitioner assigns key-values pairs to reduceR based on an internal random number generator.
  2. The default partitioner implements a round-robin strategy, shuffling the key-value pairs to each reducer in turn. This ensures an even partition of the key space.
  3. The default partitioner computes the hash of the key. Hash values between specific ranges are associated with different buckets, and each bucket is assigned to a specific reducer.
  4. The default partitioner computes the hash of the key and divides that value modulo the number of reducers. The result determines the reducer assigned to process the key-value pair.
  5. The default partitioner computes the hash of the value and divides that value modulo the number of reducers. The result determines the reducer assigned to process the key-value pair.

19. One question tell you that they want to change all the customer name to upper case, and change all the date into same format. How can you do it?

Write a Map reduce to use mapper to convert it into upper case, and reduce to convert the date into same                       format.Reducer is not needed in mapper only all the things can be done.

20. No output directory created and we try to copy data from local to HDFS prgm run and put output in HDFS or it will fail.

Execute and create new directory

      Coding:

21. There will two questions asking you what is the number of reducer? Two questions from 2 different coding.

In order to answer it, go to the Configuration (driver) part of the code. In 1 code, there is no declaration of       reducer, you can find conf add Mapper class (name of  your Mapper) … but no add reducer class. Therefore for this code, the number of reducer is zero.

However, in another code, they haves something like add Reducer class (Myreducer), but they do not set number of reducer (no conf set Number Reduccer Task(2))=> choose 1 because 1 is the default number of reducer in code.

2+3. One question give you the code with Driver Mapper, Reducer. Then in Mapper when it emits key-value pair using context c.write(new Text (word), new IntWritable (word.frequency(word))). I do not remember exactly but there is surely frequency term.

Then in the question, they give you another Mapper Class. For this new Mapper, when it emits k-v pair it uses c.write(new Text(word), new Intwritable(1)). Similar to wordcount.

22. The question asks if we replace this new mapper, what happen?

Ans  –   bandwidth would be greater.

23. If another question asked for the same code as how many mapper will run?

Ans:  4 (four files, first 3 files is less than a block size, last one is empty but it counts too)

24. If a simple code is given to you and ask you how to get the file name?

Ans: (File Split).c.getInputSplit().getPath(), by using this we can get the file name.

25. If we do not set any reduces then how many reducers will run defaultly

Ans: 3 reducers

26. Find the output from the below given file ( 1 )

j j g o h

j k g h o

j j g j j h

  1. j 7
  2. g 3
  3. g 5
  4. g 15
  5. job fails

27. How many mapper s will run if file1 is 256mb and file2 is 10 gb and block size is 128 mb?

Ans – 102

28. You want to populate an associative array in order to perform a map-side join. You’ve decided to put this information in a text file, place that file into the DistributedCache and read it in your Mapper before any records are processed. Indentify which method in the Mapper you should use to implement code for reading the file and populating the associative array?

  1. combine
  2. map
  3. init
  4. configure

  Answer: B

29. How to delete table CUSTOMER? ( 1 )

  1. hive –e “DROP TABLE customer”
  2. hive—e “DROP CUSTOMER;”

30. You have database in HDFS. (relational database). Now you want move it to local file. How can you do it?

Ans:   hadoop –fs  put

31. If you use the hadoop fs –put command to write a 300 MB file using an HDFS block size of 64 MB.Just after this command has finished, another user started of writing 200 MB of the file, what would another user see when they trying to access this file? ( 1 )

  1. They would see no content until the whole file is written and closed.
  2. They would see the content of the file through the last completed block.
  3. They would see the current state of the file, up to the last bit written by the command.
  4. They would see Hadoop throw an concurrent File Access Exception when they try to access this file

32. You need to move a file titled “weblogs” into HDFS. When you try to copy the file, you can’t. You know you have ample space on your DataNodes. Which action should you take to relieve  this situation and store more files in HDFS?  ( 4 )

  1. Increase the block size on all current files in HDFS.
  2. Increase the block size on your remaining files.
  3. Decrease the block size on your remaining files.
  4. Increase the amount of memory for the NameNode.
  5. Increase the number of disks (or size) for the NameNode.
  6. Decrease the block size on all current files in HDFS.

33. You have user employee records in your OLTP database, that you want to join with project details records in another OLTP database. You have already put this into the Hadoop file system. How will you obtain these user records? 

Hdfs command
Pig load commans
Sqoop import 
Hive load data command
Flume agent

34. What is the difference between a failed task attempt and a killed task attempt? 

A failed task attempt is a task attempt that did not generate any key value pairs. A killed task attempt is a task attempt that threw an exception, and thus killed by the execution framework.
A failed task attempt is a task attempt that threw a RuntimeException (i.e., the task fails). A killed task attempt is a task attempt that threw any other type of exception (e.g., IOException); the execution framework catches these exceptions and reports them as killed.
A failed task attempt is a task attempt that threw an unhandled exception. A killed task attempt is one that was terminated by the JobTracker.
A failed task attempt is a task attempt that completed, but with an unexpected status value. A killed task attempt is a duplicate copy of a task attempt that was started as part of speculative execution.

35. In oozie workflow all the mapreduce jobs can run in sequence only.? True or false. 
False

36. You have just executed a MapReduce job. Where is intermediate data written to after being emitted from the Mapper’s map method?

  1. Intermediate data in streamed across the network from Mapper to the Reduce and is never written to disk.
  2. Into in-memory buffers on the TaskTracker node running the Mapper that spill over and are written into HDFS.
  3. Into in-memory buffers that spill over to the local file system of the TaskTracker node running the Mapper.
  4. Into in-memory buffers that spill over to the local file system (outside HDFS) of the TaskTracker node running the Reducer
  5. Into in-memory buffers on the TaskTracker node running the Reducer that spill over and are written into HDFS.

 Answer: C

37. Identify the MapReduce v2 (MRv2 / YARN) daemon responsible for launching application containers and monitoring application resource usage? 

ApplicationMaster
ResourceManager
NodeManager
TaskTracker
JobTracker

38. For each intermediate key, each reducer task can emit: 

a) As many final key-value pairs as desired, as long as all the keys have the same type and all the values have the same type
b) One final key-value pair per value associated with the key; no restrictions on the type
c) One final key-value pair per key; no restrictions on the type
d) As many final key-value pairs as desired, but they must have the same type as the intermediate key-value pairs
e) As many final key-value pairs as desired. There are no restrictions on the types of those key value pairs (i.e., they can be heterogeneous)

39. I have read that pig provides additional capability for allowing you to control the flow of multiple mapreduce jobs, but I have seen it doesn’t provide additional capability, it provides interpreter only. 
so which one I should consider…??

40. You want to understand more about how users browse your public website, such as which pages they visit prior to placing an order. You have a farm of 200 web servers hosting your website. How will you gather this data for your analysis? 

A. Ingest the server web logs into HDFS using Flume.
B. Write a MapReduce job, with the web servers for mappers, and the Hadoop cluster nodes for reduces.
C. Import all users’ clicks from your OLTP databases into Hadoop, using Sqoop.
D. Channel these clickstreams inot Hadoop using Hadoop Streaming.
E. Sample the weblogs from the web servers, copying them into Hadoop using curl.

41. Which describes how a  client reads a file from   HDFS?

A. The client queries the  NameNode for the block   location(s). The  NameNode returns the block location(s) to the  client.   The client reads the data  directory off the DataNode(s).

B. The client queries all DataNodes in  parallel. The DataNode that contains the requested data responds directly to the client. The client reads the data directly off the DataNode.

C. The client contacts the NameNode for the block location(s). The NameNode then queries the DataNodes for     block locations. The DataNodes respond to the NameNode, and the NameNode redirects the client to the DataNode that holds the requested data block(s). The client then reads the data directly off the DataNode.

D. The client contacts the NameNode for the block  location(s). The NameNode contacts the DataNode that holds the requested  data block. Data is transferred from the DataNode to the NameNode, and then from the NameNode to the client.  MapReduce metric reporting

42. You  use the hadoop fs –put command to write a 300 MB file  using and HDFS block size of 64 MB. Just after  this command  has   finished  writing 200 MB of this file, what would another   user see when trying to access this  life?   (my answer D)

A. They would see Hadoop throw an ConcurrentFileAccessException when they try to access this file.
B. They would see the current state of the file, up to the last bit written by the command.
C. They would see the current of the file through the last completed block.
D. They would see no content until the whole file written and closed.

43. You want to populate an associative array in order to  perform a map-side join. You’ve decided to put this  information in a text file, place that file into the DistributedCache and read it in your Mapper before any  records are processed.  Indentify which method in the  Mapper you should use to implement code for reading the file  and populating the associative array?  (my answer D) 

A.combine
B.map
C.init
D.configure

44. You’ve written a MapReduce job that will process 500 million input records and generated 500 million keyvalue pairs. The data is not uniformly distributed. Your MapReduce job will create a significant amount of intermediate data that it needs to transfer between mappers and reduces which is a potential bottleneck. A custom implementation of which interface is most likely to reduce the amount of intermediate data transferred across the network?

  1. Partitioner
  2. OutputFormat
  3. WritableComparable
  4. Writable
  5. InputFormat
  6. Combiner

 Answer: F

45. You have just executed a MapReduce job. Where is intermediate data written to after being emitted from the  Mapper’s map method?(my answer C)    

A.Intermediate data in streamed across the network from  Mapper to the Reduce and is never written to disk.
B.Into in-memory buffers on the TaskTracker node running  the Mapper that spill over and are written into HDFS.
C.Into in-memory buffers that spill over to the local file  system of the TaskTracker node running the Mapper.
D. Into in-memory buffers that spill over to the local file  system (outside HDFS) of the TaskTracker node running the  Reducer
E.Into in-memory buffers on the TaskTracker node  running the Reducer that spill over and are written into  HDFS

46. You want to understand more about how users browse your  public website, such as which pages they visit prior to  placing an order. You have a farm of 200 web servers hosting  your website. How will you gather this data for your analysis?   (my answer A) 

A.Ingest the server web logs into HDFS using Flume.
B.Write a MapReduce job, with the web servers for mappers,  and the Hadoop cluster nodes for reduces.
C.Import all users’ clicks from your OLTP databases into  Hadoop, using Sqoop.
D. Channel these clickstreams inot Hadoop using Hadoop  Streaming.
E.Sample the weblogs from the web servers,  copying them into Hadoop using curl.

47. Which best describes how TextInputFormat processes input  files and line breaks?  (my answer A) 

A.Input file splits may cross line breaks. A line that  crosses file splits is read by the RecordReader of the split  that contains the beginning of the broken line.
B.Input file splits may cross line breaks. A line that crosses file splits is read by the RecordReaders of both  splits containing the broken line.
C.The input file is split exactly at the line breaks, so  each RecordReader will read a series of complete lines.
D.Input file splits may cross line breaks. A line that crosses file splits is ignored.
E.Input file splits may cross line breaks. A line that  crosses file splits is read by the RecordReader of the split  that contains the end of the broken line

48. Given a directory of files with the following structure: 

line number, tab character, string:
Example:
1 abialkjfjkaoasdfjksdlkjhqweroij
2 kadfjhuwqounahagtnbvaswslmnbfgy
3 kjfteiomndscxeqalkzhtopedkfsikj
You want to send each line as one record to your Mapper.  Which InputFormat should you use to complete the line:  conf.setInputFormat (____.class) ; ?  (I think some issue with the below options and probably  right answer would be keyValueTextInputFormat)

A.SequenceFileAsTextInputFormat
B.SequenceFileInputFormat
C.KeyValueFileInputFormat
D. BDBInputFormat

49. When is the earliest point at which the reduce method of a given Reducer can be called? 

  1. As soon as at least one mapper has finished processing its input split.
  2. As soon as a mapper has emitted at least one record.
  3. Not until all mappers have finished processing all records.
  4. It depends on the InputFormat used for the job.

     Answer 3

50. You want to understand more about how users browse your public website, such as which pages they visit prior to placing an order. You have a farm of 200 web servers hosting your website. How will you gather this data for your analysis?

  1. Ingest the server web logs into HDFS using Flume.
  2. Write a MapReduce job, with the web servers for mappers, and the Hadoop cluster nodes for reduces.
  3. Import all users’ clicks from your OLTP databases into Hadoop, using Sqoop.
  4. Channel these clickstreams inot Hadoop using Hadoop Streaming.
  5. Sample the weblogs from the web servers, copying them into Hadoop using curl.

  Answer: A

51. Which process describes the lifecycle of a  Mapper?   (my answer B) 

A.The JobTracker calls the TaskTracker’s configure ()  method, then its map () method and finally its close () method.
B.The TaskTracker spawns a new Mapper to process all  records in a single input split.
C.The TaskTracker spawns a new Mapper to process each  key-value pair.
D.The JobTracker spawns a new Mapper to process all records  in a single file.

52. Determine which best describes when the reduce method is  first called in a MapReduce job?  (my answer B) 

A. Reducers start copying intermediate key-value pairs from  each Mapper as soon as it has completed. The programmer can  configure in the job what percentage of the intermediate  data should arrive before the reduce method begins.
B. Reducers start copying intermediate key-value pairs from  each Mapper as soon as it has completed. The reduce method  is called only after all intermediate data has been copied  and sorted.
C.Reduce methods and map methods all start at the beginning  of a job, in order to provide optimal performance for map-only or reduce-only jobs.
D.Reducers start copying intermediate key-value pairs from  each Mapper as soon as it has completed. The reduce method  is called as soon as the intermediate key-value pairs start  to arrive.

53. Which describes how a client reads a file from HDFS?

  1. The client queries the NameNode for the block location(s). The NameNode returns the block location(s) to the client. The client reads the data directory off the DataNode(s).
  2. The client queries all DataNodes in parallel. The DataNode that contains the requested data respondsdirectly to the client. The client reads the data directly off the DataNode.
  3. The client contacts the NameNode for the block location(s). The NameNode then queries theDataNodes for block locations. The DataNodes respond to the NameNode, and the NameNoderedirects the client to the DataNode that holds the requested data block(s). The client then reads the data directly off the DataNode.
  4. The client contacts the NameNode for the block location(s). The NameNode contacts the DataNodethat holds the requested data block. Data is transferred from the DataNode to the NameNode, and thenfrom the NameNode to the client.

   Answer: A

54. To process input key-value pairs, your mapper needs to lead  a 512 MB data file in memory. What is the best way to  accomplish this?  

A.Serialize the data file, insert in it the JobConf object, and read the data into memory in the configure method of the  mapper.
B.Place the data file in the DistributedCache and read the  data into memory in the map method of the mapper.
C.Place the data file in the DataCache and read the data  into memory in the configure method of the mapper.  D. Place the data file in the DistributedCache and read the  data into memory in the configure method of the mapper.

55. Assuming default settings, which best describes the order of data provided to a reducer’s reduce method:

  1. The keys given to a reducer aren’t in a predictable order, but the values associated with those keys always are.
  2. Both the keys and values passed to a reducer always appear in sorted order.
  3. Neither keys nor values are in any predictable order.
  4. The keys given to a reducer are in sorted order but the values associated with each key are in nopredictable order

   Answer: D

56. You wrote a map function that throws a runtime exception when it encounters a control character in input data. The input supplied to your mapper contains twelve such characters totals, spread across five file splits. The first four file splits each have two control characters and the last split has four control characters. Indentify the number of failed task attempts you can expect when you run the job with mapred.max.map.attempts set to 4:

  1. You will have forty-eight failed task attempts
  2. You will have seventeen failed task attempts
  3. You will have five failed task attempts
  4. You will have twelve failed task attempts
  5. You will have twenty failed task attempts

  Answer: E

57. You are developing a MapReduce job for sales reporting. The mapper will process input keys representing the year (IntWritable) and input values representing product indentifies (Text). Indentify what determines the data types used by the Mapper for a given job.

  1. The key and value types specified in the JobConf.setMapInputKeyClass and JobConf.setMapInputValuesClass methods
  2. The data types specified in HADOOP_MAP_DATATYPES environment variable
  3. The mapper-specification.xml file submitted with the job determine the mapper’s input key and value types.
  4. The InputFormat used by the job determines the mapper’s input key and value types.

  Answer: D

58. You need to run the same job many times with minor variations. Rather than hardcoding all job configuration options in your drive code, you’ve decided to have your Driver subclass org.apache.hadoop.conf.Configured and implement the org.apache.hadoop.util.Tool interface. Indentify which invocation correctly passes.mapred.job.name with a value of Example to Hadoop?

  1. hadoop “mapred.job.name=Example” MyDriver input output
  2. hadoop MyDriver mapred.job.name=Example input output
  3. hadoop MyDrive D mapred.job.name=Example input output
  4. hadoop setproperty mapred.job.name=Example MyDriver input output
  5. hadoop setproperty (“mapred.job.name=Example”) MyDriver input output

 Answer: C

59. MapReduce v2 (MRv2/YARN) is designed to address which two issues?

  1. Resource pressure on the JobTracker.
  2. Single point of failure in the NameNode.
  3. HDFS latency.
  4. Reduce complexity of the MapReduce APIs.
  5. Ability to run frameworks other than MapReduce, such as MPI.
  6. Standardize on a single MapReduce API.

 Answer: A & E

60. Can you use MapReduce to perform a relational join on two large tables sharing a key? Assume that the two tables are formatted as comma-separated files in HDFS.

  1. No, MapReduce cannot perform relational operations.
  2. Yes, but only if one of the tables fits into memory
  3. Yes, so long as both tables fit into memory.
  4. Yes.
  5. No, but it can be done with either Pig or Hive.

 Answer: D

61. What is block, and difference between file block and HDFS block?

A chunk of that stored into a physical file called block. A fixed amount of data that can store to read or write. File systems default block size is 512 bytes. HDFS default block size is 64MB. If disk block filled a portion of actual block size, it occupies total memory. While HDFS block doesn’t occupy a full blocks worth of file.

62. What is the use of cluster id?

Cluster Id added to identify all nodes in cluster. After formatting of name-node this cluster Id helps to identify the nodes. Name space can generate Block Ids to identify each blocks information.

63. When and where we use serialization process?

These serialization and deserialization process most frequently occur in distributed data processing. RPC protocol uses serialization and deserialization concept when hadoop transmit data between different nodes.

64. Can you tell me some RPC serialization formats?

Compact format for best network bandwidth. Fast- for inter process communication format. Its highly recommendable for distributed system’s to read, write TBs of data in seconds. Extensible-Protocols change over time to meet new requirements. Inter-operable- support clients that are written in different languages to the server.

65. Why Hadoop uses its own serialization format instead of RPC formats?

Writable interface is central to Hadoop to common key and value types. Which serialize and deserialize the data. Its compact and fast, but not easy to extend. Instead of using other serialization, hadoop uses its own interface to serialize and deserialize.

66. When Do you use Safemode?

By default hadoop automatically enters and leaves safemode when the cluster is started, but admin can manually enter and leave the Safemode.

When we are upgrading the hadoop version, or doing complex changes in the Name node, admin manually enable Safemode. To save the metadata manually to the disk and reset the edit log , the name node should be in safe mode  and save the namespace with the help of given commands.

hadoop dfsadmin –Safemode enter

hadoop dfsadmin- save Namespace

hadoop dfsadmin- Upgrade

hadoop dfsadmin-Safemode leave

67. What is HFTP?

HFTP is a read only Hadoop File system that lets us read data from a remote HDFS cluster. It will not allow writing or modifying the file system state. If we are moving data from one hadoop version to another hadoop version, use HFTP. Its wire compatible between different versions of HDFS.

Example: hadoop distcp-I htfp://source_hostname:50070/path hdfs://dest_hostname:50070/path

68. Why sometimes Hadoop shows “connection Refused “error?

Below are some reasons.

  • Namenode Service Might be down
  • SSH is not installed.
  • Hostname missmatch.

69)    Identify the MapReduce v2 (MRv2 / YARN) daemon responsible for launching application containers and monitoring application resource usage?

  • ApplicationMaster
  • ResourceManager
  • NodeManager
  • ApplicationMasterService
  • TaskTracker
  • JobTracker

70)    MapReduce v2 (MRv2/YARN) is designed to address which two issues?

  • Reduce complexity of the MapReduce APIs.
  • Resource pressure on the JobTracker
  • Ability to run frameworks other than MapReduce, such as MPI.
  • Single point of failure in the NameNode.
  • Standardize on a single MapReduce API.
  • HDFS latency.

71)    You have a large dataset of key-value pairs, where the keys are strings, and the values are integers. For each unique key, you want to identify the largest integer. In writing a MapReduce program to accomplish this, can you take advantage of a combiner

Yes

72)    You want to perform analysis on a large collection of images. You want to store this data in HDFS and process it with MapReduce but you also want to give your data analysts and data scientists the ability to process the data directly from HDFS with an interpreted high-level programming language like Python. Which format should you use to store this data in HDFS?

  • SequenceFiles
  • Avro
  • HTML
  • XML
  • JSON
  • CSV

73)    What is the difference between a failed task attempt and a killed task attempt?

  • A failed task attempt is a task attempt that did not generate any key value pairs. A killed task attempt is a task attempt that threw an exception, and thus killed by the execution framework.
  • A failed task attempt is a task attempt that threw a RuntimeException (i.e., the task fails). A killed task attempt is a task attempt that threw any other type of exception (e.g., IOException); the execution framework catches these exceptions and reports them as killed.
  • A failed task attempt is a task attempt that threw an unhandled exception. A killed task attempt is one that was terminated by the JobTracker.
  • A failed task attempt is a task attempt that completed, but with an unexpected status value. A killed task attempt is a duplicate copy of a task attempt that was started as part of speculative execution.

74)    You have user employee records in your OLPT database, that you want to join with project details records in another OLTP database. You have already put this into the Hadoop file system. How will you obtain these user records?

  • HDFS Command
  • Pig Load coammnd
  • Sqoop command
  • Hive Load data command
  • Flume Agents

75)    For each intermediate key, each reducer task can emit:

  • One final key value pair per key; no restrictions on the type
  • One final key-value pair per value associated with the key; no restrictions on the type
  • As many final key-value pairs as desired, as long as all the keys have the same type and all the values have the same type
  • As many final key-value pairs as desired, but they must have the same type as the intermediate key-value pairs

76)    You want to understand more about how users browse your public website, such as which pages they visit prior to placing an order. You have a farm of 200 web servers hosting your website. How will you gather this data for your analysis?

A.    Ingest the server web logs into HDFS using Flume.
B.    Write a MapReduce job, with the web servers for mappers, and the Hadoop cluster nodes for reduces.
C.     Import all users’ clicks from your OLTP databases into Hadoop, using Sqoop.
D.     Channel these clickstreams inot Hadoop using Hadoop Streaming.
E.     Sample the weblogs from the web servers, copying them into Hadoop using curl.

77)    Which type of algorithms are difficult to express as MapReduce?

  • Large-scale graph algorithms that require one-step link traversal
  • Algorithms that requite global, shared state
  • Relational operations on large amounts of structured and semi structured data
  • Text analysis algorithms on large collections of unstructured text (e.g., Web crawls)
  • Algorithms that require applying the same mathematical function to large numbers of individual binary records

78)    To process input key-value pairs, your mapper needs to load a 512 MB data file in memory. What is the best way to accomplish this?

  • Place the datafile in the DataCache and read the data into memory in the configure method of the mapper.
  • Place the data file in the DistributedCache and read the data into memory in the map method of the mapper.
  • Place the data file in the DistributedCache and read the data into memory in the configure/setup method of the mapper.
  • Serialize the data file, insert it in the Jobconf object, and read the data into memory in the configure method of the mapper.

79)    What is the disadvantage of using multiple reducers with the default HashPartitioner and distributing your workload across you cluster?

  • You will no longer be able to take advantage of a Combiner
  • By using multiple reducers with the default HashPartitioner, output files may not be in globally sorted order.
  • You will not be able to compress the intermediate data.
  • There are no concerns with this approach. It is always advisable to use multiple reduce

80)    Which three distcp features can you utilize on a Hadoop cluster?

A. Use distcp to copy files only between two clusters or more. You cannot use distcp to copy data between directories inside the same cluster.
B. Use distcp to copy HBase table files.
C. Use distcp to copy physical blocks from the source to the target destination in your cluster.
D. Use distcp to copy data between directories inside the same cluster.
E. Use distcp to run an internal MapReduce job to copy files.

81)    Which best describes how TextInputFormat processes input files and line breaks?

  • The input file is split exactly at the line breaks, so each RecordReader will read a series of complete lines.
  • Input file splits may cross line breaks. A line that crosses file splits is ignored.
  • Input file splits may cross line breaks. A line that crosses file splits is read by the RecordReader of the split that contains the end of the broken line.
  • Input file splits may cross line breaks. A line that crosses file splits is read by the RecordReaders of both splits containing the broken line
  • Input file splits may cross line breaks. A line that crosses file splits is read by the RecordReader of the split that contains the beginning of the broken line.

82)    What two processes must you do if you are running a Hadoop cluster with a single NameNode and six DataNodes, and you want to change a configuration parameter so that it affects all six DataNodes.

A. You must restart the NameNode daemon to apply the changes to the cluster
B. You must restart all six DataNode daemons to apply the changes to the cluster.
C. You don’t need to restart any daemon, as they will pick up changes automatically.
D. You must modify the configuration files on each of the six DataNode machines.
E. You must modify the configuration files on only one of the DataNode machine
F. You must modify the configuration files on the NameNode only. DataNodes read their configuration from the master nodes.

83) You use the hadoop fs –put command to write a 300 MB file using an HDFS block size of 64 MB. Just after this command has finished writing 200 MB of this file, what would another user see when trying to access this file?

  • They would see the current state of the file, up to the last bit written by the command.
  • They would see the content of the file through the last completed block.
  • They would see no content until the whole file is written and closed
  • They would see Hadoop throw an concurrentFileAccessException when they try to access this file.

84) Your cluster has 10 DataNodes, each with a single 1 TB hard drive. You utilize all your disk capacity for HDFS, reserving none for MapReduce. You implement default replication settings. What is the storage capacity of your Hadoop cluster (assuming no compression)?

  • about 11 TB
  • about 3 TB
  • about 5 TB
  • about 10 TB

85) You are developing a MapReduce job for sales reporting. The mapper will process input keys representing the year (IntWritable) and input values representing product indentifies (Text).
Identify what determines the data types used by the Mapper for a given job.

  • The InputFormat used by the job determines the mapper’s input key and value types.
  • The data types specified in HADOOP_MAP_DATATYPES environment variable
  • The mapper-specification.xml file submitted with the job determine the mapper’s input key and value types.
  • The key and value types specified in the JobConf.setMapInputKeyClass and JobConf.setMapInputValuesClass methods

86) When can a reduce class also serve as a combiner without affecting the output of a MapReduce program?

  • When the types of the reduce operation’s input key and input value match the types of the reducer’s output key and output value and when the reduce operation is both communicative and associative.
  • When the signature of the reduce method matches the signature of the combine method
  • Always. Code can be reused in Java since it is a polymorphic object-oriented programming language
  • Always. The point of a combiner is to serve as a mini-reducer directly after the map phase to increase performance.
  • Never. Combiners and reducers must be implemented separately because they serve different purposes.

Profile photo of Siva

About Siva

Senior Hadoop developer with 4 years of experience in designing and architecture solutions for the Big Data domain and has been involved with several complex engagements. Technical strengths include Hadoop, YARN, Mapreduce, Hive, Sqoop, Flume, Pig, HBase, Phoenix, Oozie, Falcon, Kafka, Storm, Spark, MySQL and Java.


Leave a comment

Your email address will not be published. Required fields are marked *

7 thoughts on “100 Hadoop Certification Dump Questions


Review Comments
default image

I am a plsql developer. Intrested to move into bigdata.

Neetika Singh ITA Hadoop in Dec/2016 December 22, 2016

.