June 6, 2015 at 8:23 pm #3562
Interview Questions from my interviews
1. Hive – Where do you use Internal or Managed table? What scenarios?
2. In your resume, what do you mean by, “monitoring & managing MapReduce jobs”? Explain?
3. Interviewer’s Project: How to modify the RDBMs’ Nested SQL queries into Hadoop
framework using Pig.
4. Sqoop: Need to know very well. Some of the current projects are importing data from other
RDBMs sources into HDFS.
5. Can you join or transform tables/columns when importing using Sqoop?
6. Can you do the above with different RDBMs (not clear)?
7. How do you transfer flat files from Unix systems?
8. What is your Pig/Hive programming level (1- 10)? (Almost all interviewers asked this.)
9. Learn Scala! – Interviewer repeatedly told me.
Other Interview Questions:
1. Hive – Interval vs External How do you save your files in Hive
2. Sqoop – Incremental vs hast modified relate to your project
3. Sqoop – How to check if RDBMS Table Columns added/removed and how to incorporate
these changes into the import job.
4. What are the challenges you’ve faced in your project? Give 2 examples.
5. How do you check Data Integrity (log files)
6. How to improve performance in your script (PIG)?
7. Tell me about your project? work.
8. How do you use Partitioning/Bucketing in your project? (Examples from your project)
9. Where do you look for answers? (user groups, Apache Web, stack overflow)
10. NOSQL- HBase – Unstructured data storage?
11. How to debug Production issue? Give example. (logs, script counters, JVM)
12. Data Ingestion
13. What is the file size you’ve used?
14. Does Hive support indexing? (How does this relate to Partition and Bucketing)
15. Pig support Conditional Loop?
16. Hive – What type of data stored?
17. Recruiter: In your experience, what is the jump from DB developer to Hadoop without Java
More Technical type Interview Questions:
10. What functions did you use in PIG?
11. Filter – What did you filter out?
12. Join – What did you join?
13. What is your cluster size?
14. What is the file size for production environment?
15. How long does it take to run your script in Production cluster?
16. Are you planning for anything to improve the performance?
17. What size of file do you use for Development?
18. What did you work on HBase?
19. Why Hadoop? compare to RDBMS.
20. Hive – What did you do to increase the performance.
21. PIG – what did you do to increase the performance
22. What Java UDF did you write?
23. What scenario do you think you can use Java for?
24. You can process log files in RDBMS too. Why Hadoop?
25. Hive partitioning – your project example? Why?
26. Hive – What file format do you use in your work? (Avro, Parquet, Sequence file)
27. Hadoop – What is the challenge or difficulty you’ve faced?
28. PIG – What is the challenge or difficulty you’ve faced?
29. Flume – What is the challenge or difficulty you’ve faced?
30. Sqoop – What is the challenge or difficulty you’ve faced? (he didn’t ask this question)
31. How experienced are you in Linux?
32. What shell type do you use?
33. How about your experience in Cloudera Manager?
34. How about your experience in Cloudera Manager?
35. Do you use Impala? (I compared it with Hive and explained in more details)
36. How do you select the Eco system tools for your project?
InfoSys – Interview Questions:
As you can see, questions are mostly based on theory.
1. Why Hadoop? (Compare to RDBMS)
2. What would happen if NameNode failed? How do you bring it up?
3. What details are in the “fsimage” file?
4. What is SecondaryNameNode?
5. Explain the MapReduce processing framework? (start to end)
6. What is Combiner? Where does it fit and give an example? Preferably from your project.
7. What is Partitioner? Why do you need it and give an example? Preferably from your project.
8. Oozie – What are the nodes?
9. What are the actions in Action Node?
10. Explain your Pig project?
11. What log file loaders did you use in Pig?
12. Hive Joining? What did you join?
13. Explain Partitioning & Bucketing (based on your project)?
14. Why do we need bucketing?
15. Did you write any Hive UDFs?
16. Filter – What did you filter out?
21. Impala? Explain the use of Impala?
22. Cassandra? What do you know about Cassandra?
24. What is your cluster size?
25. What is the DataNode configurations? (RAM, CPU core, Disk size)
26. What is the NameNode configurations? (RAM, CPU core, Disk size)
27. How many Map slots & reducer slots configured in each DataNode? (he didn’t ask this)
28. How do you copy file from cluster to cluster?
29. What commands do you use to check to check system health, jobs, etc.?
30. Do you use Cloudera Manager to monitor and manage the jobs, cluster, etc.?
31. What is Speculative execution?
32. What do you know about Scala? (interviewer asked about the skills that I listed in my resume)1) Which best describes how TextInputFormat processes input files and line breaks?A. Input file splits may cross line breaks. A line that crosses file splits is read by the RecordReader of the split that contains the beginning of the broken line.B. Input file splits may cross line breaks. A line that crosses file splits is read by theRecordReaders of both splits containing the broken line.C. The input file is split exactly at the line breaks, so each RecordReader will read a series ofcomplete lines.D. Input file splits may cross line breaks. A line that crosses file splits is ignored.E. Input file splits may cross line breaks. A line that crosses file splits is read by the RecordReader of the split that contains the end of the broken line.2) For each intermediate key, each reducer task can emit:A. As many final key-value pairs as desired. There are no restrictions on the types of those key value pairs (i.e., they can be heterogeneous).B. As many final key-value pairs as desired, but they must have the same type as the intermediate key-value pairs.C. As many final key-value pairs as desired, as long as all the keys have the same type and all the values have the same type.D. One final key-value pair per value associated with the key; no restrictions on the type.E. One final key-value pair per key; no restrictions on the type.
2) Which describes how a client reads a file from HDFS?
- The client queries the NameNode for the block location(s). The NameNode returns the block location(s) to the client. The client reads the data directory off the DataNode(s).
- The client queries all DataNodes in parallel. The DataNode that contains the requested data responds directly to the client. The client reads the data directly off the DataNode.
- The client contacts the NameNode for the block location(s). The NameNode then queries the DataNodes for block locations. The DataNodes respond to the NameNode, and the NameNode redirects the client to the DataNode that holds the requested data block(s). The client then reads the data directly off the DataNode.
- The client contacts the NameNode for the block location(s). The NameNode contacts the
DataNode that holds the requested data block. Data is transferred from the DataNode to the
NameNode, and then from the NameNode to the client.
3) You want to populate an associative array in order to perform a map-side join. You’ve decided to put this information in a text file, place that file into the DistributedCache and read it in your Mapper before any records are processed. Indentify which method in the Mapper you should use to implement code for reading the file and populating the associative array?
4) You have just executed a MapReduce job. Where is intermediate data written to after being
emitted from the Mapper’s map method?
- Intermediate data in streamed across the network from Mapper to the Reduce and is never written to disk.
- Into in-memory buffers on the TaskTracker node running the Mapper that spill over and arewritten into HDFS.
- Into in-memory buffers that spill over to the local file system of the TaskTracker node running the Mapper.
- Into in-memory buffers that spill over to the local file system (outside HDFS) of the TaskTracker node running the Reducer
- Into in-memory buffers on the TaskTracker node running the Reducer that spill over and are written into HDFS.
5) You want to understand more about how users browse your public website, such as which pages they visit prior to placing an order. You have a farm of 200 web servers hosting your website. How will you gather this data for your analysis?
- Ingest the server web logs into HDFS using Flume.
- Write a MapReduce job, with the web servers for mappers, and the Hadoop cluster nodes for reduces.
- Import all users’ clicks from your OLTP databases into Hadoop, using Sqoop.
- Channel these clickstreams inot Hadoop using Hadoop Streaming.
- Sample the weblogs from the web servers, copying them into Hadoop using curl.
6)Which best describes how TextInputFormat processes input files and line breaks?
- Input file splits may cross line breaks. A line that crosses file splits is read by the RecordReader of the split that contains the beginning of the broken line.
- Input file splits may cross line breaks. A line that crosses file splits is read by the
RecordReaders of both splits containing the broken line.
- The input file is split exactly at the line breaks, so each RecordReader will read a series of
- Input file splits may cross line breaks. A line that crosses file splits is ignored.
- Input file splits may cross line breaks. A line that crosses file splits is read by the RecordReader of the split that contains the end of the broken line.
7)For each input key-value pair, mappers can emit:
- As many intermediate key-value pairs as designed. There are no restrictions on the types of those key-value pairs (i.e., they can be heterogeneous).
- As many intermediate key-value pairs as designed, but they cannot be of the same type as the input key-value pair.
- One intermediate key-value pair, of a different type.
- One intermediate key-value pair, but of the same type.
- As many intermediate key-value pairs as designed, as long as all the keys have the same types and all the values have the same type.
8)You have user profile records in your OLPT database, that you want to join with web logs you have already ingested into the Hadoop file system. How will you obtain these user records?
- HDFS command
- Pig LOAD command
- Sqoop import
- Hive LOAD DATA command
- Ingest with Flume agents
- Ingest with Hadoop Streaming
9)Given a directory of files with the following structure: line number, tab character, string:
You want to send each line as one record to your Mapper. Which InputFormat should you use to complete the line: conf.setInputFormat (____.class) ; ?
- You have written a Mapper which invokes the following five calls to the OutputColletor.collect method:
output.collect (new Text (“Apple”), new Text (“Red”) ) ;
output.collect (new Text (“Banana”), new Text (“Yellow”) ) ;
output.collect (new Text (“Apple”), new Text (“Yellow”) ) ;
output.collect (new Text (“Cherry”), new Text (“Red”) ) ;
output.collect (new Text (“Apple”), new Text (“Green”) ) ;
How many times will the Reducer’s reduce method be invoked?
- To process input key-value pairs, your mapper needs to lead a 512 MB data file in memory. What is the best way to accomplish this?
- Serialize the data file, insert in it the JobConf object, and read the data into memory in the
configure method of the mapper.
- Place the data file in the DistributedCache and read the data into memory in the map method of the mapper.
- Place the data file in the DataCache and read the data into memory in the configure method of the mapper.
( note :- this option was not in exam
Place the data file in the DistributedCache and read the data into memory in the configure method of the mapper)
- In a MapReduce job, the reducer receives all values associated with same key. Which statement best describes the ordering of these values?
- The values are in sorted order.
- The values are arbitrarily ordered, and the ordering may vary from run to run of the same
- The values are arbitrary ordered, but multiple runs of the same MapReduce job will always have the same ordering.
- Since the values come from mapper outputs, the reducers will receive contiguous sections of sorted values.
- You need to create a job that does frequency analysis on input data. You will do this by writing a Mapper that uses TextInputFormat and splits each value (a line of text from an input file intoindividual characters. For each one of these characters, you will emit the character as a key and an InputWritable as the value. As this will produce proportionally more intermediate data than input data, which two resources should you expect to be bottlenecks?
- Processor and network I/O
- Disk I/O and network I/O
- Processor and RAM
- Processor and disk I/O
14) Last 12 question is based on programming
- Inverted index what will be the o/p without putting any condition in reducer
Doc1 j t b
Doc2 j j t
Doc3 t t b
What will be the o/p
- J doc1 , doc2
T doc1 doc2 doc3
B do1 doc3
- J doc1 , doc2,doc2
T doc1 ,doc2, doc3 ,doc3
B do1 ,doc3
- How many Mapper ( there is no block size mentioned not input file size was mentioned )
I put we can’t get ans as we don’t have enough information
- How many Reducer if we didn’t set reducer
I put 3 as there is three unique terms
- This program based on file input and output it given 2 path for 2 file input and one path for file output
What will be the o/p in output.txt
- How many mapper if file1 256mb file2 10gb and block size is 128
- This question replaced code of map function before map task using array list and new code that need to be placed have only split logic
- This program based on find word with max freq with count of freq ( in input condition is > (greater than))
option was very confusing only second highest frequency word was there with different frequency of each word
like higest freq word was not mentioned
G 6 g3 g4 E5 (in place of other word g is written but count is correct for others)
- Sqoop table costumer need to make hive table
Create hive table in interactive mode using e
Delete table (drop will use)
Export data from HDFs to external data base (is it required to use update field like user id was given or without that)
The topic ‘Hadoop Interview Questions from Various Candidate Interviews’ is closed to new replies.