1. As the data is replicated thrice in HDFS, does it mean that any calculation done on one node will also be replicated on the other two?
Since there are 3 nodes, when we send the MapReduce programs, calculations will be done only on the original data. The master node will know which node exactly has that particular data. In case, if one of the nodes is not responding, it is assumed to be failed. Only then, the required calculation will be done on the second replica.
2. What happens if you get a ‘connection refused java exception’ when you type hadoop fs -ls /?
It could mean that the Namenode is not working on our hadoop cluster.
3. Can we have multiple entries in the master files?
Yes, we can have multiple entries in the Master files.
4. Why do we need a password-less SSH in Fully Distributed environment?
We need a password-less SSH in a Fully-Distributed environment because when the cluster is LIVE and running in Fully Distributed environment, the communication is too frequent. The Resource Manager/Namenode should be able to send a task to Node manager /Datanodes quickly.
5. Does this lead to security issues?
No, not at all. Hadoop cluster is an isolated cluster. And generally it has nothing to do with an internet. It has a different kind of a configuration. We needn’t worry about that kind of a security breach, for instance, someone hacking through the internet, and so on. Hadoop has implemented Kerberose for strong security way to connect to other machines to fetch and to process data.
6. which port does SSH work on?
SSH works on Port 22, though it can be configured. 22 is the default Port number.
7. If we only had 32 MB of memory how can we sort one terabyte of data?
Take a smaller chunk of data and sort it in memory, so we partition it in lots of little data sets. Then merge those sorted lists into one big list before writing the results back to disk.
8. How can we create an empty file in HDFS ?
We can create empty file with hadoop fs -touchz command. It creates a zero byte file.
9. How can we see the contents of a Snappy compressed file or SequenceFile via command line?
With the help of $ hadoop fs -text /user/hadoop/filename command we can see the contents of sequencefile or any compressed format file in text format.
10. How can we check the existence of a file in HDFS ?
The hadoop test is used for file test operations. The syntax is shown below:
Here “e” for checking the existence of a file, “z” for checking the file is zero length or not, “d” for checking the path is a directory or not. On success, the test command returns 1 else 0.
11. How can we set the replication factor of directory via command line?
Hadoop setrep is used to change the replication factor of a file. Use the -R option for recursively changing the replication factor.
12. How can we apply the permission to all the files under a HDFS directory recursively?
Using $ hadoop fs -chmod -R 755 /user/hadoop/dir command we can set the permissions to a directory recursively.
13. What is hadoop fs -stat command used for?
Hadoop stat returns the stats information of a file. It returns the last updated date and time. The syntax of stat is shown below:
14. What is expunge command in HDFS and why is it used for ?
Hadoop fs expunge command is used to empty the trash directory in HDFS. Syntax is:
15. How can we see the output of a MR job as a single file if the reducer might have created multiple part-r-0000* files ?
We can use hadoop fs -getmerge command to combine all the part-r-0000* files into single file and this file can be browsed to view the entire results of the MR job at a time. Syntax is:
The addnl option is for adding new line character at the end of each file.
16. Which of the following is most important when selecting hardware for our new Hadoop cluster?
1. The number of CPU cores and their speed.
2. The amount of physical memory.
3. The amount of storage.
4. The speed of the storage.
5. It depends on the most likely workload.
Answer 5 – Though some general guidelines are possible and we may need to generalize whether our cluster will be running a variety of jobs, the best fit depends on the anticipated workload.
17. Why would you likely not want to use network storage in your cluster?
1. Because it may introduce a new single point of failure.
2. Because it most likely has approaches to redundancy and fault-tolerance that may
be unnecessary given Hadoop’s fault tolerance.
3. Because such a single device may have inferior performance to Hadoop’s use of
multiple local disks simultaneously.
4. All of the above.
Answer 4: Network storage comes in many flavors but in many cases we may find a large Hadoop cluster of hundreds of hosts reliant on a single (or usually a pair of) storage devices. This adds a new failure scenario to the cluster and one with a less uncommon likelihood than many others. Where storage technology does look to address failure mitigation it is usually through disk-level redundancy.
18. We will be processing 10 TB of data on our cluster. Our main MapReduce job processes financial transactions, using them to produce statistical models of behavior and future forecasts. Which of the following hardware choices would be our first choice for the cluster?
1. 20 hosts each with fast dual-core processors, 4 GB memory, and one 500 GB
2. 30 hosts each with fast dual-core processors, 8 GB memory, and two 500 GB
3. 30 hosts each with fast quad-core processors, 8 GB memory, and one 1 TB disk drive.
4. 40 hosts each with 16 GB memory, fast quad-core processors, and four 1 TB
Answer 3. Probably! We would suggest avoiding the first configuration as, though it has just enough raw storage and is far from under powered, there is a good chance the setup will provide little room for growth. An increase in data volumes would immediately require new hosts and additional complexity in the MapReduce job could require additional processor power or memory.
Configurations B and C both look good as they have surplus storage for growth and provide similar head-room for both processor and memory. B will have the higher disk I/O and C the better CPU performance. Since the primary job is involved in financial modelling and forecasting, we expect each task to be reasonably heavyweight in terms of CPU
and memory needs. Configuration B may have higher I/O but if the processors are running at 100 percent utilization it is likely the extra disk throughput will not be used. So the hosts with greater processor power are likely the better fit.
Configuration D is more than adequate for the task and we don’t choose it for that very reason; why buy more capacity than we know we need?