This post is written under the assumption that, an user reading this post already have an idea about installing and configuring hadoop on single node cluster. If not, it is better to go through the post Installing Hadoop on single node cluster
In this post we will briefly discuss about installing & configuring hadoop-2.3.0 version on multiple node cluster. For ease of simplicity, we will consider small cluster of 3 nodes, each with below minimum configuration.
Install Hadoop on Multi Node Cluster:
- All the three machines have latest Ubuntu 64-bit OS installed. At the of writing this post, Ubuntu 14.04 is the latest version available
- All the three machines must have Java version 1.6 or higher is installed. If not, follow the instructions in the post Installing Java on Ubuntu in all three machines and setup JAVA_HOME & PATH environment variables appropriately.
- All the three machines must have SSH (Secure Shell) installed. if not already installed please follow the instructions from post installing SSH on ubuntu
- Lets consider the IP addresses and hostnames of these three machines in /etc/hosts file of the respective machine as shown below.
And we will plan to consider master node to setup name node, resource manager along with data node and node manager but slave-1 and slave-2 will be setup only for running data node and node manager daemons.
- Create a separate user hduser in all three machines with the command shown below.
Configure /etc/hosts file on each machine
By default in Ubuntu file system, /etc/hosts file on each of the machine will have ip address and host name as shown below.
In order to recognize all the three machines each other, we need update this /etc/hosts file on each machine with IP addresses and hostnames of all three machines. Only Super user (sudo) will have access to edit this file . The format for specifying the host name and the IP will be as shown below
Copy the above four lines into /etc/hosts file of all the three machines. With this setting, each machine can recognize the other machine with, just by host name instead specifying the IP address every time.
SSH Configuration for Cluster setup
Before installing hadoop on three machines, we need to setup three machines in a cluster, where master node can able to connect with slave nodes without requiring a password and it (master) should be able to connect to itself without requiring any authentication/password.
For this, As per our prerequisite, SSH will be installed on all three machines.
- Create new RSA public key on master node with below command.
Press Enter key on first request to take the default file to save key And for Pass phrase leave the space as it is and press Enter key, to make sure no password is needed to login.
- We need to copy the above generated public key into the list of authorized keys on master node with the below command
- Now we will be able to login to master through SSH without giving any password.
- Now, we need to copy the same public key generated on master node (in $HOME/.ssh folder in master node), into corresponding SSH authorized keys files on all slave nodes. we can do this with the help of below commands.
Note: The above two commands need to be issued from master node but not from slave nodes. The above
Here hduser is the username and slave-1 and slave-2 are the host names.
Install Hadoop on each Machine
Now, we need to download latest stable version of hadoop and install it on each node usually in /usr/lib/hadoop location. This mainly include below three activities on each node.
- Download binary gzipped file from latest stable hadoop release into preferred installation location /usr/lib/hadoop on each machine and
- un archive the zipped files.
- Setup HADOOP_HOME, HADOOP_CONF_DIR, PATH environment variables with appropriate hadoop installation directory locations.
To check, whether the hadoop is installed properly or not, on all machines, issue the below command from terminal.