Install Hadoop on Multi Node Cluster


This post is written under the assumption that, an user reading this post already have an idea about installing and configuring hadoop on single node cluster. If not, it is better to go through the post Installing Hadoop on single node cluster

In this post we will briefly discuss about installing & configuring hadoop-2.3.0 version on multiple node cluster. For ease of simplicity, we will consider small cluster of 3 nodes, each with below minimum configuration.

Install Hadoop on Multi Node Cluster:

Prerequisite

  • All the three machines have latest Ubuntu  64-bit OS installed. At the of writing this post, Ubuntu 14.04 is the latest version available
  • All the three machines must have Java version 1.6 or higher is installed. If not, follow the instructions in the post  Installing Java on Ubuntu in all three machines and setup JAVA_HOME & PATH environment variables appropriately.
  • All the three machines must have SSH (Secure Shell) installed. if not already installed please follow the instructions from post installing SSH on ubuntu
  • Lets consider the IP addresses and hostnames of these three machines in /etc/hosts file of the respective machine as shown below.

And we will plan to consider master node to setup name node, resource manager along with data node and node manager but slave-1 and slave-2 will be setup only for running data node and node manager daemons.

  • Create a separate user hduser in all three machines with the command shown below.

Configure /etc/hosts file on each machine

By default in Ubuntu file system, /etc/hosts file on each of the machine will have ip address and host name as shown below. 

In order to recognize all the three machines each other, we need update this /etc/hosts file on each machine with IP addresses and hostnames of all three machines. Only Super user (sudo) will have access to edit this file  . The format for specifying the host name and the IP will be as shown below

Copy the above four lines into /etc/hosts file of all the three machines. With this setting, each machine can recognize the other machine with, just by host name instead specifying the IP address every time.

SSH Configuration for Cluster setup

Before installing hadoop on three machines, we need to setup three machines in a cluster, where master node can able to connect with slave nodes without requiring a password and it (master) should be able to connect to itself without requiring any authentication/password.

For this, As per our prerequisite, SSH will be installed on all three machines.

  • Create new RSA public key on master node with below command.

Press Enter key on first request to take the default file to save key And for Pass phrase leave the space as it is and press Enter key, to make sure no password is needed to login.

  • We need to copy the above generated public key into the list of authorized keys on master node with the below command

  • Now we will be able to login to master through SSH without giving any password.

  •  Now, we need to copy the same public key generated on master node (in $HOME/.ssh folder in master node), into corresponding SSH authorized keys files on all slave nodes. we can do this with the help of below commands.

 Note: The above two commands need to be issued from master node but not from slave nodes. The above

Here hduser is the username and slave-1 and slave-2 are the host names.

Install Hadoop on each Machine

Now, we need to download latest stable version of hadoop and install it on each node usually in /usr/lib/hadoop location. This mainly include below three activities on each node.

  1. Download binary gzipped file from latest stable hadoop release into preferred installation location /usr/lib/hadoop on each machine and
  2. un archive the zipped files.
  3. Setup HADOOP_HOME, HADOOP_CONF_DIR, PATH environment variables with appropriate hadoop installation directory locations.

To check, whether the hadoop is installed properly or not, on all machines, issue the below command from terminal.

we should get message similar to above if the installation is successful. if not, review your installation once again.

Configure Master node

Now, on master node, we need to update hdfs-site.xml configuration file in HADOOP_CONF_DIR location with below properties.

Update slaves file:

update the slaves file on master node with the host names of all the slave nodes. Since, we are planning to run data node on all the three machines, we will mention all three host names in slaves file.

Note:-  Here in the above hdfs-site.xml list, we can set the two flags dfs.datanode.use.datanode.hostname, dfs.namenode.datanode.registration.ip-hostname-check as false if we do not want to map IP addresses with hostnames in /etc/hosts file. And if these flags are false, then we need to list the IP addresses of the slave nodes in the slaves file instead of host names.

Configure Slave nodes

The below configurations are same for all slave nodes. I.e. the same settings have to be done on each slave node machine.

Update core-site.xml:

We need to update core-site.xml on each slave node to specify the name node address as shown below.

Format NameNode

Now, we need to format our name node by issuing the below command from master machine.

Start/stop Daemons

Now, the cluster is ready to start running dfs or yarn daemons. I.e. we can issue start-dfs.sh or start-yarn.sh commands from the master machine itself to run the daemons on entire cluster.

Note:  we do need to start the daemons separately from slave nodes. Just by issuing start-dfs.sh or start-yarn.sh commands on master machine will itself trigger datanode or nodemanager daemons on slave nodes as well. 

The output of jps command frommaster node will result as shown below

But jps command from slave node will result in below output:

Now, Hadoop is installed  on your cluster successfully.


Profile photo of Siva

About Siva

Senior Hadoop developer with 4 years of experience in designing and architecture solutions for the Big Data domain and has been involved with several complex engagements. Technical strengths include Hadoop, YARN, Mapreduce, Hive, Sqoop, Flume, Pig, HBase, Phoenix, Oozie, Falcon, Kafka, Storm, Spark, MySQL and Java.

Leave a comment

Your email address will not be published. Required fields are marked *


Review Comments
default image

I am a plsql developer. Intrested to move into bigdata.

Neetika Singh ITA Hadoop in Dec/2016 December 22, 2016

.