In this post, we will discuss about hadoop installation on cloud storage. Though there are number of posts available across internet on this topic, we are documenting the procedure for Cloudera Manager Installation on Amazon EC2 instances with some of our practical views on installation and tips and hints to avoid getting into issues. This post also gives a basic introduction on usage of Amazon AWS cloud services.
Creation of Amazon EC2 Instances:
First we need to create necessary EC2 instances on Amazon AWS cloud services with appropriate AMI (Amazon Machine Image). In this post, we are using Ubuntu 14.04 as AMI for Clouder Manager 5 installation along with CDH 5.2 release. We are going to setup 4 node cluster with 1 Namenode and 3 Datanodes, which is the minimum requirement for cloudera hadoop cluster setup without any error messages or warnings.
Private and Public IP Addresses:
Creation of EC2 instances in Amazon AWS cloud services, will assign one private IP address (This will be used within AWS to access each machine) and one Public IP address (This is used to access the machines from outside AWS like Internet) to each EC2 instance created.
The pricing mode of EC2 instances usage is hourly basis and whenever we are not using any EC2 instance, we can stop that instance and we can start it again with same AMI configuration to save the cost of EC2 instances usage. In this case the billing is charged only for the hours during which the instances were running.
But the only disadvantage of stopping and starting instances again is that, every time we start the instance, instance will be assigned with new Private and Public IP address pair created dynamically. And we can’t access this instance with previous Private and Public IP address pairs.
If we install Hadoop on EC2 instances directly, then either we need to keep all the EC2 instances running forever so that their Private and Public IP addresses will not change after Hadoop Installation or we need to terminate the instances and re-create them and re-install hadoop after every stop/start of EC2 instances. Both of these options are not ideal maintaining a hadoop cluster.
So, in order to keep the cluster cost effective (so that we can stop and start the instances whenever we needed), we can make use of Amazon VPC (Virtual Private Cloud network) cloud service and Elastic IP addresses. With these two AWS cloud services, we can achieve Static Private and Public IP addresses for the EC2 instances being created. But keep in mind that these two additional services come at the extra cost but provides flexibility to stop and start the EC2 instances whenever we do not need the instances running to save the cost.
In this post, we will make use of Amazon VPC, Elastic IP and EC2 Instance AWS cloud services to setup a private cloud network and maintain static IP addresses.
Creation of VPC, Launching EC2 Instances and Assigning Elastic IP addresses:
After login into AWS console, first select VPC cloud service it will open VPC dashboard as shown below.
Click on Start VPC Wizard and select VPC with a single public subnet as shown below. Provide VPC name and rest all properties can be left as default values. And create VPC as HDP-VPC.
Now select Services –> EC2 to open EC2 Dashboard and Launch Instances into VPC.
Namenode Instance Configuration:
Now select AMI configuration for Namenode as m3.2xlarge instance type.
Click on Launch Instance and choose AMI as Ubuntu 14.04 and follow steps as shown in below screens in the same order.
Select 1 instance for Namenode and select HDP-VPC (the above created VPC) as Network and remaining properties as default values.
Now Add storage at least 80 GB to install Cloudera Manager.
And give instance name=CL_NN in Tag instance and Create a new security group as shown below.
Creation of Security Group:
Add inbound rules as shown above for TCP ports 7180, 7182, 7183 and 7432 and SSH port 22 and all other rules shown in above screen are better keep. In order to access this EC2 instance from any machine from outside, we need to select Anywhere in source tab.
If this is not setup properly, we can’t access Cloudera Manager server admin login, Postgre SQL login.
Now review the configuration and launch the instance:
After this page click on Launch button and we will be asked for creation of private key pair and Download Key Pair , This is the only place where we can save the private key pair otherwise we can’t connect to these EC2 instances from outside. Give the key pair name as HDPCluster1
Now we can see the instance running under Instances tab.
Creation of Elastic IP Address:
Elastic IPs –> Allocate New Address and after new IP address allocation, open the Associate Address and select the instance just now created.
This will associate a static Private and Public IP address pair to Namenode Instance.
Create DataNode EC2 Instances:
Similar to Namenode EC2 instance creation as shown above, create 3 instances under HDP-VPC, each with 100 GB storage and all are allocated to same security group which is created in the above. This time, Choose AMI as Ubuntu 14.04 and Instance Type as m3.xlarge and select the instance configuration as shown below.
And review the configuration and launch the instances and Allocate three new Elastic IP addresses and associate them to Data node instances. Below are the list of four instances:
Install Cloudera Manager on NameNode Instance:
Now connect to Namenode instance via the terminal from our local Ubuntu machine through SSH port 22. The commands needed for connection to an EC2 instance is shown in below screen.
After changing the permissions on HDPCluster1.pem file we can use below ssh command to connect to EC2 instance.
After connecting to EC2 instance perform below commands in sequence.
This will start the Cloudera manager installer as shown below.
Follow the directions shown by installer and after successful completion screen will instruct to login at 7180 port on Namenode hostname for Cloudera Manager Admin page login for further CDH5.2 installation.
Login to admin page with admin as both username and password.
Continue the steps as shown in below screens in the sequence.
Here in the search box provide Private IP addresses or Private Hostnames to avoid unnecessary error messages while starting Cloudera Agent Services later.
Even Public IP addresses seem to be working fine but some time we may receive error messages as shown below:
Perform Cluster installation by using parcels to install CDH 5.2.
Here in the below SSH Login Credentials Screen, we need to select the username as ubuntu than root. We should not select root here. We have to assign HDPCluster1.pem as the private key file and need to select all hosts accept same private key.
If we don’t get any error messages, then the installation will be successful as shown below
Or if we get any error messages as shown below,
In this case, provide Private IP Addresses and Private DNS names in /etc/hosts file of all nodes of the cluster being installed.
In the next steps, save the Postgre SQL login Username and Password somewhere to login Postgre SQL manually incase of any issues in creating metastore tables.
Next select Continue with the default settings for the cluster configuration and follow till first run of all the requested services. On Successful start of all service cluster show good health for all the services as shown below.
As all the services are showing green status, the hadoop cluster is successfully installed and configured and all the services are running successfully without any warning messages.