In this post, we will discuss about hadoop installation on cloud storage. Though there are number of posts available across internet on this topic, we are documenting the procedure for Cloudera Manager Installation on Amazon EC2 instances with some of our practical views on installation and tips and hints to avoid getting into issues. This post also gives a basic introduction on usage of Amazon AWS cloud services.
Table of Contents
Creation of Amazon EC2 Instances:
First we need to create necessary EC2 instances on Amazon AWS cloud services with appropriate AMI (Amazon Machine Image). In this post, we are using Ubuntu 14.04 as AMI for Clouder Manager 5 installation along with CDH 5.2 release. We are going to setup 4 node cluster with 1 Namenode and 3 Datanodes, which is the minimum requirement for cloudera hadoop cluster setup without any error messages or warnings.
Private and Public IP Addresses:
Creation of EC2 instances in Amazon AWS cloud services, will assign one private IP address (This will be used within AWS to access each machine) and one Public IP address (This is used to access the machines from outside AWS like Internet) to each EC2 instance created.
Pricing Mode:
The pricing mode of EC2 instances usage is hourly basis and whenever we are not using any EC2 instance, we can stop that instance and we can start it again with same AMI configuration to save the cost of EC2 instances usage. In this case the billing is charged only for the hours during which the instances were running.
But the only disadvantage of stopping and starting instances again is that, every time we start the instance, instance will be assigned with new Private and Public IP address pair created dynamically. And we can’t access this instance with previous Private and Public IP address pairs.
Hint:
If we install Hadoop on EC2 instances directly, then either we need to keep all the EC2 instances running forever so that their Private and Public IP addresses will not change after Hadoop Installation or we need to terminate the instances and re-create them and re-install hadoop after every stop/start of EC2 instances. Both of these options are not ideal maintaining a hadoop cluster.
So, in order to keep the cluster cost effective (so that we can stop and start the instances whenever we needed), we can make use of Amazon VPC (Virtual Private Cloud network) cloud service and Elastic IP addresses. With these two AWS cloud services, we can achieve Static Private and Public IP addresses for the EC2 instances being created. But keep in mind that these two additional services come at the extra cost but provides flexibility to stop and start the EC2 instances whenever we do not need the instances running to save the cost.
In this post, we will make use of Amazon VPC, Elastic IP and EC2 Instance AWS cloud services to setup a private cloud network and maintain static IP addresses.
Creation of VPC, Launching EC2 Instances and Assigning Elastic IP addresses:
After login into AWS console, first select VPC cloud service it will open VPC dashboard as shown below.
Click on Start VPC Wizard and select VPC with a single public subnet as shown below. Provide VPC name and rest all properties can be left as default values. And create VPC as HDP-VPC.
Now select Services –> EC2 to open EC2 Dashboard and Launch Instances into VPC.
Namenode Instance Configuration:
Now select AMI configuration for Namenode as m3.2xlarge instance type.
Click on Launch Instance and choose AMI as Ubuntu 14.04 and follow steps as shown in below screens in the same order.
Select 1 instance for Namenode and select HDP-VPC (the above created VPC) as Network and remaining properties as default values.
Now Add storage at least 80 GB to install Cloudera Manager.
And give instance name=CL_NN in Tag instance and Create a new security group as shown below.
Creation of Security Group:
Add inbound rules as shown above for TCP ports 7180, 7182, 7183 and 7432 and SSH port 22 and all other rules shown in above screen are better keep. In order to access this EC2 instance from any machine from outside, we need to select Anywhere in source tab.
If this is not setup properly, we can’t access Cloudera Manager server admin login, Postgre SQL login.
Now review the configuration and launch the instance:
After this page click on Launch button and we will be asked for creation of private key pair and Download Key Pair , This is the only place where we can save the private key pair otherwise we can’t connect to these EC2 instances from outside. Give the key pair name as HDPCluster1
Now we can see the instance running under Instances tab.
Creation of Elastic IP Address:
Elastic IPs –> Allocate New Address and after new IP address allocation, open the Associate Address and select the instance just now created.
This will associate a static Private and Public IP address pair to Namenode Instance.
Create DataNode EC2 Instances:
Similar to Namenode EC2 instance creation as shown above, create 3 instances under HDP-VPC, each with 100 GB storage and all are allocated to same security group which is created in the above. This time, Choose AMI as Ubuntu 14.04 and Instance Type as m3.xlarge and select the instance configuration as shown below.
And review the configuration and launch the instances and Allocate three new Elastic IP addresses and associate them to Data node instances. Below are the list of four instances:
Install Cloudera Manager on NameNode Instance:
Now connect to Namenode instance via the terminal from our local Ubuntu machine through SSH port 22. The commands needed for connection to an EC2 instance is shown in below screen.
After changing the permissions on HDPCluster1.pem file we can use below ssh command to connect to EC2 instance.
After connecting to EC2 instance perform below commands in sequence.
This will start the Cloudera manager installer as shown below.
Follow the directions shown by installer and after successful completion screen will instruct to login at 7180 port on Namenode hostname for Cloudera Manager Admin page login for further CDH5.2 installation.
Login to admin page with admin as both username and password.
Continue the steps as shown in below screens in the sequence.
Here in the search box provide Private IP addresses or Private Hostnames to avoid unnecessary error messages while starting Cloudera Agent Services later.
Even Public IP addresses seem to be working fine but some time we may receive error messages as shown below:
Perform Cluster installation by using parcels to install CDH 5.2.
Here in the below SSH Login Credentials Screen, we need to select the username as ubuntu than root. We should not select root here. We have to assign HDPCluster1.pem as the private key file and need to select all hosts accept same private key.
If we don’t get any error messages, then the installation will be successful as shown below
Or if we get any error messages as shown below,
In this case, provide Private IP Addresses and Private DNS names in /etc/hosts file of all nodes of the cluster being installed.
In the next steps, save the Postgre SQL login Username and Password somewhere to login Postgre SQL manually incase of any issues in creating metastore tables.
Next select Continue with the default settings for the cluster configuration and follow till first run of all the requested services. On Successful start of all service cluster show good health for all the services as shown below.
As all the services are showing green status, the hadoop cluster is successfully installed and configured and all the services are running successfully without any warning messages.
Hi,
Thanks for the tutorial. It is helping me, a lot.
I have been unsuccessful, so far. I am having a problem with reverse DNS lookup. How did you configure your networking so that it will work?
I had to specify the internal names during my Cloudera configuration. Any advice you can offer is greatly appreciated.
Thanks,
Jeff
Hi Jeff,
Some times it will success without any error messages if we provide public IP addresses, in specifying host for cloudera installation page. If this fails trying providing your public hostnames (instead of ip addresses).
Even if this also fails, try giving Private IP addresses/hostnames. In case if you still get errors in installing, then you need to change the /etc/hosts file on each node of the cluster being installed.
you need to copy ip addresses of all the nodes into /etc/hosts file of each node in the below format
private-ip-addresse-namenode private-hostname-namenode
private-ip-addresse-dn1 private-hostname-dn1
private-ip-addresse-dn2 private-hostname-dn2
copy these lines (of course you need to provide actual ip addresses and hostnames in the above) into /etc/hosts file of each node and replace the 127.0.0.1 localhost lines on the machine.
If you still get any error message please post your error message details/screen shots in hadoop discussion forum (http://hadooptutorial.info/forums/forum/hadoop-discussion-forum/)…we will definitely help you in resolving the issue.
Hi
I am trying CDH automatic installation on AWs EC2 using cloudera manager bin. I have created one ubuntu Precise 12.04 LTS micro instance,
I followed the on screen instructions as instructions on per this tutorial.. ” http://www.cloudera.com/content/cloudera/en/documentation/cloudera-manager/v4-7-1/Cloudera-Manager-I…
1) this is my vi /etc/hosts file
127.0.0.1 localhost
172.31.13.46 master
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts
2) Downloaded and changed the permission for cloudera-manager-installer.bin also.
3) sudo ./cloudera-manager-installer.bin after this command cloudera manager installed,
4) But i couldnot access cloudera manager webconsole using ” http://172.31.13.46:7180 ” and i have opened my port 7180 while creating an instance, but still not able to acess through webconsole,
5) my cloudera manager db and cloudera manager server both are running.
6) and the port 7180 is also not listning in my ubuntu server and i used the following comand, ” sudo ufs allow 7180″ but no use,,
7) I checked $ sudo ufw status and the result is inactive
8) when I check $ sudo service cloudera-scm-agent status on 172.31.13.46 it comes as unrecognized service
I am struggling in this part, Could you please let me know where I went wrong in installing cloudera in a clustered environment..???
if yes, it will be helpful for me, please,
Thanks in advance,
Regards,
Bharath
Hi Siva,
Thank you for this helpful tutorial. I am trying to evaluate whether I should be using EMR + S3 or if I should be using EC2+ Cloudera Enterprise Hub.
From your experience, will you be willing to provide your thoughts on Pros and Cons of (EMR + S3) Vs (EC2 + Cloudera Enterprise Hub)? Thank you very much for sharing all the good work.
Best,
CK
im getting below error as mentioned by you
Installation Failed. Failed to Receive Heartbeat from Agent
Ensure that the host’s hostname is configured properly.
Ensure that port 7182 is accessible on the Cloudera Manager server (check firewall rules).
Ensure that ports 9000 and 9001 are free on the host being added.
Check agent logs in /var/log/cloudera-scm-agent/ on the host being added (some of the logs can be found in the installation details).
could u pls help?
in hosts file its looks like below: mine is single node cluster with CDH5 on local system with ubuntu 12.04 lts.
127.0.0.1 localhost
127.0.1.1 ubuntu
Amazon EC2 Security Groups -> Inbound and Outbound do edit.
Do select Type All trafiic.
this /etc/hosts file is very critical in this installation.
I have finished all other steps, but always get error on hostname not properly configured.
Accroding to Cloudera doc, /etc/hosts should look like this
127.0.0.1 localhost.localdomainlocalhost
192.168.1.1 cluster01.example.com cluster01
….
but it’s not working. so I guess we like to have author’s /etc/hosts file content to see how it setup to work.
Please share your /etc/hosts file.
thanks,
Robin
OK, I noticed in the tutorials, it said, in this case(if failed of cloudera manager installation/hosts hostname confiugre), provide private ip and private DNS in /etc/hosts on all nodes.
I wonder if the /etc/hosts have 2 steps to configure to make this cloudera manager installed.
Shall we list all public ip, public dns, private ip and private dns in /etc/hosts file before install cloudera manager?
thanks,
Robin
I have posted my question for few days now, havent get any words from any one yet.
I simply like to have a sample of /etc/hosts file to fix the host’s hostmane error from cloudera manager installation. any one can help?
thanks,
Hi Robin, sorry for the delayed response. I currently closed my AWS cluster since it is chargeable and maintaining offline cluster but anyway to your question below,
First try listing out your public ip address and dns names in /etc/hosts and check if this works,
In case if this is not working then you can try with private ip and dns names in /etc/hosts file
Make sure that these entries are same across all the nodes in your cluster.
Suppose if you are having 4 node cluster,
then there should be entry for these 4 machines private ip, dns names in every /etc/hosts file across these 4 machines.
I hope this will be helpful for you.
Siva Sr.
thank you so much to take time to answer my question. Sorry to take your precise time.
your tutorials is the simplest one for CDH 5 install on AWS/EC2. I have learned so much from it.
thank you for doing this.
I always get an error: Enrsure that host’s hostname configured properly…. and
no matter how I modify my /etc/hosts file, this error stay with me like a cancel cell….
However my host file is like this now:
127.0.0.1 localhost.localdomainlocalhost
52.88.118.48 ec2-52-88-118-48.us-west-2.compute.amazonaws.com 172.31.0.146 ip-172-31-0-146
52.26.227.159 ec2-52-26-227-159.us-west-2.compute.amazonaws.com 172.31.0.147 ip-172-31-0-147
52.27.9.129 ec2-52-27-9-129.us-west-2.compute.amazonaws.com 172.31.0.149 ip-172-31-0-149
also, the output on my master server is:
ubuntu@ip-172-31-0-146:~$ hostname
ip-172-31-0-146
ubuntu@ip-172-31-0-146:~$ hostname -f
ip-172-31-0-146
ubuntu@ip-172-31-0-146:~$ hostname -A
ip-172-31-0-146
ubuntu@ip-172-31-0-146:~$ sudo ifconfig
…. addr:172.31.0.146 ……….
please help, thank you so much.
Robin
so any suggestions?
yes, I did have the same /etc/hosts files across all my slave nodes. one of them died now. so only 3 for now.
with this new /etc/hosts file, I still get the same error. it is like a cancer cell to me :(. I wanted to fix it so bad.
thanks Siva,
Robin
Hi Robin,
When you got the error click on back arrow at the bottom of the page until the home then click continue, you will be able to get the installation going.
However, I have a problem that once the cluster shutdown, when trying to bring up, HDFS fail to start namenode and all sorts.
Any suggests, Siva?
Hi Odie,
For your problem it is due to dynamic ip addresses allocated by AWS for your machines. To resolve this issue go with static private ip and public ip addresses
Hi Siva,
I use Elastic IP for all nodes still got problem with HDFS, HBase then my namenode went into safe mode.
Hi, I’m having the same problem… the services are not starting… it is stuck at the very last moment
I am pretty new to the whole concept on Big data and Cloudera in general. I have recently registered to Amazon services and got a free 1 year usage. So will it charge me if I try and do this cloudera setup in an instance for learning purposes?