Hadoop can be installed on a cluster of many machines in fully distributed mode or on a single machine in pseudo distributed mode.
Apart from these two modes, there is one more mode of running hadoop on standalone mode or local mode. In stand alone mode, there will be no daemons running and everything runs in a single JVM. It is easy to run and test Map Reduce programs in stand alone mode during development.
In Pseudo distributed mode, all the Hadoop daemons run on a local machine, simulating cluster on a small scale. It’s easy to install on a single machine and will be sufficient to run all components of Hadoop in pseudo distributed.
So, we will learn how to install a stable Hadoop version from Apache Software Foundation on a Single Node Cluster. Below Installation is done on Ubuntu Machine.
Hadoop Installation Prerequisites:
- As Hadoop is written in Java, at least JDK 1.6 or later is required for installation of Hadoop.
If java is not installed already on Ubuntu please refer the post on How to Install Java on Ubuntu for installation instructions
- Download a latest stable version of Hadoop from Apache Release Mirrors. In this example installation, Hadoop-2.3.0 version is downloaded from here. Download the zipped version of binary tarball Hadoop-2.3.0.tar.gz file.
- Copy the binary gzipped file into your preferred directory location for hadoop installation. Generally into /usr/lib/hadoop. If this directory is not there follow the below instructions to create directory, copy file and extract the file.
From above commands, we have created directory /usr/lib/hadoop and changed mode to give permission to be edited by hadoop user.
copied the downloaded file from $HOME/Downloads, default download directory in Ubuntu, into /usr/lib/hadoop directory and extracted the contents of gzipped file.
browse through the directory created, hadoop-2.3.0 under /usr/lib/hadoop folder. Below are the details of each directory under hadoop-2.3.0.
bin, sbin — These two folders contains binary executable s stored as ‘*.sh’ files. So, these folders need to be added to the list of directories in PATH environment variable.
etc/hadoop — Configuration directory. It contains all the config files which needs to be modified specific to our installation.
include — This folder includes the ‘.h’ & ‘.hh’ files needed for C, C++ API.
lib, libexec — These two are library folders which includes necessary library files.
share — This folder contains documentation and source code for current hadoop release.
Now, we need to set up hadoop environment variables through .bashrc profile file, so that these will be picked up automatically whenever a terminal is started. Follow below instructions to set required environment variables.
In .bashrc file add the below lines at the bottom , based on hadoop installation directory.
As shown above, include both bin & sbin directories into list of directories in PATH environment variables. With this setting, we can access all the .sh files from terminal by default instead of rooting to hadoop directory every time. Save & close the .bashrc file.
It’s to verify the installation, close the terminal and open a new terminal and enter below command
If you get message similar to above, then your installation is successful
- Configuring SSH: Since, Hadoop runs multiple processes on one or more machines, we need to ensure that hadoop user should be able to connect to each host without requiring a password. This can be created by secure shell SSH. Please refer this post for SSH setup for hadoop.
- Set JAVA_HOME environment variable in ‘hadoop-env.sh’ file in hadoop’s configuration directory etc/hadoop and remaining environment variables can remain as it is.
Most of the Hadoop properties are configured through below four XML files
core-site.xml — Common properties to HDFS, YARN, Map Reduce and etc.. are stored in this file.
hdfs-site.xml — HDFS specific properties are stored in this file.
mapred-site.xml — Map Reduce specific properties need to be stored in this file. If this file is not present in HADOOP_CONF_DIR, then create it by renaming ‘mapred-site.xml.template‘ file to mapred-site.xml.
yarn-site.xml — YARN properties will be stored in this file.
All the above are site specific properties which are applied only to a single site, but hadoop provides default configurations as well. These are present in core-default.xml, hdfs-default.xml, mapred-default.xml and yarn-default.xml and these can be referenced from share directory.
We need to set below properties in core-site.xml configuration file.