In this post, we will discuss about Hive installation on Ubuntu in pseudo distributed mode. We are installing latest release of Hive at the time of writing this, which is hive-0.13.1 version.
HCatalog is included with Hive, starting with Hive release 0.11.0, so optionally, we can setup configuration required for HCatalog as well in our hive version 0.13.1 installation. Even WebHCat is also installed with Hive, starting from Hive release 0.11.0.
Hive Installation on Ubuntu
- JDK 1.6 or later versions of Java installed on our Ubuntu machine.
- Hadoop 1 or Hadoop 2 Installed and Configured properly. HADOOP_HOME environment variable should be set to hadoop’s installation directory.
Hive Installation Procedure
- Download a latest stable version of Hive, whose version matches with the your existing hadoop version. Generally, Hive works with the latest release of Hadoop. Hive binary tarballs can be downloaded from Apache download mirrors.
- Copy that apache-hive-0.13.1-bin.tar.gz to our preferred hive installation directory, usually into /usr/lib/hive and unpack the tarball. Below are the set of commands to perform these activities.
And below is the screen shot from the installation terminal.
- Set HIVE_HOME, HIVE_CONF_DIR environment variables in .bashrc file as shown below and add the Hive bin directory to PATH environment variable.
- Optionally, we can set environment variables for HCatalog and WebHCat in .bashrc file. Below is the snap of .bashrc file after setting the above environment variables.
With the above installation instructions we can run hive service, but optionally we can set below configuration parameters. All these configuration changes are just recommended but not mandatory to run simple hive service.
- Create hive-site.xml file if it is not present under HIVE_CONF_DIR (HIVE_HOME/conf) directory with the below properties.
below are the detailed descriptions of the above properties.
mapred.reduce.tasks : Typically set to a prime close to the number of available hosts. Ignored when mapred.job.tracker is “local”. Hadoop set this to 1 by default, whereas hive uses -1 as its default value. By setting this property to -1, Hive will automatically figure out what should be the number of reducers.
hive.exec.scratchdir : Hive stores per-query temporary/intermediate data sets under this directory and are normally cleaned up by the hive client when the query is finished.
When writing data to a table or partition, Hive will first write to a temporary location on the target table’s filesystem (using hive.exec.scratchdir as the temporary location) and then move the data to the target table.
hive.metastore.warehouse.dir : HDFS directory location for storing managed tables under hive’s control.
- By default, Hive metastore runs in an embedded derby database and it allows only one active hive session at a time. So, multiple hive users can’t access hive server at a time. To configure metastore to allow multiple users at a time read through the post Configuring Metastore for Hive.
Verify Hive Installation
With the above changes, basic setup and configuration of hive server is done and now we are ready to check our installation and hive server.
We can verify the hive installation with $ hive –help command or starting default Hive CLI service with $ hive or $ hive –service cli commands as shown below.
If we receive messages as shown above then our installation is successful, otherwise we need to review the instructions followed once again.
Note: HiveServer2, introduced in Hive 0.11 has a new CLI called Beeline. To use Beeline, execute $ bin/beeline command in the Hive home directory.