- In Hive, metastore is the central repository to store metadata for hive tables/partitions. Any datastore that has a JDBC driver can be used as a metastore.
- By default, the metastore service runs in the same JVM as the Hive service and contains an embedded Derby database instance backed by the local disk.
There are 3 different ways to setup the metastore server using different Hive configurations:
- Embedded Metastore
- Local Metastore
- Remote Metastore
- Embedded metastore: Derby is the default database for the embedded metastore. It is a simple way to get started with Hive. But we can only have one Hive session open at a time that shares the same metastore. Trying to start a second session gives the error:
Failed to start database ‘metastore_db’
So, embedded metastore is not practical solution but can be used for unit tests.
- Local Metastore: In local metastore, a standalone database (MySQL or PostgreSQL or any other db with JDBC driver) is used as a metastore.
In this mode, multiple users can open connection to metastore and make SQL queries against it.
- Remote Metastore: In remote metastore, all Hive Clients will make a connection to a metastore server (MySQL or PostgreSQL or Oracle) which in turn queries the datastore. In this, metastore server run in a separate process to the Hive service.
Below is the high level architecture of three different Hive metastores:
Below are the configuration properties and their default values which will control the above metastore setups.
|CONFIGURATION PROPERTY||DEFAULT VALUE DESCRIPTION|
|javax.jdo.option.ConnectionURL||jdbc:derby:;databaseName=metastore_db;create=true Derby database located at current working directory|
|javax.jdo.option.ConnectionDriverName||org.apache.derby.jdbc.EmbeddedDriver Derby embeded JDBC driver class.|
|hive.metastore.warehouse.dir||/user/hive/warehouse location of database for the warehouse|
|hive.metastore.uris||Blank Thrift URI for the remote metastore. Blank for local mode.|
- Here in the above default settings, in javax.jdo.option.ConnectionURL property, databaseName=metastore_db;create=true, denotes that metastore_db will be created when it is not present in the current working directory.
- That’s why, whenever we try to run hive service in a different working directory, it will create a metastore_db in that folder as well and no tables from previous session will be available.
- if we specify the complete path in databaseName field like file:///usr/lib/hive/db then metastore_db folder will be created in that location and the same will be accessed every time when we start hive service. So, previous tables and partitions can be accessed even from other working directories.
To Maintain a single copy of Embedded Metastore:
Add the below property to hive-site.xml file.
Configuring Local Metastore:
Below are the configuration details for setting Local Metastore with separate MySQL database. If MySQL is not installed already, please follow the instructions from the post to install MySQL on Ubuntu.
Note: The JDBC driver JAR file for MySQL (Connector/J) must be on Hive’s classpath, which is simply achieved by placing it in Hive’s lib directory. MySQL Connector/J jar file can be downloaded from MySQL site.
Add the below properties to hive-site.xml file.
In the above hive-site.xml file, specify your root user password in place of xxxxxx in javax.jdo.option.ConnectionPassword property.
Start Hive CLI service:
Start hive CLI (command line interface) service with $ hive command on terminal after starting start-dfs.sh daemons we should get hive shell open without any error messages as shown below.
Start Hive Thrift server:
Start hive thrift server with below command and running service process can be verified with $ jps -lm command.
I.e. the above local metastore configuration is successful through standalone MySQL database. Successful start of hive service will create metastore database specified in hive-site.xml in MySQL with root privileges and we can verify the same.
With this we can say that Hive service with Local Metastore setup is successful.
Start Hive Metastore service:
We can start the metastore service with the below command.
Hive metastore listens on port 9083 by default and the same can be verified below to test whether metastore started successfully or not.
Configure Remote Metastore:
We have successfully configured local metastore in the above section. Suppose if we want to add another node (node2) to the existing cluster and new node should use the same metastore on node1, then we have to setup the hive-site.xml file in node2 in such a way that it should reference metastore present in the remote machine (node1).
To access the remote metastore from node2 we need to set the below property in hive-site.xml file in node2.
To access the tables created in node1 hive shell, from node2, we need to keep hive metastore service running (with command: hive –service metastore) on node1 and start using hive -cli from node2 after setting hive metastore URI successfully as shown above.