In this post, we will discuss about hive table commands with examples. This post can be treated as sequel to the previous post Hive Database Commands.
Hive Table Creation Commands
Introduction to Hive Tables
In Hive, Tables are nothing but collection of homogeneous data records which have same schema for all the records in the collection.
Hive Table = Data Stored in HDFS + Metadata (Schema of the table) stored in RDBMS
Hive metadata is stored in hive metastore configured via any RDBMS (default is Derby but can be configured to any of these: MySQL, PostGreSQL, Oracle, MS SQL Server, etc…). So, Hive metadata is not stored on HDFS.
Hive table data can be stored local filesystem as well, when running in local mode.
Hive Table Types
Hive supports two types of tables.
Managed Tables – Default table type in Hive
- Tables data is manged by Hive by moving data into its warehouse directory configured by hive.metastore.warehouse.dir (by default /user/hive/warehouse).
- If this table is dropped both data and metadata (schema) are deleted. I.e. these tables are owned by Hive.
- Less convenient to share with other tools like Pig, HBase etc, as these are maintained by Hive and data can be deleted without informing these tools.
- These tables are not managed or owned by Hive. And tables data will not be copied into hive warehouse directory but maintained at external location
- If these tables are dropped only the schema from metastore will be deleted but not the data files from external location.
- Provides convenience to share the tables data with other tools like Pig, HBase, etc…
- “Location” Clause is mandatory to create an external table otherwise table will be managed by Hive only even if we create it with “External” keyword.
- By the name itself, these are temporary and available till end of current session only.
- Useful in case of creating intermediate tables to copy data records from one table to another but can be deleted after our copy operation.
- Table’s Data will be stored in the user’s scratch directory configured by hive.exec.scratchdir, and deleted at the end of the session.
- Temporary tables doesn’t support Partitioning & Indexing.
- Be careful in naming temporary tables, As Hive doesn’t warn or error out if we use a name that already exist in the database.
- If a temporary table is created with same name of a permanent table which already exists in the database, then, original table can’t be accessed in that session until we drop the temporary table.
- These are introduced for first time in Hive-0.14.0 to improve performance of tables with one or more columns having skewed (repeated) values.
- Hive will split the skewed (very often) values records into separate files, and the same will be considered into account at the time of querying this table, so that it can skip (or include) the whole file based on the input criteria.
- These are not separate table types, but can be managed or external.
Comparison With Partitioned Tables and Skewed Tables
- Skewing technique similar to Partitioning but it is recommended when only few values are occurring very often in input.
- Like, If we partition a table by country and there 200 countries in input file, but 80% records are from only US, UK, IN, JPN, then it is better to go by Skewing by country for four values. In skewing, it will create only 5 separate files/directories (4 for US, UK, IN, JPN and 1 for remaining all) where as partitioning will create 200 directories making the structure very complex.
- One of the main disadvantage of Partitioning is that HDFS Scalability will be an issue more partitioning is done. For example, if there are 1000 mappers and 1000 partitions, and each mapper gets at least 1 row for each key, we will end up in creating 1 million intermediate files, So Namenode’s memory will be in trouble to store metadata about all these files.
Hive Tables Creation
Hive Tables can be created with CREATE TABLE command which is very similar to MySQL syntax with some hive extensions. Make a note that below HiveQL syntax is case insensitive but just for better readability, we have used keywords in uppercase.
Simple Hive Table Creation Syntax is follows.
But there is another scaring version which is very complex to remember but provides lots of ways to create tables. We will try to provide examples for as many keywords as possible, rest can be tried on your own.
Below are the detailed descriptions of each option/argument to CREATE TABLE command.
- TEMPORARY – Specified for creation of temporary tables
- EXTERNAL – Specified only for external tables
- IF NOT EXISTS – Suppresses error messages when a table already exists with same name and ignores creation of table again even if there is a schema difference between existing table and new table.
- db_name – This is also optional but can be used to specify the table under a particular target database, if we are not already working under it.
- COMMENT – Similar to CREATE DATABASE statement comments, we can add comments to table as well as to columns (strings within single quotes) to provide descriptive information to users.
- PARTITIONED BY – This clause is useful to partition the tables based on particular columns. Detailed discussion on Partitioning is deferred to another individual post Partitioning and Clustering tables in Hive.
- CLUSTERED BY – This clause is used to provide more structure to tables and partitions. Detailed discussion is deferred to another individual post Partitioning and Clustering tables in Hive.
- SKEWED BY – This clause is useful to create skewed tables. Further details can be discussed during example for this clause.
Below is an example of syntax for a table with two skewed columns:
- ROW FORMAT – This clause is used to specify the format of each row in the input data. If data fields are delimited by certain characters we can use DELIMITED sub-clause or we need to provide a SERDE that can serialize or deserialize the input data records.
Below are the default values/Syntax for DELIMITED clause.
I.e. default field delimiter is Ctrl+A (octal representation – ‘\001’) (also represented as ^A), collection delimiter is Ctrl+B (‘\002’ or ^B) and map keys are terminated by Ctrl+C (‘\003’ or ^C) and lines terminator is new line feed ‘\n’.
Syntax for SERDE clause:
- STORED AS – Storage file format can be specified in this clause. Below are the available file formats for hive table creation.
Detailed information on storage formats supported in Hive is deferred to post Hive Storage formats.
- STORED BY class_name [WITH SERDEPROPERTIES (…)]
It is an alternative to above two clauses (ROW FORMAT & STORED AS) to provide custom row format handler class_name and custom serde properties.
- LOCATION – Directory location for table data will be specified under this clause.
- TBLPROPERTIES – Metadata key/value pairs can be tagged to the table. last_modified_user and last_modified_time properties are automatically added under table properties and managed by Hive. Some example predefined table properties are,