Hive Table Creation Commands 1


In this post, we will discuss about hive table commands with examples. This post can be treated as sequel to the previous post Hive Database Commands.

Hive Table Creation Commands

Introduction to Hive Tables

In Hive, Tables are nothing but collection of homogeneous data records which have same schema for all the records in the collection.

Hive Table = Data Stored in HDFS + Metadata (Schema of the table) stored in RDBMS 

Hive metadata is stored in hive metastore configured via any RDBMS (default is Derby but can be configured to any of these: MySQL, PostGreSQL, Oracle, MS SQL Server, etc…). So, Hive metadata is not stored on HDFS.

Hive table data can be stored local filesystem as well, when running in local mode.

Hive Table Types 

Hive supports two types of tables.

  • Managed Tables – Default table type in Hive
    • Tables data is manged by Hive by moving data into its warehouse directory configured by hive.metastore.warehouse.dir (by default /user/hive/warehouse).
    • If this table is dropped both data and metadata (schema) are deleted. I.e. these tables are owned by Hive.
    • Less convenient to share with other tools like Pig, HBase etc, as these are maintained by Hive and data can be deleted without informing these tools.
  • External Tables
    • These tables are not managed or owned by Hive. And tables data will not be copied into hive warehouse directory but maintained at external location
    • If these tables are dropped only the schema from metastore will be deleted but not the data files from external location.
    • Provides convenience to share the tables data with other tools like Pig, HBase, etc…
    • “Location” Clause is mandatory to create an external table otherwise table will be managed by Hive only even if we create it with “External” keyword.
  • Temporary Tables
    • By the name itself, these are temporary and available till end of current session only.
    • Useful in case of creating intermediate tables to copy data records from one table to another but can be deleted after our copy operation.
    • Table’s Data will be stored in the user’s scratch directory configured by hive.exec.scratchdir, and deleted at the end of the session.
    • Temporary tables doesn’t support Partitioning & Indexing.

Warning:

  • Be careful in naming temporary tables, As Hive doesn’t warn or error out if we use a name that already exist in the database.
  • If a temporary table is created with same name of a permanent table which already exists in the database, then, original table can’t be accessed in that session until we drop the temporary table.
  • Skewed Tables
    • These are introduced for first time in Hive-0.14.0 to improve performance of tables with one or more columns having skewed (repeated) values.
    • Hive will split the skewed (very often) values records into separate files, and the same will be considered into account at the time of querying this table,  so that it can skip (or include) the whole file based on the input criteria.
    • These are not separate table types, but can be managed or external.
  • Comparison With  Partitioned Tables and Skewed Tables
    • Skewing technique similar to Partitioning but it is recommended when only few values are occurring very often in input.
    • Like, If we partition a table by country and there 200 countries in input file, but 80% records are from only US, UK, IN, JPN, then it is better to go by Skewing by country for four values. In skewing, it will create only 5 separate files/directories (4 for US, UK, IN, JPN and 1 for remaining all) where as partitioning will create 200 directories making the structure very complex.
    • One of the main disadvantage of Partitioning is that HDFS Scalability will be an issue more partitioning is done. For example, if there are 1000 mappers and 1000 partitions, and each mapper gets at least 1 row for each key, we will end up in creating 1 million intermediate files, So Namenode’s memory will be in trouble to store metadata about all these files.

Hive Tables Creation

Hive Tables can be created with CREATE TABLE command which is very similar to MySQL syntax with some hive extensions. Make a note that below HiveQL syntax is case insensitive but just for better readability, we have used keywords in uppercase.

Simple Hive Table Creation Syntax is follows.

But there is another scaring version which is very complex to remember but provides lots of ways to create tables. We will try to provide examples for as many keywords as possible, rest can be tried on your own.

Complex Syntax

Below  are the detailed descriptions of each option/argument to CREATE TABLE command.

  • TEMPORARY – Specified for creation of temporary tables
  • EXTERNAL – Specified only for external tables
  • IF NOT EXISTS – Suppresses error messages when a table already exists with same name and ignores creation of table again even if there is a schema difference between existing table and new table.
  • db_name – This is also optional but can be used to specify the table under a particular target database, if we are not already working under it.
  • COMMENT – Similar to CREATE DATABASE statement comments, we can add comments to table as well as to columns (strings within single quotes) to provide descriptive information to users.
  • PARTITIONED BY – This clause is useful to partition the tables based on particular columns. Detailed discussion on Partitioning is deferred to another individual post Partitioning and Clustering tables in Hive.
  • CLUSTERED BY – This clause is used to provide more structure to tables and partitions. Detailed discussion is deferred to another individual post Partitioning and Clustering tables in Hive.
  • SKEWED BY – This clause is useful to create skewed tables. Further details can be discussed during example for this clause.

Below is an example of syntax for a table with two skewed columns:

  • ROW FORMAT – This clause is used to specify the format of each row in the input data. If data fields are delimited by certain characters we can use DELIMITED sub-clause or we need to provide a SERDE that can serialize or deserialize the input data records.

Below are the default values/Syntax for DELIMITED clause.

I.e. default field delimiter is Ctrl+A (octal representation – ‘\001’) (also represented as ^A), collection delimiter is Ctrl+B (‘\002’ or ^B) and map keys are terminated by Ctrl+C (‘\003’ or ^C) and lines terminator is new line feed ‘\n’. 

Syntax for SERDE clause:

  • STORED AS –  Storage file format can be specified in this clause. Below are the available file formats for hive table creation.

Detailed information on storage formats supported in Hive is deferred to post Hive Storage formats.

  • STORED BY class_name [WITH SERDEPROPERTIES (…)]

It is an alternative to above two clauses (ROW FORMAT & STORED AS) to provide custom row format handler class_name and custom serde properties.

  • LOCATION – Directory location for table data will be specified under this clause.
  • TBLPROPERTIES   Metadata key/value pairs can be tagged to the table. last_modified_user and last_modified_time properties are automatically added under table properties and managed by Hive. Some example predefined table properties are,

  •  AS select_statement – AS clause is used to create table similar to the schema of the select_statement (another query statement) and populated with the output records of select_statement. It is also know as CTAS (Create Table AS) clause.
Create Table Like

Combination of LIKE clause with CREATE TABLE can be used to create tables similar to another existing tables or views. Its syntax is as follows.

Unlike CTAS clause, Only the schema is copied from existing table or view but not the records.

Hive Table Creation Examples

For detailed description on datatypes of columns used in table refer the post Hive Datatypes.

Example 1 – Managed Table with Different Data types

Based on the above knowledge on table creation syntax, Lets create a hive table suitable for user data records (most common use case) attached below.

Sample User records for table creation –> User_Records

(For testing purpose, we have created only 5 records to cover all the commonly used data types)

Header record for this data is as follows.

Brief notes on data, all names are string literals, address is a combination of two fields zip code and street name by tab delimited, and phone_nos is an array of numbers that should fit any number of values by tab separated.

Below are the HiveQL statements that can be used to create a table that is suitable for the above data and in this HiveQL we are also loading data and querying it for verification. Lets save this HiveQL in User_Create.hql file. Save the sample user records into User_Records.txt file under home directory and update the same path in below HiveQL.

Execute the above script and below is the screen shot of the output from the query.

Hive Table OutputFor better readability we have printed the column names in the header and we can observe all the data records are parsed correctly and each field is having appropriate values.

Example 2 – External Table with Create Like Command

Lets uncover the few more options in this example by creating an external table in test_db database with same schema and  records as in user table created above.

Below HiveQL can be used to create the above table. Lets save this into Like_User.hql and execute it from terminal.

We may feel like, as we need both the schema as well as the records from default.user table into test_db.user table, we can use CTAS command but CTAS doesn’t support external tables.

Below is the output of the above HiveQL execution.

Example 3 – External Table with ORC FileFomat & Snappy Compressed

Below is an example External Table creation with File Format as ORC and Compressed with Sanppy Codec.

Below is the output of the execution of above script.

Formatted Description of the USER_ORC table is given below.

Example 4  – Skewed Tables Stored in SequenceFile

Below is the sample HiveQL for creation of Skewed tables in Hive.

Output screen:

Hive Skewed Tables Example

Examine the properties of table in the detailed table information obtained from DESCRIBE FORMATTED user_skewed command.

Example 5 – Temporary Tables

Below is an example of creation for temporary tables in Hive. In this we are creating temp table and inserting records into it and querying the records. At the bottom, we are view detailed description of the table.

Notes:
  • Table names and column names are case insensitive
  • Comments are single quoted string literals
  • CREATE-TABLE-AS-SELECT cannot create external tables, partitioned tables and bucketed tables.
  • Default Storage format is TEXTFILE.
  • Default field delimiter is not tab (‘\t’) but it is Ctrl+A (‘\001’)
  • Table can be made freeze so that no other changes are allowed on this with table property TBLPROPERTIES (“immutable”=”true”).
References

Apache Hivehttps://cwiki.apache.org/confluence/display/Hive/Home


About Siva

Senior Hadoop developer with 4 years of experience in designing and architecture solutions for the Big Data domain and has been involved with several complex engagements. Technical strengths include Hadoop, YARN, Mapreduce, Hive, Sqoop, Flume, Pig, HBase, Phoenix, Oozie, Falcon, Kafka, Storm, Spark, MySQL and Java.


Leave a comment

Your email address will not be published. Required fields are marked *

One thought on “Hive Table Creation Commands


Review Comments
default image

I have attended Siva’s Spark and Scala training. He is good in presentation skills and explaining technical concepts easily to everyone in the group. He is having excellent real time experience and provided enough use cases to understand each concepts. Duration of the course and time management is awesome. Happy that I found a right person on time to learn Spark. Thanks Siva!!!

Dharmeswaran ETL / Hadoop Developer Spark Nov 2016 September 21, 2017

.