Hive


Hadoop and Hive Interview Cheat Sheet 1

Hive SQL Based Datawarehouse app built on top of hadoop(select,join,groupby…..) It is a platform used to develop SQL type scripts to do MapReduce operations. PARTITIONING Partition tables changes how HIVE structures the data storage *Used for distributing load horizantally ex: PARTITIONED BY (country STRING, state STRING); A subset of a table’s data set where one column has the same value for all records in the subset. In Hive, as in most databases […]


String Functions in Hive 1

This post is about basic String Functions in Hive with syntax and examples. Creating Table in HIVE:

String Functions and Normal Queries:

ASCII ASCII Function converts the first character of the string into its numeric ASCII value.

 CONCAT The CONCAT function concatenates all the strings/columns.

CONCAT_WS Syntax: “CONCAT_WS(string delimiter, string str1,str2……)” The CONCAT_WS function concatenates all the strings only strings and Column with datatype string.

[…]


Hive Aggregate Functions 1

Creating Table in HIVE :

Aggregated Functions and Normal Queries:

SUM Returns the sum of the elements in the group or sum of the distinct values of the column in the group.

Count count(*) – Returns the total number of retrieved rows, including rows containing NULL values; count(expr) – Returns the number of rows for which the supplied expression is non-NULL; count(DISTINCT expr[, expr]) – Returns the […]


Hive Date Functions 2

HIVE Date Functions from_unixtime: This function converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a STRING that represents the TIMESTAMP of that moment in the current system time zone in the format of “1970-01-01 00:00:00”. The following example returns the current date including the time.

from_utc_timestamp This function assumes that the string in the first expression is UTC and then, converts that  string to the […]


Hive Functions Examples 2

Hive Functions Examples SET SHOW USE CREATE DATABASE CREATE MANAGED TABLE CREATE EXTERNAL TABLE CREATING TABLE FROM EXISTING TABLE CREATING EXTERNAL TABLES FROM MANAGED TABLES LOAD COPY DATA FROM ONE TABLE TO ANOHTER DROP QUIT SELECT DESCRIBE DESCRIBE SPECIFIC FIELD DESCRIBE EXTENDED ALTER CLONE SCHEMA (DATA IS NOT COPIED) CLONE SCHEMA TO ANOTHER DB USING REGULAR EXPRESSIONS MATHEMATICAL FUNCTIONS AGGREGATE FUNCTIONS LIMIT NESTED SELECT STATEMENT CASE..WHEN..THEN LIKE & RLIKE JOINS […]


Hive Performance Tuning 6

In our previous post we have discussed about hadoop job optimization or Hadoop Performance Tuning for Mapreduce jobs. In this post we will briefly discuss a few points on how to optimize hive queries/ Hive Performance tuning. If we do not fine tune Hive properly, then even for select queries on smaller tables in Hive, some times it may take minutes to emit results. So, because of this reason Hive […]


Enable Compression in Hive 1

For data intensive workloads, I/O operation and network data transfer will take considerable time to complete. By Enabling Compression in Hive we can improve the performance Hive Queries and as well as save the storage space on HDFS cluster. Find Available Compression Codecs in Hive To enable compression in Hive, first we need to find out the available compression codes on hadoop cluster, and we can use below set command […]


Hive Built In Functions 2

Hive Built In Functions Functions in Hive are categorized as below. Mathematical Functions: These functions mainly used to perform mathematical calculations. Date Functions: These functions are used to perform operations on date data types like adding the number of days to the date etc. String Functions: These functions are used to perform operations on strings like finding the length of a string etc. Conditional Functions: These functions are used to […]


Hive Authorization Models and Hive Security 3

In this post, we will discuss about Hive Authorization Models and Hive security. Before discussing about Hive Authorization Models lets note the difference between authentication and authorization. Authentication – Verifying the identity of the user, whether the logged in user is real user or not. Authorization – Verifying whether a user has permission to perform a certain action. Hive Authorization Models In Hive, by default Authorization will not be enabled. But […]


Hive JDBC Client Example 4

In this post, we will discuss about one of common hive clients, JDBC client for both HiveServer1 (Thrift Server) and HiveServer2. Use of HiveServer2 is recommended as HiveServer1 has several concurrency issues and lacks some features available in HiveServer2. JDBC Data Types The following table lists the data types implemented for HiveServer/HiveServer2 JDBC. Hive Type Java Type Specification TINYINT byte signed or unsigned 1-byte integer SMALLINT short signed 2-byte integer INT int […]


HiveServer2 Beeline Introduction 3

In this post we will discuss about HiveServer2 Beeline Introduction. As of hive-0.11.0, Apache Hive started decoupling HiveServer2 from Hive. It is because of overcoming the existing Hive Thrift Server. Below are the Limitations of Hive Thrift Server 1 No Sessions/Concurrency Essentially need 1 server per client Security Client Interface Stability Sessions/Currency Old Thrift API and server implementation didn’t support concurrency. Authentication/Authorization Incomplete implementations of Authentication (verifying the identity of […]


Sqoop Hive Use Case Example 2

This is another Use case on Sqoop, Hive concepts. Hive Use Case Example. Hive Use Case Example Problem Statement There are about 35,000 crime incidents that happened in the city of San Francisco in the last 3 months. Our task is to store this relational data in an RDBMS. Use Sqoop to import it into Hadoop. Can we answer the following queries on this data:   Relative frequencies of different types of crime incidents […]


Hive Use case example for JSON Data 1

Hive Use case example with US government web sites data Click here to download example data to analyze —> UsaGovData The data present in the above file is JSON Format and its JSON Schema is as shown below,

Note: If you copy the text file into LFS make sure that you do not have any empty lines at the end of the file otherwise you will encounter below exception

[…]


Bucketing In Hive 20

In our previous post we have discussed about partitioning in Hive, now we will focus on Bucketing In Hive, which is another way of giving more fine grained structure to Hive tables. Bucketing in Hive Usually Partitioning in Hive offers a way of segregating hive table data into multiple files/directories. But partitioning gives effective results when, There are limited number of partitions Comparatively equal sized partitions But this may not […]


Partitioning in Hive 23

In this post, we will discuss about one of the most critical and important concept in Hive, Partitioning in Hive Tables. Partitioning in Hive Table partitioning means dividing table data into some parts based on the values of particular columns like date or country, segregate the input records into different files/directories based on date or country. Partitioning can be done based on more than column which will impose multi-dimensional structure on directory […]


Review Comments
default image

I am a plsql developer. Intrested to move into bigdata.

Neetika Singh ITA Hadoop in Dec/2016 December 22, 2016

.