Hive Functions Examples 2


Hive Functions Examples

SET

SHOW

USE

CREATE DATABASE

CREATE MANAGED TABLE

CREATE EXTERNAL TABLE

CREATING TABLE FROM EXISTING TABLE

CREATING EXTERNAL TABLES FROM MANAGED TABLES

LOAD

COPY DATA FROM ONE TABLE TO ANOHTER

DROP

QUIT

SELECT

DESCRIBE

DESCRIBE SPECIFIC FIELD

DESCRIBE EXTENDED

ALTER

CLONE SCHEMA (DATA IS NOT COPIED)

CLONE SCHEMA TO ANOTHER DB

USING REGULAR EXPRESSIONS

MATHEMATICAL FUNCTIONS

AGGREGATE FUNCTIONS

LIMIT

NESTED SELECT STATEMENT

CASE..WHEN..THEN

LIKE & RLIKE

JOINS

ORDER BY, SORT BY, DISTRIBUTED BY WITH SORT BY, CLUSTER BY

SET

SHOW

USE

CREATE DATABASE

CREATE MANAGED TABLE

CREATE EXTERNAL TABLE:

CREATING TABLE FROM EXISTING TABLE

CREATING EXTERNAL TABLES FROM MANAGED TABLES:

LOAD

COPY DATA FROM ONE TABLE TO ANOHTER

DROP

QUIT

SELECT

DESCRIBE

DESCRIBE SPECIFIC FIELD

DESCRIBE EXTENDED

ALTER

CLONE SCHEMA (DATA IS NOT COPIED)

CLONE SCHEMA TO ANOTHER DB

–EMPLOYEES TABLE:

USING REGULAR EXPRESSIONS:

COMPUTING WITH COLUMN VALUES(using functions & arithmetic expressions)

ARITMETIC OPERATORS:

+, -, *, /, %, &(AND), |(OR), ^(XOR),~(NOT)

 

MATHEMATICAL FUNCTIONS:

round, rand, abs, etc

 

 

AGGREGATE FUNCTIONS:

count, sum, ave, min, max, etc

NOTE: this setting will attempt to do top-level aggregation in the map phase

NOTE: An aggregation that is NOT top-level is aggregation after GROUP BY

NOTE: The answer will be ZERO if symbol is a partitioned column

TABLE GENERATING FUNCTIONS(inverse of aggregate functions):

explode, json_tuple, parse_url_tuple, stack

NOTE:    aggregate : many rows –> one result

table generating funtions: one column –> many rows/columns

OTHER BUILT-IN FUNCTIONS

LIMIT

NESTED SELECT STATEMENT

CASE..WHEN..THEN

HIVE AVOIDS MAP REDUCE:

NOTE: Hive avoids mapreduce if WHERE clause filters on partition keys

FORCING HIVE TO AVOID MAP REDUCE:

NOTE: No gurantee that Hive won’t invoke map reduce

PREDICATE OPERATORS:

NOTE: Use extreme caution when comparing floating point numbers;

Avoid all implicit casts from smaller to wider types;

Avoid floating point numbers for anything involving money

LIKE & RLIKE

NOTE:    period –> any character

* –> the character apprearing immediately left of *  can repeat 0 or more times

| –> or

GROUP BY

JOINS

Hive supports:

  • INNER JOIN
  • LEFT OUTER JOIN
  • RIGHT OUTER JOIN
  • FULL OUTER JOIN
  • LEFT SEMI JOIN(returns records from the left table if records are found
  • in right table that satisfy ON predicates)
  • CARTISIAN PRODUCT JOIN (select * from T1 join T2)
  • MAP SIDE JOIN(small table cached in memory)(small table on right side of join)

–>Hive doesn’t support OR between predicates in ON clause (AND is supported)

–>You can join more than 2 tables

ORDER BY, SORT BY, DISTRIBUTED BY WITH SORT BY, CLUSTER BY

  • ORDER BY –> total ordering
  • SORT BY –> orders data within each reducer(local ordering)
  •  if hive.mapred.mode=strict –> LIMIT needed with ORDER BY
  • DISTRIBUTED BY controls how map output is divided among reducers
  • CLUSTER BY = DISTRIBUTE BY + SORT BY

–NOTE: Comparison:

ITEM                       ORDERING                             # REDUCERS                                             OUTPUT

ORDER BY              Global                                    1 reducer(unacceptable for large data sets)                    1 sorted file

SORY BY                 Local to reducer                      N reducers(with overlapping data)                            >=N sorted files

DISTRIBUTE BY     No sorting                              N reducers(non-overlapping data)(not sorted)       >=N sorted files

CLUSTER BY           [result:global]                       N reducers(non-overlapping data)(sorted)               >=N sorted files

Example:

CAST

TABLESAMPLE-BUCKET

  • tablesample (bucket x out of y [on column_name]);
  • tablesample (bucket x out of y on rand()) table_alias;
  • Rows of the table are bucketed on the column name randomly into y buckets numbered 1 to y
  • Rows belonging to bucket x are returned

BLOCK SAMPLING

–> tablesample(n percent)  –> n % of the data size (not n% of rows)

–NOTE: Smallest unit of sampling is a single HDFS block; if table size > block size, all rows are returned;

–> In percentage sampling,  by using ‘set hive.sample.seednumner=<integer>’ you can control the subsets of data sampled

–INPUT PRUNING FOR BUCKET TABLES:

–> Typically, tablesample will scan the entire table; this is not efficient;  but if the columns specified in TABLESAMPLE clause match columns in the CLUSTERED BY clause, tablesample queries only scan the required hash paritions of the

 

-UNION ALL (combines 2 or more tables)

–> number and type of columns must match

 –NOTE: Above query uses same table for union; same results can be achieved using SELECT & WHERE clause

 [cloudera@localhost ~]$ hadoop fs -ls output/union.out;

Found 1 items

-rw-r–r–   3 cloudera supergroup     278413 2015-02-23 14:58 output/union.out/000000_0

[cloudera@localhost ~]$ hadoop fs -cat output/union.out/000000_0

sample_output:

CRT30.1

CRT30.7

CRT30.7

CRT31.56

 

VIEWS

–> Materialized views are not supported in Hive

–> Views can be used to reduce query complexity

# View Information

View Original Text:            select * from employees_no_partition where size(subordinates) > 0

 

–NOTE: Table Type:   MANAGED_TABLE –> for managed tables

Table Type:   EXTERNAL_TABLE –> for external tables

 

–CREATE TABLES FROM VIEWS:

 

–ALTER VIEW:

 

–DROP VIEW:

 

–INDEX:

 

[cloudera@localhost ~]$ cat scripts/create_index.hql;

 

–??????????????

[cloudera@localhost ~]$ hive -f ‘scripts/create_index.hql’;

Logging initialized using configuration in jar:file:/usr/lib/hive/lib/hive-common-0.10.0-cdh4.3.0.jar!/hive-log4j.properties

Hive history file=/tmp/cloudera/hive_job_log_df686526-e57d-49c3-a7e4-7325a6ce2c17_1322001493.txt

FAILED: ParseException line 7:0 missing EOF at ‘PARTITIONED’ near ’employees_index_table’

–??????????????

–NOTE: Bitmap indexes are commonly used for columns with few distinct values

–> WITH DIFFERED REBUILD, the new index starts empty; at any time, the index can by built the first time or rebuilt using ALTER INDEX statement:

 

-> If PARTITION clause is omitted, the index is rebuilt on all partitions

–> show formatted index on employees;

–> drop index if exists employee_index on table employees;

–> AVOID MULTIPLE PASSES OVER THE SAME DATA:

–> Inefficient:

 –> Efficient:

 

–INDEXES:

–> Hive has limited indexing capabilites

–> There are no keys; but you can build indexes on columns to speed some operations

–> Index data for a table is stored in another table.

 

 –EXPLAIN:

 ABSTRACT SYNTAX TREE:

 

STAGE DEPENDENCIES:

Stage-1 is a root stage

Stage-0 is a root stage

 

STAGE PLANS:

 

–NOTE: You can also use:

explain EXTENDED select symbol, count(symbol) from stocks group by symbol;

——>TUNING

  • Use [EXTEDNED] EXPLAIN
  • Tune LIMIT clause (use set )
  • Tune JOIN clause(use map-sie join on small datasets)
  • use Local Mode on small datasets(use set)
  • Parallel Execution (execute jobs in parallel)(use set)
  • Strict Mode (use set hive.mapred.mode=strict). Enforces the following:
    • enforces WHERE clause on partitioned tables
    • enforces LIMIT clause with ORDER BY
    • enforces ON clause with catesian product (JOIN)
  • Tuning number of reducers:
    • set hive.exec.reducers.bytes.per.reducer=750,000,000
    • set hive.exec.reducers.max (max reducer count)
    • reduce.tasks (reducer count)
  • JVM Reuse (use set mapred.job.reuse.jvm.num.tasks=10)
  • Use Indexes
  • Tune dynamic partition:
    • exec.dynamic.partition.mode=strict (at least one static partition is needed)
    • exec.max.dynamic.partitions
    • exec.max.dynamic.partitions.pernode
  • o Speculative Execcution (Hadoop launches multiple instances of map/reduce jobs):
    • map.tasks.speculative.execution=true;
    • reduce.tasks.speculative.execution=true;
    • mapred.reduce.tasks.speculative.execution=true;
  • Use single mapreduce job to combine multiple GROUP BY in a query(common group by key is rqd)(use set)
  • Use Virtual columns (see example below)

 

 

 

NOTE: INPUT__FILE__NAME & BLOCK__OFFSET__INSIDE__FILE needs to be in upper case; seperator is 2 dashes(_ _ )

 SCRIPTS:

[cloudera@localhost ~]$ cat scripts/create_table.hql

CREATE TABLE IF NOT EXISTS mydb.employees (

name STRING COMMENT ‘Employee name’,

salary FLOAT COMMENT ‘Employee salary’,

subordinates ARRAY<STRING> COMMENT ‘Names of subordinates’,

deductions MAP<STRING, FLOAT> COMMENT ‘Keys are deductions names, values are percentages’,

address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT> COMMENT ‘Home address’)

COMMENT ‘Description of the table’

LOCATION ‘/user/hive/warehouse/’

TBLPROPERTIES (‘creator’=’me’, ‘created_at’=’2015-02-19 9:55:00’ );

 

[cloudera@localhost ~]$ hive -f ‘scripts/create_table.hql’

OTHERS:

 –CHANGING HIVE PROMPT:

 

-PRINT column headers;

 

 –HIVE ONESHOT COMMANDS:

[cloudera@localhost ~]$ hive -e ‘select * from records LIMIT 5’;

[cloudera@localhost ~]$ hive -e ‘select * from mydb.stocks limit 5;’

 

–EXECUTE UNIX COMMANDS:

 

–INSTALLATION LOCATION:

[cloudera@localhost ~]$ ls /usr/lib

 

–HIVE INSTALLATION LOCATION

[cloudera@localhost ~]$ ls /usr/lib/hive/lib

cloudera@localhost ~]$ ls -l /usr/lib/hive/bin

total 28

-rwxr-xr-x. 1 root root  881 May 27  2013 beeline

drwxr-xr-x. 3 root root 4096 Jun 18  2013 ext

-rwxr-xr-x. 1 root root 5781 May 27  2013 hive

-rwxr-xr-x. 1 root root 1900 May 27  2013 hive-config.sh

-rwxr-xr-x. 1 root root  885 May 27  2013 hiveserver2

-rwxr-xr-x. 1 root root  832 May 27  2013 metatool

[cloudera@localhost ~]$

 

–HIVE CONFIGURATION:

[cloudera@localhost ~]$ ls /usr/lib/hive/conf

 

–HIVE LOG:

[NOTE: cat /tmp/cloudera/hive.log displays the log contents

 

–HIVE LOGGING:

Open following file in text editor and you can change logging level:

/usr/lib/hive/lib/hive-common-0.10.0-cdh4.3.0.jar/hive-log4j-properties

[cloudera@localhost ~]$ hive -help

usage: hive

-d,–define <key=value>             Variable subsitution to apply to hive

commands. e.g. -d A=B or –define A=B

–database <databasename>        Specify the database to use

-e <quoted-query-string>            SQL from command line

-f <filename>                              SQL from files

-H,–help                                     Print help information

-h <hostname>                            connecting to Hive Server on remote host

–hiveconf <property=value>      Use value for given property

–hivevar <key=value>               Variable subsitution to apply to hive

commands. e.g. –hivevar A=B

-i <filename>                            Initialization SQL file

-p <port>                                 connecting to Hive Server on port number

-S,–silent                                Silent mode in interactive shell

-v,–verbose                            Verbose mode (echo executed SQL to the

console)

 

–HIVE SERVICES:

[cloudera@localhost ~]$ hive –service help

 

[NOTE:   cli : command line interface (default)

hiveserver: runs hive as a server (exposes a thirft service)

hwi: hive web interface

jar: similar to hadoop jar

metastore: runs metastore as a standalone process]

 

[NOTE: Thrift client: lets you run hive commands from different programming languages]

 

[cloudera@localhost ~]$ hive –help

Usage ./hive <parameters> –service serviceName <service parameters>

Service List: beeline cli help hiveserver2 hiveserver hwi jar lineage metastore metatool rcfilecat

Parameters parsed:

–auxpath : Auxillary jars

–config : Hive configuration directory

–service : Starts specific service/component. cli is default

Parameters used:

HADOOP_HOME or HADOOP_PREFIX : Hadoop install directory

HIVE_OPT : Hive options

For help on a particular service:

./hive –service serviceName –help

Debug help:  ./hive –debug –help

 

–CLI SERVICE

[cloudera@localhost ~]$ hive –help –service cli

usage: hive

-d,–define <key=value>                   Variable subsitution to apply to hive commands. e.g. -d A=B or –define A=B

–database <databasename>              Specify the database to use

-e <quoted-query-string>                  SQL from command line

-f <filename>                                     SQL from files

-H,–help                                            Print help information

-h <hostname>                                   connecting to Hive Server on remote host

–hiveconf <property=value>             Use value for given property

–hivevar <key=value>                        Variable subsitution to apply to hive commands. e.g. –hivevar A=B

-i <filename>                                       Initialization SQL file

-p <port>                                             connecting to Hive Server on port number

-S,–silent                                             Silent mode in interactive shell

-v,–verbose                                         Verbose mode (echo executed SQL to the console)

 

— HIVE NAMESPACES:

hivevar

hiveconf

system

env

 

–SELECT in NON-INTERACTIVE MODE:

[cloudera@localhost ~]$ hive -e ‘select * from records’;

Logging initialized using configuration in

[cloudera@localhost ~]$ hive -S -e “set” | grep warehouse

hive.metastore.warehouse.dir=/user/hive/warehouse

hive.warehouse.subdir.inherit.perms=true

 

 –HIVE FILES:

[cloudera@localhost ~]$ cat scripts/select.q

select * from records;

 

–CHANGE TO ROOT USER

[cloudera@localhost ~]$ su root

Password:                                                            –>cloudera

[root@localhost cloudera]#

 

–EXECUTE HADOOP COMMANDS:

 

 –LIST ALL HADOOP COMMANDS:

 

 –COMMENTS IN HIVE:

–this is a comment

 

WORK:

hadoop1@ubuntu:~/scripts$ cat nyse_daily.hql

 

hadoop1@ubuntu:~/scripts$ hive -f ‘nyse_daily.hql’

 

FAILED: SemanticException Line 1:17 Invalid path ”/home/hadoop1/data/NYSE_daily”: No files matching path hdfs://localhost:9000/home/hadoop1/data/NYSE_daily

hadoop1@ubuntu:~/data$ hadoop fs -put NYSE* data

 

 

 

 


Profile photo of Siva

About Siva

Senior Hadoop developer with 4 years of experience in designing and architecture solutions for the Big Data domain and has been involved with several complex engagements. Technical strengths include Hadoop, YARN, Mapreduce, Hive, Sqoop, Flume, Pig, HBase, Phoenix, Oozie, Falcon, Kafka, Storm, Spark, MySQL and Java.


Leave a comment

Your email address will not be published. Required fields are marked *

2 thoughts on “Hive Functions Examples

  • Suresh Perumal

    Dear Siva

    Surely this posting will be useful for <a href = “http://itcoordinatestraining.com”>hadoop </a>developer . Clearly explained with examples. thank u.

     

    with Regards

    Suresh Perumal

    ITCoordinates Mylapore

     

     

     

     


Review Comments
default gravatar

I am a plsql developer. Intrested to move into bigdata.

Neetika Singh ITA

.