Hive Performance Tuning

In our previous post we have discussed about hadoop job optimization or Hadoop Performance Tuning for Mapreduce jobs. In this post we will briefly discuss a few points on how to optimize hive queries/ Hive Performance tuning.

If we do not fine tune Hive properly, then even for select queries on smaller tables in Hive, some times it may take minutes to emit results. So, because of this reason Hive is mainly limited to OLAP features only. When instant results expected then Hive is not suitable. But by following below practices we can improve the Hive query performances at least by 50 %.

Hive Performance Tuning:

Below are the list of practices that we can follow to optimize Hive Queries.

1. Enable Compression in Hive

By enabling compression at various phases (i.e. on final output, intermediate data), we achieve the performance improvement in Hive Queries. For further details on how to enable compression Hive refer the post Compression in Hive.

2. Optimize Joins

We can improve the performance of joins by enabling Auto Convert Map Joins and enabling optimization of skew joins.

Auto Map Joins

Auto Map-Join is a very useful feature when joining a big table with a small table. if we enable this feature, the small table will be saved in the local cache on each node, and then joined with the big table in the Map phase. Enabling Auto Map Join provides two advantages. First, loading a small table into cache will save read time on each data node. Second, it avoids skew joins in the Hive query, since the join operation has been already done in the Map phase for each block of data.

To enable the Auto Map-Join feature, we need to set below properties.

Skew Joins

We can enable optimization of skew joins, i.e. imbalanced joins by setting hive.optimize.skewjoin property to true either via SET command in hive shell or hive-site.xml file. Below are the list of properties that can be fine tuned to better optimize the skew joins.

Enable Bucketed Map Joins

If tables are bucketed by a particular column and these tables are being used in joins then we can enable bucketed map join to improve the performance. To do this, we can set below properties in hive-site.xml or hive shell.

3. Avoid Global Sorting in Hive

Global Sorting in Hive can be achieved in Hive with ORDER BY clause but this comes with a drawback. ORDER BY produces a result by setting the number of reducers to one, making it very inefficient for large datasets. 

When a globally sorted result is not required, then we can use SORT BY clause. SORT BY produces a sorted file per reducer. 

If we need to control which reducer a particular row goes to, we can use DISTRIBUTE BY clause, for example, 

Each dept will be processed separately by a reducer and records will be sorted by id and name fields within each dept separately.

4. Enable Tez Execution Engine

Instead of running Hive queries on venerable Map-reduce engine, we can improve the performance of hive queries at least by 100% to 300 % by running on Tez execution engine. We can enable the Tez engine with below property from hive shell.

5. Optimize LIMIT operator

By default LIMIT operator still executes the entire query, then only returns a limited results. Because this behavior is generally wasteful, it can be avoided by setting below properties.

6. Enable Parallel Execution

Hive converts a query into one or more stages. Stages could be a MapReduce stage, sampling stage, a merge stage, a limit stage. By default, Hive executes these stages one at a time. A particular job may consist of some stages that are not dependent on each other and could be executed in

parallel, possibly allowing the overall job to complete more quickly. Parallel execution can be enabled by setting below properties.

7. Enable Mapreduce Strict Mode

we can enable mapreduce strict mode by setting below property to strict.

8. Single Reduce for Multi Group BY

By enabling single reducer task for multi group by operations, we can combine multiple GROUP BY operations in a query into a single MapReduce job.

9. Controls Parallel Reduce Tasks

We can control the number of parallel reduce tasks that can be run for a given hive query with below properties.

we can also set the parallel reduce tasks to a fixed value with below property.

10. Enable Vectorization

Vectorization feature is introduced into hive for the first time in hive-0.13.1 release only. By vectorized query execution, we can improve performance of operations like scans, aggregations, filters and joins, by performing them in batches of 1024 rows at once instead of single row each time.

We can enable vectorized query execution by setting below three properties in either hive shell or hive-site.xml file.

11. Enable Cost Based Optimization

Recent Hive releases provided the feature of cost based optimization, one can achieve further optimizations based on query cost, resulting in potentially different decisions: how to order joins, which type of join to perform, degree of parallelism and others.

cost based optimization can be enabled by setting below properties in hive-site.xml file.

And we can gather basic statistics about all columns in an employee table with below command in hive shell.

12. Use ORC File Format

Using ORC (Optimized Record Columnar) file format we can improve the performance of Hive Queries very effectively. Below picture on file format best depicts the power of ORC file file over other formats.

ORCFile

Create Tables with ORC File Format

We can create new hive table with ORC file format with just by adding STORED AS ORC clause to CREATE TABLE command in hive. Optionally we can provide compression techniques in TBLPROPERTIES clause.

Convert Existing Tables to ORC

Create a table with the same schema as the source table and STORED AS ORC, then we can submit below command to copy data from regular old table new ORC formatted table.

Key Default Notes
orc.compress ZLIB Compression to use in addition to columnar compression (one of NONE, ZLIB, SNAPPY)
orc.compress.size 262,144 (= 256 KiB) Number of bytes in each compression chunk
orc.stripe.size 268,435,456 (= 256 MiB) Number of bytes in each stripe
orc.row.index.stride 10,000 Number of rows between index entries (must be >= 1,000)
orc.create.index true Whether to create inline indexes

Example ORC table creation:

Thus by using these 12 techniques we can improve the performance of Hive Queries.

Leave a Reply

Your email address will not be published. Required fields are marked *

  1. Hi,

    I have .hql file with a set of simple select queries in it. The .hql file size is 7,185  bytes (no.of queries = 20 ) and its taking nearly 1 hour for completion.

    Sample query: select count distinct <field> from <table>

    In few queries I also use join ,group by , order by etc

    I would like to know if I can reduce the file execution time.

    I tried implementing the below properties,

    set hive.auto.convert.join=true;
    set hive.auto.convert.sortmerge.join=true;
    set hive.auto.convert.sortmerge.join.noconditionaltask=false;
    set mapred.compress.map.output=true;
    set mapred.output.compress=true;
    set hive.exec.parallel=true;
    set hive.vectorized.execution.enabled = true;
    set hive.vectorized.execution.reduce.enabled = true;

    But still I can’t find any big change in execution time. Could you please help?

     

    • Hi Ann,

      For : SELECT countDISTINCT FROM tab;
      Try changing this to
      SELECT count(1) FROM ( SELECT DISTINCT field FROM tab) table;

      This helps increasing underlying reducers instead of 1 and gain some through put.

      Check if cardinal columns are there change their positions in GROUP BY clause if any..

  2. The database is in SQL Server, the client uses ODBC connector (HortonWorks Hive 2.1) sends INSERT statements sequentially to the HiveServer2 server (I note that every time a statement is sent it opens and closes the connection), this generates a Really bad performance about 20 rows in 30 seconds.
    Optimize as much as possible the configuration of the HiveServer, hive-site.xml, map-redsite.xml and increase the memory of hiveserver2, however the performance does not improve.
    It is true that Hive is Batch-oriented, however I find that this can be improved, I am very grateful for the help with this problem

    Additionally any other architecture or tool to improve this process will be highly appreciated