Impala Best Practices


Below are Impala performance tuning options:

Pre-execution Checklist

  •    Data types
  •    Partitioning
  •    File Format

Data Type Choices

  •      Define integer columns as INT/BIGINT
  •      Operations on INT/BIGINT more efficient than STRING
  •      Convert “external” data to good “internal” types on load
  •      e.g. CAST date strings to TIMESTAMPS
  •      This avoids expensive CASTs in queries later

Partitioning

  • The fastest I/O is the one that never takes place.
  • Understand your query filter predicates
  • For time-series data, this is usually the date/timestamp column
  • Use this/these column(s) for a partition key(s)
  • Validate queries leverage partition pruning using EXPLAIN •You can have too much of a good thing
  • A few thousand partitions per table is probably OK
  • Tens of thousands partitions is probably too much
  • Partitions/Files should be no less than a few hundred MBs

Use Parquet Columnar Format for HDFS

  • Well defined open format – http://parquet.io
  • Works in Impala, Pig, Hive & Map/Reduce
  • I/O reduction by only reading necessary columns
  • Columnar layout compresses/encodes better
  • Supports nested data by shredding columns
  • Uses techniques used by Google’s ColumnIO
  • Impala loads use Snappy compression by default
  • Gzip available: set PARQUET_COMPRESSION_CODEC=gzip;
  • Quick word on Snappy vs. Gzip

impala-parquet

Quick Note on Compression

Snappy

  • Faster compression/decompression speeds
  • Less CPU cycles
  • Lower compression ratio

Gzip/Zlib

  • Slower compression/decompression speeds
  • More CPU cycles
  • Higher compression ratio
  • It’s all about trade-offs

Left-Deep Join Tree

  • The largest* table should be listed first in the FROM clause
  • Joins are done in the order tables are listed in FROM clause
  • Filter early – most selective joins/tables first
  • v1.2.1 will do JOIN ordering

Types of Hash Joins

Broadcast

  • Default hash join type is BROADCAST (aka replicated)
  • Each node ends up with a copy of the right table(s)*
  • Left side, read locally and streamed through local hash join(s)
  • Best choice for “star join”, single large fact table, multiple small dims

Shuffle

  • Alternate hash join type is SHUFFLE (aka partitioned)
  • Right side hashed and shuffled; each node gets ~1/Nth the data
  • Left side hashed and shuffled, then streamed through join
  • Best choice for “large_table JOIN large_table”
  • Only available if ANALYZE was used to gather table/column stats*

How to use ANALYZE

  • Table Stats (from Hive shell)
  • analyze table T1 [partition(partition_key)] compute statistics;
  • Column Stats (from Hive shell)
  • analyze table T1 [partition(partition_key)] compute statistics for columns c1,c2,…
  • Impala 1.2.1 will have a built-in ANALYZE command

Hinting Joins

Determining Join Type From EXPLAIN

impala-explain

Memory Requirements for Joins & Aggregates

  • Impala does not “spill” to disk — pipelines are in-memory
  • Operators’ mem usage need to fit within the memory limit
  • This is not the same as “all data needs to fit in memory”
  • Buffered data generally significantly smaller than total accessed data
  • Aggregations’ mem usage proportional to number of groups
  • Applies for each in-flight query (sum of total)
  • Minimum of 128GB of RAM is recommended for impala nodes

 


About Siva

Senior Hadoop developer with 4 years of experience in designing and architecture solutions for the Big Data domain and has been involved with several complex engagements. Technical strengths include Hadoop, YARN, Mapreduce, Hive, Sqoop, Flume, Pig, HBase, Phoenix, Oozie, Falcon, Kafka, Storm, Spark, MySQL and Java.

Leave a comment

Your email address will not be published. Required fields are marked *


Review Comments
default image

I am a plsql developer. Intrested to move into bigdata.

Neetika Singh ITA Hadoop in Dec/2016 December 22, 2016

.