In our previous post we have discussed about hadoop job optimization or Hadoop Performance Tuning for Mapreduce jobs. In this post we will briefly discuss a few points on how to optimize hive queries/ Hive Performance tuning.
If we do not fine tune Hive properly, then even for select queries on smaller tables in Hive, some times it may take minutes to emit results. So, because of this reason Hive is mainly limited to OLAP features only. When instant results expected then Hive is not suitable. But by following below practices we can improve the Hive query performances at least by 50 %.
Hive Performance Tuning:
Below are the list of practices that we can follow to optimize Hive Queries.
1. Enable Compression in Hive
By enabling compression at various phases (i.e. on final output, intermediate data), we achieve the performance improvement in Hive Queries. For further details on how to enable compression Hive refer the post Compression in Hive.
2. Optimize Joins
We can improve the performance of joins by enabling Auto Convert Map Joins and enabling optimization of skew joins.
Auto Map Joins
Auto Map-Join is a very useful feature when joining a big table with a small table. if we enable this feature, the small table will be saved in the local cache on each node, and then joined with the big table in the Map phase. Enabling Auto Map Join provides two advantages. First, loading a small table into cache will save read time on each data node. Second, it avoids skew joins in the Hive query, since the join operation has been already done in the Map phase for each block of data.
To enable the Auto Map-Join feature, we need to set below properties.
We can enable optimization of skew joins, i.e. imbalanced joins by setting hive.optimize.skewjoin property to true either via SET command in hive shell or hive-site.xml file. Below are the list of properties that can be fine tuned to better optimize the skew joins.