In our previous post we have discussed about partitioning in Hive, now we will focus on Bucketing In Hive, which is another way of giving more fine grained structure to Hive tables.
Bucketing in Hive
Usually Partitioning in Hive offers a way of segregating hive table data into multiple files/directories. But partitioning gives effective results when,
- There are limited number of partitions
- Comparatively equal sized partitions
But this may not possible in all scenarios, like when are partitioning our tables based geographic locations like country, some bigger countries will have large partitions (ex: 4-5 countries itself contributing 70-80% of total data) where as small countries data will create small partitions (remaining all countries in the world may contribute to just 20-30 % of total data). So, In these cases Partitioning will not be ideal.
To overcome the problem of over partitioning, Hive provides Bucketing concept, another technique for decomposing table data sets into more manageable parts.
- Bucketing concept is based on (hashing function on the bucketed column) mod (by total number of buckets). The hash_function depends on the type of the bucketing column.
- Records with the same bucketed column will always be stored in the same bucket.
- We use CLUSTERED BY clause to divide the table into buckets.
- Physically, each bucket is just a file in the table directory, and Bucket numbering is 1-based.
- Bucketing can be done along with Partitioning on Hive tables and even without partitioning.
- Bucketed tables will create almost equally distributed data file parts.
- Bucketed tables offer efficient sampling than by non-bucketed tables. With sampling, we can try out queries on a fraction of data for testing and debugging purpose when the original data sets are very huge.
- As the data files are equal sized parts, map-side joins will be faster on bucketed tables than non-bucketed tables. In Map-side join, a mapper processing a bucket of the left table knows that the matching rows in the right table will be in its corresponding bucket, so it only retrieves that bucket (which is a small fraction of all the data stored in the right table).
- Similar to partitioning, bucketed tables provide faster query responses than non-bucketed tables.
- Bucketing concept also provides the flexibility to keep the records in each bucket to be sorted by one or more columns. This makes map-side joins even more efficient, since the join of each bucket becomes an efficient merge-sort.
- Specifying bucketing doesn’t ensure that the table is properly populated. Data Loading into buckets needs to be handled by our-self.
Example Use Case
Lets explore the remaining features of Bucketing in Hive with an example Use case, by creating buckets for sample user records provided in the previous post on partitioning –> UserRecords
Let us create the table partitioned by country and bucketed by state and sorted in ascending order of cities.
Creation of Bucketed Tables
We can create bucketed tables with the help of CLUSTERED BY clause and optional SORTED BY clause in CREATE TABLE statement. With the help of the below HiveQL we can create bucketed_user table with above given requirement.
Unlike partitioned columns (which are not included in table columns definition) , Bucketed columns are included in table definition as shown in above code for state and city columns.
Inserting data Into Bucketed Tables
Similar to partitioned tables, we can not directly load bucketed tables with LOAD DATA (LOCAL) INPATH command, rather we need to use INSERT OVERWRITE TABLE … SELECT …FROM clause from another table to populate the bucketed tables. For this, we will create one temporary table in hive with all the columns in input file from that table we will copy into our target bucketed table.
Lets assume we have created temp_user temporary table, and below is the HiveQL for populating bucketed table with temp_user table.
To populate the bucketed table, we need to set the property hive.enforce.bucketing = true, so that Hive knows to create the number of buckets declared in the table definition.
- The property hive.enforce.bucketing = true similar to hive.exec.dynamic.partition=true property in partitioning. By Setting this property we will enable dynamic bucketing while loading data into hive table.
- It will automatically sets the number of reduce tasks to be equal to the number of buckets mentioned in the table definition (for example 32 in our case) and automatically selects the clustered by column from table definition.
- If we do not set this property in Hive Session, we have to manually convey same information to Hive that, number of reduce tasks to be run (for example in our case, by using set mapred.reduce.tasks=32) and CLUSTER BY (state) and SORT BY (city) clause in the above INSERT …SELECT statement at the end.
Lets see the execution of these scripts and their output in the next page.
Solution For Example Use Case
Below is the combined HiveQL along with script required for temporary hive table creation. Lets save this HiveQL into bucketed_user_creation.hql and save the input file provided in example use case section into user_table.txt file in home directory.
Lets execute this script in hive and below is the output of the above script execution.