Bucketing In Hive

In our previous post we have discussed about partitioning in Hive, now we will focus on Bucketing In Hive, which is another way of giving more fine grained structure to Hive tables.

Bucketing in Hive

Usually Partitioning in Hive offers a way of segregating hive table data into multiple files/directories. But partitioning gives effective results when,

  • There are limited number of partitions
  • Comparatively equal sized partitions

But this may not possible in all scenarios, like when are partitioning our tables based geographic locations like country, some bigger countries will have large partitions (ex: 4-5 countries itself contributing 70-80% of total data) where as small countries data will create small partitions (remaining all countries in the world may contribute to just 20-30 % of total data). So, In these cases Partitioning will not be ideal.

To overcome the problem of over partitioning, Hive provides Bucketing concept, another technique for decomposing table data sets into more manageable parts.

Features
  • Bucketing concept is based on (hashing function on the bucketed column) mod (by total number of buckets). The hash_function depends on the type of the bucketing column.
  • Records with the same bucketed column will always be stored in the same bucket.
  • We use CLUSTERED BY clause to divide the table into buckets.
  • Physically, each bucket is just a file in the table directory, and Bucket numbering is 1-based.
  • Bucketing can be done along with Partitioning on Hive tables and even without partitioning.
  • Bucketed tables will create almost equally distributed data file parts.
Advantages
  • Bucketed tables offer efficient sampling than by non-bucketed tables. With sampling, we can try out queries on a fraction of data for testing and debugging purpose when the original data sets are very huge.
  • As the data files are equal sized parts, map-side joins will be faster on bucketed tables than non-bucketed tables. In Map-side join, a mapper processing a bucket of the left table knows that the matching rows in the right table will be in its corresponding bucket, so it only retrieves that bucket (which is a small fraction of all the data stored in the right table).
  • Similar to partitioning, bucketed tables provide faster query responses than non-bucketed tables.
  • Bucketing concept also provides the flexibility to keep the records in each bucket to be sorted by one or more columns. This makes map-side joins even more efficient, since the join of each bucket becomes an efficient merge-sort.
Limitations
  • Specifying bucketing doesn’t ensure that the table is properly populated. Data Loading into buckets needs to be handled by our-self.

Example Use Case

Lets explore the remaining features of Bucketing in Hive with an example Use case, by creating buckets for sample user records provided in the previous post on partitioning –> UserRecords

Let us create the table partitioned by country and bucketed by state and sorted in ascending order of cities.

Creation of Bucketed Tables

We can create bucketed tables with the help of CLUSTERED BY clause and optional SORTED BY clause in CREATE TABLE statement. With the help of the below HiveQL we can create bucketed_user table with above given requirement.


Unlike partitioned columns (which are not included in table columns definition) , Bucketed columns are included in table definition as shown in above code for state and city columns.

Inserting data Into Bucketed Tables

Similar to partitioned tables, we can not directly load bucketed tables with LOAD DATA (LOCAL) INPATH command, rather we need to use INSERT OVERWRITE TABLE … SELECT …FROM clause from another table to populate the bucketed tables. For this, we will create one temporary table in hive with all the columns in input file from that table we will copy into our target bucketed table.

Lets assume we have created temp_user temporary table, and below is the HiveQL for populating bucketed table with temp_user table.

To populate the bucketed table, we need to set the property hive.enforce.bucketing = true, so that Hive knows to create the number of buckets declared in the table definition.

Note:

  • The property hive.enforce.bucketing = true similar to hive.exec.dynamic.partition=true property in partitioning. By Setting this property we will enable dynamic bucketing while loading data into hive table.
  • It will automatically sets the number of reduce tasks to be equal to the number of buckets mentioned in the table definition (for example 32 in our case) and automatically selects the clustered by column from table definition.
  • If we do not set this property in Hive Session, we have to manually convey same information to Hive that, number of reduce tasks to be run (for example in our case, by using set mapred.reduce.tasks=32) and CLUSTER BY (state) and SORT BY (city) clause in the above INSERT …SELECT statement at the end.

Lets see the execution of these scripts and their output in the next page.

Solution For Example Use Case

Below is the combined HiveQL along with script required for temporary hive table creation. Lets save this HiveQL into bucketed_user_creation.hql and save the input file provided in example use case section into user_table.txt file in home directory.

Output

Lets execute this script in hive and below is the output of the above script execution.

From the above box, we can see that mapreduce job initiated 32 reduce tasks for 32 buckets and four partitions are created by country.

Table Sampling in Hive

Table Sampling in hive is nothing but extraction small fraction of data from the original large data sets. It is similar to LIMIT operator in Hive.

But below are the difference between LIMIT and TABLESAMPLE in Hive.

  • In many cases a LIMIT clause executes the entire query, and then only returns a limited results.
  • But Sampling will only select a portion of data to perform query.

Now we will do sampling on these bucketed tables to see the performance difference between bucketed and non-bucketed tables. Lets pull the records present in the last bucket of bucketed_user table created above.

In the above sampling we can see the sample records from various countries and covering many states and cities, But if we use LIMIT operator on non-bucketed tables it will return either all the 129 records from first country CA or last country US but we can’t evenly distributed sample records from all countries and states. This can be seen in the below screen.

We can also perform random sampling with Hive with below syntax.

Percentage Table Sampling

Thus bucketing is most suitable Sampling purpose and also adds some optimization to query performance.

Leave a Reply

Your email address will not be published. Required fields are marked *

  1. Hi Siva,

    Iam very confusing in hive ,

    Bucket Join
    Bucket Map Join
    Sort Merge Bucket Join
    Sort Merge Bucket Map Join   please help me with example.

    thanks advance…..

     

  2. Very nice explanation Shiva.

    But one question here…

    when i gave sorted by for bucketed column ,  the data under that column of bucket is not sorted .
    can you explain?Or am i missing something here.Please let me know.

  3. Hi Siva,
    Thanks for your detailed explanation.

    I created all the tables as shown in your example and loaded the data accordingly. But when i run the below queries i dont get to see any result set? I have applied all the settings ( partition.mode=nonstrict, dynamic.partition=true, no.of reducers = 32). Why i am not getting the result set, am I missing anything?

    hive> SELECT firstname, country, state, city FROM bucketed_user
    > TABLESAMPLE(BUCKET 10 OUT OF 32 ON state);

    Note: When i run the above query using PERCENT, i get the output.

    Thanks in advance.

  4. Hi

    i have gone through your blog for this topic and found a lot of differences in output.

    1. Once you create partition table , column on which partition will come will come in last

    2. once you load the data using the temp table need to put partition column in last , but it is not so in your case
    did you try to do desc tablename after creation of table

    3. using data for dynamic partition , column on which partiton is created must be in last in case of partiton table and same for temp table from where you are loading the data

    4. you have shown output using where cluase , but did you try select * from table and checked if table is coming with all records in proper way?

    5. you might get NULL values which shows that records are not properly feeded for dynamic partition. thats problem in Hive , as it work with schema on read , means data correction and consistency will be checked once you select the records from table and check.

    • Yes You are correct that partitioned columns will come at the end in the descriptions and also insert into statements you should use the select column names and specify the partition column at the end, and select * from temp table will not work if the columns order doesn’t match the expected partitioned tables column order. So that’s why it is always better to specify the columns order in the select statements.

  5. Hi,

    Very nice explanation.. Thank you!

    Can you please enlighten on the following — I have a Hive external table with around 40M rows. It has a STRING column having 5 distinct values. I create a managed table partitioned by that String column and insert all the rows from External table to the Partitioned table. The insertion takes around 4 mins.

    Then I create another managed table with 20 buckets based on another column which is INT type and has all Unique values. Then I again insert all the rows from the External table to the Bucketed table and this time the insertion more than 8 mins.

    Is there an inherent reason that inserts into bucketed tables should take longer than inserts into partitioned table?

    Your expert comments please.

    Thanks

    Gautam

     

     

  6. Hi Siva,

    Nice article.. thank you! Would like to ask you a question here —

    I have a Hive external table with around 40M rows. The table has a STRING column which has 5-6 distinct values. I created a managed table Partitioned by the String column. When I inserted all the rows from the External table into the Partitioned table, it took around 4 mins.

    Next I created another managed table which is clustered by an INT type column and number of buckets set to 20. The Int column has all unique values across the table. When I inserted all the rows from the External table into the Bucketed table, it took more than 8 mins! Even when I create a normal managed table (neither partitioned nor bucketed), it takes 3 mins.

    Is there any inherent reason as to why insertion into bucketed table should take longer compared to insertion into partitioned table?

    Can you please throw some light on this?

    Thanks

    • For Bucketed and Partitioned tables inserts will take relatively longer times but they are intended for faster read times but not for write times

      • Hi Siva,

        Thanks for clarifying. So it is justified why partitioned/bucketed table takes longer time during inserts as compared to normal tables.

        But is there any specific reason why insert to bucketed table should take longer than partitioned table? Apparently the number of buckets is not playing any role here. I tried the same with 5 buckets (in place of 20) but the time did not reduce.

        Just curious to know whether there is some architectural difference in between partitioned & bucketed which is allowing faster insert into partitioned and slower into bucketed.

        Thanks in advance for your help.

        Gautam

  7. hai ,
    your explanation is very good but to practice where is the dataset for patient if u include the data set it would give us more insight when we run the example and see the results.
    thank you,
    swapna.