In this post we will describe about the process of creating custom UDF in Hive. Though there are many generic UDFs (User defined functions) provided by Hive we might need to write our custom UDFs sometime to meet our requirements. In this post, we will discuss about one of the general requirement for the clients, those migrating from any traditional RDBMSs to Hive, they will expect Auto Increment Column in a table to have unique ID in the column which is very easy to write in SQL.
As of Hive-0.13.1 release, there is no support for Auto Increment Column value in Hive. To achieve this feature in Hive, we need to write custom UDF.
Creating Custom UDF in Hive:
As the Hive language is written in Java, The UDFs need to be written in Java. There are three types of UDFs in Hive:
- Regular UDFs (User defined functions)
- UDAFs (User-defined aggregate functions)
- UDTFs (User-defined table-generating functions).
In this post we will concentrate on only Regular UDFs. Any custom UDFs that we are going to write must satisfy the following two properties:
- Must extend class org.apache.hadoop.hive.ql.exec.UDF .
- Must implement at least one evaluate() method.
Hive inspects the UDF to find the evaluate() method that matches the Hive function
that was invoked. Below is the code for our custom UDF for creating Auto Increment Column in Hive.
Here make sure that addition of annotation @UDFType(stateful = true) is required otherwise counter value will not get increment in the Hive column, it will just return value 1 for all the rows but not the actual row number.
Compile this program by adding hive-exec-*.jar file into classpath and build a jar file named AutoIncrementUDF.jar.
To use this UDF in Hive we need to register this Jar file in Hive with ADD JAR command as shown below.
Here we can create either Temporary function or Permanent function depending on our requirement. Temporary function will be available during current session only, once the session is closed temporary functions can’t be accessed. This means that we need to add the JAR file, and define the function at the beginning of each session.
Permanent function can be added for the above with below command.
Note: Changes in hive-site.xml or hive-env.sh file
As an alternative to calling ADD JAR command for each session, we can add the paths of our custom jars to value of hive.aux.jars.path property in hive-site.xml file. Once it is added to hive.aux.jars.path then we do not need to add this jar at each session.
Or we can add the same property into hive-env.sh file with name HIVE_AUX_JARS_PATH under HIVE_CONF_DIR location, as shown below.
We can use these UDF functions only in Select statements but not in Create Statements. But to create table with auto increment column, we will make use of CREATE TABLE AS SELECT command in Hive.
Auto Increment Column In Hive:
Lets create a sample Hive table t1 with below data.
Now register the Jar file and create Temporary function and create similar table with one extra column for Auto Increment value.
So we have successfully added one extra column with automatic increment value with row number by creating Custom UDF in hive.