HCatalog and Pig Integration
In short, HCatalog opens up the hive metadata to other mapreduce tools. Every mapreduce tools has its own notion about HDFS data (example Pig sees the HDFS data as set of files, Hive sees it as tables). With having table based abstraction, HCatalog supported mapreduce tools do not need to care about where the data is stored, in which format and storage location (HBase or HDFS).
We do get the facility of WebHcat to submit jobs in an RESTful way if you configure webhcat along Hcatalog. In this post we will see HCatalog and Pig Integration and loading and storing of Hive tables using Pig via HCatalog.
The HCatLoader and HCatStorer interfaces are used with Pig scripts to read and write data in HCatalog-managed tables. No HCatalog-specific setup is required for these interfaces.
Note: HCatalog is not thread safe.
Load data from hive to pig:
Using HcatLoader, we can load the data from hive to pig. The basic syntax is shown below,
The Data types that are supported with Hcatalog and Pig are listed below as of Hive-0.14
Pig does not automatically pick up HCatalog jars. To bring in the necessary jars, start pig session with below option.
Or we can pass the below required jars in our command line as shown below:
HCatStorer is used with Pig scripts to write data to Hcatalog/Hive managed tables.
We can directly store into a partitioned table also, as shown below.
To achieve the dynamic partitioning – To write into multiple partitions at once, make sure that the partition column is present in our data, then call HCatStorer with no argument as shown below.