In this post, we will discuss about processing of various types of log files in Hive. Processing Logs in Hive is similar to Processing logs in Pig post. So, we are using the same sample log files for testing examples in this post as well. So, we are not concentrating on details of log file formats in this post. For quick verification, you can refer our previous posts on Log Analysis in Hadoop, Parsing log files in Pig.
So, we will try to process below three types of Log File Formats in Hive in this post.
- Common Log File format
- Combined Log File format
- Custom Log File Format
Similar to Pig’s custom Load/Store functions for Log files, Even Hive provides its own version of SERDEs (Serializer & Deserializer) for processing Custom file formats. Especially Hive provides RegexSerDe to process these log files, as these are according to a particular expression format.
But note that RegexSerDe is available in two packages / jar files in Hive distribution, org.apache.hadoop.hive.serde2.RegexSerDe in jar hive-serde-<version>.jar and org.apache.hadoop.hive.contrib.serde2.RegexSerDe in jar hive-contrib-<version>.jar. Any of these classes can be used in our examples but when using contrib library, then we need to make it is added into Hive Session. Please refer the Caution for more details on this.
Common Log Format Files
Sample Common Log file for testing —> common_access_log
This format is Apache’s Common Web Log file format. It contains total of 7 fields. Below is the regular expression that can parse the Apache’s common web log file format.
Example Use case of Common Log File Parsing in Hive
In Below HiveQL, we are using RegexSerDe class as SERDE to process the common log file with the help of above regular expressions.
To execute this script, download the above available sample common log file to home directory and save this HiveQL into common_log_parser.hql file and run this HiveQL file on terminal with the below command.
Below is the output of the above script execution.
Combined Log Format Files
Combined Log files are an extension to Apache’s Common Log File format and these contain referrer and agent fields in addition to the 7 fields in common log format. So, there will be total of 9 fields in this format.
Sample Combined Log file for testing —> combined_access_log
Below is the regular expression that matches combined log file format.
Example Use case of Combined Log File Parsing in Hive
With the help of above regular expression in RegexSerDe class we can parse these log files. Below is an example HiveQL to parse the above attached combined log file.