In the previous post we have discussed about the basic introduction on log files and the architecture of log analysis in hadoop. In this post, we will enter into much deeper details on processing logs in pig.
As discussed in the previous post, there will be three types of log files majorly.
- Web Server Access Logs
- Web Server Error Logs
- Application Server Logs
All these log files will be in anyone of the below three formats.
- Common Log File format
- Combined Log File format
- Custom Log File Format
In the following sections we will discuss more about these log file formats and processing these log formats separately in pig.
Common Log Format Files:
Common Log files format will be as shown in the below. This is also called as Apache’s Common Log Format as this format is originated from Apache Web Server logs.
Example Log line will be as shown below:
Sample Common Log file for testing —> common_access_log
As we already know about Load functions in Pig from previous post on Pig Load Functions, This section can be considered as the best example for Custom Load functions. Fortunately Piggybank, a repository of user-submitted UDF, contains a custom loader function CommonLogLoader to load Apache’s Common Log Format files into pig. This java class extends RegExLoader class which is custom UDF for Load function.
To know more about writing UDF’s for Custome Load function refer the post.
CommonLogLoader user below regular expression to parse the Common Log Format files:
Example Use case of CommonLogLoader:
Lets put the above Apache Common log file into HDFS location /in/ and Register piggybank jar file and define a temporary function for CommonLogLoader and use it to parse the Apache Common_access_log file successfully.
From Pig-0.13.0 release piggybank.jar file is included in pig lib directory only, so we do not need to register this jar file each time we login to pig grunt shell otherwise we need to register this jar file as shown below to use any UDFs from this piggybank.
Lets put the Pig Latin commands in a script file common_log_process_script.pig and process it.
We are trying to display the counts of addresses/host names from the log file: