Processing Logs in Pig 3


In the previous post we have discussed about the basic introduction on log files and the architecture of log analysis in hadoop. In this post, we will enter into much deeper details on processing logs in pig.

As discussed in the previous post, there will be three types of log files majorly.

  • Web Server Access Logs
  • Web Server Error Logs
  • Application Server Logs

All these log files will be in anyone of the below three formats.

  • Common Log File format
  • Combined Log File format
  • Custom Log File Format

In the following sections we will discuss more about these log file formats and processing these log formats separately in pig.

Common Log Format Files:

Common Log files format will be as shown in the below. This is also called as Apache’s Common Log Format as this format is originated from Apache Web Server logs.

Example Log line will be as shown below:

Sample Common Log file for testing —> common_access_log

As we already know about Load functions in Pig from previous post on Pig Load Functions, This section can be considered as the best example for Custom Load functions. Fortunately Piggybank, a repository of user-submitted UDF, contains a custom loader function CommonLogLoader to load Apache’s Common Log Format files into pig. This java class extends RegExLoader class which is custom UDF for Load function.

To know more about writing UDF’s for Custome Load function refer the post.

CommonLogLoader user below regular expression to parse the Common Log Format files:

Example Use case of CommonLogLoader:

Lets put the above Apache Common log file into HDFS location /in/ and Register piggybank jar file and define a temporary function for CommonLogLoader and use it to parse the Apache Common_access_log file successfully.

From Pig-0.13.0 release piggybank.jar file is included in pig lib directory only, so we do not need to register this jar file each time we login to pig grunt shell otherwise we need to register this jar file as shown below to use any UDFs from this piggybank.

REGISTER ‘/path_to_piggbank/piggybank.jar’;

Lets put the Pig Latin commands in a script file common_log_process_script.pig and process it.

We are trying to display the counts of addresses/host names from the log file:

Apache's Common Log procesing in pig

Below is the output of above pig script run:

Pig DUMP output

We can verify the count (270) of the above highlighted IP address (10.0.0.153) in the input log file in local file system to confirm that log parsing is done correctly with the below command:

verify count

So, we can confirm that Apache’s Common log files are processed successfully in pig and these results can be stored into a hdfs file and can be feed to hive external table, So that we can enable this table available to any visualization tool like Hunk, Tableau to pull this data.

Combined Log Format Files:

Combined Log files format will be as shown in the below. It is having two extra fields referrer and User agent when compared to Common Log format.

Example Log line will be as shown below:

Sample Combined Log file for testing —> combined_access_log

Similar to CommonLogLoader, Piggybank, provides a custom loader function CombinedLogLoader to load Combined Log Format files into pig. This java class also extends RegExLoader class which is custom UDF for Load function.

CombinedLogLoader user below regular expression to parse the Combine Log Format files:

Example Use case of CommonLogLoader:

Similar to above CommonLogLoader Use case, Lets copy the above sample combined log format file into HDFS, /in/ location and lets execute below pig latin script file to get top 10 referrers and their counts in the access log file.