Log Analysis in Hadoop 5


In this post we will discuss about various log file types and Log Analysis in Hadoop.

Log Files:

Logs are computer-generated files that capture network and server operations data.They are useful during various stages of software development, mainly for debugging and profiling purposes and also for managing network operations.

Need For Log Files:

Log files are commonly used at customer’s installations for the purpose of permanent
software monitoring and/or fine-tuning. Logs are essential in operating systems, computer networks, distributed systems and storage filers. Uses of Log File analysis are:

  • Application/hardware debugging and profiling
  • Error or Access statistics will be useful for fine tuning the application/hardware functionality – For example, based on the frequency of an error message in the past 6 months, we can forecast its occurrence in the future and before it’s occurrence on customer’s application/hardware, if we can provide a fix for the error, then customer satisfaction will be improved which in turn business will increase.
  • Security monitoring of Application/hardware – For example, if we suspect a security breach, we can use server log data to identify and repair the vulnerability.

Log File Types:

These log files can be generated from two types of servers – Web Servers and Application Servers.

Web Server Logs:

Web servers typically provide at least two types of log files: access log and error log.

  • Access log – records all requests that were made of this server including the client IP address, URL, response code, response size, etc.
  • Error log – records all requests that failed and the reason for the failure as recorded by the application.

For Example, logs generated by any web server like Apache logs, logs of this site hadooptutorial.info provided in the following sections of this post.

Application Server Logs:

These are the logs generated by applications servers and the custom logs generated by these can provide a great level of detail for application developers and analysts to understand how the application is used. Since developers can log arbitrary information, application server logs can be even larger than web server logs.

For example, logs generated by Hadoop daemons can be treated as Application logs.

Challenges in Processing Log Files:

  • As the log files are being continuously produced in various tiers with different types of information, the main challenge is to store and process this much data in an efficient manner to produce rich insights into the application and customer behavior. For example, A moderate web server will generate logs of size at least in GB’s for a month period.
  • We cannot store this much of data into a relational database system. RDBMS systems can be very expensive and cheaper alternatives like MySQL cannot
    scale to the volume of data that is continuously being added.
  • A better solution is to store all the log files in HDFS which stores data on commodity hardware, so it will be cost effective to store huge volumes (TBs or PBs) of log files in HDFS and Hadoop provides Mapreduce framework for parallel processing of these files.

Hadoop eco system sub components Pig, Hive support various UDF’s that can be used to parse these unstructured log files and store them in structured format.

Log File Processing Architecture:

As hadoop supports processing of structured, semi structured and un structured data efficiently, Log files are the good real time examples of un structured data, and processing them through hadoop will be the best use case for hadoop in action.

Below is the high level architecture of Log analysis in hadoop and producing useful visualizations out of it.

Log Analysis in Hadoop

As shown in the above architecture below are the major roles in Log Analysis in Hadoop.

Flume – Collection streaming log data into HDFS from various HTTP sources and Application Servers.

HDFS – HDFS is the storage file system for huge volumes of log data collected by flume.

Pig – Parses these log files into structured format through various UDF’s.

Hive – Hive or Hcatalog will define schema to this structured data and schema will be stored in hive metastore.

Hunk – Search processing and Visualization tool that provides connectivity to Hive server and metastore and pull the structured data into it. On top it we can build various types of visualization charts. For more details on this connectivity to hive and visualizations on top of it refer the post Hunk Hive connectivity.

Tableau – It is a visualization tool that provides connectivity to Hive server. For more details on this refer the post Tableau connectivity with Hive.

Sample Use Cases for Log File Analysis:
  • Ranked list of the most significant trends and page views over the last day/week/month
  • Identify the characteristics of the most active users to help promote this behavior across the board
  • Co-relate user sessions with transactional data to better understand how to improve sales

In the next post, we will discuss about loading & parsing web server logs and custom application logs using pig and making them structured to be ready for defining schema in hive/hcatalog.


About Siva

Senior Hadoop developer with 4 years of experience in designing and architecture solutions for the Big Data domain and has been involved with several complex engagements. Technical strengths include Hadoop, YARN, Mapreduce, Hive, Sqoop, Flume, Pig, HBase, Phoenix, Oozie, Falcon, Kafka, Storm, Spark, MySQL and Java.


Leave a comment

Your email address will not be published. Required fields are marked *

5 thoughts on “Log Analysis in Hadoop

  • Ramesh

    Hi,

    Is there a follow-up session for loading and parsing webserver logs using PIG?

    Also how to set up / work on below use case?

    The webserer is running on a windows machine [ In this case this windows machine is a Host o/s]
    In the same windows machine [ in step #1], i have installed a VM, on which running CDH5
    From the FLUME, running on VM, how to connect to the webserver logs on windows machine? & load data to HDFS on VM?

  • Rayyan

    Hi Siva,

    Was very much impressed by your article. I am new to this Big Data Analytics Field.I was wondering which would be the best distribution among the three; Cloudera, Hortonworks or MapR for analyzing weblogs that are stored in Tape ?


Review Comments
default image

I am a plsql developer. Intrested to move into bigdata.

Neetika Singh ITA Hadoop in Dec/2016 December 22, 2016

.