Azkaban Hadoop – A Workflow Scheduler For Hadoop 7


In this post, we will discuss about basic details of Azkaban hadoop and its setup in Ubuntu machine.

What is Azkaban Hadoop?

Azkaban Hadoop is an open-source workflow engine for hadoop eco system. It is a batch job scheduler allowing developers to control job execution inside Java and especially Hadoop projects.

Azkaban is developed at LinkedIn and it is written in Java, JavaScript and Clojure. Its main purpose is to solve the problem of Hadoop job dependencies.

Azkaban Vs Oozie:

Azkaban can be treated as a competitor for famous apache hadoop eco system tool oozie – a workflow engine for hadoop job scheduling.

Common features between Azkaban and Oozie:
  • Both are open source workflow engines for hadoop job scheduling. We can run a series of map-reduce, pig, hive, hcatalog, Sqoop, java & Unix shell scripts actions as a single workflow job.
  • Both are open source and written in java.
Differences between Azkaban and Oozie:
  • Azkaban is very simple and easy to define workflow schedules where as oozie is more complex to define the workflows
  • Azkaban job scheduling supports only time based scheduling but not input-data-dependent, where as oozie supports both time-based and input-data based scheduling.
  • Azkaban scheduling is done in GUI via Web browser only. But in oozie, scheduling can be done via command line, Java API and web browser as well.
  • Azkaban keeps state of all running workflows in memory but Oozie uses a SQL database, a workflow state is in memory only when doing a state transition
  • Azkaban web UI provides a smooth user interface and rich set of visualizations compared to Oozie.
  • Azkaban properties files are of java based property files where as oozie supports XML files for defining properties.
 Azkaban Features:

Below are the important features about Azkaban Hadoop.

  • Time based dependency scheduling of workflows for hadoop jobs and compatible with any version of hadoop
  • Provides easy to use web UI and rich set of visualizations for displaying interactive graphs in browser.
  • Provides support for Email alerts on failure and successes of workflow and also supports SLA (Service level agreement) alerting via emails.
  • We can retry running the failed jobs again via browser itself.
  • It is Modular and plugin-able for each hadoop eco system. For example it provides separate plugin for HDFS for browsing files on hadoop via web UI.
  • Tracks user actions and supports authentication.
  • Provides separate project work spaces for each project for easier future reference.

Azkaban consists of 3 key components:

  • Relational Database (MySQL)
  • AzkabanWebServer
  • AzkabanExecutorServer

For Production environment, all the above three components are recommended but for trial purpose, Azkaban provides solo server which can be used on a single machine to play around it with some sample examples.

So, In this post, we will install Azkaban solo server and try a few examples on this scheduler.

Azkaban Installation on Ubuntu Machine:

In this section, we will install Azkaban solo server on ubuntu machine on which Hadoop is installed in pseudo distributed mode. Azkaban solo server is easy to install and it doesn’t require MySQL instance as it has its own embedded H2 DB. It is Easy to start up – Both web server and executor server run in the same process.

Azkaban Installation Procedure:
  • Download the latest stable version of Azkaban Solo server from Azkaban downloads page at http://azkaban.github.io/downloads.html. At the time of writing this post, Azkaban-2.5.0 is the latest stable version available, so we are installing azkaban-2.5.0 in this post.
  • Copy the gzipped binary tarball into our preferred location of installation directory (usually into /usr/lib/azkaban/) and extract the contents in this folder.

Azkaban Installation

Note:  We should not setup environment variables for AZKABAN_HOME or its bin directory to PATH to start/stop the solo server .sh files. It is because of the internal configurations, we will receive some error messages if we try to add this installation directory to .bashrc file and execute the below commands from home directory itself instead of azkaban installation directory.

Start Azkaban Solo Server:
  • Start the Azkaban solo server with the help of below command from azkaban installation directory.

Azkaban Solo Server start

Azkaban start server

  • As shown in above screens, Azkaban will automatically detects Hadoop  home directory, Hive home directory and Hadoop’s classpath directory as well.
  • It will open web UI at http://localhost:8081/index on our browser.
  • We can shutdown the Azkaban server with command $ bin/azkaban-solo-shutdown.sh.
  • Login to web UI with azkaban as username and password at the above web UI.

Azkaban Web UI

Creating Work Flows in Azkaban:

A flow is a set of jobs that depends on one another. The dependencies of a job always run before the job itself can run. We can give multiple job names separated by comma in dependencies parameter.

Creation of a job in Azkaban is very easy. We create a properties file with .job extension. This job file defines the type of job to be run, the dependencies and any parameters needed to set up our job correctly.

For example below is sample job file.

Here the type of job is command. In this case, it will run the command to print “Hello World”.

Creation of Flow:

To create a flow, simply create a .job file with type=flow and flow.name set to the name of the flow. For example:

In order to execute these flows in Azkaban web browser, we need put these .job files in a directory structure as shown below and need to create a zip file with that directory structure.

For example we have created the above .job files as per this directory structure and zipped the folder into this file –> testflow.

Create Project in Azkaban

Upload project files

Execute Flow

Execute the flow.

Execute Flow1

Execution submitted

Verify the completion of the tasks/jobs in the flow from Executing/History tabs on the page. Below are the results of the two jobs test1 and test2 respectively.

Completed executionTest2 completion

Thus we have successfully setup Azkaban solo server and ran a sample flow of jobs with dependencies. We will run a few hadoop based jobs with time scheduling in the next post on this category.

Note: For In depth details on this Azkaban scheduler refer Azkaban documentation page.


Profile photo of Siva

About Siva

Senior Hadoop developer with 4 years of experience in designing and architecture solutions for the Big Data domain and has been involved with several complex engagements. Technical strengths include Hadoop, YARN, Mapreduce, Hive, Sqoop, Flume, Pig, HBase, Phoenix, Oozie, Falcon, Kafka, Storm, Spark, MySQL and Java.


Leave a comment

Your email address will not be published. Required fields are marked *

7 thoughts on “Azkaban Hadoop – A Workflow Scheduler For Hadoop

  • Shashi

    Hi,

    After running bin/azkaban-solo-start.sh command , i am getting following error.

    [root@sandbox bin]# 2015/02/03 10:52:53.743 +0000 INFO [AzkabanWebServer] [Azkaban] Starting Azkaban Server
    2015/02/03 10:52:53.774 +0000 INFO [AzkabanServer] [Azkaban] Loading azkaban settings file from ./../conf
    2015/02/03 10:52:53.774 +0000 INFO [AzkabanServer] [Azkaban] Loading azkaban private properties file
    2015/02/03 10:52:53.779 +0000 INFO [AzkabanServer] [Azkaban] Loading azkaban properties file
    2015/02/03 10:52:53.781 +0000 INFO [AzkabanDatabaseUpdater] [Azkaban] Use scripting directory sql
    2015/02/03 10:52:53.781 +0000 INFO [AzkabanDatabaseUpdater] [Azkaban] Will auto update any changes.
    Exception in thread “main” java.io.IOException: Cannot find ‘database.properties’ file in sql/database.properties
    at azkaban.database.AzkabanDatabaseSetup.loadDBProps(AzkabanDatabaseSetup.java:163)
    at azkaban.database.AzkabanDatabaseSetup.loadTableInfo(AzkabanDatabaseSetup.java:88)
    at azkaban.database.AzkabanDatabaseUpdater.runDatabaseUpdater(AzkabanDatabaseUpdater.java:83)
    at azkaban.webapp.AzkabanSingleServer.main(AzkabanSingleServer.java:43)

     

    Can you please let me know what is the problem here?

    • Profile photo of Siva
      Siva Post author

      Can You check ‘database.properties’ file in azkaban-solo-2.5.0/sql/database.properties and also make sure that you are running the above command only if your present working directory is /usr/lib/azkaban/azkaban-solo-2.5.0/ (i.e. your azkaban installation home directory).

      Most probably that is the case that you are running the above command from different directory than from the azkaban installation folder.

  • Abhijit Das

    Hello,

    This is a nice post for startup guys to setup Azkaban. While i followed your points, which helped me a lot, I got stuck while uploading the ZIP file into the project. It gave me error that – “testflow is looking for the flow test2”

    Then, I removed the folder “myflow”, kept all the 3 files under parent folder “testflow”, re-zipped and uploaded. Worked.

  • veerendra

    whenever trying to run ./azkaban-solo-start.sh  command

    i am getting error like this:

    java.io.IOException: Cannot find ‘database.properties’ file in sql/database.properties


Review Comments
default image

I am a plsql developer. Intrested to move into bigdata.

Neetika Singh ITA Hadoop in Dec/2016 December 22, 2016

.