In this post, we will discuss about basic details of Azkaban hadoop and its setup in Ubuntu machine.
What is Azkaban Hadoop?
Azkaban Hadoop is an open-source workflow engine for hadoop eco system. It is a batch job scheduler allowing developers to control job execution inside Java and especially Hadoop projects.
Azkaban Vs Oozie:
Azkaban can be treated as a competitor for famous apache hadoop eco system tool oozie – a workflow engine for hadoop job scheduling.
Common features between Azkaban and Oozie:
- Both are open source workflow engines for hadoop job scheduling. We can run a series of map-reduce, pig, hive, hcatalog, Sqoop, java & Unix shell scripts actions as a single workflow job.
- Both are open source and written in java.
Differences between Azkaban and Oozie:
- Azkaban is very simple and easy to define workflow schedules where as oozie is more complex to define the workflows
- Azkaban job scheduling supports only time based scheduling but not input-data-dependent, where as oozie supports both time-based and input-data based scheduling.
- Azkaban scheduling is done in GUI via Web browser only. But in oozie, scheduling can be done via command line, Java API and web browser as well.
- Azkaban keeps state of all running workflows in memory but Oozie uses a SQL database, a workflow state is in memory only when doing a state transition
- Azkaban web UI provides a smooth user interface and rich set of visualizations compared to Oozie.
- Azkaban properties files are of java based property files where as oozie supports XML files for defining properties.
Below are the important features about Azkaban Hadoop.
- Time based dependency scheduling of workflows for hadoop jobs and compatible with any version of hadoop
- Provides easy to use web UI and rich set of visualizations for displaying interactive graphs in browser.
- Provides support for Email alerts on failure and successes of workflow and also supports SLA (Service level agreement) alerting via emails.
- We can retry running the failed jobs again via browser itself.
- It is Modular and plugin-able for each hadoop eco system. For example it provides separate plugin for HDFS for browsing files on hadoop via web UI.
- Tracks user actions and supports authentication.
- Provides separate project work spaces for each project for easier future reference.
Azkaban consists of 3 key components:
- Relational Database (MySQL)
For Production environment, all the above three components are recommended but for trial purpose, Azkaban provides solo server which can be used on a single machine to play around it with some sample examples.
So, In this post, we will install Azkaban solo server and try a few examples on this scheduler.
Azkaban Installation on Ubuntu Machine:
In this section, we will install Azkaban solo server on ubuntu machine on which Hadoop is installed in pseudo distributed mode. Azkaban solo server is easy to install and it doesn’t require MySQL instance as it has its own embedded H2 DB. It is Easy to start up – Both web server and executor server run in the same process.
Azkaban Installation Procedure:
- Download the latest stable version of Azkaban Solo server from Azkaban downloads page at http://azkaban.github.io/downloads.html. At the time of writing this post, Azkaban-2.5.0 is the latest stable version available, so we are installing azkaban-2.5.0 in this post.
- Copy the gzipped binary tarball into our preferred location of installation directory (usually into /usr/lib/azkaban/) and extract the contents in this folder.
Note: We should not setup environment variables for AZKABAN_HOME or its bin directory to PATH to start/stop the solo server .sh files. It is because of the internal configurations, we will receive some error messages if we try to add this installation directory to .bashrc file and execute the below commands from home directory itself instead of azkaban installation directory.
Start Azkaban Solo Server:
- Start the Azkaban solo server with the help of below command from azkaban installation directory.
- As shown in above screens, Azkaban will automatically detects Hadoop home directory, Hive home directory and Hadoop’s classpath directory as well.
- It will open web UI at http://localhost:8081/index on our browser.
- We can shutdown the Azkaban server with command $ bin/azkaban-solo-shutdown.sh.
- Login to web UI with azkaban as username and password at the above web UI.
Creating Work Flows in Azkaban:
A flow is a set of jobs that depends on one another. The dependencies of a job always run before the job itself can run. We can give multiple job names separated by comma in dependencies parameter.
Creation of a job in Azkaban is very easy. We create a properties file with
.job extension. This job file defines the type of job to be run, the dependencies and any parameters needed to set up our job correctly.
For example below is sample job file.
type of job is
command. In this case, it will run the command to print “Hello World”.
Creation of Flow:
To create a flow, simply create a
.job file with
flow.name set to the name of the flow. For example:
In order to execute these flows in Azkaban web browser, we need put these .job files in a directory structure as shown below and need to create a zip file with that directory structure.
For example we have created the above .job files as per this directory structure and zipped the folder into this file –> testflow.
Execute the flow.
Verify the completion of the tasks/jobs in the flow from Executing/History tabs on the page. Below are the results of the two jobs test1 and test2 respectively.
Thus we have successfully setup Azkaban solo server and ran a sample flow of jobs with dependencies. We will run a few hadoop based jobs with time scheduling in the next post on this category.
Note: For In depth details on this Azkaban scheduler refer Azkaban documentation page.