Oozie Notes


OOZIE NOTES

  • Workflow scheduler to manage hadoop and related jobs
  • Developed first in Banglore by Yahoo
  • DAG(Direct Acyclic Graph)
  • Acyclic means a graph cannot have any loops and action members of the graph provide control dependency. Control dependency means a second job cannot run until a first action is completed
  • Ozzie definitions are written in hadoop process definition language (hPDL) and coded as an xml file (WORKFLOW.XML)
  • Workflow contains:
    • Control flow nodes (defines start, end and execution path of the workflow)
    • Action nodes (trigger execution of tasks)
  • Actions use parameterized variables taking values from a JOBS.PROPERTIES file and the variables are written in the form: ${variable_name}
  • Control flow nodes include:
    • START
    • FORK (splits an execution path into multiple concurrent execution paths)
    • JOIN ( waits until all execution paths have executed)
    • DECISION (case statement used to select execution path)
    • KILL
    • END
  • Action Nodes include:
    • Java MapReduce
    • Streaming MapReduce
    • Pig
    • Hive
    • Scoop
    • FileSystem tasks
    • Distributed copy
    • Java programs
    • Shell scripts
    • Http
    • Email
    • Oozie sub worklows
  • Ozzie detects completion of a job by:
    • Callback
    • Polling
  • Actions have 2 transitions:
    • OK
    • ERROR
  • Action Node Additional elements:
    • PREPARE
    • JOB-XML
    • CONFIGURATION
    • FILE
    • ARCHIVE
  • Built-in contains:
    • MB. GB. TB. PB
  • Built-in functions:
    • Trim(), concat(), timestamp(), etc
  • Access to workflow arttributes:
    • wf: id()
    • name()
    • wf:user()
    • errorCode()
    • wf:errorMessage()
    • wf:lastErorCode()
    • etc
  • Counters:
    • RECORDS
    • MAP_IN
    • MAP_OUT
    • REDUCE_IN
    • REDUCE_OUT
    • GROUPS
  • Functions:
    • fs:exists()
    • isDir()
    • dirSize()
    • fileSize()
    • blockSize()
  • HDFS should contain:
    • workflow.xml
    • config-default.xml –> optional
    • /lib –>contains jar files and shared libraries
  • Local file system should(?) contain
    • properties
  • Ozzie execution can done via:
    • Command line tools
    • Web server API
    • Java API
  • LAB1: workflow.xml
  • LAB1 : job.properties
  • Screenshots
  • OOZIE COORDINATOR
    • Supports the automated starting of oozie wokflow processes
    • Can control a job(define triggers to invoke workflows) based on
      • Data availability
      • Time
      • Other external events
    • Can define dependencies between workflows/jobs with a workflow
    • XML
    • Uses UTC time
  • Oozie components

Oozie

  • Synchronous datasets:
    • Are produced at regular intervals
  • coordinator. xml has:
    • Start, end
    • Timezone
    • Datasets to be used
    • Actions (workflows to be invoked)
  • Coordinator files:
    • coordinator.xml
    • coord-config-default.xml
    • coordinator.propteries
  • Invoking Oozie:
    • INSERT PICTURE
  • Running an Oozie work flow

oozie job -Oozie http://Localhost:8280/oozie-config[location of]/job.properties-run

  •  YOU may export the location of Oozie so that you do not need to code it on the run  statement
    • export OOZIE_URL=http://localhost:8280/oozie
    • Oozie-config[location of]/job.properties-run
  •  Running an Oozie coordinator
    • Oozie job-run-config[location of]/coordinator.properties
    • This assumes that the OOZIE_URL was exported
  • Coordinator job: timings
  • bundle jobs : chaining the jobs
  •  starting Oozie job:

$ ./oozie job –run –config home/cloudera/oozieWF/job.properties


About Siva

Senior Hadoop developer with 4 years of experience in designing and architecture solutions for the Big Data domain and has been involved with several complex engagements. Technical strengths include Hadoop, YARN, Mapreduce, Hive, Sqoop, Flume, Pig, HBase, Phoenix, Oozie, Falcon, Kafka, Storm, Spark, MySQL and Java.

Leave a comment

Your email address will not be published. Required fields are marked *


Review Comments
default image

I have attended Siva’s Spark and Scala training. He is good in presentation skills and explaining technical concepts easily to everyone in the group. He is having excellent real time experience and provided enough use cases to understand each concepts. Duration of the course and time management is awesome. Happy that I found a right person on time to learn Spark. Thanks Siva!!!

Dharmeswaran ETL / Hadoop Developer Spark Nov 2016 September 21, 2017

.