Apache Tez – Successor of Mapreduce Framework 4


Apache Tez Overview

What is Apache Tez?

Apache Tez is another execution framework project from Apache Software Foundation and it is built on top of Hadoop YARN. It is considered as a more flexible and powerful successor of the mapreduce framework.

Apache Tez Features:

Tez provides,

  • Performance gain over Map Reduce also Provides backward compatibility to Mapreduce framework.
  • Optimal resource management
  • Plan reconfiguration at run-time
  • Dynamic physical data flow decisions
  • Tez is client side application and it is very simple and easy to try it out. No deployments needed.

By using Tez for data processing tasks, that earlier took multiple MR jobs, can be done now in a single Tez job. Tez also supports running of existing MR jobs on top of Tez framework to provide easy upgrade for existing mapreduce framework users.

End user advantages for Tez framework:

  • Better performance of applications and predictability of results
  • Reduced load on distributed system HDFS and reduced network usage.

Tez layer on Hadoop 2 Architecture:

Tez Layer

Tez Data Processing model:

Tez data processing

Tez Terminology:

In Tez parlance a map-reduce job is a simple DAG (Directed Acyclic Graph). Map and Reduce tasks are the vertices in the execution graph. An edge connects every map task to every reduce task.

Apache Tez Integration With Hadoop:

Prerequisite:
  • Apache Hadoop 2 with Yarn.
  • Maven 3
  • Protocol Buffers 2.5 or later

Actually Tez engine is already available in Hive-0.13.1 release and the dependency of hive-0.13.1 is that it needs Tez-0.4.1 release at least to run Tez engine from hive.

Apache Hive 0.13.1 version correctly integrates with Tez-0.4.1 version only. If we try to integrate Hive-0.13.1 with Tez-0.5.0 or Tez-0.5.1 versions we will receive many error messages due to version inconsistency.

In this post we consider below components and versions to install Tez on Hadoop without any version inconsistencies.

  • Ubuntu Machine with Hadoop 2.3.0 Version with Yarn Framework.
  • Tez-0.4.1-incubating version
  • Hive-0.13.1 release
  • Maven3
Building and Installation of Tez on Ubuntu Machine:

As of now there is no binary tar ball available for download from Apache Tez releases to install it on our machine. So, we have to build Tez binary tar ball from the appropriate source release versions available from Apache Tez Releases page.

So in order to build binary tar ball, we need to perform below activities in the given order.

  • Download the source files for Tez from tez-0.4.1-incubating-src.tar.gz from releases tab on apache download mirrors.
  • Extract this source tar ball in any directory (say /usr/lib/tez) and navigate to this extracted directory. The terminal commands needed to perform these actions are:

  • Now we need to change the pom.xml file in tez-0.4.1-incubating-src directory to reflect our target Hadoop version. It requires minimum of hadoop-2.2.0 version and in this post we are changing it to hadoop-2.3.0 version as shown below.

  • Next we need to build binary tar ball with below command from source files. Here we are skipping unit tests of source programs to complete the build process quickly.

tez installation

Tez build success

  • Once the Maven build is successful, we can see the tez-dist folder created in the same directory.Now, we have to copy Tez tar ball from tez-dist/target/tez-*-full.tar.gz file (if this tar.gz file is not there then we may find tez-0.4.1-*-full directory) into our preferred location of Tez installation directory (usually into /usr/lib/tez) and extract the tar file. Export this tez-0.4.1-*-full directory location as TEZ_HOME directory.
Copy the Jar files into HDFS:

We need to copy the relevant jar files onto HDFS directory. Lets create a directory /apps/tez-0.4.1 in hadoop and copy the jar files from TEZ_HOME directory as shown below.

HDFS tez jars directory

Create tez-site.xml file:

Now we need to create a folder conf under TEZ_HOME directory and create a new tez-site.xml file with below properties to set the HDFS path to tez jars in property tez.lib.uris

Add below entries to .bashrc file:

Open .bashrc by $ gedit ~/.bashrc command, and add below environment variables for Tez into .bashrc file.

Optional: Change mapred-site.xml or yarn-site.xml:

If we plan to run existing MapReduce jobs on Tez framework, modify mapred-site.xml or yarn-site.xml file to change “mapreduce.framework.name” property from its default value of “yarn” to “yarn-tez”.

Restart the cluster:

Now reopen the terminal to pick .bashrc file changes and also stop all the hadoop and yarn daemons and start them again to pick these changes.

Verify Installation by running an Example Tez job:

Lets verify the above installation by running orderedwordcount example program from tez-mapreduce-examples-0.4.1-incubating.jar file with below command.

tez run1

On successful completion of Tez job we can see the counters on Tez DAG job as shown in below screenshot.

Tez dag counters

View the output results from below command.

Tez Dag job output

As the results are correct, we have successfully setup Tez on Hadoop2 and able to sample Tez DAG job and verify its output results.

In the next post on Tez, we will discuss about Hive Integration with Tez and running sample hive queries on Hive with Tez framework.


About Siva

Senior Hadoop developer with 4 years of experience in designing and architecture solutions for the Big Data domain and has been involved with several complex engagements. Technical strengths include Hadoop, YARN, Mapreduce, Hive, Sqoop, Flume, Pig, HBase, Phoenix, Oozie, Falcon, Kafka, Storm, Spark, MySQL and Java.


Leave a comment

Your email address will not be published. Required fields are marked *

4 thoughts on “Apache Tez – Successor of Mapreduce Framework

  • Avinash

    Thanks for publishing this in detail with the screenshots as well.I have followed the same steps which you have mentioned, but I am not able to install properly.The error that I am encountering is

    I have downloaded the build version of tez. Apache Tez Version:0.8.4, Hadoop Version:2.6.0.

    My tez-site.xml is

    <?xml version=”1.0″ encoding=”UTF-8″?>
    <?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>
    <configuration>
    <property>
    <name>tez.lib.uris</name>
    <value>${fs.defaultFS}/apps/tez-0.8.4,${fs.defaultFS}/apps/tez-0.8.4/lib/</value>
    </property>
    </configuration>

    and my bashrc configuration is:

    export HADOOP_HOME=/usr/local/hadoop
    export HADOOP_INSTALL=$HADOOP_HOME
    export HADOOP_MAPRED_HOME=$HADOOP_HOME
    export HADOOP_COMMON_HOME=$HADOOP_HOME
    export HADOOP_HDFS_HOME=$HADOOP_HOME
    export YARN_HOME=$HADOOP_HOME
    export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
    export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
    export PATH=$PATH:/usr/local/spark/bin
    export HIVE_HOME=/usr/local/hive
    export PATH=$PATH:$HIVE_HOME/bin
    export CLASSPATH=$CLASSPATH:/usr/local/Hadoop/lib/*:.
    export CLASSPATH=$CLASSPATH:/usr/local/hive/lib/*:.
    export DERBY_HOME=/usr/local/derby
    export PATH=$PATH:$DERBY_HOME/bin
    export CLASSPATH=$CLASSPATH:$DERBY_HOME/lib/derby.jar:$DERBY_HOME/lib/derbytools.jar
    export HIVE_OPTS=”-hiveconf mapreduce.map.memory.mb=4096 -hiveconf mapreduce.reduce.memory.mb=5120″
    export TEZ_HOME=/usr/local/apache-tez-0.8.4-bin
    export TEZ_CONF_DIR=$TEZ_HOME/conf
    export TEZ_JARS=$TEZ_HOME

    if [ -z “$HIVE_AUX_JARS_PATH” ]; then
    export HIVE_AUX_JARS_PATH=”$TEZ_JARS”
    else
    export HIVE_AUX_JARS_PATH=”$HIVE_AUX_JARS_PATH:$TEZ_JARS”
    fi

    export HADOOP_CLASSPATH=${TEZ_CONF_DIR}:${TEZ_JARS}/*:${TEZ_JARS}/lib/*
    export CLASSPATH=$CLASSPATH:${TEZ_CONF_DIR}:${TEZ_JARS}/*:${TEZ_JARS}/lib/*:.

    my mapreduce-site.xml is

    <configuration>
    <!–<property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
    </property> –>
    <property>
    <name>mapreduce.framework.name</name>
    <value>yarn-tez</value>
    <description>The runtime framework for executing MapReduce jobs.
    Can be one of local, classic or yarn.
    </description>
    </property>
    </configuration>

    When i try to run the sample example program it is returning the trace as

    Failing this attempt. Failing the application.
    16/07/27 12:52:00 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
    16/07/27 12:52:00 INFO client.DAGClientImpl: DAG completed. FinalState=FAILED
    16/07/27 12:52:00 INFO examples.OrderedWordCount: DAG diagnostics: [Application application_1469604082434_0001 failed 2 times due to AM Container for appattempt_1469604082434_0001_000002 exited with exitCode: 1
    For more detailed output, check application tracking page:http://AnalyticsLinux.tcs.com:8088/proxy/application_1469604082434_0001/Then, click on links to logs of each attempt.
    Diagnostics: Exception from container-launch.
    Container id: container_1469604082434_0001_02_000001
    Exit code: 1
    Stack trace: ExitCodeException exitCode=1:
    at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
    at org.apache.hadoop.util.Shell.run(Shell.java:455)
    at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
    at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
    Container exited with a non-zero exit code 1

    When i see in the http://localhost:8088 under stderr I found the above one.

    Please help me in resolving this.Thanks in Advance!!.


Review Comments
default image

I have attended Siva’s Spark and Scala training. He is good in presentation skills and explaining technical concepts easily to everyone in the group. He is having excellent real time experience and provided enough use cases to understand each concepts. Duration of the course and time management is awesome. Happy that I found a right person on time to learn Spark. Thanks Siva!!!

Dharmeswaran ETL / Hadoop Developer Spark Nov 2016 September 21, 2017

.