Apache Tez Overview
What is Apache Tez?
Apache Tez is another execution framework project from Apache Software Foundation and it is built on top of Hadoop YARN. It is considered as a more flexible and powerful successor of the mapreduce framework.
Apache Tez Features:
- Performance gain over Map Reduce also Provides backward compatibility to Mapreduce framework.
- Optimal resource management
- Plan reconfiguration at run-time
- Dynamic physical data flow decisions
- Tez is client side application and it is very simple and easy to try it out. No deployments needed.
By using Tez for data processing tasks, that earlier took multiple MR jobs, can be done now in a single Tez job. Tez also supports running of existing MR jobs on top of Tez framework to provide easy upgrade for existing mapreduce framework users.
End user advantages for Tez framework:
- Better performance of applications and predictability of results
- Reduced load on distributed system HDFS and reduced network usage.
Tez layer on Hadoop 2 Architecture:
Tez Data Processing model:
In Tez parlance a map-reduce job is a simple DAG (Directed Acyclic Graph). Map and Reduce tasks are the vertices in the execution graph. An edge connects every map task to every reduce task.
Apache Tez Integration With Hadoop:
- Apache Hadoop 2 with Yarn.
- Maven 3
- Protocol Buffers 2.5 or later
Actually Tez engine is already available in Hive-0.13.1 release and the dependency of hive-0.13.1 is that it needs Tez-0.4.1 release at least to run Tez engine from hive.
Apache Hive 0.13.1 version correctly integrates with Tez-0.4.1 version only. If we try to integrate Hive-0.13.1 with Tez-0.5.0 or Tez-0.5.1 versions we will receive many error messages due to version inconsistency.
In this post we consider below components and versions to install Tez on Hadoop without any version inconsistencies.
- Ubuntu Machine with Hadoop 2.3.0 Version with Yarn Framework.
- Tez-0.4.1-incubating version
- Hive-0.13.1 release
Building and Installation of Tez on Ubuntu Machine:
As of now there is no binary tar ball available for download from Apache Tez releases to install it on our machine. So, we have to build Tez binary tar ball from the appropriate source release versions available from Apache Tez Releases page.
So in order to build binary tar ball, we need to perform below activities in the given order.
- Download the source files for Tez from tez-0.4.1-incubating-src.tar.gz from releases tab on apache download mirrors.
- Extract this source tar ball in any directory (say /usr/lib/tez) and navigate to this extracted directory. The terminal commands needed to perform these actions are:
- Now we need to change the pom.xml file in tez-0.4.1-incubating-src directory to reflect our target Hadoop version. It requires minimum of hadoop-2.2.0 version and in this post we are changing it to hadoop-2.3.0 version as shown below.
- Next we need to build binary tar ball with below command from source files. Here we are skipping unit tests of source programs to complete the build process quickly.
- Once the Maven build is successful, we can see the tez-dist folder created in the same directory.Now, we have to copy Tez tar ball from tez-dist/target/tez-*-full.tar.gz file (if this tar.gz file is not there then we may find tez-0.4.1-*-full directory) into our preferred location of Tez installation directory (usually into /usr/lib/tez) and extract the tar file. Export this tez-0.4.1-*-full directory location as TEZ_HOME directory.
Copy the Jar files into HDFS:
We need to copy the relevant jar files onto HDFS directory. Lets create a directory /apps/tez-0.4.1 in hadoop and copy the jar files from TEZ_HOME directory as shown below.
Create tez-site.xml file:
Now we need to create a folder conf under TEZ_HOME directory and create a new tez-site.xml file with below properties to set the HDFS path to tez jars in property tez.lib.uris
Add below entries to .bashrc file:
Open .bashrc by $ gedit ~/.bashrc command, and add below environment variables for Tez into .bashrc file.