In this post we will discuss about the basic introduction to Sqoop and Sqoop Installation on Ubuntu machine and we will discuss about example run of Sqoop from MySQL database in the next post.
What is Sqoop?
Sqoop is open source tool that enables users to transfer bulk data between Hadoop eco system and relational databases. Here Hadoop eco system includes, HDFS, Hive, HBase, HCatalog, etc. And Relational databases supported at this time are MySQL, PostGreSQL, Oracle, SQL Server and DB2.
Apache Sqoop is another top level open source project from Apache software foundation.
Sqoop can be used to both import data from external structured databases into HDFS or related systems like Hive and HBase, as well as export data from Hadoop to external relational databases and enterprise data warehouses.
It is a mechanism for unlocking hadoop for relational database users.
In many ways Sqoop is similar to hadoop’s distcp (Moving data efficiently between clusters using Distributed Copy). Both are built on top of MapReduce and
take advantage of its parallelism and fault tolerance. Both run map only tasks in parallel. But the difference is Instead of moving data between clusters, Sqoop is designed to move data from/to hadoop to/from relational databases.
Sqoop was originally developed by Cloudera but later provided to Apache Open source community. Now Sqoop is top level project at Apache Software foundation. Sqoop has two versions, Sqoop1 and Sqoop2 but at the time of this writing, Sqoop2 is not suitable for production deployment so, we will discuss using Sqoop1 only.
Sqoop is named after Sql + Hadoop (Sqoop) meaning that bridge between Sql databases and Hadoop eco system.
Sqoop Installation on Ubuntu:
In this section we will install Sqoop1.4.5 version (latest stable version) on Ubuntu machine and configure it to run on Hadoop cluster.
Sqoop installation on Ubuntu machine is simple and straight.
- Download gzipped Sqoop binary tar ball from Apache Download mirrors at sqoop site. This file would be of format sqoop-<version>.bin__hadoop-<version>-<alpha>.tar.gz
- Copy this tar ball into preferred location of installation directory (usually into /usr/lib/sqoop) and extract the contents of gzipped the tar ball.
- Set environment variables SQOOP_HOME with sqoop installation directory in .bashrc file and add its bin directory to $PATH.
- Sqoop requires the JDBC drivers for specific database servers like MySQL, Oracle etc. Due to licensing issues, Sqoop doesn’t include the JDBC drivers for these external databases. Some of the JDBC drivers are available for free of charge from the database vendors’ websites. For example, we can download JDBC drivers for MySQL from MySql Connectors Download page. For Ubuntu, download the platform independent version of this connector. Extract this mysql-connector-java-<version>.tar.gz file and copy the mysql-connector-java-<version>-bin.jar file into SQOOP_HOME/lib directory.
Verify Sqoop Installation:
Lets verify sqoop installation with the following command.
If we receive message similar to above, then our sqoop installation on ubuntu is successful. We will discuss example sqoop import and export commands in the next post.