Sqoop Interview Cheat Sheet

Install sqoop

sudo yum install sqoop

  1. sudo apt-get install sqoop
  2. in sqoop-normal commnd prompt
  3. sqoop config file—sqoop site.xml

install jdbc drivers

After you’ve obtained the driver, you need to copy the driver’s JAR file(s) into Sqoop’s lib/ directory. If you’re
using the Sqoop tarball, copy the JAR files directly into the lib/ directory after unzipping the tarball. If you’re using packages, you will need to copy the driver files into the /usr/lib/sqoop/lib directory

sqoop list-databases

import

sqoop import

*results are imported to hdfs under cloudera-employees

In employees directory 4 files are created

  • part-m-00000
  • part-m-00001
  • part-m-00002
  • part-m-00003

import all

 

mappers m1

sqoop import

 

Subset of Data
where

sqoop import

 

part-m-00000-179973 records.
When using the –where parameter, keep in mind the parallel nature of Sqoop transfers.
Data will be transferred in several concurrent tasks. Any expensive function call will  put a significant performance burden on your database server. Advanced functionscould lock certain tables, preventing Sqoop from transferring data in parallel. This willadversely affect transfer performance. For efficient advanced filtering, run the filteringquery on your database prior to import, save its output to a temporary table and run
Sqoop to import the temporary table into Hadoop without the –where parameter.

boundary queries

 

 

 

 

By default, when using the –compress parameter, output fileswill be compressed using the GZip codec, and all files will end up with a .gz extension.
You can choose any other codec using the –compression-codec parameter.

other compression codecs
bzip2-.bz2

As Sqoop delegates compression to the MapReduce engine, you need to make sure the compressed map output is allowed in your Hadoop configuration. For example, if in the mapred-site.xml file, the property
mapred.output.compress is set to false with the final flag, then Sqoop won’t be able to compress the output files even when you call it with the –compress parameter.
The selected compression codec might have a significant impact on subsequent processing.
Some codecs do not support seeking to the middle of the compressed file without reading all previous content, effectively preventing Hadoop from processing the input files in a parallel manner. You should use a splittable codec for data that you’re planning to use in subsequent processing.

  • Compression codecs
  • Splittable Not Splittable
  • BZip2, LZO GZip, Snappy

 

encoding null values

Sqoop encodes database NULL values using the null string constant. Your downstream processing (Hive queries, custom MapReduce job, or Pig script) uses a different constant for encoding missing values

sharing matastore

Sqoop’s metastore can easily be started as a service with the following command:
sqoop metastore
Other clients can connect to this metastore by specifying the parameter –meta-connect in the command line with the URL of this machine. For example, to create a new saved job in the remote metastore running on the host

 

The content of each <arg> tag is considered to be one parameter regardless of how many spaces it contains; this is especially useful for entering queries as <arg>SELECT * FROMcities</arg>, which is considered to be one single parameter. Having spaces inside oa <command> tag might not be obvious, especially when you’re using variables to parametrizeyour workflow. The preferred way to use Sqoop in Oozie is with <arg> tags.

 

install jdbc oozie

Sqoop works correctly when executed from the command line, but in Oozie it cannot
find the JDBC drivers.
Solution
You need to install the JDBC drivers into Oozie separately. You have two options: install
the driver either into your workflow’s lib/ directory or into the shared action library
location usually found at /user/oozie/share/lib/sqoop/.

 

sqoop.export.records.per.statement tospecify the number of records that will be used in each insert statement:Finally, you can set how many rows will be inserted per transaction with the
sqoop.export.statements.per.transaction property:

 

staging table

kalyana13:You can use a staging table to first load data to a temporary table before making changes
to the real table. The staging table name is specified via the –staging-table parameter.
In the below example,

updating dataset

The ability to conditionally insert a new row or update an existing one is an advanced database feature known as upsert. This feature is not available on all database systems nor supported by all Sqoop connectors. Currently it’s available only for Oracle and nondirect MySQL exports.

stored procedure
Using Stored Procedures—————-

 

parameter. For example, when exporting into MySQL,
Sqoop uses the following query:
CALL populate_cities(?, ?, ?)

import into hive

Sqoop mandates that the partition column be of type STRING. The current implementation is limited to a single partition level. Unfortunately, you can’t use this feature if your table has more than one level of  partitioning (e.g., if you would like a partition by day followed by a partition by hour).
delims You can instruct Sqoop to automatically clean your data using –hive-dropimport- delims, which will remove all \n, \t, and \01 characters from all string-based columns:
If removing the special characters is not an option in your use case, you can take advantage of the parameter –hive-delims-replacement, which will accept a replacement string.

 

Problem

Imports into HBase take significantly more time than importing as text files in HDFS.
Solution
Create your HBase table prior to running Sqoop import, and instruct HBase to create
more regions with the parameter NUMREGIONS. For example, you can create the HBase
table cities with the column family world and 20 regions using the following
command:
hbase> create ‘cities’, ‘world’, {NUMREGIONS => 20, SPLITALGO => ‘HexString
Split’}
Sqoop

Sqoop will connect to your Hadoop cluster and submit a MapReduce job. Eachmapper of the job will then transfer a slice of the table’s data. As MapReduce executesmultiple mappers at the same time, Sqoop will be transferring data in parallel to achievethe best possible performance by utilizing the potential of your database server. Eachmapper transfers the table’s data directly between the database and the Hadoop cluster.
To avoid becoming a transfer bottleneck, the Sqoop client acts as the overseer ratherthan as an active participant in transferring the data. This is a key tenet of Sqoop’s design.

Filters

HBase filters are a powerful feature that can greatly enhance your effectiveness when working with data stored in tables.

Most concrete filter classes are direct descendants of FilterBase, As CompareFilter-based filters add one more feature to the base FilterBase class,namely the compare() operation, it has to have a user-supplied operator type that defines how the result of the comparison is interpreted.

Why Hbase?

HBase is a distributed and scalable data store that supports row-level updates, rapid queries, and row-level transactions (but not multirow transactions).

hbase is good for oltp

HBase can be used like a key-value store, where a single key is used for each row to provide very fast reads and writes of the row’s columns or column families. HBase also keeps a configurable number of versions of each column’s values (marked by timestamps), so it’s possible to go “back in time” to previous values, when needed.

Sqoop Hive Use Case Example

This is another Use case on Sqoop, Hive concepts. Hive Use Case Example.

Hive Use Case Example

Problem Statement

There are about 35,000 crime incidents that happened in the city of San Francisco in the last 3 months.

Our task is to store this relational data in an RDBMS. Use Sqoop to import it into Hadoop.

Can we answer the following queries on this data:

  •   Relative frequencies of different types of crime incidents
  •   Crime occurrence frequency as a function of day of the week

https://data.sfgov.org/Public-Safety/SFPD-Incidents-Previous-Three-Months/tmnf-yvry

Step-1:

Download the data into .csv format

Step-2:

Copy the file into hdfs, as shown as below command,

Step-3:

if you observe the data,

 Here,

column is surrounded by quotes and in between we have commas. so while exporting that data into RDBMS we need to remove that quotes and we should not consider comma inside ‘description’ as delimiter. This can be achieved with the help of CSVSerde in hive. Below is the high level notes on it.

To Setup the CSVSerde, do the following:

  1. Download the CSV SerDe jar to from https://github.com/ogrodnek/csv-serde.
  2. In Hive CLI, submit “add jar <path-to-serde-jar>".
  3. The list jars command should display the CSV SerDe if it has been added successfully.
  4. To add the CSV SerDe to an existing table:
  5. To add the CSV SerDe to a new table:
  6. The default separator, quote, and escape characters from the opencsv library are:
  7. We can use SERDEPROPERTIES to override default escape, quite and separator characters.  For example, the following overrides the defaults, using the tab character instead of comma for the separator, using a single quote instead of a double quote, and two backslashes for the escape character:

To achieve this we are first creating hive table and loading data into it,

Step-4:

Now we are again copying data from hive table to hdfs to remove commas inside the description specified above.

Step-5:

Now exporting the data into RDBMS using Sqoop,

Step-6:

Final output:

  • Relative frequencies of different types of crime incidents

  • Crime occurrence frequency as a function of day of the week

Sqoop Export Commands

Below are some Sqoop Export Commands and Other Miscellaneous commands

Sqoop-export

It is nothing but exporting data from HDFS to database. To use ‘export‘ command, a table in database should already exist. then only export functionality in sqoop will works.

export command will works in two ways
1. insert
2. update

1. insert: insert mode will insert the new records from HDFS to RDBMS table.

command:

Note: if a record already present in the database table with same primary key, then it will raise MySQLIntegrityConstraintViolationException exception.

2. update: update mode will update the records in the RDBMS from HDFS data.

update mode only update already existing records, it will not insert new records in the RDBMS.

command:

Sqoop-JOB

Job is nothing but to save a sqoop command and excute n times when we require it. The job command allows you to create and work with saved jobs. Saved jobs remember the parameters used to specify a job, so they can be re-executed by invoking the job by its handle.

The following are Sqoop Job commands,

Create job:

Note:
There should be space between ‘–‘ and ‘import’.

list jobs: it will show all the jobs.

inspect job: it will show details about the job.

delete job: it will delete existing job.

execute job: it will execute the job.

Sqoop-Merge

Merge command in sqoop is to merge two existing tables in HDFS to single new table in HDFS.

Let us discuss merge functionality with the example:

The below are tables in RDBMS,

Table 1: emp_1

Table 2: emp_2

now,

Importing emp_1 into HDFS

Importing emp_2 into HDFS

After importing emp_1 and emp_2 into HDFS, sqoop will generate a jar into our local file system. with help of that jar we will able to merge the two tables in HDFS

1. create a database and table as shown below

2. inserted 10000 + records as shown below

Note: I have used mysql import to load 10000+ records, as below

3. Run below sqoop command to import from MYSQL to HDFS location

It created 4 Map jobs and stored in part files in HDFS location.

4. Updated some records randomly

5. Used below command to pic updated records and merged into single file.

Sqoop-import-hive

sqoop provides direct import into hive.There are two ways to import into hive

1. Create-hive-table: it will be used when data already imported into hdfs but not into hive

By above command, sqoop will create an empty table with DB column properties.To load data we need to use hive command as shown below:

2.hive-import: It is a direct import from RDBMS to Hive

Sqoop Import Command Arguments

In this post we will discuss about one of the important commands in Apache Sqoop, Sqoop Import Command Arguments with examples. This documentation is applicable for sqoop versions 1.4.5 or later because earlier versions doesn’t support some of the below mentioned arguments to import command

As of Sqoop 1.4.5 version, Sqoop import command supports various number of arguments to import relational database tables into below tools or services.

  • HDFS
  • Hive
  • HBase
  • HCatalog
  • Accumulo

There are specific arguments for these services, apart from

  • Common arguments
  • Control arguments
  • Input parsing arguments
  • Output line formatting arguments
  • Incremental import arguments.

Lets discuss about each of these arguments with a brief explanation about each argument.

Sqoop Import Command Arguments

First, lets discuss about the common arguments to Sqoop Import command. Each argument in sqoop commands is prefixed by ‘–‘ (double-hyphen). In the below table only mandatory arguments are highlighted with red color in description column.

Common arguments

Argument Description & Example
–connect <jdbc-uri> To Specify JDBC connect string containing hostname or IP address (optionally port) followed by database name. It is mandatory argument.
–connect jdbc:mysql://localhost/test_db 
In this example we are connecting to MySQL On localhost and database name is test_db
–connection-manager <class>  Specify connection manager class name It is optional. Ex:
–connection-manager org.apache.sqoop.manager.GenericJdbcManager
–connection-param-file <file> optionally specify extra JDBC parameters via a property file using this option. This file is parsed as standard Java properties and passed into the driver while creating a connection.
 –driver <class> Manually specify JDBC driver class to use. Ex: –driver oracle.jdbc.OracleDriver 
 –hadoop-home <hdir> By default sqoop will look for $HADOOP_HOME to set to $HADOOP_COMMON_HOME. We can override this value with this argument.Specify this to use custom hadoop installation instead of default from $HADOOP_HOME 
 –hadoop-mapred-home <dir> Same as above for $HADOOP_MAPRED_HOME  Specify the path of new hadoop mapreduce installation directory.
–username & (–password or -P) We can specify the username and password of the RDBMS to retrieve the tables. Example: –username u1 –password 12345 

Notes

  • We should not use the URL localhost in the connect string, if we intend to use Sqoop with a distributed Hadoop cluster. The connect string we supply will be used on data nodes throughout our cluster; if we specify the literal name localhost, each node will connect to a different database on their localhosts (or more likely, no database at all). Instead, we should use the full hostname or IP address of the database host that can be seen by all our remote nodes.
  • The –password parameter is insecure, as other users may be able to read our password from the command-line arguments via the output of programs such as ps.                       The -P argument (prompts for user password) will read a password from a console prompt ,and is the preferred method of entering credentials. Credentials may still be transferred between nodes of the MapReduce cluster using insecure means. The most secure way is to use, –password-file <file containing the password> method. Set authentication password in this file on the users home directory with 400 permissions.
  • The jar containing the driver class (ex: com.mysql.jdbc.Driver) should be copied into $SQOOP_HOME/lib directory otherwise exception will be thrown.

Below is the example sqoop import command covering all the arguments discussed above.

Sqoop Importing MySQL Data into HDFS

In this post, we will create a new database in MySQL server and create table and insert data into it and we will do importing MySQL data into HDFS via Sqoop tool.

We assume that MySQL is installed and Sqoop & Hadoop are installed on local machine to test this example. We need to make sure that MySQL JDBC drivers connector jar file is downloaded and copied into SQOOP_HOME/lib directory.

Prepare MySQL Data

Login to MySQL and create a new database, and new table and load some sample data shown in below in sequence. In this example we will create sample table with below columns:

  • first_name
  • last_name
  • company_name
  • address
  • country
  • city
  • state

Login to MySQL as root:

Create new Database test_db:

Grant all Privileges on this database to all the users at localhost (or IP addresses of the machines (repeat the same commands with different IP addresses if there are more no of client machines to interact with this MySQL server database)).

MySQL Database Creation

As the test_db is permitted to be used by any user@localhost, now we can login to MySQL directly without any root credentials.

Sample User data input file used in this example is available –> userdata.

Bulk data load into MySQL table
Importing MySQL Data Into HDFS via Sqoop

 

Sqoop submits map only jobs to mapreduce framework for copying the data from relational databases to Hadoop. By default sqoop submits 4 map tasks and the input data from MySQL table is divided equally across all 4 map tasks based on primary key in MySQL table.

As we are running in localhost, we have setup number of map tasks to 2 with -m 2 option.

Submit the below Sqoop import command to copy user_data table into HDFS /table/userdata location.

Storing as Text file

Sqoop Import

Lets verify the output data in HDFS.

Verify Sqoop Import output

Using a File Format Other Than Text file

Storing as Sequence file

We need to use –as-sequencefile option in sqoop import command to store it as a sequence file.

Sqoop Import Sequencefile Output

Storing as Avrodata file

We need to use –as-avrodatafile option in sqoop import command to store it as a avro file.

Storing as Avro data file

Importing based on a filter/criteria

For this we can use option –where “column=’value'” to select only a subset of input based on a criteria or filter condition.

Sqoop Import with Where condition

Compressing Imported Data

We need to use –compress option along with –compression-code with codec classname. Its classname can be any of these from below.

  • org.apache.hadoop.io.compress.GzipCodec
  • org.apache.hadoop.io.compress.DefaultCodec
  • org.apache.hadoop.io.compress.BZip2Codec
  • org.apache.hadoop.io.compress.SnappyCodec

Gzipped Compress data

Importing all the tables

Excluding a few tables