Sqoop Interview Cheat Sheet 1


Install sqoop

sudo yum install sqoop

  1. sudo apt-get install sqoop
  2. in sqoop-normal commnd prompt
  3. sqoop config file—sqoop site.xml

install jdbc drivers

After you’ve obtained the driver, you need to copy the driver’s JAR file(s) into Sqoop’s lib/ directory. If you’re
using the Sqoop tarball, copy the JAR files directly into the lib/ directory after unzipping the tarball. If you’re using packages, you will need to copy the driver files into the /usr/lib/sqoop/lib directory

sqoop list-databases

import

sqoop import

*results are imported to hdfs under cloudera-employees

In employees directory 4 files are created

  • part-m-00000
  • part-m-00001
  • part-m-00002
  • part-m-00003

import all

 

mappers m1

sqoop import

 

Subset of Data
where

sqoop import

 

part-m-00000-179973 records.
When using the –where parameter, keep in mind the parallel nature of Sqoop transfers.
Data will be transferred in several concurrent tasks. Any expensive function call will  put a significant performance burden on your database server. Advanced functionscould lock certain tables, preventing Sqoop from transferring data in parallel. This willadversely affect transfer performance. For efficient advanced filtering, run the filteringquery on your database prior to import, save its output to a temporary table and run
Sqoop to import the temporary table into Hadoop without the –where parameter.

boundary queries

 

 

 

 

By default, when using the –compress parameter, output fileswill be compressed using the GZip codec, and all files will end up with a .gz extension.
You can choose any other codec using the –compression-codec parameter.

other compression codecs
bzip2-.bz2

As Sqoop delegates compression to the MapReduce engine, you need to make sure the compressed map output is allowed in your Hadoop configuration. For example, if in the mapred-site.xml file, the property
mapred.output.compress is set to false with the final flag, then Sqoop won’t be able to compress the output files even when you call it with the –compress parameter.
The selected compression codec might have a significant impact on subsequent processing.
Some codecs do not support seeking to the middle of the compressed file without reading all previous content, effectively preventing Hadoop from processing the input files in a parallel manner. You should use a splittable codec for data that you’re planning to use in subsequent processing.

  • Compression codecs
  • Splittable Not Splittable
  • BZip2, LZO GZip, Snappy

 

encoding null values

Sqoop encodes database NULL values using the null string constant. Your downstream processing (Hive queries, custom MapReduce job, or Pig script) uses a different constant for encoding missing values

sharing matastore

Sqoop’s metastore can easily be started as a service with the following command:
sqoop metastore
Other clients can connect to this metastore by specifying the parameter –meta-connect in the command line with the URL of this machine. For example, to create a new saved job in the remote metastore running on the host

 

The content of each <arg> tag is considered to be one parameter regardless of how many spaces it contains; this is especially useful for entering queries as <arg>SELECT * FROMcities</arg>, which is considered to be one single parameter. Having spaces inside oa <command> tag might not be obvious, especially when you’re using variables to parametrizeyour workflow. The preferred way to use Sqoop in Oozie is with <arg> tags.

 

install jdbc oozie

Sqoop works correctly when executed from the command line, but in Oozie it cannot
find the JDBC drivers.
Solution
You need to install the JDBC drivers into Oozie separately. You have two options: install
the driver either into your workflow’s lib/ directory or into the shared action library
location usually found at /user/oozie/share/lib/sqoop/.

 

sqoop.export.records.per.statement tospecify the number of records that will be used in each insert statement:Finally, you can set how many rows will be inserted per transaction with the
sqoop.export.statements.per.transaction property:

 

staging table

kalyana13:You can use a staging table to first load data to a temporary table before making changes
to the real table. The staging table name is specified via the –staging-table parameter.
In the below example,

updating dataset

The ability to conditionally insert a new row or update an existing one is an advanced database feature known as upsert. This feature is not available on all database systems nor supported by all Sqoop connectors. Currently it’s available only for Oracle and nondirect MySQL exports.

stored procedure
Using Stored Procedures—————-

 

parameter. For example, when exporting into MySQL,
Sqoop uses the following query:
CALL populate_cities(?, ?, ?)

import into hive

Sqoop mandates that the partition column be of type STRING. The current implementation is limited to a single partition level. Unfortunately, you can’t use this feature if your table has more than one level of  partitioning (e.g., if you would like a partition by day followed by a partition by hour).
delims You can instruct Sqoop to automatically clean your data using –hive-dropimport- delims, which will remove all \n, \t, and \01 characters from all string-based columns:
If removing the special characters is not an option in your use case, you can take advantage of the parameter –hive-delims-replacement, which will accept a replacement string.

 

Problem

Imports into HBase take significantly more time than importing as text files in HDFS.
Solution
Create your HBase table prior to running Sqoop import, and instruct HBase to create
more regions with the parameter NUMREGIONS. For example, you can create the HBase
table cities with the column family world and 20 regions using the following
command:
hbase> create ‘cities’, ‘world’, {NUMREGIONS => 20, SPLITALGO => ‘HexString
Split’}
Sqoop

Sqoop will connect to your Hadoop cluster and submit a MapReduce job. Eachmapper of the job will then transfer a slice of the table’s data. As MapReduce executesmultiple mappers at the same time, Sqoop will be transferring data in parallel to achievethe best possible performance by utilizing the potential of your database server. Eachmapper transfers the table’s data directly between the database and the Hadoop cluster.
To avoid becoming a transfer bottleneck, the Sqoop client acts as the overseer ratherthan as an active participant in transferring the data. This is a key tenet of Sqoop’s design.

Filters

HBase filters are a powerful feature that can greatly enhance your effectiveness when working with data stored in tables.

Most concrete filter classes are direct descendants of FilterBase, As CompareFilter-based filters add one more feature to the base FilterBase class,namely the compare() operation, it has to have a user-supplied operator type that defines how the result of the comparison is interpreted.

Why Hbase?

HBase is a distributed and scalable data store that supports row-level updates, rapid queries, and row-level transactions (but not multirow transactions).

hbase is good for oltp

HBase can be used like a key-value store, where a single key is used for each row to provide very fast reads and writes of the row’s columns or column families. HBase also keeps a configurable number of versions of each column’s values (marked by timestamps), so it’s possible to go “back in time” to previous values, when needed.


About Siva

Senior Hadoop developer with 4 years of experience in designing and architecture solutions for the Big Data domain and has been involved with several complex engagements. Technical strengths include Hadoop, YARN, Mapreduce, Hive, Sqoop, Flume, Pig, HBase, Phoenix, Oozie, Falcon, Kafka, Storm, Spark, MySQL and Java.


Leave a comment

Your email address will not be published. Required fields are marked *

One thought on “Sqoop Interview Cheat Sheet


Review Comments
default image

I have attended Siva’s Spark and Scala training. He is good in presentation skills and explaining technical concepts easily to everyone in the group. He is having excellent real time experience and provided enough use cases to understand each concepts. Duration of the course and time management is awesome. Happy that I found a right person on time to learn Spark. Thanks Siva!!!

Dharmeswaran ETL / Hadoop Developer Spark Nov 2016 September 21, 2017

.