Sqoop Hive Use Case Example 3


This is another Use case on Sqoop, Hive concepts. Hive Use Case Example.

Hive Use Case Example

Problem Statement

There are about 35,000 crime incidents that happened in the city of San Francisco in the last 3 months.

Our task is to store this relational data in an RDBMS. Use Sqoop to import it into Hadoop.

Can we answer the following queries on this data:

  • Relative frequencies of different types of crime incidents
  • Crime occurrence frequency as a function of day of the week

https://data.sfgov.org/Public-Safety/SFPD-Incidents-Previous-Three-Months/tmnf-yvry

Step-1:

Download the data into .csv format

Step-2:

Copy the file into hdfs, as shown as below command,

Step-3:

if you observe the data,

Here,

column is surrounded by quotes and in between we have commas. so while exporting that data into RDBMS we need to remove that quotes and we should not consider comma inside ‘description’ as delimiter. This can be achieved with the help of CSVSerde in hive. Below is the high level notes on it.

To Setup the CSVSerde, do the following:

  1. Download the CSV SerDe jar to from https://github.com/ogrodnek/csv-serde.
  2. In Hive CLI, submit “add jar <path-to-serde-jar>".
  3. The list jars command should display the CSV SerDe if it has been added successfully.
  4. To add the CSV SerDe to an existing table:
  5. To add the CSV SerDe to a new table:
  6. The default separator, quote, and escape characters from the opencsv library are:
  7. We can use SERDEPROPERTIES to override default escape, quite and separator characters. For example, the following overrides the defaults, using the tab character instead of comma for the separator, using a single quote instead of a double quote, and two backslashes for the escape character:

To achieve this we are first creating hive table and loading data into it,

Step-4:

Now we are again copying data from hive table to hdfs to remove commas inside the description specified above.