Apache Phoenix – An SQL Layer on HBase 5


Phoenix HBase Overview

What is Apache Phoenix?

Apache phoenix is another Top Level project from Apache Software Foundation. It provides an SQL interface to HBase. It is like an SQL layer on top of HBase architecture. It maps HBase data model to the relational world.

Phoenix is developed in java with a notion to put SQL back into NoSQL databases.

Need for Apache Phoenix:

Hive is added into Hadoop Eco-system to maintain and manage structured data in Hadoop and it also provide an SQL like dialect HiveQL to query the tables in Hive data warehouse. But hive doesn’t provide record level operations such as insert, update and delete operations.

To overcome this drawback of hive, HBase is brought into Hadoop eco-system to support record level operations. But HBase shell supports only HBase commands such as ‘scan’, ‘get’, ‘put’, ‘list’ to query entire table, to get a row, put a row and for listing tables etc. respectively. Thus HBase introduced its own set of commands and doesn’t support famous SQL interface.

Use of a well-understood language like SQL makes it easier for people to use HBase. Rather than learn another proprietary API, they can just use the language they’re used to to read and write their data.

To provide this advantage to HBase, Phoenix is introduced into Hadoop Eco System to provide an SQL Layer on top of HBase.

Phoenix provides very high performance when compared to Hive and Cloudera Impala or OpendTSDB.

Apache Phoenix Features:

  • It is delivered as embedded JDBC driver for HBase data.
  • Follows ANSI SQL standards whenever possible
  • Allows columns to be modeled as a multi-part row key or key/value cells.
  • Full query support with predicate push down and optimal scan key formation.
  • Versioned schema repository. Table metadata is stored in an HBase table and versioned, such that snapshot queries over prior versions will automatically use the correct schema.
  • DDL support: CREATE TABLE, DROP TABLE, and ALTER TABLE for adding/removing columns.
  • DML support: UPSERT VALUES for row-by-row insertion, UPSERT SELECT for mass data transfer between the same or different tables, and DELETE for deleting rows.
  • Compiles SQL query into a series of HBase scans, and runs those scans in parallel to produce regular JDBC result sets.
  • It can seamlessly integrate with HBase, Pig, Flume and Sqoop.
Limitations:
  • Limited transaction support through client-side batching.
  • Joins are not completely supported. FULL OUTER JOIN and CROSS JOIN are not supported.

Why is Phoenix so Fast?

  • Apache Phoenix breaks up SQL queries into multiple HBase scans and runs them in parallel.
  • Phoenix is developed with notion of bringing the computation to the data by using:
    • Coprocessors to perform operations on the server-side thus minimizing client/server data transfer
    • Custom filters to prune data as close to the source as possible In addition, to minimize any start up cost.
    • Phoenix uses native HBase APIs rather than going through the mapreduce framework.
  • Phoenix leverages below HBase custom filters to provide higher performance.
    • Essential Column Family filter leads to improved performance when Phoenix query filters on data that is split in multiple column families (cf) by only loading essential cf. In second pass, all cf are loaded as needed.
    • Phoenix’s Skip Scan Filter leverages SEEK_NEXT_USING_HINT of HBase Filter. It significantly improves point queries over key columns.
    • Salting in Phoenix leads to both improved read and write performance by adding an extra hash byte at start of key and pre-splitting data in number of regions. This eliminates hot-spotting of single or few regions servers
  • In single line, we can say that Direct use of the HBase API, along with coprocessors and custom filters, results in performance on the order of milliseconds for small queries, or seconds for tens of millions of rows.

For aggregate queries, co processors complete partial aggregation on their local region servers and only returns relevant data to the client.

Phoenix creates secondary indexes to improve performance on non-row key columns.

Apache Phoenix Installation/Configuration:

Apache Phoenix installation on HBase is very simple and straight but we need to pick right version of Phoenix that matches with HBase version on our Hadoop cluster. Below is version compatibility matrix.

  • Phoenix 3.x – HBase 0.94.x
  • Phoenix 4.x – HBase 0.98.1+

In this post, we need it for HBase version 0.98.2, so we use phoenix 4.1.0. We need to perform below activities.

  • Download Phoenix binary tar ball from Apache download mirrors and extract it into our preferred installation directory (usually into /usr/lib/phoenix).
  • Copy phoenix-4.1.0-server.jar file into the HBase lib directory on each HBase region server machine. Terminal commands for performing these two actions.

  • Add Phoenix installation directory to .bashrc file. Add below environment variables by opening $ gedit ~/.bashrc.

Phoenix Installation

  • Restart all HBase region servers and reopen the terminal to pick .bashrc changes. Below are the commands to restart HBase in QuickStart VM

Now the installation is completed.

Installation of Phoenix on Cloudera QuickStart VM

By default Cloudera Quickstart VM ships with a lot of hadoop eco system tools installed and configured in pseudo mode but Apache Phoenix is not yet officially supported tool by cloudera by the time of writing this post. So we need to install Apache Phoenix manually on Cloudera QuickStart VM.

Cloudera-QuickStart-VM-5.3.0 best integrates with Phoenix-4.3.0 version. (Cloudera Quickstart VM 5.4.0 or 5.4.2 [Hbase version 1.0.0] tried with Apache Vanilla Phoenix 4.4.0 or 4.5.0 but there are lot of integrity issues)

But the successful combination is Cloudera-QuickStart-VM-5.3.0 (HBase-0.98.6) with Phoenix-4.3.0 but here we need to download the phoenix from cloudera labs.

Verify Phoenix Installation:

To connect to phoenix, we need to specify the zookeeper quorum node and in pseudo mode it is localhost.

A terminal interface to execute SQL from the command line is now bundled with Phoenix. To start it, execute the following:

Verify the tables and columns with !tables & !columns commands.

Sqlline

Press Ctrl+D to quit the phoenix session.

In Cloudera QuickStart VM,