In this post we are going to discuss about basic details of Tableau software and Tableau Integration with hadoop.
What is Tableau?
Tableau is a visualization tool based on breakthrough technology that provides drag & drop features to analyze data on large amounts of data very easily and quickly.
The Dashboard of Tableau is very interactive and gives dynamic results. Tableau supports strong interactive capabilities and provides rich set of graphic charts. Tableau can query relational databases, cubes, cloud database, and spreadsheets and then generates a number of graph types that can be combined into dashboards and shared over a computer network or the internet.
Tableau offers 4 different products that can be used for various purposes:
- Tableau desktop – desktop version and can be used by anyone for 15 days without any license.
- Tableau server – Tableau Server is a business intelligence application that provides browser-based analytics.
- Tableau public – Public version of Tableau software.
- Tableau online – Tableau Online is a hosted version of Tableau Server.
Tableau supports Visual Query Language (VizQL). VizQL translates drag-and-drop actions into data queries and then expresses that data visually.
Tableau data engine is based on advanced In-Memory Technology to speed up ad-hoc analysis of massive data in a few seconds. Query results are cached in memory, to provide the results very quickly on even large amounts of data sets.
Tableau is a commercial product but it provides 15 days of trial period on Tableau Desktop, so that we can try this for some sample metrics. As of now Tableau supports only Windows and Mac Os. There is no support for Linux operating systems.
- Ease of Use – No programming or scripts needed. We can connect to data in a few clicks, via its drag and drop features and then visualize and create interactive dashboards with a few more.
- Speed processing – With its advance in-memory technology its processing speed is fast.
- Rich Set of GUI and Chart based dashboards.
- Easy publishing and sharing of dashboards on web or emails.
Its drawback is that, no predictive analytical capabilities are supported by Tableau.
Tableau Desktop Installation & Tableau Integration with Hadoop:
- Tableau Desktop for 15 days of trial period can be downloaded at http://www.tableausoftware.com/products/trial. Based on the windows operating system choose either 32 bit version or 64 bit version of installer files for Tableau Desktop. Download the windows installer file (TableauPublicDesktop-32bit.msi) and install it on the machine.
- For connecting with MySQL relational database, we need to download its ODBC (Open database connectivity) drivers and install on the machine. ODBC drivers for MySQL can be downloaded (mysql-connector-odbc-5.3.4-win32.msi) at http://dev.mysql.com/downloads/connector/odbc/
- Tableau supports visualization on Hadoop data by connecting to Hiveserver or Hiveserver2. There are no ODBC drivers available as of now for connecting to Apache Hive directly from Tableau but Cloudera provides ODBC drivers for Tableau integration with Hadoop via cloudera hiveserver.
- Cloudera hive ODBC drivers can be downloaded at http://www.cloudera.com/content/cloudera/en/downloads/connectors.html . Make sure to download ODBC drivers (ClouderaHiveODBC32.msi) but not JDBC drivers.
- Install the above .msi files in windows machine and restart it to correctly integrate the drivers with each other.
Connect to Tableau and Perform Data Visualization on Hadoop Data:
Start tableau desktop software and connect to Cloudera Hive (Data –> Connect) and provide the details as shown in below screen. Connect to Hiveserver at port 10000 on IP address of the machine on which hiveserver is running.
Once we have successfully connected to Hiveserver machine, we need to select the schema (database name) and need to either search tables or select any listed tables.
For example in the above screen default the schema name and we need to select the default schema here in the above screen. After selecting default as schema and clicking on search button against tables, then all the tables present in the this schema will be listed as shown in the below screen.
Now it is very easy to perform analysis/statistics/reports/joins on tables from the above list. We just need to drag and drop the tables in to worksheet as shown in above screen for table t1. If we need to join two tables then we need to drag two tables from the list and drop into worksheet, if there are any common columns between two tables they will be automatically used to join perform join other wise, we need to give the column names on which, the join should be performed. Also Tableau will ask by highlighting to choose join type (Inner join, left outer join, right outer join or full outer joins).
Once the tables are dropped into worksheet and if we click on “Go to Worksheet” Tableau will fire the query in the background on hiveserver and pulls the records from into its advanced in-memory. By default Tableau retrieves only 10000 rows into its memory but we can configure this by entering our required value into “Rows” field on right side of the above page and clicking on either of the options “Update now” or “Automatically Update”
Once the data from the table is loaded into memory, we are ready to perform any queries on visualizations on top of the data. Here we can drag any column from the tables and drop into “Columns” tab in the below screen and Preferably measures into Rows for better visualizations but this rule is not mandatory. We can explore the best possible ways to visualize the data just by playing around the drag and drop of various columns into different areas.
Here in the below screen we are trying to list out the number of records per date from a sample table. So, we have selected Dt into “Columns” and No of Records into “Rows” and below is the automatic bar chart visualization.
But also we can select any of the highlighted chart formats (Pie charts, bubble chart, etc…) to visualize data.
In the background, tableau will fire queries against hiveserver to pull data from hive table into it’s in-memory. For example, below is the screen shot of the query fired for data pulling the above sample table from hiveserver. We can observe the Tez job running in the below screen.
Tableau connectivity with MySQL:
Below is the procedure for connecting Tableau with MySQL RDBMS. After successful installation of ODBC drivers for MySQL database, (Data –> connect –> Select MySQL) and enter the details as shown in below screen.
Provide the MySQL server IP address and default port number 3306 for MySQL and provide username and password details to connect the database.
And select the database first and drag and drop the tables which we want to include in analysis.
Now as shown in above (cloudera hive) section, we can perform visualizations on the tables.
Data Blending in Tableau:
We can perform data blending in Tableau, by connecting two data sources at a time. As shown in above two sections we can connect to two data sources (Cloudera Hive and MySQL) at a time and select the tables into Worksheet. We can blend (mix/merging of data) these tables by assigning relationships between them. This can be done by Selecting a table from MySQL (say user table) and another table from Cloudera Hive Server (say t1) and we can perform data blending by selecting data –> edit relationships and selecting the common columns in the tables to blend the data among the tables. Below is a sample screen shot for the same. If both the tables have at least one column in common then only we can blend the tables.
Building Dashboards in Tableau:
We can build dashboards in Tableau with the visualization charts prepared on various charts into single dash board by selecting any layout. Below is the 4 sheet layout containing four types of visualization charts in a single dashboard.