Spark: Connect Tableau Desktop to SparkSQL

Last (but not least) post of 2014, and a new Hacking challenge. Based on the work I’ve done on SQLDeveloper (https://hadoopi.wordpress.com/2014/10/25/use-spark-sql-on-sql-developer/), I was wondering how to connect Tableau Desktop to my SparkSQL cluster.

Install Tableau Desktop

I’m quite new to Tableau, but it’s worth giving a try. However, spending $999 for a challenge isn’t worth it, but free edition offers a 14 days trial (this implementation took me a couple of days :)).

I’m using MacOS edition which – I believe – is not as stable as the Windows version (could explain some of the bugs / limitations mentioned hereafter). Don’t hesitate to let me know how it behaves on a Windows environment.

Anyway, installation is quite straightforward, and nothing to set on Tableau to make it work with SparkSQL. The main logic is on Spark itself. Simply make sure Hive is an available connection option on Tableau’s connection page (see below screenshot)

Screenshot 2014-12-31 17.10.28

Create a Hive Table

Spark SQL (although it works in standalone mode) is using Hive metastore. For that purpose, create a dummy Hive table and insert some data (alternatively, use Hive external table). The data set I’m using is the UFO survey available here that contains around 60’000 documented reports of unidentified flying object. My Hive table is as follows

col_name data_type comment 
-------------- --------- ------- 
date_observed string 
date_reported string 
location_city string 
location_state string 
shape string 
duration int

With the following first 10 lines:

1995/10/09 1995/10/09 Iowa City IA NA
1995/10/10 1995/10/11 Milwaukee WI 120
1995/01/01 1995/01/03 Shelton WA NA
1995/05/10 1995/05/10 Columbia MO 120
1995/06/11 1995/06/14 Seattle WA NA
1995/10/25 1995/10/24 Brunswick County ND 1800
1995/04/20 1995/04/19 Fargo ND 120
1995/09/11 1995/09/11 Las Vegas NV NA
1995/01/15 1995/02/14 Morton WA NA
1995/09/15 1995/09/15 Redmond WA 360

Getting this exact table required a pre-processing of the original CSV, but out of scope for this article. Tableau will work with any kind of flat structure such as CSV.

Start Spark Thrift interface

Assuming Hive metastore is up and running, you need to copy Hive config files to Spark configuration directory.

sudo cp /etc/hive/hive-site.xml /usr/lib/spark/conf/

Now make sure Spark can access your Hive tables (not that your spark distribution should have been built with Hive support enabled)

antoine@dataphi:~ $ spark-sql --master local
Spark assembly has been built with Hive, including Datanucleus jars on classpath
spark-sql> show tables;
ufo
ufo_date
ufo_tmp
Time taken: 1.563 seconds
spark-sql>

As Spark can connect Hive metastore, let’s start Spark Thrift service as follows

/usr/lib/spark/sbin/start-thriftserver.sh --master ${MASTER}

I’m using MASTER=spark://localhost:7077, but it should also work on yarn-client. Once done, make sure a Spark job is running (from Spark UI).

Screenshot 2014-12-31 17.26.25

 

Note that Spark is using same port as HiveServer2. Should you have any issue with HiveServer running on the same box, change spark port with option : –hiveconf hive.server2.thrift.port=10001

At this stage, a SQL client with the right hive driver set should be able to connect your SparkSQL through the Spark Thrift Service you’ve just initiated

Install Hive ODBC package

On my MacOS, I get the ODBC packages from Hortonworks website. Demo effect, the download page is currently not available, but trust me, I made it work using package available there. I could delay this post by a few days to give you the exact URL, but I’m quite in a rush if I want to publish this one still in 2014 🙂

Have fun with Tableau

Now that your environment is ready, let’s get into Tableau Desktop. The connection to use is the “Hive Connection for Hortonworks”.  Simply add host / port of your thrift interface, and select HiveServer2. Authentication on Spark is disabled by default. However, I’m adding username only to make sure Tableau will act on behalf of my user in regards to my Hive connection.

Screenshot 2014-12-31 17.38.50

Simply click on connect, and assuming you did it right (mainly that you’ve found the right driver), you should get access to your Hive databases / tables.

Screenshot 2014-12-31 17.39.19

Go to your worksheet, and start playing with visualisations. Note how fast it is compare to a vanilla Hive connection when playing with numbers / String (see below errors with dates). Here is an example of UFO sightings in the US over the past 30 years.

Screenshot 2014-12-31 17.46.44

 

Limitations

The main limitation I’ve found out are about the date format. I’m getting below errors anytime I’m playing with date hierarchy (year, quarter, month, etc…), and I’m note brave enough to investigate it any further today.

Screenshot 2014-12-31 17.58.21

I made it work, great, but it is far not stable enough to be considered as a supported Tableau connection. However, the official connection will be released soon (http://www.tableausoftware.com/about/blog/2014/10/tableau-spark-sql-big-data-just-got-even-more-supercharged-33799), and I’ve requested a Beta program access to Tableau.

So far, getting a awesome tool like Tableau on a great big data stack sounds truly exiting, so once again, 2015 will be SPARKling!

With you all the best for the new year!

Antoine

 

 

Advertisements