Spark: Use Spark-SQL on SQL Developer

Screen Shot 2014-10-25 at 14.29.50

I’m describing here how I set SQL Developer to connect / query my Spark cluster. I made it work on my local environment below:

  • Ubuntu precise 64 bits (1 master, 2 slaves)
  • Hadoop Hortonworks 2.4.0.2.1.5.0-695
  • Hive version 0.13.0.2.1.5.0-695, metastore hosted on MySQL database
  • Spark 1.1.0 prebuilt for Hadoop 2.4
  • SQL Developer 4.0.3.16

Note that I’ve successfully tested same setup on a 20 nodes cluster on AWS (EMR)

Preparing your Spark environment

First of all, you will have to make sure Spark SQL is able to connect Hive metastore. Simply copy hive-site.xml file in Spark conf directory

sudo cp /etc/hive/conf/hive-site.xml /usr/lib/spark/conf/

Open a hive shell, and create a dummy table

hive> CREATE TABLE dummy (foo STRING, bar STRING);
OK
Time taken: 2.555 seconds

Now open a spark-sql shell, and make sure you can find the newly created table. Note that –master local can be used here since we’re not running any job.

vagrant@vagrant:~$ /usr/lib/spark/bin/spark-sql --master local
../..
spark-sql> show tables;
../..
dummy
../..
Time taken: 3.452 seconds
../..

Great, looks like Spark-SQL is now connected to Hive metastore…

Install Hive JDBC drivers on SQL Developer

oraSqlD01

SQL Developer is quite a nice tool – at least for Oracle databases. In order to make it work with Hive / Spark, we need to download the Hive connectors first. I found them on Cloudera website. Note that I was not able to find same from Hortonworks. This could be an issue if I was using Hive natively, but it worth giving a try on Spark first.

Unzip the downloaded archive (I’m using JDBC4 package), and open SQLDeveloper.
You will have to add third-part connectors from preferences -> database -> Third party JDBC driver. Add all the jars included in your archive

~/Downloads/Cloudera_HiveJDBC_2.5.4.1006/Cloudera_HiveJDBC4_2.5.4.1006/libthrift-0.9.0.jar
~/Downloads/Cloudera_HiveJDBC_2.5.4.1006/Cloudera_HiveJDBC4_2.5.4.1006/HiveJDBC4.jar                                           
~/Downloads/Cloudera_HiveJDBC_2.5.4.1006/Cloudera_HiveJDBC4_2.5.4.1006/log4j-1.2.14.jar
~/Downloads/Cloudera_HiveJDBC_2.5.4.1006/Cloudera_HiveJDBC4_2.5.4.1006/TCLIServiceClient.jar                                   
~/Downloads/Cloudera_HiveJDBC_2.5.4.1006/Cloudera_HiveJDBC4_2.5.4.1006/ql.jar
~/Downloads/Cloudera_HiveJDBC_2.5.4.1006/Cloudera_HiveJDBC4_2.5.4.1006/hive_metastore.jar                                      
~/Downloads/Cloudera_HiveJDBC_2.5.4.1006/Cloudera_HiveJDBC4_2.5.4.1006/slf4j-api-1.5.8.jar
~/Downloads/Cloudera_HiveJDBC_2.5.4.1006/Cloudera_HiveJDBC4_2.5.4.1006/hive_service.jar                                        
~/Downloads/Cloudera_HiveJDBC_2.5.4.1006/Cloudera_HiveJDBC4_2.5.4.1006/slf4j-log4j12-1.5.8.jar
~/Downloads/Cloudera_HiveJDBC_2.5.4.1006/Cloudera_HiveJDBC4_2.5.4.1006/libfb303-0.9.0.jar

Once added, restart SQL Developer, and ohhh, awesome, a new tab “Hive” magically appeared 🙂

Screen Shot 2014-10-25 at 15.07.30

Open this tab and create a new connection to your HiveServer2 as follows

Screen Shot 2014-10-25 at 14.22.29

And test connection ! Assuming HiveServer2 / Hive Metastore is running, your connection should be established and a new Hive Worksheet should pop up. At this stage, connection From SQL Developer to Hive is up, but SQL are executed on Hive, not Spark !

Screen Shot 2014-10-25 at 15.15.43

Now we need to switch back to Spark !

The goal is now to use Spark thrift interface in lieu of vanilla HiveServer2. Stop HiveServer2 daemon, and start Spark thrift interface as follows..

vagrant@vagrant:~/usr/lib/spark/sbin/start-thriftserver.sh --master yarn-client
Spark assembly has been built with Hive, including Datanucleus jars on classpath
../..
14/10/25 15:11:49 INFO service.AbstractService: Service:OperationManager is inited.
14/10/25 15:11:49 INFO service.AbstractService: Service: SessionManager is inited.
14/10/25 15:11:49 INFO service.AbstractService: Service: CLIService is inited.
14/10/25 15:11:49 INFO service.AbstractService: Service:ThriftBinaryCLIService is inited.
14/10/25 15:11:49 INFO service.AbstractService: Service: HiveServer2 is inited.
14/10/25 15:11:49 INFO service.AbstractService: Service:OperationManager is started.
14/10/25 15:11:49 INFO service.AbstractService: Service:SessionManager is started.
14/10/25 15:11:49 INFO service.AbstractService: Service:CLIService is started.
14/10/25 15:11:49 INFO hive.metastore: Trying to connect to metastore with URI thrift://192.168.56.101:9083
14/10/25 15:11:49 INFO hive.metastore: Waiting 1 seconds before next connection attempt.
14/10/25 15:11:50 INFO hive.metastore: Connected to metastore.
14/10/25 15:11:50 INFO service.AbstractService: Service:ThriftBinaryCLIService is started.
14/10/25 15:11:50 INFO service.AbstractService: Service:HiveServer2 is started.
14/10/25 15:11:50 INFO thriftserver.HiveThriftServer2: HiveThriftServer2 started
14/10/25 15:11:50 INFO thrift.ThriftCLIService: ThriftBinaryCLIService listening on 0.0.0.0/0.0.0.0:10000
../..

Looking at application master log file, Spark is actually embedding its own HiveServer2 into a Spark job. Connection will be apparently dropped if Spark job ends up, so make sure Spark job is not killed !

I’m using yarn-client to handle my JDBC connections, but this should support spark standalone clusters as well

Now get back to your SQL Developer, and restore connection to Hive (HiveServer is now embedded into a Spark job), and you’re now connected to Spark !

And voila !

For testing purpose, I’ve downloaded some ngrams data from S3 (link), and created a new Table ngrams on Hive. I’m now able to get lightning fast SQL queries on Spark from a user friendly environment (SQL Developer).

Screen Shot 2014-10-25 at 14.21.35

As initially said, this seems quite hacky (Remember I used Cloudera JDBC on Hortonworks environment), but it seems to work, and it seems to be really fast ! Please let me know the environment / distribution you’re using if you’re facing any issue..

Cheers!
Antoine

Advertisements