Spark: Use Spark-SQL on SQL Developer

Screen Shot 2014-10-25 at 14.29.50

I’m describing here how I set SQL Developer to connect / query my Spark cluster. I made it work on my local environment below:

  • Ubuntu precise 64 bits (1 master, 2 slaves)
  • Hadoop Hortonworks 2.4.0.2.1.5.0-695
  • Hive version 0.13.0.2.1.5.0-695, metastore hosted on MySQL database
  • Spark 1.1.0 prebuilt for Hadoop 2.4
  • SQL Developer 4.0.3.16

Note that I’ve successfully tested same setup on a 20 nodes cluster on AWS (EMR)

Preparing your Spark environment

First of all, you will have to make sure Spark SQL is able to connect Hive metastore. Simply copy hive-site.xml file in Spark conf directory

sudo cp /etc/hive/conf/hive-site.xml /usr/lib/spark/conf/

Open a hive shell, and create a dummy table

hive> CREATE TABLE dummy (foo STRING, bar STRING);
OK
Time taken: 2.555 seconds

Now open a spark-sql shell, and make sure you can find the newly created table. Note that –master local can be used here since we’re not running any job.

vagrant@vagrant:~$ /usr/lib/spark/bin/spark-sql --master local
../..
spark-sql> show tables;
../..
dummy
../..
Time taken: 3.452 seconds
../..

Great, looks like Spark-SQL is now connected to Hive metastore…

Install Hive JDBC drivers on SQL Developer

oraSqlD01

SQL Developer is quite a nice tool – at least for Oracle databases. In order to make it work with Hive / Spark, we need to download the Hive connectors first. I found them on Cloudera website. Note that I was not able to find same from Hortonworks. This could be an issue if I was using Hive natively, but it worth giving a try on Spark first.

Unzip the downloaded archive (I’m using JDBC4 package), and open SQLDeveloper.
You will have to add third-part connectors from preferences -> database -> Third party JDBC driver. Add all the jars included in your archive

~/Downloads/Cloudera_HiveJDBC_2.5.4.1006/Cloudera_HiveJDBC4_2.5.4.1006/libthrift-0.9.0.jar
~/Downloads/Cloudera_HiveJDBC_2.5.4.1006/Cloudera_HiveJDBC4_2.5.4.1006/HiveJDBC4.jar                                           
~/Downloads/Cloudera_HiveJDBC_2.5.4.1006/Cloudera_HiveJDBC4_2.5.4.1006/log4j-1.2.14.jar
~/Downloads/Cloudera_HiveJDBC_2.5.4.1006/Cloudera_HiveJDBC4_2.5.4.1006/TCLIServiceClient.jar                                   
~/Downloads/Cloudera_HiveJDBC_2.5.4.1006/Cloudera_HiveJDBC4_2.5.4.1006/ql.jar
~/Downloads/Cloudera_HiveJDBC_2.5.4.1006/Cloudera_HiveJDBC4_2.5.4.1006/hive_metastore.jar                                      
~/Downloads/Cloudera_HiveJDBC_2.5.4.1006/Cloudera_HiveJDBC4_2.5.4.1006/slf4j-api-1.5.8.jar
~/Downloads/Cloudera_HiveJDBC_2.5.4.1006/Cloudera_HiveJDBC4_2.5.4.1006/hive_service.jar                                        
~/Downloads/Cloudera_HiveJDBC_2.5.4.1006/Cloudera_HiveJDBC4_2.5.4.1006/slf4j-log4j12-1.5.8.jar
~/Downloads/Cloudera_HiveJDBC_2.5.4.1006/Cloudera_HiveJDBC4_2.5.4.1006/libfb303-0.9.0.jar

Once added, restart SQL Developer, and ohhh, awesome, a new tab “Hive” magically appeared 🙂

Screen Shot 2014-10-25 at 15.07.30

Open this tab and create a new connection to your HiveServer2 as follows

Screen Shot 2014-10-25 at 14.22.29

And test connection ! Assuming HiveServer2 / Hive Metastore is running, your connection should be established and a new Hive Worksheet should pop up. At this stage, connection From SQL Developer to Hive is up, but SQL are executed on Hive, not Spark !

Screen Shot 2014-10-25 at 15.15.43

Now we need to switch back to Spark !

The goal is now to use Spark thrift interface in lieu of vanilla HiveServer2. Stop HiveServer2 daemon, and start Spark thrift interface as follows..

vagrant@vagrant:~/usr/lib/spark/sbin/start-thriftserver.sh --master yarn-client
Spark assembly has been built with Hive, including Datanucleus jars on classpath
../..
14/10/25 15:11:49 INFO service.AbstractService: Service:OperationManager is inited.
14/10/25 15:11:49 INFO service.AbstractService: Service: SessionManager is inited.
14/10/25 15:11:49 INFO service.AbstractService: Service: CLIService is inited.
14/10/25 15:11:49 INFO service.AbstractService: Service:ThriftBinaryCLIService is inited.
14/10/25 15:11:49 INFO service.AbstractService: Service: HiveServer2 is inited.
14/10/25 15:11:49 INFO service.AbstractService: Service:OperationManager is started.
14/10/25 15:11:49 INFO service.AbstractService: Service:SessionManager is started.
14/10/25 15:11:49 INFO service.AbstractService: Service:CLIService is started.
14/10/25 15:11:49 INFO hive.metastore: Trying to connect to metastore with URI thrift://192.168.56.101:9083
14/10/25 15:11:49 INFO hive.metastore: Waiting 1 seconds before next connection attempt.
14/10/25 15:11:50 INFO hive.metastore: Connected to metastore.
14/10/25 15:11:50 INFO service.AbstractService: Service:ThriftBinaryCLIService is started.
14/10/25 15:11:50 INFO service.AbstractService: Service:HiveServer2 is started.
14/10/25 15:11:50 INFO thriftserver.HiveThriftServer2: HiveThriftServer2 started
14/10/25 15:11:50 INFO thrift.ThriftCLIService: ThriftBinaryCLIService listening on 0.0.0.0/0.0.0.0:10000
../..

Looking at application master log file, Spark is actually embedding its own HiveServer2 into a Spark job. Connection will be apparently dropped if Spark job ends up, so make sure Spark job is not killed !

I’m using yarn-client to handle my JDBC connections, but this should support spark standalone clusters as well

Now get back to your SQL Developer, and restore connection to Hive (HiveServer is now embedded into a Spark job), and you’re now connected to Spark !

And voila !

For testing purpose, I’ve downloaded some ngrams data from S3 (link), and created a new Table ngrams on Hive. I’m now able to get lightning fast SQL queries on Spark from a user friendly environment (SQL Developer).

Screen Shot 2014-10-25 at 14.21.35

As initially said, this seems quite hacky (Remember I used Cloudera JDBC on Hortonworks environment), but it seems to work, and it seems to be really fast ! Please let me know the environment / distribution you’re using if you’re facing any issue..

Cheers!
Antoine

Advertisements

7 thoughts on “Spark: Use Spark-SQL on SQL Developer

  1. Pingback: Spark: Connect Tableau Desktop to SparkSQL | Hadoopi

  2. Hi,

    I am using Hive JDBC driver provided by Cloudera to access Hive tables.
    [http://www.cloudera.com/content/www/en-us/downloads/connectors/hive/jdbc/2-5-4.html]

    static String JDBC_DRIVER = “com.cloudera.hive.jdbc41.HS2Driver”;
    // Define a string as the connection URL
    private static final String CONNECTION_URL = “jdbc:hive2://localhost:10000/default”;

    public static void main(String[] args) throws Exception {
    Connection con = null;
    Statement stmt = null;
    ResultSet rs = null;
    // Define a plain query
    String query = “SELECT name FROM employees”;
    // Define a parametrized query

    try {
    // Register the driver using the class name
    Class.forName(JDBC_DRIVER);
    // Establish a connection using the connection

    // URL
    con = DriverManager.getConnection(CONNECTION_URL,”hive”,””);
    // Create a Statement object for sending SQL
    // statements to the database
    stmt = con.createStatement();
    // Execute the SQL statement
    rs = stmt.executeQuery(query);

    If I use HiveServer2 , the above program works fine .
    Now I want to run the SQL against Spark ,so I stop/kill HiveServer2 and start Spark thrift server but the program fails with following exception.
    Please note that if I connect using beeline ,it works fine. Any idea what am I missing here.

    /scratch/xyz/view_storage/xyz_bam_spark_Nov/jdk8u40/bin/java -server -classpath /scratch/xyz/spark-examples-master/first-example/SparkApp/.adf:/scratch/xyz/spark-examples-master/first-example/SparkApp/first-example/classes:/home/xyz/clouder_jdbc/jars/HiveJDBC41.jar:/home/xyz/clouder_jdbc/jars/libthrift-0.9.0.jar:/home/xyz/clouder_jdbc/jars/hive_metastore.jar:/home/xyz/clouder_jdbc/jars/hive_service.jar:/home/xyz/clouder_jdbc/jars/libfb303-0.9.0.jar:/home/xyz/clouder_jdbc/jars/log4j-1.2.14.jar:/home/xyz/clouder_jdbc/jars/ql.jar:/home/xyz/clouder_jdbc/jars/slf4j-api-1.5.11.jar:/home/xyz/clouder_jdbc/jars/slf4j-log4j12-1.5.11.jar:/home/xyz/clouder_jdbc/jars/TCLIServiceClient.jar:/home/xyz/clouder_jdbc/jars/zookeeper-3.4.6.jar -Djavax.net.ssl.trustStore=/tmp/trustStore291984617787443166.jks org.sparkexample.ClouderaJDBCHiveExample
    log4j:WARN No appenders could be found for logger (org.apache.thrift.transport.TSaslTransport).
    log4j:WARN Please initialize the log4j system properly.
    java.sql.SQLException: [Cloudera][HiveJDBCDriver](500151) Error setting/closing session: Open Session Error.
    at com.cloudera.hive.hivecommon.api.ExtendedHS2Client.openSession(Unknown Source)
    at com.cloudera.hive.hivecommon.api.HS2Client.(Unknown Source)
    at com.cloudera.hive.hivecommon.api.ExtendedHS2Client.(Unknown Source)
    at com.cloudera.hive.hivecommon.api.ZooKeeperEnabledExtendedHS2Factory.createClient(Unknown Source)
    at com.cloudera.hive.hivecommon.core.HiveJDBCCommonConnection.connect(Unknown Source)
    at com.cloudera.hive.hive.core.HiveJDBCConnection.connect(Unknown Source)
    at com.cloudera.hive.jdbc.common.BaseConnectionFactory.doConnect(Unknown Source)
    at com.cloudera.hive.jdbc.common.AbstractDriver.connect(Unknown Source)
    at java.sql.DriverManager.getConnection(DriverManager.java:664)
    at java.sql.DriverManager.getConnection(DriverManager.java:247)
    Caused by: com.cloudera.hive.support.exceptions.GeneralException: [Cloudera][HiveJDBCDriver](500151) Error setting/closing session: Open Session Error.
    … 10 more
    Caused by: com.cloudera.hive.support.exceptions.GeneralException: CONN_SESSION_ERR
    … 10 more
    Process exited with exit code 0.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s