Use Spark-SQL on SQL Developer

I'm describing here how I set SQL Developer to connect / query my Spark cluster. I made it work on my local environment below: Ubuntu precise 64 bits (1 master, 2 slaves) Hadoop Hortonworks 2.4.0.2.1.5.0-695 Hive version 0.13.0.2.1.5.0-695, metastore hosted on MySQL database Spark 1.1.0 prebuilt for Hadoop 2.4 SQL Developer 4.0.3.16 Note that I've … Continue reading Use Spark-SQL on SQL Developer

Advertisements

Processing GDELT data using Hadoop InputFormat and SparkSQL

GDELT A quick overview of GDELT public data set: "GDELT Project monitors the world's broadcast, print, and web news from nearly every corner of every country in over 100 languages and identifies the people, locations, organisations, counts, themes, sources, and events driving our global society every second of every day, creating a free open platform … Continue reading Processing GDELT data using Hadoop InputFormat and SparkSQL

Add third-party libraries to MapReduce job

Anybody working with Hadoop should have already faced a same common issue: How to add third-party libraries to your MapReduce job. Add libjars option The first solution, maybe the most common one, consists on adding libraries using -libjars parameter on CLI. To make it work, your class MyClass must use GenericOptionsParser class. Easiest way is … Continue reading Add third-party libraries to MapReduce job

Primitive Array Clustering

Hadoop implementation of Canopy Clustering using Levenshtein distance algorithm and other non-mathematical distance measures (such as Jaccard coefficient). Difference with Mahout Needless to say clustering algorithms (such as K-Means) use a mathematical approach in order to compute Clusters' centers. Each time a new point is added to a cluster, Mahout framework recomputes cluster's center as … Continue reading Primitive Array Clustering

Implementing the Tool interface for MapReduce driver

Most of people usually create their MapReduce job using a driver code that is executed though its static main method. The downside of such implementation is that most of your specific configuration (if any) is usually hardcoded. Should you need to modify some of your configuration properties on the fly (such as changing the number … Continue reading Implementing the Tool interface for MapReduce driver