Processing GDELT data using Hadoop InputFormat and SparkSQL

GDELT A quick overview of GDELT public data set: "GDELT Project monitors the world's broadcast, print, and web news from nearly every corner of every country in over 100 languages and identifies the people, locations, organisations, counts, themes, sources, and events driving our global society every second of every day, creating a free open platform … Continue reading Processing GDELT data using Hadoop InputFormat and SparkSQL


Create a simple Spark job

MapReduce is dead, long live Spark ! Following big data new trends, the logical next step for me is to start getting my head around apache Spark. Yarn is yet another resource negotiator indeed, but definitely not yet another big data application... Although you can still execute MapReduce on Yarn, writing a Yarn application is … Continue reading Create a simple Spark job

Add third-party libraries to MapReduce job

Anybody working with Hadoop should have already faced a same common issue: How to add third-party libraries to your MapReduce job. Add libjars option The first solution, maybe the most common one, consists on adding libraries using -libjars parameter on CLI. To make it work, your class MyClass must use GenericOptionsParser class. Easiest way is … Continue reading Add third-party libraries to MapReduce job

Primitive Array Clustering

Hadoop implementation of Canopy Clustering using Levenshtein distance algorithm and other non-mathematical distance measures (such as Jaccard coefficient). Difference with Mahout Needless to say clustering algorithms (such as K-Means) use a mathematical approach in order to compute Clusters' centers. Each time a new point is added to a cluster, Mahout framework recomputes cluster's center as … Continue reading Primitive Array Clustering

Implementing the Tool interface for MapReduce driver

Most of people usually create their MapReduce job using a driver code that is executed though its static main method. The downside of such implementation is that most of your specific configuration (if any) is usually hardcoded. Should you need to modify some of your configuration properties on the fly (such as changing the number … Continue reading Implementing the Tool interface for MapReduce driver