Test MapReduce using MRUnit

When you implement a new MapReduce job, it could be quite handy to test it locally before screwing up your production environment with unstable code.. Although, I too like to live dangerously.. You can still package your code and submit your jar file to a running test / dev environment (or even better, to spin … Continue reading Test MapReduce using MRUnit

Advertisements

Add third-party libraries to MapReduce job

Anybody working with Hadoop should have already faced a same common issue: How to add third-party libraries to your MapReduce job. Add libjars option The first solution, maybe the most common one, consists on adding libraries using -libjars parameter on CLI. To make it work, your class MyClass must use GenericOptionsParser class. Easiest way is … Continue reading Add third-party libraries to MapReduce job

Primitive Array Clustering

Hadoop implementation of Canopy Clustering using Levenshtein distance algorithm and other non-mathematical distance measures (such as Jaccard coefficient). Difference with Mahout Needless to say clustering algorithms (such as K-Means) use a mathematical approach in order to compute Clusters' centers. Each time a new point is added to a cluster, Mahout framework recomputes cluster's center as … Continue reading Primitive Array Clustering

Implementing the Tool interface for MapReduce driver

Most of people usually create their MapReduce job using a driver code that is executed though its static main method. The downside of such implementation is that most of your specific configuration (if any) is usually hardcoded. Should you need to modify some of your configuration properties on the fly (such as changing the number … Continue reading Implementing the Tool interface for MapReduce driver

Custom RecordReader – Processing String / Pattern delimited records

Now that both InputFormat and RecordReader are familiar concepts for you (if not, you can still refer to article Hadoop RecordReader and FileInputFormat), it is time to enter into the heart of the subject. The default implementation of TextInputFormat is based on a Line-by-Line approach. Each line found in data set will be supplied to MapReduce … Continue reading Custom RecordReader – Processing String / Pattern delimited records