Hadoop: Test MapReduce using MRUnit

When you implement a new MapReduce job, it could be quite handy to test it locally before screwing up your production environment with unstable code.. Although, I too like to live dangerously..

images

You can still package your code and submit your jar file to a running test / dev environment (or even better, to spin up a HadoopMiniCluster), but what if you could validate your mapper / reducer separately on a JUnit ? For that purpose, MRUnit is a perfect tool.

MRUNit

According to cloudera website “MRUnit helps bridge the gap between MapReduce programs and JUnit by providing a set of interfaces and test harnesses, which allow MapReduce programs to be more easily tested using standard tools and practices.”.
Maven dependencies can be found as follows:

<dependency>
  <groupId>org.apache.mrunit</groupId>
  <artifactId>mrunit</artifactId>
  <version>1.0.0</version>
  <classifier>hadoop2</classifier>
  <scope>test</scope>
</dependency>

Pay attention at classifier tag. I’m using version 2 of MapReduce. A common source of misunderstanding in Hadoop is the mapreduce version (basically 1 or 2) and framework used (Map reduce or Yarn). That’s said, start with classifier = hadoop2, and should you encounter the error “Expected an interface, get a class something”, fall back to hadoop1 classifier 🙂

Implementation

On your JUNit class, get an instance on MapDriver in order to test mapper class only, ReduceDriver for reducer class only, or MapReduceDriver to get a full E2E testing. Remember to use mapred package for Old API or mapreduce package for New API.
Let’s jump in below code:

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mrunit.mapreduce.MapDriver;
import org.apache.hadoop.mrunit.mapreduce.MapReduceDriver;
import org.apache.hadoop.mrunit.types.Pair;
import org.junit.Test;
import org.junit.runner.RunWith;
import org.junit.runners.JUnit4;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import java.util.Scanner;

@RunWith(JUnit4.class)
public class WordCountTest {

    private Mapper mapper = new WordCount.WordCountMapper();
    private Reducer reducer = new WordCount.WordCountReducer();
    private MapDriver<LongWritable, Text, Text, IntWritable> mapDriver = MapDriver.newMapDriver(mapper);
    private MapReduceDriver<LongWritable, Text, Text, IntWritable, Text, IntWritable> mapReduceDriver = MapReduceDriver.newMapReduceDriver(mapper, reducer);

    @Test
    public void testMapOnly() throws IOException {
        mapDriver.getConfiguration().set("foo","bar");
        mapDriver.addAll(getInput());
        mapDriver.withAllOutput(getOutput("output/wordcount-output-mapper.txt"));
        mapDriver.withCounter("DATA","line", getInputLines());
        mapDriver.runTest();
    }

    @Test
    public void testMapReduce() throws IOException {
        mapReduceDriver.addAll(getInput());
        mapReduceDriver.withAllOutput(getOutput("output/wordcount-output.txt"));
        mapReduceDriver.withCounter("DATA","line", getInputLines());
        mapReduceDriver.withCounter("DATA","word", getOutputRecords());
        mapReduceDriver.runTest();
    }

The “tricky” part is to properly declare Input / Output key values class for all drivers as follows

MapReduceDriver<MapperInputKey.class, MapperInputValue.class, MapperOutputKey.class, MapperOutputValue.class, ReduceOutputKey.class, ReduceOutputValue.class> driver;

As you can see in this code snippet, you can still play with some Hadoop features (such as configuration, counters)..
FYI, I’m using below external files for such a test

wordcount-input.txt

Hadoop mapreduce JUnit testing
Hadoop JUnit testing
JUnit
Hadoop hdfs mapreduce
Hadoop
testing
Hadoop

wordcount-output.txt

hadoop,1
mapreduce,1
junit,1
testing,1
hadoop,1
junit,1
testing,1
junit,1
hadoop,1
hdfs,1
mapreduce,1
hadoop,1
testing,1
hadoop,1

wordcount-output.txt

hadoop,5
hdfs,1
junit,3
mapreduce,2
testing,3

A file will be read to feed MapReduce, and another one to get the expected output key / values pairs using below static methods.


    private static List<Pair<LongWritable, Text>> getInput(){
        Scanner scanner = new Scanner(WordCountTest.class.getResourceAsStream("input/wordcount-input.txt"));
        List<Pair<LongWritable, Text>> input = new ArrayList<Pair<LongWritable, Text>>();
        while(scanner.hasNext()){
            String line = scanner.nextLine();
            input.add(new Pair<LongWritable, Text>(new LongWritable(0), new Text(line)));
        }
        return input;
    }

    private static int getInputLines(){
        Scanner scanner = new Scanner(WordCountTest.class.getResourceAsStream("input/wordcount-input.txt"));
        int line = 0;
        while(scanner.hasNext()){
            scanner.nextLine();
            line++;
        }
        return line;
    }

    private static int getOutputRecords(){
        Scanner scanner = new Scanner(WordCountTest.class.getResourceAsStream("output/wordcount-output.txt"));
        int record = 0;
        while(scanner.hasNext()){
            scanner.nextLine();
            record++;
        }
        return record;
    }

    private static List<Pair<Text,IntWritable>> getOutput(String fileName){
        Scanner scanner = new Scanner(WordCountTest.class.getResourceAsStream(fileName));
        List<Pair<Text,IntWritable>> output = new ArrayList<Pair<Text, IntWritable>>();
        while(scanner.hasNext()){
            String keyValue[] = scanner.nextLine().split(",");
            String word = keyValue[0];
            String count = keyValue[1];
            output.add(new Pair<Text, IntWritable>(new Text(word), new IntWritable(Integer.parseInt(count))));
        }
        return output;
    }

Even though you can supply mapreduce code with a simple key / value pair using add() method, I strongly recommend using several ones using addAll() method. This will ensure Shuffling / partitioning is working well. By doing so, you need however to build these key / values pairs on the exact same order you expect mapreduce output.
Anyway, too much blabla for such a simple tool. Now that you get a working recipe, simply do not test your code into production 🙂

Cheers!
Antoine

Advertisements

One thought on “Hadoop: Test MapReduce using MRUnit

  1. This is great and I mistakenly used the MapDriver from old API and it ruined my 2 hours until I came across this blog. Thanks for calling it out in your article.

    The problem is that the old MapDriver doesn’t have “mapred” in its package like other APIs; so it hard to know which one you are using 😦

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s