When you implement a new MapReduce job, it could be quite handy to test it locally before screwing up your production environment with unstable code.. Although, I too like to live dangerously..
You can still package your code and submit your jar file to a running test / dev environment (or even better, to spin up a HadoopMiniCluster), but what if you could validate your mapper / reducer separately on a JUnit ? For that purpose, MRUnit is a perfect tool.
MRUNit
According to cloudera website “MRUnit helps bridge the gap between MapReduce programs and JUnit by providing a set of interfaces and test harnesses, which allow MapReduce programs to be more easily tested using standard tools and practices.”.
Maven dependencies can be found as follows:
org.apache.mrunit mrunit 1.0.0 hadoop2 test
Pay attention at classifier tag. I’m using version 2 of MapReduce. A common source of misunderstanding in Hadoop is the mapreduce version (basically 1 or 2) and framework used (Map reduce or Yarn). That’s said, start with classifier = hadoop2, and should you encounter the error “Expected an interface, get a class something”, fall back to hadoop1 classifier 🙂
Implementation
On your JUNit class, get an instance on MapDriver in order to test mapper class only, ReduceDriver for reducer class only, or MapReduceDriver to get a full E2E testing. Remember to use mapred package for Old API or mapreduce package for New API.
Let’s jump in below code:
import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mrunit.mapreduce.MapDriver; import org.apache.hadoop.mrunit.mapreduce.MapReduceDriver; import org.apache.hadoop.mrunit.types.Pair; import org.junit.Test; import org.junit.runner.RunWith; import org.junit.runners.JUnit4; import java.io.IOException; import java.util.ArrayList; import java.util.List; import java.util.Scanner; @RunWith(JUnit4.class) public class WordCountTest { private Mapper mapper = new WordCount.WordCountMapper(); private Reducer reducer = new WordCount.WordCountReducer(); private MapDriver mapDriver = MapDriver.newMapDriver(mapper); private MapReduceDriver mapReduceDriver = MapReduceDriver.newMapReduceDriver(mapper, reducer); @Test public void testMapOnly() throws IOException { mapDriver.getConfiguration().set("foo","bar"); mapDriver.addAll(getInput()); mapDriver.withAllOutput(getOutput("output/wordcount-output-mapper.txt")); mapDriver.withCounter("DATA","line", getInputLines()); mapDriver.runTest(); } @Test public void testMapReduce() throws IOException { mapReduceDriver.addAll(getInput()); mapReduceDriver.withAllOutput(getOutput("output/wordcount-output.txt")); mapReduceDriver.withCounter("DATA","line", getInputLines()); mapReduceDriver.withCounter("DATA","word", getOutputRecords()); mapReduceDriver.runTest(); }
The “tricky” part is to properly declare Input / Output key values class for all drivers as follows
MapReduceDriver driver;
As you can see in this code snippet, you can still play with some Hadoop features (such as configuration, counters)..
FYI, I’m using below external files for such a test
wordcount-input.txt
Hadoop mapreduce JUnit testing Hadoop JUnit testing JUnit Hadoop hdfs mapreduce Hadoop testing Hadoop
wordcount-output.txt
hadoop,1 mapreduce,1 junit,1 testing,1 hadoop,1 junit,1 testing,1 junit,1 hadoop,1 hdfs,1 mapreduce,1 hadoop,1 testing,1 hadoop,1
wordcount-output.txt
hadoop,5 hdfs,1 junit,3 mapreduce,2 testing,3
A file will be read to feed MapReduce, and another one to get the expected output key / values pairs using below static methods.
private static List getInput(){ Scanner scanner = new Scanner(WordCountTest.class.getResourceAsStream("input/wordcount-input.txt")); List input = new ArrayList(); while(scanner.hasNext()){ String line = scanner.nextLine(); input.add(new Pair(new LongWritable(0), new Text(line))); } return input; } private static int getInputLines(){ Scanner scanner = new Scanner(WordCountTest.class.getResourceAsStream("input/wordcount-input.txt")); int line = 0; while(scanner.hasNext()){ scanner.nextLine(); line++; } return line; } private static int getOutputRecords(){ Scanner scanner = new Scanner(WordCountTest.class.getResourceAsStream("output/wordcount-output.txt")); int record = 0; while(scanner.hasNext()){ scanner.nextLine(); record++; } return record; } private static List getOutput(String fileName){ Scanner scanner = new Scanner(WordCountTest.class.getResourceAsStream(fileName)); List output = new ArrayList(); while(scanner.hasNext()){ String keyValue[] = scanner.nextLine().split(","); String word = keyValue[0]; String count = keyValue[1]; output.add(new Pair(new Text(word), new IntWritable(Integer.parseInt(count)))); } return output; }
Even though you can supply mapreduce code with a simple key / value pair using add() method, I strongly recommend using several ones using addAll() method. This will ensure Shuffling / partitioning is working well. By doing so, you need however to build these key / values pairs on the exact same order you expect mapreduce output.
Anyway, too much blabla for such a simple tool. Now that you get a working recipe, simply do not test your code into production 🙂
Cheers!
Antoine
This is great and I mistakenly used the MapDriver from old API and it ruined my 2 hours until I came across this blog. Thanks for calling it out in your article.
The problem is that the old MapDriver doesn’t have “mapred” in its package like other APIs; so it hard to know which one you are using 😦