Implementing the Tool interface for MapReduce driver

Most of people usually create their MapReduce job using a driver code that is executed though its static main method. The downside of such implementation is that most of your specific configuration (if any) is usually hardcoded. Should you need to modify some of your configuration properties on the fly (such as changing the number of reducers), you would have to modify your code, rebuild your jar file and redeploy your application. This can be avoided by implementing the Tool interface in your MapReduce driver code.

Hadoop Configuration

By implementing the Tool interface and extending Configured class, you can easily set your hadoop Configuration object via the GenericOptionsParser, thus through the command line interface. This makes your code definitely more portable (and additionally slightly cleaner) as you do not need to hardcode any specific configuration anymore.

Let’s take a couple of example with and without the use of Tool interface.

Without Tool interface


public class ToolMapReduce {

	public static void main(String[] args) throws Exception {

		// Create configuration
		Configuration conf = new Configuration();

		// Create job
		Job job = new Job(conf, "Tool Job");
		job.setJarByClass(ToolMapReduce.class);

		// Setup MapReduce job
		job.setMapperClass(Mapper.class);
		job.setReducerClass(Reducer.class);

		// Set only 1 reduce task
		job.setNumReduceTasks(1);

		// Specify key / value
		job.setOutputKeyClass(LongWritable.class);
		job.setOutputValueClass(Text.class);

		// Input
		FileInputFormat.addInputPath(job, new Path(args[0]));
		job.setInputFormatClass(TextInputFormat.class);

		// Output
		FileOutputFormat.setOutputPath(job, new Path(args[1]));
		job.setOutputFormatClass(TextOutputFormat.class);

		// Execute job
		int code = job.waitForCompletion(true) ? 0 : 1;
		System.exit(code);
	}
}

Your MapReduce job will be executed as follows. You expect only 2 arguments here, inputPath and outputPath, located at respectively index [0] and [1] on your main method String array.

hadoop jar /path/to/My/jar.jar com.wordpress.hadoopi.ToolMapReduce /input/path /output/path

In that case, the number of reducers (1) is hardcoded (line #17) and therefore cannot be modified on demand.

With Tool interface

import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class ToolMapReduce extends Configured implements Tool {

	public static void main(String[] args) throws Exception {
		int res = ToolRunner.run(new Configuration(), new ToolMapReduce(), args);
		System.exit(res);
	}

	@Override
	public int run(String[] args) throws Exception {

		// When implementing tool
		Configuration conf = this.getConf();

		// Create job
		Job job = new Job(conf, "Tool Job");
		job.setJarByClass(ToolMapReduce.class);

		// Setup MapReduce job
		// Do not specify the number of Reducer
		job.setMapperClass(Mapper.class);
		job.setReducerClass(Reducer.class);

		// Specify key / value
		job.setOutputKeyClass(LongWritable.class);
		job.setOutputValueClass(Text.class);

		// Input
		FileInputFormat.addInputPath(job, new Path(args[0]));
		job.setInputFormatClass(TextInputFormat.class);

		// Output
		FileOutputFormat.setOutputPath(job, new Path(args[1]));
		job.setOutputFormatClass(TextOutputFormat.class);

		// Execute job and return status
		return job.waitForCompletion(true) ? 0 : 1;
	}
}

ToolsRunner execute your MapReduce job through its static run method.
In this example we do not need to hardcode the number of reducers anymore as it can be specified directly from the CLI (using the “-D” option).

hadoop jar /path/to/My/jar.jar com.wordpress.hadoopi.ToolMapReduce -D mapred.reduce.tasks=1 /input/path /output/path

Note that you still have to supply inputPath and outputPath arguments. Basically GenericOptionParser will separate the generic Tools options from the actual job’s arguments. Whatever the number of generic options you might supply, inputPath and outputPath variables will be still located at index [0] and [1], but in your run method String array (not in your main method).

This -D option can be used for any “official” or custom property values.

conf.set("my.dummy.configuration","foobar");

becomes now…

-D my.dummy.configuration=foobar

HDFS and JobTracker properties

When I need to submit a jar file remotely to a distant hadoop server, I need to specify the below properties in my driver code.

Configuration conf = new Configuration();
conf.set("mapred.job.tracker", "myserver.com:8021");
conf.set("fs.default.name", "hdfs://myserver.com:8020");

Using Tool interface, this is now out of the box as you can supply both -fs and -jt options from the CLI.

hadoop jar myjar.jar com.wordpress.hadoopi.ToolMapReduce -fs hdfs://myserver.com:8020 -jt myserver.com:8021

Thanks to this Tool implementation, my jar file is now 100% portable, and can be executed both locally or remotely without having to hardcode any specific value.

Generic options supported

Some additional useful options can be supplied from CLI.

-conf specify an application configuration file
-D use value for given property
-fs specify a namenode
-jt specify a job tracker
-files specify comma separated files to be copied to the map reduce cluster
-libjars specify comma separated jar files to include in the classpath.
-archives specify comma separated archives to be unarchived on the compute machines.

Should you need to add a specific library, archive file, etc… this Tool interface might be quite useful.
As you can see, it is maybe worth to implement this Tool interface in your driver code as it brings added value without any additional complexity.

Cheers!

16 thoughts on “Implementing the Tool interface for MapReduce driver”

Sarfraz Ramay

13 April 2014 at 18 h 25 min

nice post but i am using the toolrunner and still i get the warning
WARN mapreduce.JobSubmitter: Haoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.

as a result when i pass the parameter -D case-sensitive=false or -D case-sensitive=false, it doesn’t work and takes the default value set in the mapper.

Can you please help ?

- Antoine Amend
  
  15 May 2014 at 14 h 41 min
  
  Are you following all above steps ? Check code on my github account.
  https://github.com/aamend/hadoop-mapreduce/blob/master/sandbox/src/main/java/com/aamend/hadoop/mapreduce/sandbox/ToolImplementation.java
  
  It works on my side
  
  hadoop jar sandbox-1.0-SNAPSHOT.jar com.aamend.hadoop.mapreduce.sandbox.ToolImplementation tmp/input tmp/output
  ===> “foo equals DEFAULT”
  
  hadoop jar sandbox-1.0-SNAPSHOT.jar com.aamend.hadoop.mapreduce.sandbox.ToolImplementation -D foo=bar tmp/input tmp/output
  ===> “foo equals bar”
  
Abhisekh

11 May 2014 at 1 h 25 min

Good explanation, thanks:)

Pingback: Hadoop: Add third-party libraries to MapReduce job | Hadoopi
Pingback: Implementing Tools Interface in MapReduce | TECHtonka
archana

1 November 2014 at 12 h 42 min

Reblogged this on Big Data World and commented:
Hadoop: Implementing the Tool interface for MapReduce driver

Lakshminarayana

1 November 2014 at 20 h 05 min

Excellent explanation. Please keep on posts for Hadoop.
I really impressed with your way of blogging about technology.
Thanks once again.

Vamsi Jolla

27 November 2014 at 13 h 02 min

It makes me to understand easily about Tool interface.
Thanks 🙂

risarora

18 December 2014 at 12 h 21 min

Reblogged this on Rishi Arora.

nguyenhoduytri

3 February 2015 at 2 h 15 min

Reblogged this on Duy Trí.

Pingback: How to let Hadoop deploy jars to the culuster? | 我爱源码网
raj

24 March 2015 at 2 h 16 min

Superb clearcut explanation which my Udemy instructor couldnt give…thanks !

Nishant

7 April 2015 at 6 h 08 min

Thanks a lot for wonderful explanation

Sourabh Jain

1 August 2015 at 7 h 50 min

Simple and Neat

Sai

19 August 2015 at 17 h 00 min

Simple and easily explicable

rajkumar

4 June 2017 at 5 h 57 min

how to use -files option? I am trying to use it for implementing distributed cache.Please provide the steps.