Most of people usually create their MapReduce job using a driver code that is executed though its static main method. The downside of such implementation is that most of your specific configuration (if any) is usually hardcoded. Should you need to modify some of your configuration properties on the fly (such as changing the number of reducers), you would have to modify your code, rebuild your jar file and redeploy your application. This can be avoided by implementing the Tool interface in your MapReduce driver code.
Hadoop Configuration
By implementing the Tool interface and extending Configured class, you can easily set your hadoop Configuration object via the GenericOptionsParser, thus through the command line interface. This makes your code definitely more portable (and additionally slightly cleaner) as you do not need to hardcode any specific configuration anymore.
Let’s take a couple of example with and without the use of Tool interface.
Without Tool interface
public class ToolMapReduce { public static void main(String[] args) throws Exception { // Create configuration Configuration conf = new Configuration(); // Create job Job job = new Job(conf, "Tool Job"); job.setJarByClass(ToolMapReduce.class); // Setup MapReduce job job.setMapperClass(Mapper.class); job.setReducerClass(Reducer.class); // Set only 1 reduce task job.setNumReduceTasks(1); // Specify key / value job.setOutputKeyClass(LongWritable.class); job.setOutputValueClass(Text.class); // Input FileInputFormat.addInputPath(job, new Path(args[0])); job.setInputFormatClass(TextInputFormat.class); // Output FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setOutputFormatClass(TextOutputFormat.class); // Execute job int code = job.waitForCompletion(true) ? 0 : 1; System.exit(code); } }
Your MapReduce job will be executed as follows. You expect only 2 arguments here, inputPath and outputPath, located at respectively index [0] and [1] on your main method String array.
hadoop jar /path/to/My/jar.jar com.wordpress.hadoopi.ToolMapReduce /input/path /output/path
In that case, the number of reducers (1) is hardcoded (line #17) and therefore cannot be modified on demand.
With Tool interface
import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; public class ToolMapReduce extends Configured implements Tool { public static void main(String[] args) throws Exception { int res = ToolRunner.run(new Configuration(), new ToolMapReduce(), args); System.exit(res); } @Override public int run(String[] args) throws Exception { // When implementing tool Configuration conf = this.getConf(); // Create job Job job = new Job(conf, "Tool Job"); job.setJarByClass(ToolMapReduce.class); // Setup MapReduce job // Do not specify the number of Reducer job.setMapperClass(Mapper.class); job.setReducerClass(Reducer.class); // Specify key / value job.setOutputKeyClass(LongWritable.class); job.setOutputValueClass(Text.class); // Input FileInputFormat.addInputPath(job, new Path(args[0])); job.setInputFormatClass(TextInputFormat.class); // Output FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setOutputFormatClass(TextOutputFormat.class); // Execute job and return status return job.waitForCompletion(true) ? 0 : 1; } }
ToolsRunner execute your MapReduce job through its static run method.
In this example we do not need to hardcode the number of reducers anymore as it can be specified directly from the CLI (using the “-D” option).
hadoop jar /path/to/My/jar.jar com.wordpress.hadoopi.ToolMapReduce -D mapred.reduce.tasks=1 /input/path /output/path
Note that you still have to supply inputPath and outputPath arguments. Basically GenericOptionParser will separate the generic Tools options from the actual job’s arguments. Whatever the number of generic options you might supply, inputPath and outputPath variables will be still located at index [0] and [1], but in your run method String array (not in your main method).
This -D option can be used for any “official” or custom property values.
conf.set("my.dummy.configuration","foobar");
becomes now…
-D my.dummy.configuration=foobar
HDFS and JobTracker properties
When I need to submit a jar file remotely to a distant hadoop server, I need to specify the below properties in my driver code.
Configuration conf = new Configuration(); conf.set("mapred.job.tracker", "myserver.com:8021"); conf.set("fs.default.name", "hdfs://myserver.com:8020");
Using Tool interface, this is now out of the box as you can supply both -fs and -jt options from the CLI.
hadoop jar myjar.jar com.wordpress.hadoopi.ToolMapReduce -fs hdfs://myserver.com:8020 -jt myserver.com:8021
Thanks to this Tool implementation, my jar file is now 100% portable, and can be executed both locally or remotely without having to hardcode any specific value.
Generic options supported
Some additional useful options can be supplied from CLI.
-conf specify an application configuration file -D use value for given property -fs specify a namenode -jt specify a job tracker -files specify comma separated files to be copied to the map reduce cluster -libjars specify comma separated jar files to include in the classpath. -archives specify comma separated archives to be unarchived on the compute machines.
Should you need to add a specific library, archive file, etc… this Tool interface might be quite useful.
As you can see, it is maybe worth to implement this Tool interface in your driver code as it brings added value without any additional complexity.
Cheers!
nice post but i am using the toolrunner and still i get the warning
WARN mapreduce.JobSubmitter: Haoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
as a result when i pass the parameter -D case-sensitive=false or -D case-sensitive=false, it doesn’t work and takes the default value set in the mapper.
Can you please help ?
Are you following all above steps ? Check code on my github account.
https://github.com/aamend/hadoop-mapreduce/blob/master/sandbox/src/main/java/com/aamend/hadoop/mapreduce/sandbox/ToolImplementation.java
It works on my side
hadoop jar sandbox-1.0-SNAPSHOT.jar com.aamend.hadoop.mapreduce.sandbox.ToolImplementation tmp/input tmp/output
===> “foo equals DEFAULT”
hadoop jar sandbox-1.0-SNAPSHOT.jar com.aamend.hadoop.mapreduce.sandbox.ToolImplementation -D foo=bar tmp/input tmp/output
===> “foo equals bar”
Good explanation, thanks:)
Pingback: Hadoop: Add third-party libraries to MapReduce job | Hadoopi
Pingback: Implementing Tools Interface in MapReduce | TECHtonka
Reblogged this on Big Data World and commented:
Hadoop: Implementing the Tool interface for MapReduce driver
Excellent explanation. Please keep on posts for Hadoop.
I really impressed with your way of blogging about technology.
Thanks once again.
It makes me to understand easily about Tool interface.
Thanks 🙂
Reblogged this on Rishi Arora.
Reblogged this on Duy Trí.
Pingback: How to let Hadoop deploy jars to the culuster? | 我爱源码网
Superb clearcut explanation which my Udemy instructor couldnt give…thanks !
Thanks a lot for wonderful explanation
Simple and Neat
Simple and easily explicable
how to use -files option? I am trying to use it for implementing distributed cache.Please provide the steps.