Add third-party libraries to MapReduce job

Anybody working with Hadoop should have already faced a same common issue: How to add third-party libraries to your MapReduce job.

Add libjars option

The first solution, maybe the most common one, consists on adding libraries using -libjars parameter on CLI. To make it work, your class MyClass must use GenericOptionsParser class. Easiest way is to implement the Hadoop Tool interface as described in post Hadoop: Implementing the Tool interface for MapReduce driver.

$ export LIBJARS=/path/jar1,/path/jar2
$ hadoop jar /path/to/my.jar com.wordpress.hadoopi.MyClass -libjars ${LIBJARS} value

This will obviously work only when playing with CLI, so how the heck can we add such external jar files when not using CLI ?

Add jar files to Hadoop classpath

You could certainly upload external jar files to each tasktracker and update HADOOOP_CLASSPATH accordingly, but are you really willing to bother Ops team each time you need to add a new jar ? Works well on a single server node, but are you going to upload such jar across all of the 10, 100 or even more Hadoop nodes ? This approach does not scale at all !

Create a fat jar

Another approach is to create a fat jar, which is a JAR that contains your classes as well as your third-party classes (see this Cloudera blog post for more details). Be aware this Jar will not only contain your classes, but might also include all your project dependencies (such as Hadoop libraries) unless you explicitly exclude them (using provided tag).
Here is an example of maven plugin you will need to set up


                maven-assembly-plugin

                             jar-with-dependencies

                        make-assembly
                        package

                            single

Following a “mvn clean package” command, your fat JAR will be located in maven project’s target directory as follows

drwxr-xr-x  2 antoine  staff        68 Jun 10 09:30 archive-tmp
drwxr-xr-x  3 antoine  staff       102 Jun 10 09:29 classes
drwxr-xr-x  3 antoine  staff       102 Jun 10 09:29 generated-sources
drwxr-xr-x  3 antoine  staff       102 Jun 10 09:29 generated-test-sources
drwxr-xr-x  3 antoine  staff       102 Jun 10 09:29 maven-archiver
drwxr-xr-x  4 antoine  staff       136 Jun 10 09:29 myproject-1.0-SNAPSHOT
-rw-r--r--  1 antoine  staff  63880020 Jun 10 09:30 myproject-1.0-SNAPSHOT-jar-with-dependencies.jar
drwxr-xr-x  4 antoine  staff       136 Jun 10 09:29 surefire-reports
drwxr-xr-x  4 antoine  staff       136 Jun 10 09:29 test-classes

In above example, note the actual size of your JAR file (61MB). Quite fat, isn’t it ?
You can ensure all dependencies have been added by firing up below command

$ jar -tf myproject-1.0-SNAPSHOT-jar-with-dependencies.jar

META-INF/
META-INF/MANIFEST.MF
com/aamend/hadoop/allMyClasses.class
...
com/others/allMyDependencies.class
...

Use Distributed cache

I am always following such approach when using third-party libraries in my MapReduce jobs. One would say such approach is not elegant, but I can work without annoying anyone from Ops team :). I first create a directory “lib” in my HDFS home directory (“/user/hadoopi/”). You could even use “/tmp”, it does not matter. I then create a static method that

Locate the jar file that includes the class I need
Upload this jar to Hadoop HDFS
Add the uploaded jar file to Hadoop distributed cache

Simply add the following lines to some Utils class.

    private static void addJarToDistributedCache(
            Class classToAdd, Configuration conf)
        throws IOException {

        // Retrieve jar file for class2Add
        String jar = classToAdd.getProtectionDomain().
                getCodeSource().getLocation().
                getPath();
        File jarFile = new File(jar);

        // Declare new HDFS location
        Path hdfsJar = new Path("/user/hadoopi/lib/"
                + jarFile.getName());

        // Mount HDFS
        FileSystem hdfs = FileSystem.get(conf);

        // Copy (override) jar file to HDFS
        hdfs.copyFromLocalFile(false, true,
            new Path(jar), hdfsJar);

        // Add jar to distributed classPath
        DistributedCache.addFileToClassPath(hdfsJar, conf);
    }

The only thing you need to remember is to add this class prior to Job submission…


    public static void main(String[] args) throws Exception {

        // Create Hadoop configuration
        Configuration conf = new Configuration();

        // Add 3rd-party libraries
        addJarToDistributedCache(MyFirstClass.class, conf);
        addJarToDistributedCache(MySecondClass.class, conf);

        // Create my job
        Job job = new Job(conf, "Hadoop-classpath");
        .../...
    }

Here you are, your MapReduce is now able to use any external JAR file.

Cheers!
Antoine

5 thoughts on “Add third-party libraries to MapReduce job”

Pingback: Spark / Hadoop: Processing GDELT data using Hadoop InputFormat and SparkSQL | Hadoopi
chrisrian weaves

1 August 2015 at 0 h 42 min

If you are using ToolRunner, you can add your jars via the command line using the -files flag, for example: hadoop jar mydriver.jar MyDriver -files file1, file2 in out

humoyunUz

15 September 2015 at 11 h 08 min

In Hadoop2 DistributedCache is depreciated

- Tomas
  
  11 November 2015 at 14 h 20 min
  
  No, the old API of DistributedCache is deprecated. See here: http://stackoverflow.com/questions/21239722/hadoop-distributedcache-is-deprecated-what-is-the-preferred-api
  
bigdata

6 November 2015 at 20 h 16 min

I add my third party jars in /usr/lib/hadoop-mapreduce in datanode

BIG DATA 4 SCIENCE

Industrialising data science at scale

Add third-party libraries to MapReduce job

Add libjars option

Add jar files to Hadoop classpath

Create a fat jar

Use Distributed cache

5 thoughts on “Add third-party libraries to MapReduce job”

Leave a reply to chrisrian weaves Cancel reply

Add libjars option

Add jar files to Hadoop classpath

Create a fat jar

Use Distributed cache

Partager:

5 thoughts on “Add third-party libraries to MapReduce job”

Leave a reply to chrisrian weaves Cancel reply