Hadoop: Setup Maven project for MapReduce in 5mn

I am sure I am not the only one who ever struggled with Hadoop eclipse plugin installation. This plugin strongly depends on your environment (eclipse, ant, jdk) and hadoop distribution and version. Moreover, it only provides the Old API for MapReduce.
It is so simple to create a maven project for Hadoop that wasting time trying to build this plugin becomes totally useless. I am describing on this article how to setup a first maven hadoop project for Cloudera CDH4 on eclipse.

Prerequisite

maven 3
jdk 1.6
eclipse with m2eclipse plugin installed

Add Cloudera repository

Cloudera jar files are not available on default Maven central repository. You need to explicitly add cloudera repo in your settings.xml (under ${HOME}/.m2/settings.xml).

<?xml version="1.0" encoding="UTF-8"?>
<settings>
    <profiles>
        <profile>
            <id>standard-extra-repos</id>
            <activation>
                <activeByDefault>true</activeByDefault>
            </activation>
            <repositories>
                <repository>
                    <!-- Central Repository -->
                    <id>central</id>
                    <url>http://repo1.maven.org/maven2/</url>
                    <releases>
                        <enabled>true</enabled>
                    </releases>
                    <snapshots>
                        <enabled>true</enabled>
                    </snapshots>
                </repository>
                <repository>
                    <!-- Cloudera Repository -->
                    <id>cloudera</id>
                    <url>https://repository.cloudera.com/artifactory/cloudera-repos</url>
                    <releases>
                        <enabled>true</enabled>
                    </releases>
                    <snapshots>
                        <enabled>true</enabled>
                   </snapshots>
                </repository>
            </repositories>
        </profile>
    </profiles>
</settings>

Create Maven project

On eclipse, create a new Maven project as follow

maven

maven2

maven3

Add Hadoop Nature

For Cloudera distribution CDH4, open pom.xml file and add the following dependencies


	<dependencyManagement>
		<dependencies>
			<dependency>
				<groupId>jdk.tools</groupId>
				<artifactId>jdk.tools</artifactId>
				<version>1.6</version>
			</dependency>
			<dependency>
				<groupId>org.apache.hadoop</groupId>
				<artifactId>hadoop-hdfs</artifactId>
				<version>2.0.0-cdh4.0.0</version>
			</dependency>
			<dependency>
				<groupId>org.apache.hadoop</groupId>
				<artifactId>hadoop-auth</artifactId>
				<version>2.0.0-cdh4.0.0</version>
			</dependency>
			<dependency>
				<groupId>org.apache.hadoop</groupId>
				<artifactId>hadoop-common</artifactId>
				<version>2.0.0-cdh4.0.0</version>
			</dependency>
			<dependency>
				<groupId>org.apache.hadoop</groupId>
				<artifactId>hadoop-core</artifactId>
				<version>2.0.0-mr1-cdh4.0.1</version>
			</dependency>
			<dependency>
				<groupId>junit</groupId>
				<artifactId>junit-dep</artifactId>
				<version>4.8.2</version>
			</dependency>
		</dependencies>
	</dependencyManagement>
	<dependencies>
		<dependency>
			<groupId>org.apache.hadoop</groupId>
			<artifactId>hadoop-hdfs</artifactId>
		</dependency>
		<dependency>
			<groupId>org.apache.hadoop</groupId>
			<artifactId>hadoop-auth</artifactId>
		</dependency>
		<dependency>
			<groupId>org.apache.hadoop</groupId>
			<artifactId>hadoop-common</artifactId>
		</dependency>
		<dependency>
			<groupId>org.apache.hadoop</groupId>
			<artifactId>hadoop-core</artifactId>
		</dependency>
		<dependency>
			<groupId>junit</groupId>
			<artifactId>junit</artifactId>
			<version>4.10</version>
			<scope>test</scope>
		</dependency>
	</dependencies>
	<build>
		<plugins>
			<plugin>
				<groupId>org.apache.maven.plugins</groupId>
				<artifactId>maven-compiler-plugin</artifactId>
				<version>2.1</version>
				<configuration>
					<source>1.6</source>
					<target>1.6</target>
				</configuration>
			</plugin>
		</plugins>
	</build>

Download dependencies

Now that you have added your Cloudera repository and created your project, download dependencies. This can be easily done by right-clicking on your eclipse project, “update Maven dependencies”.
All these dependencies must have been added on your .m2 repository.

[developer@localhost ~]$ find .m2/repository/org/apache/hadoop -name "*.jar" 
.m2/repository/org/apache/hadoop/hadoop-tools/1.0.4/hadoop-tools-1.0.4.jar
.m2/repository/org/apache/hadoop/hadoop-common/2.0.0-cdh4.0.0/hadoop-common-2.0.0-cdh4.0.0-sources.jar
.m2/repository/org/apache/hadoop/hadoop-common/2.0.0-cdh4.0.0/hadoop-common-2.0.0-cdh4.0.0.jar
.m2/repository/org/apache/hadoop/hadoop-core/2.0.0-mr1-cdh4.0.1/hadoop-core-2.0.0-mr1-cdh4.0.1-sources.jar
.m2/repository/org/apache/hadoop/hadoop-core/2.0.0-mr1-cdh4.0.1/hadoop-core-2.0.0-mr1-cdh4.0.1.jar
.m2/repository/org/apache/hadoop/hadoop-hdfs/2.0.0-cdh4.0.0/hadoop-hdfs-2.0.0-cdh4.0.0.jar
.m2/repository/org/apache/hadoop/hadoop-streaming/1.0.4/hadoop-streaming-1.0.4.jar
.m2/repository/org/apache/hadoop/hadoop-auth/2.0.0-cdh4.0.0/hadoop-auth-2.0.0-cdh4.0.0.jar
[developer@localhost ~]$ 

Create WordCount example

Create your driver code

package com.aamend.hadoop.MapReduce;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

public class WordCount {

	public static void main(String[] args) throws IOException,
			InterruptedException, ClassNotFoundException {

		Path inputPath = new Path(args[0]);
		Path outputDir = new Path(args[1]);

		// Create configuration
		Configuration conf = new Configuration(true);

		// Create job
		Job job = new Job(conf, "WordCount");
		job.setJarByClass(WordCountMapper.class);

		// Setup MapReduce
		job.setMapperClass(WordCountMapper.class);
		job.setReducerClass(WordCountReducer.class);
		job.setNumReduceTasks(1);

		// Specify key / value
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(IntWritable.class);

		// Input
		FileInputFormat.addInputPath(job, inputPath);
		job.setInputFormatClass(TextInputFormat.class);

		// Output
		FileOutputFormat.setOutputPath(job, outputDir);
		job.setOutputFormatClass(TextOutputFormat.class);

		// Delete output if exists
		FileSystem hdfs = FileSystem.get(conf);
		if (hdfs.exists(outputDir))
			hdfs.delete(outputDir, true);

		// Execute job
		int code = job.waitForCompletion(true) ? 0 : 1;
		System.exit(code);

	}

}

Create Mapper class

package com.aamend.hadoop.MapReduce;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class WordCountMapper extends
		Mapper<Object, Text, Text, IntWritable> {

	private final IntWritable ONE = new IntWritable(1);
	private Text word = new Text();

	public void map(Object key, Text value, Context context)
			throws IOException, InterruptedException {

		String[] csv = value.toString().split(",");
		for (String str : csv) {
			word.set(str);
			context.write(word, ONE);
		}
	}
}

Create your Reducer class

package com.aamend.hadoop.MapReduce;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class WordCountReducer extends
		Reducer<Text, IntWritable, Text, IntWritable> {

	public void reduce(Text text, Iterable<IntWritable> values, Context context)
			throws IOException, InterruptedException {
		int sum = 0;
		for (IntWritable value : values) {
			sum += value.get();
		}
		context.write(text, new IntWritable(sum));
	}
}

Build project

Exporting jar file is actually out of the box using maven. Execute the following command

mvn clean install

You should see same output as below

.../...

[INFO] 
[INFO] --- maven-jar-plugin:2.3.2:jar (default-jar) @ MapReduce ---
[INFO] Building jar: /home/developer/Workspace/hadoop/MapReduce/target/MapReduce-0.0.1-SNAPSHOT.jar
[INFO] 
[INFO] --- maven-install-plugin:2.3.1:install (default-install) @ MapReduce ---
[INFO] Installing /home/developer/Workspace/hadoop/MapReduce/target/MapReduce-0.0.1-SNAPSHOT.jar to /home/developer/.m2/repository/com/aamend/hadoop/MapReduce/0.0.1-SNAPSHOT/MapReduce-0.0.1-SNAPSHOT.jar
[INFO] Installing /home/developer/Workspace/hadoop/MapReduce/pom.xml to /home/developer/.m2/repository/com/aamend/hadoop/MapReduce/0.0.1-SNAPSHOT/MapReduce-0.0.1-SNAPSHOT.pom
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 9.159s
[INFO] Finished at: Sat May 25 00:35:56 GMT+02:00 2013
[INFO] Final Memory: 16M/212M
[INFO] ------------------------------------------------------------------------

And your jar file must be available on project’s target directory (additionally in your ${HOME}/.m2 local repository).

maven5

This jar is ready to be executed on your Hadoop environment.

hadoop jar MapReduce-0.0.1-SNAPSHOT.jar com.aamend.hadoop.MapReduce.WordCount input output

Each time I need to create a new Hadoop project, I simply copy pom.xml template described above, and that’s it..

Advertisements

30 thoughts on “Hadoop: Setup Maven project for MapReduce in 5mn

  1. Pingback: sidcode's Blog | hadoop maven wordcount 예제 실행

  2. Pingback: sidcode's Blog | hive thrift java 예제

  3. Hi Antonie,

    I am following the below article for ‘hadoop-1.0.3.tar.gz’ installation:
    http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/

    So, will “Cloudera CDH4” jar (from your article) clash with hadoop-1.0.3 while executing hadoop ?

    I mean can I directly run ‘MapReduce-0.0.1-SNAPSHOT.jar’ against my installation without any hiccup, being both distribution are different (CDH4: jar Vs hadoop-1.0.3: jar) ?

    Thanks
    Amp

    • The article you were following is a really good example for getting started with Hadoop. Unfortunately, this “cloudera project” will not work against your apache distribution. That’s said, you need to update the pom.xml with the right apache dependencies

      org.apache.hadoop
      hadoop-core
      1.2.1

      org.apache.hadoop
      hadoop-common
      1.2.1

      org.apache.hadoop
      hadoop-hdfs
      0.23.9

  4. Hi Antonie,

    It is a nice blog. It was very useful and I was able to be ready with a simple application within a short span of time. Thank you.

    But I have problems in setting up my hadoop environment. I am using windowsXp, and i downloaded hadoop 2.2.0. To check whether my hadoop environment is ready i used this in my command prompt “hadoop [–config confdir] version”. But it gives me error saying JAVA_HOME is incorreclt set. I have my java in location – “C:\Program Files\Java\jdk1.7.0\bin” . When i searched about this problem in web i came to know that i need to set the “JAVA_HOME” in hadoop-env.sh file. I tried by setting that path as “C:\Program Files\Java\jdk1.7.0\bin”. But i am getting the same problem.

    Your help will be much appreciated and it will be very useful for the beginners in hadoop.

    Thanks in advance,
    Priya.

  5. Hey,

    I followed your tutorial carefully. Running under windows 8, cygwin installed and added to PATH, yet I receive the following errors:

    Missing artifact org.apache.hadoop:hadoop-core:jar:2.0.0-mr1-cdh4.0.1 pom.xml /MapReduce
    The container ‘Maven Dependencies’ references non existing library ‘C:\Users\shay\.m2\repository\org\apache\hadoop\hadoop-hdfs\2.0.0-cdh4.0.0\hadoop-hdfs-2.0.0-cdh4.0.0.jar’
    Missing artifact org.apache.hadoop:hadoop-auth:jar:2.0.0-cdh4.0.0 pom.xml /MapReduce
    Missing artifact org.apache.hadoop:hadoop-common:jar:2.0.0-cdh4.0.0 pom.xml /MapReduce
    Missing artifact org.apache.hadoop:hadoop-hdfs:jar:2.0.0-cdh4.0.0 pom.xml /MapReduce

    Any help will be appreciated.

  6. Hi Antoine ,

    I have been trying to get Maven to download the libs for the past 1 week , but this error wont go away. I did everything as u have described, but still Maven is not able to connect to Cloudera repo. This is Eclipse Juno in windows 7 and I have the CHD ready in a Linux machine.

    ArtifactDescriptorException: Failed to read artifact descriptor for org.apache.hadoop:hadoop-hdfs:jar:2.0.0-cdh4.0.0: ArtifactResolutionException: Failure to transfer org.apache.hadoop:hadoop-hdfs:pom:2.0.0-cdh4.0.0 from https://repository.cloudera.com.artifactory/cloudera-repos was cached in the local repository, resolution will not be reattempted until the update interval of cloudera has elapsed or updates are forced. Original error: Could not transfer artifact org.apache.hadoop:hadoop-hdfs:pom:2.0.0-cdh4.0.0 from/to cloudera (https://repository.cloudera.com.artifactory/cloudera-repos): null to https://repository.cloudera.com.artifactory/cloudera-repos/org/apache/hadoop/hadoop-hdfs/2.0.0-cdh4.0.0/hadoop-hdfs-2.0.0-cdh4.0.0.pom.

    Would appreciate hearing back if possible .

    Thanks

  7. Im following same instructions but I get this exception, what is wrong here?

    Exception in thread “main” java.lang.ClassNotFoundException: src.main.java.com.hadoop.bi.MapReduce.MaxTemperature
    at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:270)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:201)

    run :
    [root@dev MapReduce]# hadoop jar target/MapReduce-0.0.1-SNAPSHOT.jar src/main/java/com/hadoop/bi/MapReduce/MaxTemperature sample.txt /out

    and this is my main class:
    // cc MaxTemperature Application to find the maximum temperature in the weather dataset
    // vv MaxTemperature
    package com.hadoop.bi.MapReduce;
    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

    public class MaxTemperature {

    public static void main(String[] args) throws Exception {
    if (args.length != 2) {
    System.err.println(“Usage: MaxTemperature “);
    System.exit(-1);
    }
    Configuration conf = new Configuration();
    Job job = new Job(conf, “MaxTemperature”);
    job.setJarByClass(MaxTemperature.class);

    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));

    job.setMapperClass(MaxTemperatureMapper.class);
    job.setReducerClass(MaxTemperatureReducer.class);

    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);

    System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
    }
    // ^^ MaxTemperature

  8. Pingback: Setting up Eclipse IDE for Hadoop 2.5.0 | Hadoop

  9. I have installed Hadoop Cloudera4 in ubuntu now following these steps can I do map reduce examples myself .Thanks in advance.Please reply quikly.

    Regards,
    Arindam Bose

  10. Just to add on I used link same link as Amp below hence I believe I got bit of the answer.Can you guide me as to proper site for learning Hadoop from scratch(map reduce and admin wise).I will give this a shot nd reply back quickly.

    Thanks Again!!!

  11. Another thing I tried but am getting some error, my pom file is below:-

    4.0.0

    com.aamend.com
    MapReduce1
    0.0.1-SNAPSHOT
    jar

    MapReduce1
    http://maven.apache.org

    UTF-8

    junit
    junit
    3.8.1
    test

    jdk.tools
    jdk.tools
    1.6

    org.apache.hadoop
    hadoop-hdfs
    2.0.0-cdh4.0.0

    org.apache.hadoop
    hadoop-auth
    2.0.0-cdh4.0.0

    org.apache.hadoop
    hadoop-common
    2.0.0-cdh4.0.0

    org.apache.hadoop
    hadoop-core
    2.0.0-mr1-cdh4.0.1

    junit
    junit-dep
    4.8.2

    org.apache.hadoop
    hadoop-hdfs

    org.apache.hadoop
    hadoop-auth

    org.apache.hadoop
    hadoop-common

    org.apache.hadoop
    hadoop-core

    junit
    junit
    4.10
    test

    org.apache.maven.plugins
    maven-compiler-plugin
    2.1

    1.6
    1.6

    plugins tag is throwing error :(. Then I need to do right click—>Maven—->Update Project (I am using Juno and Maven3 as per your advise).Please help

  12. Getting this error, what could be the issue.
    $ hadoop jar /tmp/MapReduce-0.0.1-SNAPSHOT.jar com.chan.hadoop.MapReduce.WordCount input output
    15/10/20 23:18:36 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
    15/10/20 23:18:38 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
    15/10/20 23:18:38 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
    15/10/20 23:18:39 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
    15/10/20 23:18:39 INFO mapreduce.JobSubmitter: Cleaning up the staging area file:/tmp/hadoop-a_campsb/mapred/staging/a_campsb678466908/.staging/job_local678466908_0001
    Exception in thread “main” org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://fastdevl0543.svr.emea.jpmchase.net:9000/user/a_campsb/input
    at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:321)
    at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:264)
    at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:385)
    at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:597)
    at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:614)
    at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:492)
    at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1296)
    at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1293)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Unknown Source)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
    at org.apache.hadoop.mapreduce.Job.submit(Job.java:1293)
    at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1314)
    at com.chan.hadoop.MapReduce.WordCount.main(WordCount.java:54)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
    at java.lang.reflect.Method.invoke(Unknown Source)
    at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:136)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s