Thursday, May 1, 2014

How to fix hadoop NoClassDefFoundError

Ever faced a situation where a hadoop program fails with a similar exception:
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/crunch/Pipeline
 at java.lang.Class.forName0(Native Method)
 at java.lang.Class.forName(Class.java:270)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:201)
Caused by: java.lang.ClassNotFoundException: org.apache.crunch.Pipeline
 at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:323)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:268)
 ... 3 more
A NoClassDefFoundError exception is thrown, when a class is not available for a JVM at runtime. It is a behavior of the JVM to expect third-party jars, and classes to be available on the JVM classpath during it's execution.

MapReduce jobs are executed on a node's task tracker JVM, so how do we guarantee third-party jars are available in the JVM classpath?

Distributed Cache and HADOOP_CLASSPATH
Hadoop requires all third-party/external jars to be available on the client and remote JVM classpath. To accomplish this, there are two different steps that has to be followed.

Update HADOOP_CLASSPATH to include the location of third-party jars so that the client JVM can use it.
export HADOOP_CLASSPATH=/path/file1.jar:/path/file2.jar:$HADOOP_CLASSPATH

Run the hadoop command with -libjars option. This will instruct hadoop to place the third-party jars in DistributedCache. Once the jars are in DistributedCache, hadoop framework will take care of placing the jars in the task node's JVM classpath. However, this will only work when your program implements the Tool interface so that hadoop can identify the libjars option.

hadoop jar <jar-file> -libjars=/path/file1.jar,/path/file2.jar MainClass arg1 arg2
By running the export command first and issuing the libjars option along with your hadoop command, mapreduce jobs should be able to utilize third-party jars.

Alternatively, if the third-party jar is a dependency, you can build an uber jar so that all dependencies are loaded along with the main jar. Here is a guide on how to build a shaded (or) uber jar. http://blog.nithinasokan.com/2014/05/create-shaded-uber-jar-for-maven-multi.html