Learn Apache Crunch: maven

Ever faced a situation where a hadoop program fails with a similar exception:

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/crunch/Pipeline
 at java.lang.Class.forName0(Native Method)
 at java.lang.Class.forName(Class.java:270)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:201)
Caused by: java.lang.ClassNotFoundException: org.apache.crunch.Pipeline
 at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:323)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:268)
 ... 3 more

A NoClassDefFoundError exception is thrown, when a class is not available for a JVM at runtime. It is a behavior of the JVM to expect third-party jars, and classes to be available on the JVM classpath during it's execution.

MapReduce jobs are executed on a node's task tracker JVM, so how do we guarantee third-party jars are available in the JVM classpath?

Distributed Cache and HADOOP_CLASSPATH
Hadoop requires all third-party/external jars to be available on the client and remote JVM classpath. To accomplish this, there are two different steps that has to be followed.

Update HADOOP_CLASSPATH to include the location of third-party jars so that the client JVM can use it.

export HADOOP_CLASSPATH=/path/file1.jar:/path/file2.jar:$HADOOP_CLASSPATH

Run the hadoop command with -libjars option. This will instruct hadoop to place the third-party jars in DistributedCache. Once the jars are in DistributedCache, hadoop framework will take care of placing the jars in the task node's JVM classpath. However, this will only work when your program implements the Tool interface so that hadoop can identify the libjars option.

hadoop jar <jar-file> -libjars=/path/file1.jar,/path/file2.jar MainClass arg1 arg2

By running the export command first and issuing the libjars option along with your hadoop command, mapreduce jobs should be able to utilize third-party jars.

Alternatively, if the third-party jar is a dependency, you can build an uber jar so that all dependencies are loaded along with the main jar. Here is a guide on how to build a shaded (or) uber jar. http://blog.nithinasokan.com/2014/05/create-shaded-uber-jar-for-maven-multi.html

In this post I will demonstrate how to add Apache Crunch dependency to a project and write a basic map reduce application that is capable of reading a text file from HDFS (hadoop distributed file system) and writing it back to HDFS.

Prerequisites
Hadoop distribution available in distributed or pseudo-distributed mode. (If you do not have hadoop installed, you may follow instructions from cloudera to install hadoop in pseudo-distributed mode)

Crunch maven dependency
Apache Crunch can be added to maven projects with the following artifact information

<dependency>
    <groupId>org.apache.crunch</groupId>
    <artifactId>crunch-core</artifactId>
    <version>${crunch.version}</version>
</dependency>

Release versions are available here. It has to be noted that there are certain constraints when choosing a Crunch version because of a dependency on HBase version. Since Crunch is a project that is making improvements and features on a regular basis, it is a good idea to look at the official recommendation when it comes to choosing a version. You can find this information from Crunch Getting Started page. For this example, I'm using 0.9.0-hadoop2 as my Crunch version.

Example program
The goal of this program is to demonstrate whether Crunch dependency plays well with our hadoop installation. We will read a text file and write it contents to a different path in HDFS.

public class ReadWriteExample {

    public static void main(final String[] args) throws Exception {

        if (args.length != 2) {
            System.err.println("Usage: hadoop jar <jar-file> input-path output-path");
            System.exit(1);
        }

        // Get input and output paths
        final String inputPath = args[0];
        final String outputPath = args[1];

        // Create an instance of Pipeline by providing ClassName and the job configuration.
        final Pipeline pipeline = new MRPipeline(ReadWriteExample.class, new Configuration());

        // Read a text file
        final PCollection<String> inputLines = pipeline.readTextFile(inputPath);

        // write the text file
        pipeline.writeTextFile(inputLines, outputPath);

        // Execute the pipeline by calling Pipeline#done()
        final PipelineResult result = pipeline.done();

        System.exit(result.succeeded() ? 0 : 1);
    }
}

The Pipeline#readTextFile(String) expects a path that lives in HDFS. This path can denote a filename if you want the map-reduce program to read a single file. However, if the path denotes a folder, hadoop will read all text files inside the folder and output the results.

Now, to test this program you need to create your file in HDFS. One possible way to quickly create files in HDFS is to move files from local filesystem to HDFS. This can be done from terminal/command prompt

hadoop fs -put /path/to/local-file-or-folder/ /path/to/destination/

However folders can be created using

hadoop fs -mkdir /path/to/folderName/

To run the program, you have to build a jar file(you may use maven) and issue this command

hadoop jar <jar-file-location> ReadWriteExample /inputFile.txt /output-folder/

Here is the set of commands I used to run this program

> hadoop fs -mkdir readWriteExample
> hadoop fs -mkdir readWriteExample/inputs
> hadoop fs -put /home/nithin/inputFile.txt readWriteExample/
> hadoop jar crunch-examples-1.0-SNAPSHOT.jar com.crunch.tutor.examples.demo.ReadWriteExample readWriteExample/inputs/ readWriteExample/outputs/

verify results
> hadoop fs -ls readWriteExample/outputs/
> hadoop fs -cat readWriteExample/outputs/part-m-00000

Source code available on Github.

Make use of comments section if you have a question.

Learn Apache Crunch

Thursday, May 1, 2014

How to fix hadoop NoClassDefFoundError

Wednesday, April 30, 2014

Crunch program to read/write text files