Thursday, April 17, 2014

Introduction to Apache Crunch

What is Apache Crunch?


From Apache crunch
The Apache Crunch Java library provides a framework for writing, testing, and running MapReduce pipelines. Its goal is to make pipelines that are composed of many user-defined functions simple to write, easy to test, and efficient to run.
Apache Crunch is a java API that works on top of Hadoop and Apache Spark. I have been using Crunch for more than a year now, and I find it really neat and simple to write any map-reduce program. I would say the main advantages include rapid development, amount of complex operations that could be achieved easily, no boiler-plate code, and incredible planner for pipeline execution.

Some of the features of Crunch include:
  • support for complex data operations like joins, unions
  • support for reading/writing data via HBase
  • support for Avro, Protocol Buffers and Thrift 
  • managing pipeline execution

How is it different?

A lot of people may wonder how is this different from other API's like Cascading, Apache Pig. There is an excellent post on quora that describes the differences.

Crunch DataTypes

Crunch API supports three different data types (at the time of writing) that represents distributed data in HDFS (Hadoop Distributed File System).

  • PCollection<T>
    • crunch representation of a collection that is distributed, immutable and can hold data of type T. 
  • PTable<K,V>
    • crunch representation of a table which is a distributed, unordered map of keys and values. It supports adding duplicate keys and values. A PTable is really a sub-interface of PCollection, with difference that a PTable can hold a key (K), and a value (V) corresponding to it. 
  • PGroupedTable<K, V>
    • crunch representation of a grouped table, which is a distributed, sorted map of keys (K) to an Iterable of values (V). A PGroupedTable is created by calling PTable#groupByKey() which triggers the sort and shuffle phase of a map-reduce job. To create a PGroupedTable crunch takes the PTable and groups distinct keys, and the values for each key is loaded to an Iterable<V> corresponding to the key. It should be noted that they keys are in order since it goes through sort and shuflle phase of map-reduce job; and the Iterable<V> for a key should be iterated only once, subsequent iterations would fail.
We will discuss more in detail about Pipeline, PCollection's, PTable's in the next few posts.

2 comments:

  1. This helps me understand the basics. Good blog. Keep writing.

    ReplyDelete
  2. You are doing a great job by sharing useful information about Apache spark course. It is one of the post to read and improve my knowledge in Apache spark.You can check our Apache spark Introduction Tutorial,for more information about Apache Spark Introduction.

    ReplyDelete