Sunday, April 20, 2014

Crunch Pipeline's

Pipeline

A Crunch pipeline executes a series of MapReduce jobs created by crunch planner that are capable of reading, writing and processing data. The crunch planner essentially chains map reduce jobs in order of execution, and is capable of identifying jobs that are capable of running independently. The planner can also launch jobs upon completion of it's parent jobs. 

The Pipeline interface provides a set of methods that make it possible to read/write data, control pipeline execution, update configurations, and cleanup temporary files.

Types of Pipeline:

There are three different types of Pipeline (at the time of writing)
  1. MRPipeline
  2. SparkPipeline
  3. MemPipeline
MRPipeline
The most commonly used Pipeline for executing map reduce jobs. The MRPipeline also provides a MRPipeline#plan() method that will run the planning phase of the pipeline which generates the plan(DOT) for the pipeline execution. The generated DOT can be visualized via external tools to understand the execution strategy for that Pipeline.

SparkPipeline
A pipeline implementation that converts a crunch pipeline to a series of spark pipelines that utilizes Apache Spark api.

MemPipeline
An in-memory pipeline that could be used on a client machine. A MemPipeline will not submit any map-reduce jobs to an actual hadoop cluster. However, a MemPipeline can be used effectively for testing user code.  A typical use case shall be unit testing, where user can test pipeline execution. Since MemPipeline runs in-memory the input dataset may be relatively small to favor memory usage.

A MemPipeline provides a set of methods to create PCollection's and PTable's on the fly as opposed to MRPipeline. However, it has to be noted that PCollection's and PTable's created using a MemPipeline should not be utilized in MRPipeline or SparkPipeline.

Additionally, a MemPipeline does not offer all features like other pipelines, for example: DistributedCache; and it's mainly because of the nature of the Pipeline. The idea is to use this pipeline on smaller data which could be run faster in-memory.

1 comment: