Distributed Programming

  • AddThis Hydra – distributed data processing and storage system originally developed at AddThis.
  • AMPLab SIMR – run Spark on Hadoop MapReduce v1.
  • Apache Crunch – a simple Java API for tasks like joining and data aggregation that are tedious to implement on plain MapReduce.
  • Apache DataFu – collection of user-defined functions for Hadoop and Pig developed by LinkedIn.
  • Apache Flink – high-performance runtime, and automatic program optimization.
  • Apache Gora – framework for in-memory data model and persistence.
  • Apache Hama – BSP (Bulk Synchronous Parallel) computing framework.
  • Apache MapReduce – programming model for processing large data sets with a parallel, distributed algorithm on a cluster.
  • Apache Pig – high level language to express data analysis programs for Hadoop.
  • Apache REEF – retainable evaluator execution framework to simplify and unify the lower layers of big data systems.
  • Apache S4 – framework for stream processing, implementation of S4.
  • Apache Spark – framework for in-memory cluster computing.
  • Apache Spark Streaming – framework for stream processing, part of Spark.
  • Apache Storm – framework for stream processing by Twitter also on YARN.
  • Apache Samza – stream processing framework, based on Kafka and YARN.
  • Apache Tez – application framework for executing a complex DAG (directed acyclic graph) of tasks, built on YARN.
  • Apache Twill – abstraction over YARN that reduces the complexity of developing distributed applications.
  • Cascalog – data processing and querying library.
  • Cheetah – High Performance, Custom Data Warehouse on Top of MapReduce.
  • Concurrent Cascading – framework for data management/analytics on Hadoop.
  • Damballa Parkour – MapReduce library for Clojure.
  • Datasalt Pangool – alternative MapReduce paradigm.
  • DataTorrent StrAM – real-time engine is designed to enable distributed, asynchronous, real time in-memory big-data computations in as unblocked a way as possible, with minimal overhead and impact on performance.
  • Facebook Corona – Hadoop enhancement which removes single point of failure.
  • Facebook Peregrine – Map Reduce framework.
  • Facebook Scuba – distributed in-memory datastore.
  • Google Dataflow – create data pipelines to help themæingest, transform and analyze data.
  • Google MapReduce – map reduce framework.
  • Google MillWheel – fault tolerant stream processing framework.
  • JAQL – declarative programming language for working with structured, semi-structured and unstructured data.
  • Kite – is a set of libraries, tools, examples, and documentation focused on making it easier to build systems on top of the Hadoop ecosystem.
  • Metamarkets Druid – framework for real-time analysis of large datasets.
  • Netflix PigPen – map-reduce for Clojure whiche compiles to Apache Pig.
  • Nokia Disco – MapReduce framework developed by Nokia.
  • Pinterest Pinlater – asynchronous job execution system.
  • Pydoop – Python MapReduce and HDFS API for Hadoop.
  • Rackerlabs Blueflood – multi-tenant distributed metric processing system
  • Stratosphere – general purpose cluster computing framework.
  • Streamdrill – usefull for counting activities of event streams over different time windows and finding the most active one.
  • Tuktu – Easy-to-use platform for batch and streaming computation, built using Scala, Akka and Play!
  • Twitter Scalding – Scala library for Map Reduce jobs, built on Cascading.
  • Twitter Summingbird – Streaming MapReduce with Scalding and Storm, by Twitter.
  • Twitter TSAR – TimeSeries AggregatoR by Twitter.