System Deployment

  • Apache Ambari – operational framework for Hadoop mangement.
  • Apache Bigtop – system deployment framework for the Hadoop ecosystem.
  • Apache Helix – cluster management framework.
  • Apache Mesos – cluster manager.
  • Apache Slider – is a YARN application to deploy existing distributed applications on YARN.
  • Apache Whirr – set of libraries for running cloud services.
  • Apache YARN – Cluster manager.
  • Brooklyn – library that simplifies application deployment and management.
  • Buildoop – Similar to Apache BigTop based on Groovy language.
  • Cloudera HUE – web application for interacting with Hadoop.
  • Facebook Prism – multi datacenters replication system.
  • Google Borg – job scheduling and monitoring system.
  • Google Omega – job scheduling and monitoring system.
  • Hortonworks HOYA – application that can deploy HBase cluster on YARN.
  • Marathon – Mesos framework for long-running services.

Machine Learning

  • Apache Mahout – machine learning library for Hadoop.
  • brain – Neural networks in JavaScript.
  • Cloudera Oryx – real-time large-scale machine learning.
  • Concurrent Pattern – machine learning library for Cascading.
  • convnetjs – Deep Learning in Javascript. Train Convolutional Neural Networks (or ordinary ones) in your browser.
  • Decider – Flexible and Extensible Machine Learning in Ruby.
  • ENCOG – machine learning framework that supports a variety of advanced algorithms, as well as support classes to normalize and process data.
  • etcML – text classification with machine learning.
  • Etsy Conjecture – scalable Machine Learning in Scalding.
  • Google Sibyl – System for Large Scale Machine Learning at Google.
  • GraphLab Create – A machine learning platform in Python with a broad collection of ML toolkits, data engineering, and deployment tools.
  • H2O – statistical, machine learning and math runtime for Hadoop.
  • MLbase – distributed machine learning libraries for the BDAS stack.
  • MLPNeuralNet – Fast multilayer perceptron neural network library for iOS and Mac OS X.
  • MonkeyLearn – Text mining made easy. Extract and classify data from text.
  • nupic – Numenta Platform for Intelligent Computing: a brain-inspired machine intelligence platform, and biologically accurate neural network based on cortical learning algorithms.
  • PredictionIO – machine learning server buit on Hadoop, Mahout and Cascading.
  • SAMOA – distributed streaming machine learning framework.
  • scikit-learn – scikit-learn: machine learning in Python.
  • Spark MLlib – a Spark implementation of some common machine learning (ML) functionality.
  • Vowpal Wabbit – learning system sponsored by Microsoft and Yahoo!.
  • WEKA – suite of machine learning software.
  • BidMach – CPU and GPU-accelerated Machine Learning Library.

Scheduling

  • Apache Aurora – is a service scheduler that runs on top of Apache Mesos.
  • Apache Falcon – data management framework.
  • Apache Oozie – workflow job scheduler.
  • Chronos – distributed and fault-tolerant scheduler.
  • Linkedin Azkaban – batch workflow job scheduler.
  • Schedoscope – Scala DSL for agile scheduling of Hadoop jobs.
  • Sparrow – scheduling platform.
  • Airflow – a platform to programmatically author, schedule and monitor workflows.

Service Programming

  • Akka Toolkit – runtime for distributed, and fault tolerant event-driven applications on the JVM.
  • Apache Avro – data serialization system.
  • Apache Curator – Java libaries for Apache ZooKeeper.
  • Apache Karaf – OSGi runtime that runs on top of any OSGi framework.
  • Apache Thrift – framework to build binary protocols.
  • Apache Zookeeper – centralized service for process management.
  • Google Chubby – a lock service for loosely-coupled distributed systems.
  • Linkedin Norbert – cluster manager.
  • OpenMPI – message passing framework.
  • Serf – decentralized solution for service discovery and orchestration.
  • Spotify Luigi – a Python package for building complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization, handling failures, command line integration, and much more.
  • Spring XD – distributed and extensible system for data ingestion, real time analytics, batch processing, and data export.
  • Twitter Elephant Bird – libraries for working with LZOP-compressed data.
  • Twitter Finagle – asynchronous network stack for the JVM.

Data Ingestion

  • Amazon Kinesis – real-time processing of streaming data at massive scale.
  • Apache Chukwa – data collection system.
  • Apache Flume – service to manage large amount of log data.
  • Apache Kafka – distributed publish-subscribe messaging system.
  • Apache Sqoop – tool to transfer data between Hadoop and a structured datastore.
  • Cloudera Morphlines – framework that help ETL to Solr, HBase and HDFS.
  • Facebook Scribe – streamed log data aggregator.
  • Fluentd – tool to collect events and logs.
  • Google Photon – geographically distributed system for joining multiple continuously flowing streams of data in real-time with high scalability and low latency.
  • Heka – open source stream processing software system.
  • HIHO – framework for connecting disparate data sources with Hadoop.
  • Kestrel – distributed message queue system.
  • LinkedIn Databus – stream of change capture events for a database.
  • LinkedIn Kamikaze – utility package for compressing sorted integer arrays.
  • LinkedIn White Elephant – log aggregator and dashboard.
  • Logstash – a tool for managing events and logs.
  • Netflix Suro – log agregattor like Storm and Samza based on Chukwa.
  • Pinterest Secor – is a service implementing Kafka log persistance.
  • Linkedin Gobblin – linkedin’s universal data ingestion framework.
  • StreamSets Data Collector – continuous big data ingest infrastructure with a simple to use IDE.

Time-Series Databases

  • Cube – uses MongoDB to store time series data.
  • Axibase Time Series Database – distributed time series database on top of HBase. Includes built-in Rule Engine, data forecasting and visualization.
  • Heroic – is a scalable time series database based on Cassandra and Elasticsearch.
  • InfluxDB – distributed time series database.
  • Kairosdb – similar to OpenTSDB but allows for Cassandra.
  • OpenTSDB – distributed time series database on top of HBase.
  • Prometheus – a time series database and service monitoring system
  • Newts – a time series database based on Apache Cassandra

NewSQL Databases

  • Actian Ingres – commercially supported, open-source SQL relational database management system.
  • Amazon RedShift – data warehouse service, based on PostgreSQL.
  • BayesDB – statistic oriented SQL database.
  • CitusDB – scales out PostgreSQL through sharding and replication.
  • Cockroach – Scalable, Geo-Replicated, Transactional Datastore.
  • Datomic – distributed database designed to enable scalable, flexible and intelligent applications.
  • FoundationDB – distributed database, inspired by F1.
  • Google F1 – distributed SQL database built on Spanner.
  • Google Spanner – globally distributed semi-relational database.
  • H-Store – is an experimental main-memory, parallel database management system that is optimized for on-line transaction processing (OLTP) applications.
  • Haeinsa – linearly scalable multi-row, multi-table transaction library for HBase based on Percolator.
  • HandlerSocket – NoSQL plugin for MySQL/MariaDB.
  • InfiniSQL – infinity scalable RDBMS.
  • MemSQL – in memory SQL database witho optimized columnar storage on flash.
  • NuoDB – SQL/ACID compliant distributed database.
  • Oracle TimesTen in-Memory Database – in-memory, relational database management system with persistence and recoverability.
  • Pivotal GemFire XD – Low-latency, in-memory, distributed SQL data store. Provides SQL interface to in-memory table data, persistable in HDFS.
  • SAP HANA – is an in-memory, column-oriented, relational database management system.
  • SenseiDB – distributed, realtime, semi-structured database.
  • Sky – database used for flexible, high performance analysis of behavioral data.
  • SymmetricDS – open source software for both file and database synchronization.
  • Map-D – GPU in-memory database, big data analysis and visualization platform
  • TiDB – TiDB is a distributed SQL database. Inspired by the design of Google F1.
  • VoltDB – claims to be fastest in-memory database