Data Ingestion

  • Amazon Kinesis – real-time processing of streaming data at massive scale.
  • Apache Chukwa – data collection system.
  • Apache Flume – service to manage large amount of log data.
  • Apache Kafka – distributed publish-subscribe messaging system.
  • Apache Sqoop – tool to transfer data between Hadoop and a structured datastore.
  • Cloudera Morphlines – framework that help ETL to Solr, HBase and HDFS.
  • Facebook Scribe – streamed log data aggregator.
  • Fluentd – tool to collect events and logs.
  • Google Photon – geographically distributed system for joining multiple continuously flowing streams of data in real-time with high scalability and low latency.
  • Heka – open source stream processing software system.
  • HIHO – framework for connecting disparate data sources with Hadoop.
  • Kestrel – distributed message queue system.
  • LinkedIn Databus – stream of change capture events for a database.
  • LinkedIn Kamikaze – utility package for compressing sorted integer arrays.
  • LinkedIn White Elephant – log aggregator and dashboard.
  • Logstash – a tool for managing events and logs.
  • Netflix Suro – log agregattor like Storm and Samza based on Chukwa.
  • Pinterest Secor – is a service implementing Kafka log persistance.
  • Linkedin Gobblin – linkedin’s universal data ingestion framework.
  • StreamSets Data Collector – continuous big data ingest infrastructure with a simple to use IDE.