Awesome DataOps
A curated list of awesome DataOps tools.
Data Catalog
Tools related to data cataloging.
- Amundsen – Data discovery and metadata engine for improving the productivity when interacting with data.
- Apache Atlas – Provides open metadata management and governance capabilities to build a data catalog.
- CKAN – Open-source DMS (data management system) for powering data hubs and data portals.
- DataHub – LinkedIn’s generalized metadata search & discovery tool.
- Magda – A federated, open-source data catalog for all your big data and small data.
- Metacat – Unified metadata exploration API service for Hive, RDS, Teradata, Redshift, S3 and Cassandra.
- OpenMetadata – A Single place to discover, collaborate and get your data right.
Data Exploration
Tools for performing data exploration.
- Apache Zeppelin – Enables data-driven, interactive data analytics and collaborative documents.
- Jupyter Notebook – Web-based notebook environment for interactive computing.
- JupyterLab – The next-generation user interface for Project Jupyter.
- Jupytext – Jupyter Notebooks as Markdown Documents, Julia, Python or R scripts.
- Polynote – The polyglot notebook with first-class Scala support.
Data Ingestion
Tools for performing data ingestion.
- Amazon Kinesis – Easily collect, process, and analyze video and data streams in real time.
- Apache Gobblin – A framework that simplifies common aspects of big data such as data ingestion.
- Apache Kafka – Open-source distributed event streaming platform used by thousands of companies.
- Apache Pulsar – Distributed pub-sub messaging platform with a flexible messaging model and intuitive API.
- Embulk – A parallel bulk data loader that helps data transfer between various storages.
- Fluentd – Collects events from various data sources and writes them to files.
- Google PubSub – Ingest events for streaming into BigQuery, data lakes or operational databases.
- Nakadi – A distributed event bus that implements a RESTful API abstraction on top of Kafka-like queues.
- Pravega – An open source distributed storage service implementing Streams.
- RabbitMQ – One of the most popular open source message brokers.
Data Lake
Tools related to storing data in data lakes.
- Delta Lake – An open source project that enables building a Lakehouse architecture on top of data lakes.
- LakeFS – Open source tool that transforms your object storage into a Git-like repository.
Data Workflow
Tools related to data workflow/pipeline.
- Apache Airflow – A platform to programmatically author, schedule, and monitor workflows.
- Apache Oozie – An extensible, scalable and reliable system to manage complex Hadoop workloads.
- Azkaban – Batch workflow job scheduler created at LinkedIn to run Hadoop jobs.
- Dagster – An orchestration platform for the development, production, and observation of data assets.
- Luigi – Python module that helps you build complex pipelines of batch jobs.
- Prefect – A workflow management system, designed for modern infrastructure.
Data Processing
Tools related to data processing (batch and stream).
- Apache Beam – A unified model for defining both batch and streaming data-parallel processing pipelines.
- Apache Flink – An open source stream processing framework with powerful capabilities.
- Apache Hadoop MapReduce – A framework for writing applications which process vast amounts of data.
- Apache Hudi – Hadoop Upserts Deletes and Incrementals.
- Apache Nifi – An easy to use, powerful, and reliable system to process and distribute data.
- Apache Samza – A distributed stream processing framework which uses Apache Kafka and Hadoop YARN.
- Apache Spark – A unified analytics engine for large-scale data processing.
- Apache Storm – An open source distributed realtime computation system.
- Apache Tez – A generic data-processing pipeline engine envisioned as a low-level engine.
- Faust – A stream processing library, porting the ideas from Kafka Streams to Python.
Data Quality
Tools for ensuring data quality.
- Cerberus – Lightweight, extensible data validation library for Python.
- Great Expectations – A Python data validation framework that allows to test your data against datasets.
- JSON Schema – A vocabulary that allows you to annotate and validate JSON documents.
Data Serialization
Tools related to data serialization.
- Apache Avro – A data serialization system which is compact, fast and provides rich data structures.
- Apache ORC – A self-describing type-aware columnar file format designed for Hadoop workloads.
- Apache Parquet – A columnar storage format which provides efficient storage and encoding of data.
- Kryo – A fast and efficient binary object graph serialization framework for Java.
- ProtoBuf – Language-neutral, platform-neutral, extensible mechanism for serializing structured data.
Data Compression
- Pigz – A parallel implementation of gzip for modern multi-processor, multi-core machines.
- Snappy – Open source compression library that is fast, stable and robuts.
Data Visualization
Tools for performing data visualization (DataViz).
- Apache Superset – A modern data exploration and data visualization platform.
- Count – SQL/drag-and-drop querying and visualisation tool based on notebooks.
- Dash – Analytical Web Apps for Python, R, Julia, and Jupyter.
- Data Studio – Reporting solution for power users who want to go beyond the data and dashboards of GA.
- HUE – A mature SQL Assistant for querying Databases & Data Warehouses.
- Lux – Fast and easy data exploration by automating the visualization and data analysis process.
- Metabase – The simplest, fastest way to get business intelligence and analytics to everyone.
- Redash – Connect to any data source, easily visualize, dashboard and share your data.
- Tableau – Powerful and fastest growing data visualization tool used in the business intelligence industry.
Data Warehouse
Tools related to storing data in data warehouses (DW).
- Amazon Redshift – Accelerate your time to insights with fast, easy, and secure cloud data warehousing.
- Apache Hive – Facilitates reading, writing, and managing large datasets residing in distributed storage.
- Google BigQuery – Serverless, highly scalable, and cost-effective multicloud data warehouse.
Database
Database tools for storing data.
Columnar Database
- Apache Cassandra – Open source column based DBMS designed to handle large amounts of data.
- Apache Druid – Designed to quickly ingest massive quantities of event data, and provide low-latency queries.
- Apache HBase – An open-source, distributed, versioned, column-oriented store.
- Scylla – Designed to be compatible with Cassandra while achieving higher throughputs and lower latencies.
Document-Oriented Database
- Apache CouchDB – An open-source document-oriented NoSQL database, implemented in Erlang.
- Elasticsearch – A distributed document oriented database with a RESTful search engine.
- MongoDB – A cross-platform document database that uses JSON-like documents with optional schemas.
- RethinkDB – The first open-source scalable database built for realtime applications.
Graph Database
- ArangoDB – A scalable open-source multi-model database natively supporting graph, document and search.
- Neo4j – A high performance graph store with all the features expected of a mature and robust database.
- Titan – A highly scalable graph database optimized for storing and querying large graphs.
Key-Value Database
- Apache Accumulo – A sorted, distributed key-value store that provides robust and scalable data storage.
- etcd – Distributed reliable key-value store for the most critical data of a distributed system.
- Memcached – A high performance multithreaded event-based key/value cache store.
- Redis – An in-memory key-value database that persists on disk.
Relational Database
- CockroachDB – A distributed database designed to build, scale, and manage data-intensive apps.
- Crate – A distributed SQL database that makes it simple to store and analyze massive amounts of data.
- MariaDB – A replacement of MySQL with more features, new storage engines and better performance.
- MySQL – One of the most popular open source transactional databases.
- PostgreSQL – An advanced RDBMS that supports an extended subset of the SQL standard.
- RQLite – A lightweight, distributed relational database, which uses SQLite as its storage engine.
Time Series Database
- Akumuli – Can be used to capture, store and process time-series data in real-time.
- InfluxDB – Scalable datastore for metrics, events, and real-time analytics.
- QuestDB – An open source SQL database designed to process time series data, faster.
- TimescaleDB – Open-source time-series SQL database optimized for fast ingest and complex queries.
Vector Database
- Milvus – An open source embedding vector similarity search engine powered by Faiss, NMSLIB and Annoy.
- Pinecone – Managed and distributed vector similarity search used with a lightweight SDK.
File System
Tools related to file system and data storage.
- Alluxio – A virtual distributed storage system.
- Amazon Simple Storage Service (S3) – Object storage built to retrieve any amount of data from anywhere
- Apache Hadoop Distributed File System (HDFS) – A distributed file system.
- GlusterFS – A software defined distributed storage that can scale to several petabytes.
- Google Cloud Storage (GCS) – Object storage for companies of all sizes, to store any amount of data.
- LizardFS – A highly reliable, scalable and efficient distributed file system.
- MinIO – High Performance, Kubernetes Native Object Storage compatible with Amazon S3 API.
- SeaweedFS – A fast distributed storage system for blobs, objects, files, and data lake.
- Swift – A distributed object storage system designed to scale from a single machine to thousands of servers.
Logging and Monitoring
Tools used for logging and monitoring data workflows.
- Grafana – Visualize metrics, logs, and traces from multiple sources like Prometheus, Loki, InfluxDB and more.
- Loki – A horizontally-scalable, highly-available, multi-tenant log aggregation system inspired by Prometheus.
- Prometheus – A monitoring system and time series database.
SQL Query Engine
Tools for parallel processing SQL statements.
- Apache Drill – Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage.
- Apache Impala – Lightning-fast, distributed SQL queries for petabytes of data.
- Dremio – Power high-performing BI dashboards and interactive analytics directly on data lake.
- Presto – A distributed SQL query engine for big data.
- Trino – A fast distributed SQL query engine for big data analytics.
Resources
Where to discover new tools and discuss about existing ones.
Books
- Data Mesh: Delivering Data-Driven Value at Scale (O’Reilly)
- Designing Data-Intensive Applications (O’Reilly)
- Fundamentals of Data Engineering (O’Reilly)
- Getting Started with Impala (O’Reilly)
- Learning and Operating Presto (O’Reilly)
- Learning Spark: Lightning-Fast Data Analytics (O’Reilly)
- Spark in Action (O’Reilly)
- Spark: The Definitive Guide (O’Reilly)
Other Lists
Slack
Contributing
All contributions are welcome! Please take a look at the contribution guidelines first.