Open Source Data Pipeline Frameworks
Data Pipeline framework built in Python, including scheduler, DAG definition and a UI for visualisation.
License: Apache License 2.0
Apache NiFi was made for dataflow. It supports highly configurable directed graphs of data routing, transformation, and system mediation logic.
License: Apache License 2.0
Argo Workflows is an open source container-native workflow engine for orchestrating parallel jobs on Kubernetes. Argo Workflows is implemented as a Kubernetes CRD (Custom Resource Definition).
License: Apache License 2.0
Azkaban is a batch workflow job scheduler created at LinkedIn to run Hadoop jobs. Azkaban resolves the ordering through job dependencies and provides an easy to use web user interface to maintain and track your workflows.
License: Apache License 2.0
BatchFlow helps data scientists conveniently work with random or sequential batches of your data and define data processing and machine learning workflows for large datasets.
License: Apache License 2.0
ETL framework for Python 3.5+ with focus on simple atomic operations working concurrently on rows of data.
License: Apache License 2.0
More of a job scheduler for Mesos than ETL pipeline.
License: Apache License 2.0
Unified interface for constructing and managing machine learning workflows on different workflow engines, such as Argo Workflows, Tekton Pipelines, and Apache Airflow.
License: Apache License 2.0
A python library that allows for building complex data science workflows on Python.
License: MIT License
DALL·E Flow is an interactive workflow for generating high-definition images from text prompt.
License: No License
A data orchestrator for machine learning, analytics, and ETL.
License: Apache License 2.0
DBND is an agile pipeline framework that helps data engineering teams track and orchestrate their data processes.
License: Apache License 2.0
ETL tool for running transformations inside data warehouses.
License: Apache License 2.0
Lyft’s Cloud Native Machine Learning and Data Processing Platform - (Demo) .
License: Apache License 2.0
Job orchestration engine to interface and trigger the execution of jobs from Hadoop-based systems.
License: Apache License 2.0
Wrapper of the data pipeline Luigi.
License: MIT License
Hamilton is a micro-orchestration framework for defining dataflows. Runs anywhere python runs (e.g. jupyter, fastAPI, spark, ray, dask). Brings software engineering best practices without you knowing it. Use it to define feature engineering transforms, end-to-end model pipelines, and LLM workflows. It complements macro-orchestration systems (e.g. kedro, luigi, airflow, dbt, etc.) as it replaces the code within those macro tasks.
License: BSD 3-Clause Clear License
Instill VDP (Versatile Data Pipeline) aims to streamline the data processing pipelines from inception to completion.
License: Unknown
Ludwig is a declarative machine learning framework that makes it easy to define machine learning pipelines using a simple and flexible data-driven configuration system.
License: Apache License 2.0
A framework for building neat pipelines, providing the right abstractions to chain your data transformation and prediction steps with data streaming, as well as doing hyperparameter searches (AutoML).
License: Apache License 2.0
Open source distributed processing framework build on Kubernetes focused mainly on dynamic building of production machine learning pipelines - (Video) .
License: Apache License 2.0
Based on Kedro and MLflow. Full comparison is found here .
License: Other
The fastest way to build data pipelines. Develop iteratively, deploy anywhere.
License: Apache License 2.0
Workflow management system that makes it easy to take your data pipelines and add semantics like retries, logging, dynamic mapping, caching, failure notifications, and more.
License: Apache License 2.0
Workflow management system for reproducible and scalable data analyses.
License: MIT License
General-purpose machine learning pipeline for generating embedding vectors using one or many ML models.
License: Apache License 2.0
Last Updated: Dec 26, 2023