Join Our Discord (750+ Members)

32 Data Pipeline and Workflow Management Frameworks for Efficient Data Processing and Machine Learning

Explore open source data pipeline and workflow management frameworks for efficient data processing and machine learning, ensuring seamless workflows.

Open Source Data Pipeline Frameworks

  • Apache Airflow

    Data Pipeline framework built in Python, including scheduler, DAG definition and a UI for visualisation.

    License: Apache License 2.0

  • Apache Nifi

    Apache NiFi was made for dataflow. It supports highly configurable directed graphs of data routing, transformation, and system mediation logic.

    License: Apache License 2.0

  • Argo Workflows

    Argo Workflows is an open source container-native workflow engine for orchestrating parallel jobs on Kubernetes. Argo Workflows is implemented as a Kubernetes CRD (Custom Resource Definition).

    License: Apache License 2.0

  • Azkaban

    Azkaban is a batch workflow job scheduler created at LinkedIn to run Hadoop jobs. Azkaban resolves the ordering through job dependencies and provides an easy to use web user interface to maintain and track your workflows.

    License: Apache License 2.0

  • Basin

    Visual programming editor for building Spark and PySpark pipelines.

    License: Other

  • BatchFlow

    BatchFlow helps data scientists conveniently work with random or sequential batches of your data and define data processing and machine learning workflows for large datasets.

    License: Apache License 2.0

  • Bonobo

    ETL framework for Python 3.5+ with focus on simple atomic operations working concurrently on rows of data.

    License: Apache License 2.0

  • Chronos

    More of a job scheduler for Mesos than ETL pipeline.

    License: Apache License 2.0

  • Couler

    Unified interface for constructing and managing machine learning workflows on different workflow engines, such as Argo Workflows, Tekton Pipelines, and Apache Airflow.

    License: Apache License 2.0

  • D6tflow

    A python library that allows for building complex data science workflows on Python.

    License: MIT License

  • DALL·E Flow

    DALL·E Flow is an interactive workflow for generating high-definition images from text prompt.

    License: No License

  • Dagster

    A data orchestrator for machine learning, analytics, and ETL.

    License: Apache License 2.0

  • DBND

    DBND is an agile pipeline framework that helps data engineering teams track and orchestrate their data processes.

    License: Apache License 2.0

  • DBT

    ETL tool for running transformations inside data warehouses.

    License: Apache License 2.0

  • Flyte

    Lyft’s Cloud Native Machine Learning and Data Processing Platform - (Demo) .

    License: Apache License 2.0

  • Genie

    Job orchestration engine to interface and trigger the execution of jobs from Hadoop-based systems.

    License: Apache License 2.0

  • Gokart

    Wrapper of the data pipeline Luigi.

    License: MIT License

  • Hamilton

    Hamilton is a micro-orchestration framework for defining dataflows. Runs anywhere python runs (e.g. jupyter, fastAPI, spark, ray, dask). Brings software engineering best practices without you knowing it. Use it to define feature engineering transforms, end-to-end model pipelines, and LLM workflows. It complements macro-orchestration systems (e.g. kedro, luigi, airflow, dbt, etc.) as it replaces the code within those macro tasks.

    License: BSD 3-Clause Clear License

  • Instill VDP

    Instill VDP (Versatile Data Pipeline) aims to streamline the data processing pipelines from inception to completion.

    License: Unknown

    GitHub
    Website: Unknown
  • Kedro

    Kedro is a workflow development tool that helps you build data pipelines that are robust, scalable, deployable, reproducible and versioned. Visualization of the kedro workflows can be done by kedro-viz .

    License: Unknown

    GitHub
    Website: Unknown
  • Ludwig

    Ludwig is a declarative machine learning framework that makes it easy to define machine learning pipelines using a simple and flexible data-driven configuration system.

    License: Apache License 2.0

  • Luigi

    Luigi is a Python module that helps you build complex pipelines of batch jobs, handling dependency resolution, workflow management, visualisation, etc..

    License: Apache License 2.0

  • Metaflow

    A framework for data scientists to easily build and manage real-life data science projects.

    License: Unknown

    GitHub
    Website: Unknown
  • Neuraxle

    A framework for building neat pipelines, providing the right abstractions to chain your data transformation and prediction steps with data streaming, as well as doing hyperparameter searches (AutoML).

    License: Apache License 2.0

  • Oozie

    Workflow scheduler for Hadoop jobs.

    License: Apache License 2.0

  • Pachyderm

    Open source distributed processing framework build on Kubernetes focused mainly on dynamic building of production machine learning pipelines - (Video) .

    License: Apache License 2.0

  • PipelineX

    Based on Kedro and MLflow. Full comparison is found here .

    License: Other

  • Ploomber

    The fastest way to build data pipelines. Develop iteratively, deploy anywhere.

    License: Apache License 2.0

  • Prefect Core

    Workflow management system that makes it easy to take your data pipelines and add semantics like retries, logging, dynamic mapping, caching, failure notifications, and more.

    License: Apache License 2.0

  • SETL

    A simple Spark-powered ETL framework that helps you structure your ETL projects, modularize your data transformation logic and speed up your development.

    License: Apache License 2.0

  • Snakemake

    Workflow management system for reproducible and scalable data analyses.

    License: MIT License

  • Towhee

    General-purpose machine learning pipeline for generating embedding vectors using one or many ML models.

    License: Apache License 2.0

Last Updated: Dec 26, 2023