ML Pipeline Orchestration: Temporal, Airflow, Kubeflow, Ray — Which Layer Does What

By Gopi Krishna Tummala

Infrastructure-First MLOps — Building the Engine of AI

Module 1: Data DNA Module 2: Dataloaders Module 3: Training Module 4: Post-Training Module 5: Serving Module 6: Kernels Module 7: Agentic AI Module 8: Orchestration

📖 You are reading Module 8: ML Pipeline Orchestration — The Conductor of AI Systems

TL;DR

Temporal = durable backend orchestration with exactly-once workflow semantics. Good for long-running ML coordination.
Airflow 3 (April 2025) = DAG scheduler for data pipelines with full DAG versioning. The industry standard for ETL. Not designed for distributed compute.
Prefect 3 = Python-native Airflow alternative with dynamic workflows and hybrid execution.
Dagster = asset-centric orchestration. You define data products; Dagster figures out what to run.
Metaflow 2.19 = ML-native pipeline SDK from Netflix. Decorator-based, AWS-native, great for DS teams.
ZenML = meta-orchestrator. Runs your pipeline on any backend (Airflow, Kubeflow, Vertex AI, etc.).
Kubeflow Pipelines v2 = containerized ML DAGs on Kubernetes with compile-time type checking via IR YAML.
TFX = opinionated, full-lifecycle ML framework. Pipeline-as-code with typed component contracts.
Ray 2.x = distributed Python execution engine. OpenAI uses it for ChatGPT training. The scaling layer beneath pipelines.
Vertex AI Pipelines = managed Kubeflow on GCP at $0.03/run. Eliminates K8s ops overhead.
MLflow Recipes = removed entirely in MLflow 3 (June 2025). Do not use.
Real FAANG ML systems layer these: scheduler → pipeline definition → execution layer → distributed compute → storage.

Act 0: The Confusion That Costs You Interviews

Here is a question that trips up a surprising number of ML engineers:

“Airflow, Kubeflow, Ray, Temporal — they all orchestrate steps. Aren’t they basically the same thing?”

No. They sit at completely different layers of the stack and solve different problems. Confusing them in a system design interview signals shallow infrastructure intuition.

The way to cut through this is to ask three questions for any orchestration tool:

What triggers execution? (Events, cron, upstream task completion, manual)
What is the unit of work? (Function call, Docker container, distributed task graph)
What compute does it target? (One server, a cluster, a GPU farm)

Let’s map every major tool against these axes.

Act I: Layer 1 — Durable Backend Orchestration

Temporal is a step up in seriousness. It is code-first workflow orchestration designed for backend services where correctness and crash-recovery matter more than schedule management.

flowchart TD
    A[Client submits\nworkflow] --> B[Temporal Server\nDurable execution history]
    B --> C[Worker 1\nActivity: validate_data]
    B --> D[Worker 2\nActivity: preprocess]
    D --> E[Worker 3\nActivity: trigger_training]
    E --> F[Wait for\nexternal signal\ndays if needed]
    F --> G[Activity: evaluate_model]
    G --> H[Activity: deploy_if_pass]

    style B fill:#6366f1,stroke:#4f46e5,color:#fff
    style F fill:#f59e0b,stroke:#d97706,color:#fff

The key idea in Temporal: workflows are durable. If your server crashes mid-execution, Temporal replays the event history and picks up exactly where it left off. It provides exactly-once semantics for workflow logic and at-least-once guarantees for activities (side-effect operations like API calls).

This is categorically different from Airflow: in Airflow, a failed task restarts from scratch. In Temporal, the workflow function resumes from the exact last activity checkpoint—even if that checkpoint happened three days ago and multiple servers have since restarted.

How companies use it for ML:

Orchestrating AI-driven pipelines that analyze billions of signals where individual LLM API calls can time out
Long-running data workflows that span hours or days
Multi-agent AI systems where each step involves external service calls that need reliable retry/resume
Human-approval gates in ML deployment pipelines

Temporal is not a data pipeline tool—it has no native operator ecosystem, no built-in data lineage, no concept of datasets or assets. You’d still need Airflow for ETL scheduling and Ray for distributed training. Temporal fills the coordination layer between them.

Act II: Layer 2 — Data Pipeline Schedulers

This is where Airflow lives, and it is the most widely deployed orchestration tool in the data industry.

flowchart TD
    subgraph DAG["Daily Training DAG (Airflow 3)"]
        A[Sensor:\nwait for data] --> B[Operator:\nextract from S3]
        B --> C[Operator:\nrun Spark job]
        C --> D[Operator:\nvalidate schema]
        D --> E[Operator:\ntrigger training\nKubernetesPodOperator]
        E --> F[Operator:\npush model to registry]
    end

    style DAG fill:#1e1b4b,stroke:#6366f1,color:#a5b4fc
    style A fill:#374151,stroke:#6b7280,color:#f9fafb
    style E fill:#6366f1,stroke:#4f46e5,color:#fff

Airflow 3.0 (released April 2025, with Airflow 2.x entering security-only support April 2026) brought a major architectural shift. The headline feature that took a decade to ship: full DAG versioning. In Airflow 2.x, any DAG code change was immediately applied to historical runs, making past execution records unreliable. In Airflow 3, each DAG run is pinned to the exact DAG version that created it.

Other significant Airflow 3 changes:

Task Execution API + Task SDK: task code is now decoupled from the Airflow runtime. Tasks can run in remote containers or edge environments without needing the Airflow metadata DB installed. This enables true edge ML workflows.
Flask replaced with FastAPI: async support, Pydantic validation, better performance
New React UI: dark mode, 17 language support, redesigned from scratch
Edge Executor (GA): distributed/edge-compute workflows are now first-class
Airflow 3.1 (September 2025): Human-in-the-Loop task support and Deadline Alerts

The migration cost is real. Airflow 3 breaks operators that directly access the Airflow metadata database (they now need to go through the Task Execution API). Core operators migrated to apache-airflow-providers-standard. For large DAG libraries, this migration takes weeks.

What Airflow is not good at:

Defining ML pipelines at a high level (no concept of model artifacts, feature stores, or experiment tracking)
Distributed training directly (you’d use KubernetesPodOperator to submit a job elsewhere)
Data-science-friendly API (writing Airflow DAGs requires infrastructure knowledge)
Dynamic workflows where the graph structure is not known at parse time

Prefect 3: The Python-Native Alternative

Prefect 3.0 (released September 3, 2024) makes a compelling architectural bet: the Prefect server handles only orchestration metadata—scheduling, state tracking, UI. Actual compute runs on workers in your own infrastructure, which poll a work pool for scheduled runs. No compute ever touches Prefect’s servers.

The developer experience is fundamentally different from Airflow:

# Prefect: decorate your existing Python
from prefect import flow, task

@task
def extract_features(dataset_path: str):
    return load_features(dataset_path)

@task(retries=3, retry_delay_seconds=60)
def train_model(features, lr: float = 1e-4):
    return run_training(features, lr=lr)

@flow
def ml_pipeline(dataset_path: str):
    features = extract_features(dataset_path)
    model = train_model(features)
    return model

No DAG class, no Operator inheritance, no XCom for passing data between tasks. Just Python.

Prefect vs Airflow comparison:

Dimension	Prefect 3	Airflow 3
Code style	Decorate existing Python	Restructure into DAG + Operators
Dynamic workflows	Native — conditionals, loops	`DynamicTaskMapping` (awkward)
Infrastructure	Hybrid — your compute, Prefect orchestration	Dedicated scheduler + webserver + workers
Operator ecosystem	Smaller, write more Python glue	Massive (10+ years, every service)
Asset-awareness	None	”Data Assets” added in Airflow 3 (early)

When to choose Prefect: Python-first teams, dynamic conditional workflows, teams wanting Airflow’s scheduling power without Airflow’s infrastructure complexity.

When to choose Airflow: You need the operator ecosystem (Snowflake, BigQuery, Databricks, EMR). You’re at a company with existing Airflow infrastructure. Your pipelines are fundamentally schedule-driven.

Dagster: Asset-Centric Orchestration

Dagster’s core abstraction is the Software-Defined Asset (SDA) — you define what data products should exist, and Dagster figures out what needs to run to materialize them. Airflow asks “did the task succeed?” Dagster asks “is this dataset fresh and correct?”

This sounds subtle but it changes everything about how you think about pipelines:

# Dagster: define what should exist, not what to run
@asset
def raw_features(context):
    return load_from_warehouse()

@asset
def trained_model(raw_features):
    return train(raw_features)

@asset(freshness_policy=FreshnessPolicy(maximum_lag_minutes=60))
def production_predictions(trained_model, raw_features):
    return trained_model.predict(raw_features)

Airflow 3’s response: Airflow 3.0 introduced “Data Assets” — a direct response to Dagster’s SDA model. It’s early and less mature, but signals the industry moving toward asset-awareness.

When to choose Dagster: Your team thinks in data products. You need partition-aware backfills (GDPR reprocessing, late-arriving data). You want built-in data quality checks without bolting on Great Expectations.

Act III: Layer 3 — ML-Native Pipeline Frameworks

This is where the tools were actually designed with ML workflows in mind.

Metaflow (Netflix)

Metaflow 2.19 is the current version. Netflix open-sourced it in 2019 and continues active development alongside Maestro (their internal workflow orchestrator). Recent additions (2025):

@checkpoint decorator: save and resume long training runs mid-step
@spin command: create flows interactively, useful for agentic workflows
Native uv support for dependency management
Recursive and conditional steps for agentic patterns
Cards: interactive visual outputs built directly into pipeline steps—custom observability dashboards as a first-class feature

from metaflow import FlowSpec, step, batch, Parameter, checkpoint

class TrainingFlow(FlowSpec):
    lr = Parameter("lr", default=1e-4)

    @step
    def start(self):
        self.next(self.preprocess)

    @batch(cpu=8, memory=32000)
    @step
    def preprocess(self):
        # Runs on AWS Batch, auto-scaled
        self.dataset = load_and_preprocess()
        self.next(self.train)

    @batch(gpu=8, memory=320000)
    @checkpoint  # Added in 2025: resume from last checkpoint on failure
    @step
    def train(self):
        # Runs on a GPU instance
        self.model = train_model(self.dataset, lr=self.lr)
        self.next(self.evaluate)

    @step
    def evaluate(self):
        self.metrics = evaluate(self.model)
        self.next(self.end)

    @step
    def end(self):
        print(f"Accuracy: {self.metrics['accuracy']}")

AWS Step Functions integration: Metaflow can compile any flow directly into an AWS Step Functions state machine. All decorators (@batch, @retry, @catch, @timeout) work within Step Functions. The @schedule decorator triggers via AWS EventBridge.

Metaflow vs ZenML: Metaflow has better UX for iterative research and is deeply AWS-native. ZenML has a stronger production MLOps governance story and is cloud-agnostic.

Kubeflow Pipelines v2

KFP v2 brought a fundamental architectural change: the compilation output is now a serialized PipelineSpec protobuf message in YAML form (called IR YAML — Intermediate Representation). Both individual components and full pipelines compile to this same format, eliminating the v1 confusion of two separate YAML schemas.

Type checking now happens at compile time across component boundaries:

from kfp import dsl
from kfp.dsl import Dataset, Model, Metrics

@dsl.component(base_image="pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime")
def train_model(
    train_data: Input[Dataset],
    learning_rate: float,
    model: Output[Model],
    metrics: Output[Metrics],
):
    # Type violations are caught at compile time, not runtime
    import torch
    result = run_training(train_data.path, lr=learning_rate)
    model.metadata["framework"] = "pytorch"
    metrics.log_metric("accuracy", result.accuracy)

@dsl.pipeline(name="training-pipeline")
def training_pipeline(lr: float = 1e-4):
    data_op = ingest_data()
    train_op = train_model(
        train_data=data_op.outputs["dataset"],
        learning_rate=lr,
    )
    evaluate_op = evaluate(model=train_op.outputs["model"])

# Compiles to portable IR YAML — runs on Kubeflow or Vertex AI
from kfp import compiler
compiler.Compiler().compile(training_pipeline, "pipeline.yaml")

flowchart TD
    subgraph KFP["Kubeflow Pipeline v2"]
        A[Component: ingest_data\nPod: data-ingest:v3\nOutput: Dataset artifact] --> B[Component: validate\nPod: tfx-validate:latest\nOutput: Schema artifact]
        B --> C[Component: transform\nPod: spark-transform:v2\nOutput: Dataset artifact]
        C --> D[Component: train\nPod: pytorch-trainer:v8\ngpu=8\nOutput: Model artifact]
        D --> E[Component: evaluate\nPod: evaluator:v3\nOutput: Metrics artifact]
        E --> F{Accuracy > 0.95?}
        F -->|Yes| G[Component: push_model\nArtifact Registry]
        F -->|No| H[Alert + abort]
    end

    style KFP fill:#1e1b4b,stroke:#6366f1,color:#a5b4fc
    style D fill:#6366f1,stroke:#4f46e5,color:#fff
    style F fill:#f59e0b,stroke:#d97706,color:#000
    style G fill:#10b981,stroke:#059669,color:#fff
    style H fill:#ef4444,stroke:#dc2626,color:#fff

Why containerization matters: each step is reproducible, independently versioned, and can run on any pod in the cluster. The training step requests 8 GPUs; the data validation step requests 64 CPU cores and 512GB RAM. The pipeline definition is separate from the resource requirements.

The local development problem: KFP’s biggest pain point is that testing a pipeline locally requires either mocking or a local K8s cluster. This makes iteration slow compared to Metaflow or Prefect, which run locally without infrastructure.

TFX (TensorFlow Extended)

TFX is Google’s internal ML production framework, open-sourced. It is the most opinionated option—it defines a strict contract for each component:

TFX Component	Input type	Output type
`ExampleGen`	Raw data path	`Examples`
`StatisticsGen`	`Examples`	`ExampleStatistics`
`SchemaGen`	`ExampleStatistics`	`Schema`
`ExampleValidator`	`Examples` + `Schema`	`ExampleAnomalies`
`Transform`	`Examples` + `Schema`	`TransformGraph` + `Examples`
`Trainer`	`Examples` + `Schema` + `TransformGraph`	`Model`
`Evaluator`	`Examples` + `Model`	`ModelEvaluation`
`Pusher`	`Model` + `ModelEvaluation`	Deployed model

The value of TFX: if your schema changes, the validation fails before it reaches training—not after you’ve wasted 8 hours of GPU time. The type contracts between components act as compile-time checks for your data pipeline.

The cost: you need to write TensorFlow. If you’re a PyTorch shop, TFX is largely inaccessible.

ZenML: The Meta-Orchestrator

ZenML takes a different approach: it is not a pipeline executor—it is a meta-orchestrator. When you run a ZenML pipeline, ZenML translates it and submits it to whatever underlying orchestrator you’ve configured:

# The same ZenML pipeline code runs on any backend
@step
def train_model(dataset: pd.DataFrame) -> ClassifierMixin:
    model = RandomForestClassifier()
    model.fit(dataset.drop("label", axis=1), dataset["label"])
    return model

@pipeline
def training_pipeline():
    data = load_data()
    model = train_model(data)
    evaluate(model, data)

Change one line in your stack configuration to switch from local execution to Kubeflow to Vertex AI to SageMaker Pipelines. ZenML handles containerization automatically—no Dockerfiles or YAML required.

What ZenML adds on top of the underlying orchestrator:

Artifact versioning and lineage tracking (Model Control Plane)
Rollback capability: ZenML snapshots exact code, Pydantic versions, and container state per step
Stack abstraction: swap orchestrator, artifact store, model registry, experiment tracker independently
Native LangGraph and LlamaIndex integration for AI agent pipelines (2025)

ZenML’s core value proposition: avoid orchestrator lock-in. The same pipeline code runs everywhere. This is especially valuable if you’re not sure whether you’ll standardize on Kubeflow, Vertex AI, or SageMaker.

Act IV: Layer 4 — Distributed Compute Engines

The tools above orchestrate which steps to run and when. This layer is about how to execute a single step across many machines.

The Foundation: MapReduce

Before Spark, Ray, and Kubeflow existed, Google published the MapReduce paper (2004) and it shaped how the entire industry thinks about distributed data processing. Understanding it is not optional for ML infra interviews—it is the conceptual substrate beneath almost everything else in this layer.

The core insight: break a large computation into two phases that can run across arbitrary numbers of machines.

flowchart LR
    subgraph Input["Input Splits"]
        S1[Shard 1\n10M rows]
        S2[Shard 2\n10M rows]
        S3[Shard 3\n10M rows]
    end

    subgraph Map["Map Phase\nRun in parallel on each shard"]
        M1[Map worker 1\nemit key-value pairs]
        M2[Map worker 2\nemit key-value pairs]
        M3[Map worker 3\nemit key-value pairs]
    end

    subgraph Shuffle["Shuffle + Sort\nGroup by key across all workers"]
        SH[All values for key K\nsent to same reducer]
    end

    subgraph Reduce["Reduce Phase\nAggregate per key"]
        R1[Reducer 1\nkey range A-M]
        R2[Reducer 2\nkey range N-Z]
    end

    S1 --> M1
    S2 --> M2
    S3 --> M3
    M1 --> SH
    M2 --> SH
    M3 --> SH
    SH --> R1
    SH --> R2

    style Map fill:#1e1b4b,stroke:#6366f1,color:#a5b4fc
    style Shuffle fill:#7c2d12,stroke:#ea580c,color:#fed7aa
    style Reduce fill:#14532d,stroke:#16a34a,color:#bbf7d0

The shuffle is the bottleneck. Every mapper writes its output to local disk. The framework then sorts and transfers the right keys to the right reducers across the network. At scale (thousands of nodes, terabytes of intermediate data), the shuffle phase can dominate total job time. This is why Spark’s in-memory model was a step change over Hadoop MapReduce—Spark avoids writing intermediate results to disk when memory permits.

How this maps to ML workloads:

ML task	Map	Reduce
Feature aggregation	Per-user event → (user_id, feature_vector)	Aggregate features per user
Word count (classic)	sentence → (word, 1)	Sum counts per word
Gradient computation	Per-batch loss → (layer, gradient)	Average gradients across workers (AllReduce)
Dataset statistics	Per-shard → (feature, stats)	Merge statistics globally

AllReduce is distributed gradient aggregation reframed as MapReduce. Each GPU computes gradients on its local batch (map). The AllReduce collective sums/averages gradients across all workers (reduce). The difference: AllReduce is ring-based (no central reducer), making it much more bandwidth-efficient than classic MapReduce at the gradient scale.

Spark is the modern implementation of the MapReduce pattern with in-memory execution, a richer API (DataFrames, SQL, MLlib), and support for iterative algorithms—which Hadoop MapReduce handled poorly because each iteration wrote to disk. For ML preprocessing at terabyte scale, Spark is the standard choice.

Ray extends the model further: instead of just map and reduce, Ray supports arbitrary directed task graphs with stateful actors. A Ray training job is essentially a generalized MapReduce where the “map” is distributed forward/backward passes and the “reduce” is gradient AllReduce—but with the flexibility to express any computation pattern.

flowchart LR
    subgraph Ray["Ray Cluster (2.x)"]
        Head[Head Node\nScheduler\nGCS Server] --> W1[Worker: GPU 0-7\nRay Train shard A]
        Head --> W2[Worker: GPU 8-15\nRay Train shard B]
        Head --> W3[Worker: GPU 16-23\nRay Train shard C]
        W1 --> PS[AllReduce\nNVLink / NCCL]
        W2 --> PS
        W3 --> PS
        PS --> Tune[Ray Tune\nHPO trials\nASHA scheduler]
    end

    style Ray fill:#1e1b4b,stroke:#6366f1,color:#a5b4fc
    style Head fill:#6366f1,stroke:#4f46e5,color:#fff
    style PS fill:#f59e0b,stroke:#d97706,color:#000
    style Tune fill:#10b981,stroke:#059669,color:#fff

Ray 2.x is the dominant distributed ML execution engine. OpenAI uses Ray to train ChatGPT—Greg Brockman stated directly: “We are using Ray to train our largest models.” At Uber, switching to a heterogeneous Ray cluster (8 GPU + 9 CPU nodes) versus 16 homogeneous GPU nodes achieved 50% higher throughput at lower capital cost by mixing workloads. Benchmark studies have demonstrated 1,657% speedups for distributed LLM batch inference with Ray versus single-node.

Ray’s ecosystem is now mature across the full ML lifecycle:

from ray import train
from ray.train.torch import TorchTrainer
from ray.train import ScalingConfig, CheckpointConfig

# Ray Train handles DDP/FSDP setup automatically
trainer = TorchTrainer(
    train_loop_per_worker=train_fn,
    scaling_config=ScalingConfig(
        num_workers=32,
        use_gpu=True,
        resources_per_worker={"GPU": 1},
    ),
    checkpoint_config=CheckpointConfig(
        num_to_keep=3,
        checkpoint_score_attribute="eval_loss",
    ),
    dataset_config={"train": train.DataConfig(datasets_to_split=["train"])},
)
result = trainer.fit()

Ray’s key components:

Ray Train: distributed training for PyTorch, TF, HuggingFace, XGBoost with automatic DDP/FSDP
Ray Data: streaming pipeline for CPU preprocessing + GPU training. Anyscale benchmarks (October 2025) show Ray Data outperforming alternatives on multimodal AI workloads (text + image + video) due to heterogeneous CPU/GPU pipelining
Ray Tune: HPO at scale. Uber reported 4× speedup using Ray Tune’s distributed search versus sequential approaches. Supports ASHA, HyperBand, Optuna, Ax.
Ray Serve: production model serving with autoscaling and batching

Ray is not a workflow scheduler: it does not handle scheduling, retries, or complex dependency graphs between pipeline stages. That’s Airflow’s or Kubeflow’s job. Ray does the compute; the pipeline orchestrator controls the flow.

Spark plays a similar role for data processing at scale—when your preprocessing step involves terabytes of tabular data, you submit a Spark job. Airflow has a SparkSubmitOperator for exactly this pattern.

Act V: Vertex AI Pipelines — The Managed Path

Vertex AI Pipelines is Google’s managed Kubeflow service. It uses the same KFP SDK v2 IR YAML format as self-hosted Kubeflow—pipeline code is portable between the two.

Pricing reality: Vertex AI Pipelines charges ** $0.03 per pipeline run** for orchestration (separate from compute costs). A team running 200 pipeline runs/day pays ~$ 180/month just for orchestration before any GPU charges. Custom training nodes start at $21.25/hour.

vs. Self-hosted Kubeflow:

Dimension	Vertex AI Pipelines	Self-hosted Kubeflow
Orchestration cost	$0.03/run	Free (K8s cluster costs instead)
Kubernetes expertise needed	None	Significant
GCP integration	Deep (Feature Store, BigQuery ML, Model Registry)	Manual
Portability	GCP-only	Any K8s cluster
UI	Polished, managed	Functional, self-maintained
DevOps burden	Minimal	Cluster admin, upgrades, monitoring

Sweet spot: GCP-native teams where DevOps bandwidth is the constraint. Startups where the $0.03/run fee is cheaper than DevOps time spent operating Kubernetes. At very high pipeline run volumes, self-hosted Kubeflow becomes more cost-effective.

Act VI: The Full Production Stack

Real ML systems at companies like Google, Meta, and Waymo are not using a single tool. They layer these systems:

flowchart TD
    subgraph L1["Layer 1: Scheduling"]
        S[Airflow 3 DAG\nCron: daily 2am\nor Vertex AI schedule\nor Temporal coordination]
    end

    subgraph L2["Layer 2: Pipeline Definition"]
        P1[Kubeflow Pipeline v2\nor TFX\nor Metaflow\nor ZenML]
    end

    subgraph L3["Layer 3: Execution Engine"]
        E1[Kubernetes\nPod scheduling\nGPU resource quotas\nKubernetes operators]
    end

    subgraph L4["Layer 4: Distributed Compute"]
        D1[Ray cluster\nor Spark\nor PyTorch DDP / FSDP directly]
    end

    subgraph L5["Layer 5: Storage + Metadata"]
        ST1[GCS / S3\nFeature Store\nArtifact Registry\nMLMD / MLflow Tracking]
    end

    S --> P1
    P1 --> E1
    E1 --> D1
    D1 --> ST1
    ST1 --> P1

    style L1 fill:#1e1b4b,stroke:#6366f1,color:#a5b4fc
    style L2 fill:#14532d,stroke:#16a34a,color:#bbf7d0
    style L3 fill:#7c2d12,stroke:#ea580c,color:#fed7aa
    style L4 fill:#1e3a5f,stroke:#3b82f6,color:#bfdbfe
    style L5 fill:#4a1d96,stroke:#7c3aed,color:#e9d5ff

A concrete example — daily model retraining at scale:

Airflow 3 triggers daily at 2am. The DataAsset sensor waits until yesterday’s feature store data has landed (new in Airflow 3 — native asset-awareness).
Kubeflow Pipeline v2 is triggered via KubernetesPodOperator. Steps: validate new data (type-checked against the stored schema) → run feature transforms → launch training job → evaluate → compare to production model.
Kubernetes schedules each pipeline step as a pod. The training step requests 64 A100s via a RayCluster custom resource.
Ray Train runs the distributed training job with automatic FSDP setup across those 64 GPUs, writing checkpoints to GCS every 10 minutes via ray.train.CheckpointConfig.
MLMD records the lineage: which data version, which pipeline version, which model artifact. MLflow tracking logs metrics per run.
If the new model passes evaluation thresholds, Kubeflow triggers a deployment component. Otherwise, it fires a PagerDuty alert.

Act VII: Tool Selection Decision Tree

flowchart TD
    A{What are you building?} --> C[Long-running stateful coordination\nSurvives crashes, approval gates]
    A --> D[Scheduled data ETL\nSQL transforms / reporting]
    A --> E[Python-first workflows\nDynamic / conditional graphs]
    A --> F[Data product management\nAsset freshness / lineage]
    A --> G[ML experiment pipeline\nTrain / evaluate / iterate]
    A --> H[Production ML on Kubernetes\nStrong typing + lineage]
    A --> I[Distributed training or\nlarge-scale preprocessing]
    A --> J[Orchestrator-agnostic ML\nMulti-cloud / vendor flexibility]

    C --> C1[Temporal]
    D --> D1[Airflow 3 / dbt]
    E --> E1[Prefect 3]
    F --> F1[Dagster]
    G --> G1[Metaflow / MLflow Tracking\n+ your orchestrator]
    H --> H1[Kubeflow Pipelines v2\nor Vertex AI Pipelines\nor TFX]
    I --> I1[Ray 2.x / Spark\n+ PyTorch DDP / FSDP]
    J --> J1[ZenML]

    style A fill:#6366f1,stroke:#4f46e5,color:#fff
    style C1 fill:#374151,stroke:#6b7280,color:#f9fafb
    style D1 fill:#374151,stroke:#6b7280,color:#f9fafb
    style E1 fill:#10b981,stroke:#059669,color:#fff
    style F1 fill:#10b981,stroke:#059669,color:#fff
    style G1 fill:#10b981,stroke:#059669,color:#fff
    style H1 fill:#10b981,stroke:#059669,color:#fff
    style I1 fill:#6366f1,stroke:#4f46e5,color:#fff
    style J1 fill:#6366f1,stroke:#4f46e5,color:#fff

Act VIII: The Comparison Table

Tool	Trigger	Unit of Work	Compute Target	ML-Native?	2025-26 Status
Temporal	Code-defined	Activity (any process)	Worker fleet	No	Active, growing
Airflow	Cron / upstream	Operator	Any (via operators)	No	v3.0 April 2025
Prefect	Any	Task (Python function)	Hybrid workers	No	v3.0 Sept 2024
Dagster	Any / asset trigger	Op / Asset	Any	Partial	Active
Metaflow	Manual / schedule	Step (Python function)	Cloud instances	Yes	v2.19.x
ZenML	Any (via backend)	Step (Python function)	Any (meta-orchestrator)	Yes	Active
Kubeflow	Pipeline trigger	Component (Docker)	Kubernetes cluster	Yes	v2 / KFP SDK 2.15
TFX	Airflow / Kubeflow	TFX Component	K8s / local	Yes (TF-only)	Stable
Vertex AI Pipelines	Schedule / trigger	Component (container)	GCP managed	Yes	$0.03/run
Ray	Code	Remote task / actor	Ray cluster	Yes	v2.x, OpenAI-backed
Spark	Code / Airflow	Stage	Spark cluster	Partial	Stable
MLflow Recipes	—	—	—	—	Removed in MLflow 3

MLflow Recipes removal note: MLflow Recipes (previously MLflow Pipelines) was removed entirely in MLflow 3.0 (June 2025). It was too rigid—real ML workflows don’t fit the six-step template (ingest → split → transform → train → evaluate → register). The caching was unreliable in practice. MLflow 3 refocused on tracking and model registry; for orchestration, they recommend using MLflow tracking inside whichever orchestrator you prefer.

Act IX: What to Say in Interviews

Q: “Design an ML pipeline that retrains a recommendation model daily.”

The wrong answer: “I’d use Airflow.” (Too vague—Airflow is just the scheduler, not the whole stack.)

The right answer:

“I’d layer this. Airflow 3 handles the daily schedule and dependency checks—it uses Data Assets to verify yesterday’s interaction logs have landed before triggering the pipeline. The pipeline definition itself would be in Kubeflow Pipelines v2 if we need strict containerization and artifact lineage, or Metaflow if the team is data-scientist-heavy and AWS-native. Each pipeline step runs as a Kubernetes pod so we can right-size compute per step—feature computation on Spark, distributed training on a Ray cluster using FSDP for the two-tower model. Artifacts and lineage go into MLMD or an artifact registry. The final step compares new model metrics to production baseline using shadow traffic—if it passes thresholds, it promotes. If not, it pages on-call and logs the failure with the artifact lineage so we can diff what changed.”

Q: “When would you use Ray vs Kubeflow?”

“They solve different problems and appear together in production. Ray is a distributed execution engine—use it when a single Python process isn’t enough, because your training job needs 64 GPUs or your preprocessing needs 10TB processed in parallel. OpenAI uses Ray for training ChatGPT. Kubeflow is a pipeline orchestrator—it defines the DAG of steps, manages typed artifacts between them, and handles conditionals like ‘only deploy if accuracy exceeds threshold.’ A Kubeflow pipeline step would submit a Ray training job. Ray does the compute; Kubeflow tracks the lineage and controls the flow.”

Q: “What’s the difference between Airflow and Dagster?”

“The core abstraction is different. Airflow asks ‘did the task succeed?’ Dagster asks ‘is this data asset fresh and correct?’ Dagster’s Software-Defined Assets let you declare what data products should exist and their freshness requirements—Dagster figures out what runs to materialize them. Airflow 3 added Data Assets as a response to this, but Dagster’s asset model is more mature. For schedule-driven ETL with a large operator ecosystem, Airflow wins. For teams that think in data products and need partition-aware backfills or built-in data quality checks, Dagster is a better fit.”

Q: “What is ZenML and when would you use it?”

“ZenML is a meta-orchestrator—it’s not a pipeline executor itself, it’s a layer that translates your pipeline definition and submits it to whatever backend you’ve configured: Airflow, Kubeflow, Vertex AI, SageMaker Pipelines. The value is avoiding orchestrator lock-in. You write your pipeline once in Python with ZenML decorators, and you can switch the backend by changing a stack configuration. It also adds artifact versioning, model lineage, and rollback capabilities on top of whatever backend you choose. I’d reach for it when a team doesn’t know yet which orchestrator they’ll standardize on, or when they need to run the same pipeline across multiple cloud providers.”

Act X: Your Context (AV / Zoox Background)

If you’ve worked on autonomous vehicle ML infrastructure, you’re already close to the serious end of this stack. AV companies typically run:

Internal schedulers built on Airflow or similar, with AV-specific sensors for data readiness (was yesterday’s lidar annotation batch complete? did the sync from the fleet land?)
Custom pipeline DSLs with Kubeflow/TFX-style typed component contracts (Waymo’s internal framework shares these concepts)
Large-scale distributed training with PyTorch FSDP across hundreds of GPUs for perception models—this is exactly the Ray use case
Metaflow or similar for research experiments, with promotion paths to production pipelines

In interviews, anchor your answers to this scale: “at Zoox, retraining a perception model meant coordinating 512 GPUs, validating against thousands of sensor-annotated scenarios across scenario categories—fog, night, construction zones—before any deployment. You’re looking at Kubernetes, distributed training via Ray Train or DDP, and a pipeline system with artifact contracts that prevents you from accidentally deploying a model trained on stale annotation data.”

That specificity signals you understand production ML infrastructure, not just the tooling vocabulary.

Quick Reference: Intuition Shortcuts

Tool	One-Line Mental Model
Temporal	Crash-proof backend workflows with exactly-once semantics
Airflow 3	Cron + DAGs + DAG versioning for data pipelines at scale
Prefect 3	Airflow ergonomics, Python-native, dynamic workflows
Dagster	Define what data should exist, not what code to run
Metaflow 2.19	Python decorators that move steps to cloud compute
ZenML	Write once, run on any orchestrator backend
Kubeflow v2	ML DAGs on Kubernetes with compile-time type checking
TFX	Kubeflow + strict typed component contracts (TF ecosystem)
Vertex AI Pipelines	Managed Kubeflow on GCP, $0.03/run
Ray 2.x	Make this Python code run on 1000 GPU cores
Spark	Process 10TB of tabular data in parallel
MLflow Recipes	Removed in MLflow 3 — do not use

The orchestration layer is where ML systems show their seams. Getting the abstraction boundaries right—scheduler vs pipeline vs compute engine—is the difference between a system that scales and one that becomes a maintenance nightmare at 10× load. The right answer is almost never “pick one tool.” It’s “understand which layer each tool owns and compose them correctly.”