Dagster + SLURM

Bridge high-performance computing with modern data orchestration. Run Dagster assets seamlessly across laptops, CI pipelines, and supercomputers with full observability.

About this integration

Including SLURM in your Dagster pipelines brings modern developer experience to high-performance computing. With dagster-slurm, you can:

- Run the same Dagster assets on your laptop, CI pipeline, or Tier-0 supercomputers without code changes

- Get full observability of Slurm jobs directly in the Dagster UI with real-time logs and metrics

- Package environments automatically using Pixi for reproducible, portable execution

- Submit batch jobs through sbatch while maintaining Dagster's lineage tracking and retries

- Choose between Bash, Ray, or custom launchers for flexible execution

This community-maintained integration uses Dagster Pipes to connect Dagster's orchestration with SLURM's powerful scheduler, delivering the ergonomics of modern data platforms on HPC systems.

Installation

uv add dagster-slurm

Or with pip:

pip install dagster-slurm

Example

import dagster as dg
from dagster_slurm import ComputeResource, RayLauncher

@dg.asset
def training_job(context: dg.AssetExecutionContext, compute: ComputeResource):
    completed = compute.run(
        context=context,
        payload_path="workloads/train.py",
        launcher=RayLauncher(num_gpus_per_node=2),
        resource_requirements={
            "framework": "ray",
            "cpus": 32,
            "gpus": 2,
            "memory_gb": 120,
        },
        extra_env={"EXP_NAME": context.run.run_id},
    )
    yield from completed.get_results()

Switch from ExecutionMode.LOCAL to ExecutionMode.SLURM and the same asset submits via sbatch, runs on the cluster, and streams logs back to your Dagster UI.

About SLURM

SLURM (Simple Linux Utility for Resource Management) is an open-source workload manager and job scheduler designed for Linux clusters. It's used by many of the world's supercomputing centers and research facilities to manage compute resources, handle job queues, and coordinate parallel processing across thousands of nodes.