Databricks | Dagster Integrations
Back to integrations
Dagster + Databricks

Dagster + Databricks

Launch a Databricks job as a Dagster op.

About this integration

Looking to use Dagster as the orchestration layer for your Databricks analytics or AI workloads? The dagster-databricks package lets you:

  • Execute an op within a Databricks context on a cluster, such that the pyspark resource uses the cluster’s Spark instance.
  • Create a op that submits an external configurable job to Databricks using the ‘Run Now’ API.

Installation

pip install dagster-databricks

Example

from dagster import job
from dagster_databricks import create_databricks_job_op, DatabricksClientResource

sparkpi = create_databricks_job_op().configured(
    {
        "job": {
            "run_name": "SparkPi Python job",
            "new_cluster": {
                "spark_version": "7.3.x-scala2.12",
                "node_type_id": "i3.xlarge",
                "num_workers": 2,
            },
            "spark_python_task": {"python_file": "dbfs:/docs/pi.py", "parameters": ["10"]},
        }
    },
    name="sparkpi",
)

@job(
    resource_defs={
        "databricks_client": DatabricksClientResource(
            host="my.workspace.url",
            token="my.access.token"
        )
    }
)
def do_stuff():
    sparkpi()

About Databricks

Founded by the creators of Apache Spark, Databricks develops a web-based platform for working with Spark, that provides automated cluster management and IPython-style notebooks.