About this integration
Looking to use Dagster as the orchestration layer for your Databricks analytics or AI workloads? The dagster-databricks
package lets you:
- Execute an op within a Databricks context on a cluster, such that the pyspark resource uses the cluster’s Spark instance.
- Create a op that submits an external configurable job to Databricks using the ‘Run Now’ API.
Installation
pip install dagster-databricks
Example
from dagster import job
from dagster_databricks import create_databricks_job_op, databricks_client
sparkpi = create_databricks_job_op().configured(
{
"job": {
"run_name": "SparkPi Python job",
"new_cluster": {
"spark_version": "7.3.x-scala2.12",
"node_type_id": "i3.xlarge",
"num_workers": 2,
},
"spark_python_task": {"python_file": "dbfs:/docs/pi.py", "parameters": ["10"]},
}
},
name="sparkpi",
)
@job(
resource_defs={
"databricks_client": databricks_client.configured(
{"host": "my.workspace.url", "token": "my.access.token"}
)
}
)
def do_stuff():
sparkpi()
About Databricks
Founded by the creators of Apache Spark, Databricks develops a web-based platform for working with Spark, that provides automated cluster management and IPython-style notebooks.