Spark | Dagster Integrations
Back to integrations
Trigger Spark jobs

Trigger Spark jobs

Configure and run Spark jobs.

About this integration

Spark jobs typically execute on infrastructure that's specialized for Spark. Spark applications are typically not containerized or executed on Kubernetes. Running Spark code often requires submitting code to a Databricks or EMR cluster.

There are two approaches to writing Dagster ops that invoke Spark computations:

dagster-pyspark provides a Spark class with methods for configuration and constructing the spark-submit command for a Spark job.

About Apache Spark

Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. It also provides libraries for graph computation, SQL for structured data processing, ML and data science.