About this integration
Spark jobs typically execute on infrastructure that's specialized for Spark. Spark applications are typically not containerized or executed on Kubernetes. Running Spark code often requires submitting code to a Databricks or EMR cluster.
There are two approaches to writing Dagster ops that invoke Spark computations:
dagster-pyspark
provides a Spark class with methods for configuration and constructing the spark-submit
command for a Spark job.
About Apache Spark
Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. It also provides libraries for graph computation, SQL for structured data processing, ML and data science.