Dagster + DuckDB + PySpark

About this integration

This integration allows you to easily store PySpark DataFrames in DuckDB by building a DuckDB I/O Manager that can store and load to PySpark.

Installation

pip install dagster-duckdb dagster-duckdb-pyspark

Example

from dagster_duckdb_pyspark import DuckDBPySparkIOManager
from dagster import asset, Definitions

@asset
def my_table():
    return some_pyspark_dataframe

defs = Definitions(
    [my_table], resources={"io_manager": DuckDBPySparkIOManager(database="my_db.duckdb")}
)

About DuckDB and PySpark

DuckDB is a column-oriented embeddable OLAP database. A typical OLTP relational database like SQLite is row-oriented. In row-oriented database, data is organised physically as consecutive tuples.

PySpark is the Python API for Apache Spark, a distributed framework and set of libraries for real-time, large-scale data processing. PySpark allows you to create more scalable analyses and data pipelines.