Back to integrations
Dagster + DuckDB + PySpark

Dagster + DuckDB + PySpark

Translate between DuckDB tables and PySpark DataFrames.

About this integration

This integration allows you to easily store PySpark DataFrames in DuckDB by building a DuckDB I/O Manager that can store and load to PySpark`.

Installation

pip install dagster-duckdb dagster-duckdb-pyspark

Example

from dagster_duckdb import build_duckdb_io_manager
from dagster_duckdb_pyspark import DuckDBPySparkTypeHandler
from dagster import asset, with_resources

@asset
def my_table():
    return some_pyspark_data_frame

duckdb_io_manager = build_duckdb_io_manager([DuckDBPySparkTypeHandler()])

assets = with_resources(
    [my_table],
    {"io_manager": duckdb_io_manager.configured({"database": "my_db.duckdb"})}
)

About DuckDB and PySpark

DuckDB is a column-oriented embeddable OLAP database. A typical OLTP relational database like SQLite is row-oriented. In row-oriented database, data is organised physically as consecutive tuples.

PySpark is the Python API for Apache Spark, a distributed framework and set of libraries for real-time, large-scale data processing. PySpark allows you to create more scalable analyses and data pipelines.