About this integration
This integration allows you to easily store PySpark DataFrames in DuckDB by building a DuckDB I/O Manager that can store and load to PySpark.
Installation
pip install dagster-duckdb dagster-duckdb-pyspark
Example
from dagster_duckdb_pyspark import DuckDBPySparkIOManager
from dagster import asset, Definitions
@asset
def my_table():
return some_pyspark_dataframe
defs = Definitions(
[my_table], resources={"io_manager": DuckDBPySparkIOManager(database="my_db.duckdb")}
)
About DuckDB and PySpark
DuckDB is a column-oriented embeddable OLAP database. A typical OLTP relational database like SQLite is row-oriented. In row-oriented database, data is organised physically as consecutive tuples.
PySpark is the Python API for Apache Spark, a distributed framework and set of libraries for real-time, large-scale data processing. PySpark allows you to create more scalable analyses and data pipelines.