About this integration
This integration allows you to easily store PySpark DataFrames in DuckDB by building a DuckDB I/O Manager that can store and load to PySpark`.
Installation
pip install dagster-duckdb dagster-duckdb-pyspark
Example
from dagster_duckdb import build_duckdb_io_manager
from dagster_duckdb_pyspark import DuckDBPySparkTypeHandler
from dagster import asset, with_resources
@asset
def my_table():
return some_pyspark_data_frame
duckdb_io_manager = build_duckdb_io_manager([DuckDBPySparkTypeHandler()])
assets = with_resources(
[my_table],
{"io_manager": duckdb_io_manager.configured({"database": "my_db.duckdb"})}
)
About DuckDB and PySpark
DuckDB is a column-oriented embeddable OLAP database. A typical OLTP relational database like SQLite is row-oriented. In row-oriented database, data is organised physically as consecutive tuples.
PySpark is the Python API for Apache Spark, a distributed framework and set of libraries for real-time, large-scale data processing. PySpark allows you to create more scalable analyses and data pipelines.