DuckDB + Pandas | Dagster Integrations
Back to integrations
Dagster + DuckDB + Pandas

Dagster + DuckDB + Pandas

Translate between DuckDB tables and Pandas DataFrames.

About this integration

This library provides an integration with the DuckDB database and Pandas data processing library, allowing you to build a DuckDB I/O Manager that can store and load Pandas DataFrames.

Installation

pip install dagster-duckdb dagster-duckdb-pandas

Example

from dagster_duckdb_pandas import DuckDBPandasIOManager
from dagster import asset, Definitions
import pandas as pd

@asset
def my_table():
    return pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})

defs = Definitions(
    [my_table], resources={"io_manager": DuckDBPandasIOManager(database="my_db.duckdb")}
)

We published a more comprehensive tutorial on using Dagster + DuckDB + Pandas in "Building an Outbound Reporting Pipeline".

About DuckDB and Pandas

DuckDB is a column-oriented embeddable OLAP database. A typical OLTP relational database like SQLite is row-oriented. In row-oriented database, data is organised physically as consecutive tuples.

Pandas is a very popular Python package that provides data structures designed to make working with “relational” or “labeled” data both easy and intuitive. Pandas aims to be the fundamental high-level building block for doing practical, real-world data analysis in Python.

DuckDB can efficiently run SQL queries directly on Pandas DataFrames.