Dagster & Polars Integration

About this integration

The Polars integration brings lightning-fast DataFrame processing to your Dagster pipelines. Polars is a high-performance DataFrame library written in Rust with Python bindings, offering both eager and lazy evaluation modes for optimal performance on large datasets.

With this integration, you can:

Use Polars DataFrames as inputs and outputs in Dagster assets and ops
Leverage lazy evaluation for memory-efficient processing of large datasets
Choose between eager and lazy modes using Python type annotations
Store data efficiently with multiple serialization formats (Parquet, Delta Lake, etc.)
Process data faster than traditional pandas workflows with automatic query optimization

The integration supports multiple storage backends and filesystems, making it ideal for both local development and cloud-scale data processing.

Installation

pip install dagster-polars

For additional I/O managers and storage formats:

# For Parquet support
pip install dagster-polars[parquet]

# For Delta Lake support  
pip install dagster-polars[deltalake]

# For BigQuery support
pip install dagster-polars[bigquery]

# Install all extras
pip install dagster-polars[all]

High Performance

Polars is designed for speed, offering:

Rust-powered execution with memory safety and parallelism
Automatic query optimization for lazy operations
Columnar processing optimized for analytical workloads
Memory efficiency with lazy evaluation and streaming

Multiple Storage Formats

The integration provides I/O managers for various storage formats:

Parquet: Efficient columnar storage
Delta Lake: ACID transactions and versioning
BigQuery: Direct integration with Google's data warehouse
Local files: Development and testing

Example

@asset(io_manager_key="polars_io_manager")
def time_series_data() -> pl.LazyFrame:
    """Load and prepare time series data."""
    return pl.scan_csv("sensor_data.csv").with_columns([
        pl.col("timestamp").str.strptime(pl.Datetime, "%Y-%m-%d %H:%M:%S"),
        pl.col("value").cast(pl.Float64)
    ]).sort("timestamp")

@asset(io_manager_key="polars_io_manager")
def rolling_averages(time_series_data: pl.LazyFrame) -> pl.LazyFrame:
    """Calculate rolling averages with different windows."""
    return time_series_data.with_columns([
        pl.col("value").rolling_mean(window_size=5).alias("rolling_5m"),
        pl.col("value").rolling_mean(window_size=15).alias("rolling_15m"),
        pl.col("value").rolling_mean(window_size=60).alias("rolling_1h")
    ])

@asset
def anomaly_detection(rolling_averages: pl.LazyFrame) -> pl.DataFrame:
    """Detect anomalies using statistical methods."""
    return rolling_averages.with_columns([
        (pl.col("value") - pl.col("rolling_1h")).abs().alias("deviation"),
        pl.col("value").std().over(pl.duration(hours=1)).alias("std_1h")
    ]).with_columns([
        (pl.col("deviation") > pl.col("std_1h") * 3).alias("is_anomaly")
    ]).filter(pl.col("is_anomaly")).collect()

About Polars

Polars is a blazingly fast DataFrame library implemented in Rust with Python bindings. It's designed from the ground up for performance, offering both eager and lazy evaluation modes, automatic query optimization, and excellent memory efficiency.

The library excels at analytical workloads and provides a modern API that's both powerful and intuitive. With its integration into Dagster, you can build high-performance data pipelines that scale from prototypes to production without sacrificing speed or reliability.

‍