Dagster + Polars

Use Polars eager or lazy DataFrames as inputs and outputs in your Dagster assets and ops.

About this integration

The Polars integration brings lightning-fast DataFrame processing to your Dagster pipelines. Polars is a high-performance DataFrame library written in Rust with Python bindings, offering both eager and lazy evaluation modes for optimal performance on large datasets.

With this integration, you can:

  • Use Polars DataFrames as inputs and outputs in Dagster assets and ops
  • Leverage lazy evaluation for memory-efficient processing of large datasets
  • Choose between eager and lazy modes using Python type annotations
  • Store data efficiently with multiple serialization formats (Parquet, Delta Lake, etc.)
  • Process data faster than traditional pandas workflows with automatic query optimization

The integration supports multiple storage backends and filesystems, making it ideal for both local development and cloud-scale data processing.

Installation

pip install dagster-polars

For additional I/O managers and storage formats:

# For Parquet support
pip install dagster-polars[parquet]

# For Delta Lake support  
pip install dagster-polars[deltalake]

# For BigQuery support
pip install dagster-polars[bigquery]

# Install all extras
pip install dagster-polars[all]

High Performance

Polars is designed for speed, offering:

  • Rust-powered execution with memory safety and parallelism
  • Automatic query optimization for lazy operations
  • Columnar processing optimized for analytical workloads
  • Memory efficiency with lazy evaluation and streaming

Multiple Storage Formats

The integration provides I/O managers for various storage formats:

  • Parquet: Efficient columnar storage
  • Delta Lake: ACID transactions and versioning
  • BigQuery: Direct integration with Google's data warehouse
  • Local files: Development and testing

Example

@asset(io_manager_key="polars_io_manager")
def time_series_data() -> pl.LazyFrame:
    """Load and prepare time series data."""
    return pl.scan_csv("sensor_data.csv").with_columns([
        pl.col("timestamp").str.strptime(pl.Datetime, "%Y-%m-%d %H:%M:%S"),
        pl.col("value").cast(pl.Float64)
    ]).sort("timestamp")

@asset(io_manager_key="polars_io_manager")
def rolling_averages(time_series_data: pl.LazyFrame) -> pl.LazyFrame:
    """Calculate rolling averages with different windows."""
    return time_series_data.with_columns([
        pl.col("value").rolling_mean(window_size=5).alias("rolling_5m"),
        pl.col("value").rolling_mean(window_size=15).alias("rolling_15m"),
        pl.col("value").rolling_mean(window_size=60).alias("rolling_1h")
    ])

@asset
def anomaly_detection(rolling_averages: pl.LazyFrame) -> pl.DataFrame:
    """Detect anomalies using statistical methods."""
    return rolling_averages.with_columns([
        (pl.col("value") - pl.col("rolling_1h")).abs().alias("deviation"),
        pl.col("value").std().over(pl.duration(hours=1)).alias("std_1h")
    ]).with_columns([
        (pl.col("deviation") > pl.col("std_1h") * 3).alias("is_anomaly")
    ]).filter(pl.col("is_anomaly")).collect()

About Polars

Polars is a blazingly fast DataFrame library implemented in Rust with Python bindings. It's designed from the ground up for performance, offering both eager and lazy evaluation modes, automatic query optimization, and excellent memory efficiency.

The library excels at analytical workloads and provides a modern API that's both powerful and intuitive. With its integration into Dagster, you can build high-performance data pipelines that scale from prototypes to production without sacrificing speed or reliability.