Blog
Building a Better Lakehouse: From Airflow to Dagster

Building a Better Lakehouse: From Airflow to Dagster

September 30, 2025
Building a Better Lakehouse: From Airflow to Dagster
Building a Better Lakehouse: From Airflow to Dagster

How I took an excellent lakehouse tutorial and made it even better with modern data orchestration

The Inspiration

I recently came across this fantastic article: Build a Lakehouse on a Laptop with dbt, Airflow, Trino, Iceberg, and MinIO by the team at Data Engineer Things. It’s an excellent tutorial that demonstrates how to build a complete lakehouse stack on your laptop using modern open-source tools.

After reading through their implementation, I thought: “This is great, but what if I could make it even better with Dagster?”

So I decided to implement the same lakehouse architecture using Dagster instead of Airflow, and the results were impressive. I highly encourage you to read the original article first, try their implementation, and then compare it with what I’ve built here.

Why This Comparison Matters

Both implementations use the same core lakehouse technologies:

  • MinIO for S3-compatible object storage
  • Trino as the distributed SQL query engine
  • Iceberg for ACID table format
  • Nessie for data catalog and versioning
  • dbt for analytics transformations

The key difference? The orchestration layer. This comparison highlights why choosing the right orchestrator can dramatically improve your data platform’s capabilities.

What I Improved Upon

1. Smart Partitioning with Time Windows

Original Article: Basic daily data processing without sophisticated partitioning.
My Enhancement: Dagster’s TimeWindowPartitionsDefinition with proper partition key management

from dagster import TimeWindowPartitionsDefinition
daily_partitions = TimeWindowPartitionsDefinition(
    cron_schedule="0 0 * * *",  # Daily boundary at midnight    
    start="2024-01-01",
    end="2027-01-01",  # Future-proof partitioning    
    fmt="%Y-%m-%d",
)

This enables:

  • Backfill capabilities — Process historical data efficiently
  • Incremental processing — Only process new partitions
  • Partition-aware sensors — Trigger only when new data arrives
  • Selective reruns — Reprocess specific date ranges

2. Event-Driven Architecture with Sensors

Original Article: Time-based scheduling only.
My Enhancement: Intelligent sensors that monitor MinIO for new data

@sensor(
    minimum_interval_seconds=60,
    job_name="daily_pipeline",
    required_resource_keys={"minio"}
)
def sales_sensor(context: SensorEvaluationContext):
    """Monitor MinIO for new sales files and trigger processing."""    
    minio = context.resources.minio
    # Check for new files in the last 7 days    
    for days_back in range(1, 8):
        date = datetime.now() - timedelta(days=days_back)
        date_str = date.strftime("%Y-%m-%d")
        prefix = f"raw/sales/dt={date_str}/"        
        file_count = minio.count_objects(prefix)
        if file_count > previous_count:
            yield RunRequest(
                partition_key=date_str,
                tags={"trigger": "new_data_detected"}
            )

Benefits:

  • React to data arrival — No waiting for scheduled times
  • Cursor-based tracking — Remember what’s been processed
  • Efficient resource usage — Only run when needed
  • Multiple sensors — Different logic for different data types

3. Pure SQL Lakehouse Pattern

Original Article: Uses Python for data loading.
My Enhancement: Pure Trino SQL with Hive to Iceberg CTAS pattern

@asset(
    required_resource_keys={"trino"},
    kinds={"trino", "hive", "iceberg"}
)
def product_category_data(context: AssetExecutionContext):
    """Create Iceberg table from MinIO using pure SQL."""    
    trino = context.resources.trino
    # Step 1: Create Hive external table    
    trino.execute_statement("""        CREATE TABLE hive.raw.product_category (            category_id BIGINT,            category_name VARCHAR,            created_date TIMESTAMP        ) WITH (            external_location = 's3a://lake/raw/product_category/',            format = 'PARQUET'        )    """)
    
    # Step 2: CTAS to Iceberg managed table    
    trino.execute_statement("""        CREATE TABLE iceberg.raw.product_category        WITH (format = 'PARQUET') AS        SELECT * FROM hive.raw.product_category    """)

Advantages:

  • Scalable — Leverage Trino’s distributed processing
  • No Python bottlenecks — Pure SQL operations
  • Better performance — Direct data transfer between catalogs
  • Industry standard — Follows lakehouse best practices

4. Comprehensive Data Quality Framework

Original Article: Basic dbt tests.
My Enhancement: Multi-layered data quality with Dagster asset checks

@asset_check(asset="product_category_data", required_resource_keys={"trino"})
def product_category_completeness(context: AssetExecutionContext):
    """Ensure product category data meets quality standards."""    
    trino = context.resources.trino
    result = trino.execute_query("SELECT COUNT(*) FROM iceberg.raw.product_category")
    count = result[0][0] if result else 0    passed = count >= 5  # Should have at least 5 categories    return AssetCheckResult(
        passed=passed,
        metadata={"category_count": count},
        description=f"Product categories: {count} (expected >= 5)"    )

Plus:

  • Freshness policies — SLA monitoring for data freshness
  • Asset checks — Programmatic data quality validation
  • dbt test integration — Leverage existing dbt tests
  • Lineage-aware quality — Quality checks follow data dependencies

5. Advanced Orchestration Features

Original Article: Basic DAG scheduling.
My Enhancement: Sophisticated job definitions with asset selection

# Flexible job definitions
setup_job = define_asset_job(
    name="setup_job",
    selection=AssetSelection.keys(
        ["raw", "product_category_data"],
        ["raw", "product_subcategory_data"],
        ["raw", "product_data"],
        ["raw", "territory_data"]
    ),
    description="Load dimension data from MinIO into Iceberg tables")
daily_pipeline = define_asset_job(
    name="daily_pipeline",
    selection=AssetSelection.keys(["raw", "daily_sales_data"]) | AssetSelection.keys(["curated", "product_dim_simple"]) | AssetSelection.keys(["marts", "sales_summary"]),
    partitions_def=daily_partitions,
    description="Daily: Sales data → dbt curated → dbt analytics")

Features:

  • Asset selection — Run specific subsets of your pipeline
  • Dynamic jobs — Jobs that adapt based on upstream changes
  • Partition-aware jobs — Handle time-series data elegantly
  • Resource optimization — Fine-tune resource usage per job

6. Modern Development Experience

Original Article: Traditional setup.
My Enhancement: Modern Dagster tooling with dg CLI

# Modern project structure
dg dev  # Start development server
# - Auto-reload on code changes
# - Rich web UI with lineage visualization# - Integrated logs and monitoring
# - Asset catalog with metadata

# Modern project structuredg dev  # Start development server# - Auto-reload on code changes# - Rich web UI with lineage visualization# - Integrated logs and monitoring# - Asset catalog with metadata

Developer benefits:

  • Auto-discovery — Assets automatically detected
  • Hot reloading — See changes instantly
  • Rich UI — Visual pipeline development
  • Integrated testing — Test assets individually

Architecture Comparison

Original Approach (Airflow)

Original Pipeline Diagram
Raw Data Python Processing Iceberg Tables dbt Analytics
Scheduled
Manual ETL
Basic DAG
Time-
based

My Enhanced Approach (Dagster)

Enhanced Pipeline Diagram
Raw Data Hive External Iceberg Managed dbt Analytics
Sensors
Event-
Driven
Pure SQL CTAS
Distributed
Processing
Asset Checks
Data Quality
Framework
Smart
Jobs
SLAs
Monitoring

The Technology Stack

Both implementations share the same robust foundation:

Infrastructure Layer

  • Docker Compose — Local development environment
  • MinIO — S3-compatible object storage
  • Trino — Distributed SQL query engine
  • Nessie — Data catalog with Git-like versioning
  • Iceberg — Open table format with ACID properties

Data Layer

  • Raw Zone — Parquet files in MinIO (s3://lake/raw/)
  • Curated Zone — Iceberg tables with proper schemas
  • Analytics Zone — dbt-transformed dimensional models

Orchestration Layer

  • Original: Apache Airflow
  • Enhanced: Dagster with modern features

Why This Matters for Your Organization

Operational Excellence

  • Reliability: Event-driven processing reduces missed data
  • Efficiency: Smart partitioning minimizes compute waste
  • Visibility: Rich monitoring and lineage tracking
  • Maintainability: Clear asset definitions and dependencies

Developer Productivity

  • Faster iteration: Hot reloading and integrated testing
  • Better debugging: Detailed logs and execution context
  • Easier onboarding: Self-documenting asset catalog
  • Modern tooling: Industry-standard development experience

Business Value

  • Faster time-to-insight: Data available when it arrives
  • Higher data quality: Comprehensive validation framework
  • Lower operational costs: Efficient resource utilization
  • Future-proof architecture: Built on modern data stack

Getting Started

I encourage you to:

  1. Read the original article — It’s an excellent introduction to lakehouse concepts
  2. Try their implementation — Get familiar with the core technologies
  3. Compare with the Dagster version — See the orchestration differences
  4. Experiment with both — Understand the trade-offs

Quick Start with My Implementation

# Clone and start the infrastructure
git clone https://github.com/eric-thomas-dagster/dagster-lakehouse
cd dagster-lakehouse
./start_lakehouse.sh
# Generate sample data
python generate_adventureworks_data.py
# Start Dagster
cd dagster_project
dg dev

Then visit:

Conclusion

The original lakehouse article provides an excellent foundation for understanding modern data architecture. By enhancing it with Dagster’s advanced orchestration capabilities, I’ve created a more robust, scalable, and maintainable data platform.

The key insight? Technology choice matters at every layer. While the core lakehouse technologies (MinIO, Trino, Iceberg, dbt) provide the foundation, the orchestration layer determines how effectively you can leverage them.

Dagster’s asset-centric approach, combined with its sensors, partitioning, and data quality features, creates a data platform that’s not just functional, but truly production-ready.

Try Both Approaches

I strongly recommend experiencing both implementations:

  • The original showcases excellent lakehouse fundamentals
  • My enhanced version demonstrates modern orchestration capabilities

The comparison will give you invaluable insights into how orchestration choices impact your data platform’s capabilities, maintainability, and developer experience.

Ready to build your own enhanced lakehouse? Check out my implementation at https://github.com/eric-thomas-dagster/dagster-lakehouse and let me know what you think!

Additional Resources

Have feedback or questions? Start a discussion in Slack or Github.

Interested in working with us? View our open roles.

Want more content like this? Follow us on LinkedIn.

Dagster Newsletter

Get updates delivered to your inbox

Latest writings

The latest news, technologies, and resources from our team.

Multi-Tenancy for Modern Data Platforms
Webinar

April 7, 2026

Multi-Tenancy for Modern Data Platforms

Learn the patterns, trade-offs, and production-tested strategies for building multi-tenant data platforms with Dagster.

Deep Dive: Building a Cross-Workspace Control Plane for Databricks
Webinar

March 24, 2026

Deep Dive: Building a Cross-Workspace Control Plane for Databricks

Learn how to build a cross-workspace control plane for Databricks using Dagster — connecting multiple workspaces, dbt, and Fivetran into a single observable asset graph with zero code changes to get started.

Dagster Running Dagster: How We Use Compass for AI Analytics
Webinar

February 17, 2026

Dagster Running Dagster: How We Use Compass for AI Analytics

In this Deep Dive, we're joined by Dagster Analytics Lead Anil Maharjan, who demonstrates how our internal team utilizes Compass to drive AI-driven analysis throughout the company.

Monorepos, the hub-and-spoke model, and Copybara
Monorepos, the hub-and-spoke model, and Copybara
Blog

April 3, 2026

Monorepos, the hub-and-spoke model, and Copybara

How we configure Copybara for bi-directional syncing to enable a hub-and-spoke model for Git repositories

Making Dagster Easier to Contribute to in an AI-Driven World
Making Dagster Easier to Contribute to in an AI-Driven World
Blog

April 1, 2026

Making Dagster Easier to Contribute to in an AI-Driven World

AI has made contributing to open source easier but reviewing contributions is still hard. At Dagster, we’re improving the contributor experience with smarter review tooling, clearer guidelines, and a focus on contributions that are easier to evaluate, merge, and maintain.

DataOps with Dagster: A Practical Guide to Building a Reliable Data Platform
DataOps with Dagster: A Practical Guide to Building a Reliable Data Platform
Blog

March 17, 2026

DataOps with Dagster: A Practical Guide to Building a Reliable Data Platform

DataOps is about building a system that provides visibility into what's happening and control over how it behaves

How Magenta Telekom Built the Unsinkable Data Platform
Case study

February 25, 2026

How Magenta Telekom Built the Unsinkable Data Platform

Magenta Telekom rebuilt its data infrastructure from the ground up with Dagster, cutting developer onboarding from months to a single day and eliminating the shadow IT and manual workflows that had long slowed the business down.

Scaling FinTech: How smava achieved zero downtime with Dagster
Case study

November 25, 2025

Scaling FinTech: How smava achieved zero downtime with Dagster

smava achieved zero downtime and automated the generation of over 1,000 dbt models by migrating to Dagster's, eliminating maintenance overhead and reducing developer onboarding from weeks to 15 minutes.

Zero Incidents, Maximum Velocity: How HIVED achieved 99.9% pipeline reliability with Dagster
Case study

November 18, 2025

Zero Incidents, Maximum Velocity: How HIVED achieved 99.9% pipeline reliability with Dagster

UK logistics company HIVED achieved 99.9% pipeline reliability with zero data incidents over three years by replacing cron-based workflows with Dagster's unified orchestration platform.

Modernize Your Data Platform for the Age of AI
Guide

January 15, 2026

Modernize Your Data Platform for the Age of AI

While 75% of enterprises experiment with AI, traditional data platforms are becoming the biggest bottleneck. Learn how to build a unified control plane that enables AI-driven development, reduces pipeline failures, and cuts complexity.

Download the eBook on how to scale data teams
Guide

November 5, 2025

Download the eBook on how to scale data teams

From a solo data practitioner to an enterprise-wide platform, learn how to build systems that scale with clarity, reliability, and confidence.

Download the e-book primer on how to build data platforms
Guide

February 21, 2025

Download the e-book primer on how to build data platforms

Learn the fundamental concepts to build a data platform in your organization; covering common design patterns for data ingestion and transformation, data modeling strategies, and data quality tips.