Blog
Building a Better Lakehouse: From Airflow to Dagster

Building a Better Lakehouse: From Airflow to Dagster

September 30, 2025
Building a Better Lakehouse: From Airflow to Dagster
Building a Better Lakehouse: From Airflow to Dagster

How I took an excellent lakehouse tutorial and made it even better with modern data orchestration

The Inspiration

I recently came across this fantastic article: Build a Lakehouse on a Laptop with dbt, Airflow, Trino, Iceberg, and MinIO by the team at Data Engineer Things. It’s an excellent tutorial that demonstrates how to build a complete lakehouse stack on your laptop using modern open-source tools.

After reading through their implementation, I thought: “This is great, but what if I could make it even better with Dagster?”

So I decided to implement the same lakehouse architecture using Dagster instead of Airflow, and the results were impressive. I highly encourage you to read the original article first, try their implementation, and then compare it with what I’ve built here.

Why This Comparison Matters

Both implementations use the same core lakehouse technologies:

  • MinIO for S3-compatible object storage
  • Trino as the distributed SQL query engine
  • Iceberg for ACID table format
  • Nessie for data catalog and versioning
  • dbt for analytics transformations

The key difference? The orchestration layer. This comparison highlights why choosing the right orchestrator can dramatically improve your data platform’s capabilities.

What I Improved Upon

1. Smart Partitioning with Time Windows

Original Article: Basic daily data processing without sophisticated partitioning.
My Enhancement: Dagster’s TimeWindowPartitionsDefinition with proper partition key management

from dagster import TimeWindowPartitionsDefinition
daily_partitions = TimeWindowPartitionsDefinition(
    cron_schedule="0 0 * * *",  # Daily boundary at midnight    start="2024-01-01",
    end="2027-01-01",  # Future-proof partitioning    fmt="%Y-%m-%d",
)

This enables:

  • Backfill capabilities — Process historical data efficiently
  • Incremental processing — Only process new partitions
  • Partition-aware sensors — Trigger only when new data arrives
  • Selective reruns — Reprocess specific date ranges

2. Event-Driven Architecture with Sensors

Original Article: Time-based scheduling only.
My Enhancement: Intelligent sensors that monitor MinIO for new data

@sensor(
    minimum_interval_seconds=60,
    job_name="daily_pipeline",
    required_resource_keys={"minio"}
)
def sales_sensor(context: SensorEvaluationContext):
    """Monitor MinIO for new sales files and trigger processing."""    minio = context.resources.minio
    # Check for new files in the last 7 days    for days_back in range(1, 8):
        date = datetime.now() - timedelta(days=days_back)
        date_str = date.strftime("%Y-%m-%d")
        prefix = f"raw/sales/dt={date_str}/"        file_count = minio.count_objects(prefix)
        if file_count > previous_count:
            yield RunRequest(
                partition_key=date_str,
                tags={"trigger": "new_data_detected"}
            )

Benefits:

  • React to data arrival — No waiting for scheduled times
  • Cursor-based tracking — Remember what’s been processed
  • Efficient resource usage — Only run when needed
  • Multiple sensors — Different logic for different data types

3. Pure SQL Lakehouse Pattern

Original Article: Uses Python for data loading.
My Enhancement: Pure Trino SQL with Hive to Iceberg CTAS pattern

@asset(
    required_resource_keys={"trino"},
    kinds={"trino", "hive", "iceberg"}
)
def product_category_data(context: AssetExecutionContext):
    """Create Iceberg table from MinIO using pure SQL."""    trino = context.resources.trino
    # Step 1: Create Hive external table    trino.execute_statement("""        CREATE TABLE hive.raw.product_category (            category_id BIGINT,            category_name VARCHAR,            created_date TIMESTAMP        ) WITH (            external_location = 's3a://lake/raw/product_category/',            format = 'PARQUET'        )    """)
    # Step 2: CTAS to Iceberg managed table    trino.execute_statement("""        CREATE TABLE iceberg.raw.product_category        WITH (format = 'PARQUET') AS        SELECT * FROM hive.raw.product_category    """)

Advantages:

  • Scalable — Leverage Trino’s distributed processing
  • No Python bottlenecks — Pure SQL operations
  • Better performance — Direct data transfer between catalogs
  • Industry standard — Follows lakehouse best practices

4. Comprehensive Data Quality Framework

Original Article: Basic dbt tests.
My Enhancement: Multi-layered data quality with Dagster asset checks

@asset_check(asset="product_category_data", required_resource_keys={"trino"})
def product_category_completeness(context: AssetExecutionContext):
    """Ensure product category data meets quality standards."""    trino = context.resources.trino
    result = trino.execute_query("SELECT COUNT(*) FROM iceberg.raw.product_category")
    count = result[0][0] if result else 0    passed = count >= 5  # Should have at least 5 categories    return AssetCheckResult(
        passed=passed,
        metadata={"category_count": count},
        description=f"Product categories: {count} (expected >= 5)"    )

Plus:

  • Freshness policies — SLA monitoring for data freshness
  • Asset checks — Programmatic data quality validation
  • dbt test integration — Leverage existing dbt tests
  • Lineage-aware quality — Quality checks follow data dependencies

5. Advanced Orchestration Features

Original Article: Basic DAG scheduling.
My Enhancement: Sophisticated job definitions with asset selection

# Flexible job definitions
setup_job = define_asset_job(
    name="setup_job",
    selection=AssetSelection.keys(
        ["raw", "product_category_data"],
        ["raw", "product_subcategory_data"],
        ["raw", "product_data"],
        ["raw", "territory_data"]
    ),
    description="Load dimension data from MinIO into Iceberg tables")
daily_pipeline = define_asset_job(
    name="daily_pipeline",
    selection=AssetSelection.keys(["raw", "daily_sales_data"]) | AssetSelection.keys(["curated", "product_dim_simple"]) | AssetSelection.keys(["marts", "sales_summary"]),
    partitions_def=daily_partitions,
    description="Daily: Sales data → dbt curated → dbt analytics")

Features:

  • Asset selection — Run specific subsets of your pipeline
  • Dynamic jobs — Jobs that adapt based on upstream changes
  • Partition-aware jobs — Handle time-series data elegantly
  • Resource optimization — Fine-tune resource usage per job

6. Modern Development Experience

Original Article: Traditional setup.
My Enhancement: Modern Dagster tooling with dg CLI

# Modern project structure
dg dev  # Start development server
# - Auto-reload on code changes
# - Rich web UI with lineage visualization# - Integrated logs and monitoring
# - Asset catalog with metadata

# Modern project structuredg dev  # Start development server# - Auto-reload on code changes# - Rich web UI with lineage visualization# - Integrated logs and monitoring# - Asset catalog with metadata

Developer benefits:

  • Auto-discovery — Assets automatically detected
  • Hot reloading — See changes instantly
  • Rich UI — Visual pipeline development
  • Integrated testing — Test assets individually

Architecture Comparison

Original Approach (Airflow)

Original Pipeline Diagram
Raw Data Python Processing Iceberg Tables dbt Analytics
Scheduled
Manual ETL
Basic DAG
Time-
based

My Enhanced Approach (Dagster)

Enhanced Pipeline Diagram
Raw Data Hive External Iceberg Managed dbt Analytics
Sensors
Event-
Driven
Pure SQL CTAS
Distributed
Processing
Asset Checks
Data Quality
Framework
Smart
Jobs
SLAs
Monitoring

The Technology Stack

Both implementations share the same robust foundation:

Infrastructure Layer

  • Docker Compose — Local development environment
  • MinIO — S3-compatible object storage
  • Trino — Distributed SQL query engine
  • Nessie — Data catalog with Git-like versioning
  • Iceberg — Open table format with ACID properties

Data Layer

  • Raw Zone — Parquet files in MinIO (s3://lake/raw/)
  • Curated Zone — Iceberg tables with proper schemas
  • Analytics Zone — dbt-transformed dimensional models

Orchestration Layer

  • Original: Apache Airflow
  • Enhanced: Dagster with modern features

Why This Matters for Your Organization

Operational Excellence

  • Reliability: Event-driven processing reduces missed data
  • Efficiency: Smart partitioning minimizes compute waste
  • Visibility: Rich monitoring and lineage tracking
  • Maintainability: Clear asset definitions and dependencies

Developer Productivity

  • Faster iteration: Hot reloading and integrated testing
  • Better debugging: Detailed logs and execution context
  • Easier onboarding: Self-documenting asset catalog
  • Modern tooling: Industry-standard development experience

Business Value

  • Faster time-to-insight: Data available when it arrives
  • Higher data quality: Comprehensive validation framework
  • Lower operational costs: Efficient resource utilization
  • Future-proof architecture: Built on modern data stack

Getting Started

I encourage you to:

  1. Read the original article — It’s an excellent introduction to lakehouse concepts
  2. Try their implementation — Get familiar with the core technologies
  3. Compare with the Dagster version — See the orchestration differences
  4. Experiment with both — Understand the trade-offs

Quick Start with My Implementation

# Clone and start the infrastructure
git clone https://github.com/eric-thomas-dagster/dagster-lakehouse
cd dagster-lakehouse
./start_lakehouse.sh
# Generate sample data
python generate_adventureworks_data.py
# Start Dagster
cd dagster_project
dg dev

Then visit:

Conclusion

The original lakehouse article provides an excellent foundation for understanding modern data architecture. By enhancing it with Dagster’s advanced orchestration capabilities, I’ve created a more robust, scalable, and maintainable data platform.

The key insight? Technology choice matters at every layer. While the core lakehouse technologies (MinIO, Trino, Iceberg, dbt) provide the foundation, the orchestration layer determines how effectively you can leverage them.

Dagster’s asset-centric approach, combined with its sensors, partitioning, and data quality features, creates a data platform that’s not just functional, but truly production-ready.

Try Both Approaches

I strongly recommend experiencing both implementations:

  • The original showcases excellent lakehouse fundamentals
  • My enhanced version demonstrates modern orchestration capabilities

The comparison will give you invaluable insights into how orchestration choices impact your data platform’s capabilities, maintainability, and developer experience.

Ready to build your own enhanced lakehouse? Check out my implementation at https://github.com/eric-thomas-dagster/dagster-lakehouse and let me know what you think!

Additional Resources

Have feedback or questions? Start a discussion in Slack or Github.

Interested in working with us? View our open roles.

Want more content like this? Follow us on LinkedIn.

Dagster Newsletter

Get updates delivered to your inbox

Latest writings

The latest news, technologies, and resources from our team.

Building a Better Lakehouse: From Airflow to Dagster

September 30, 2025

Building a Better Lakehouse: From Airflow to Dagster

How I took an excellent lakehouse tutorial and made it even better with modern data orchestration

Designing User-Friendly Dagster Components

September 25, 2025

Designing User-Friendly Dagster Components

The difference between components that thrive and components that collect digital dust? User experience design.

Dagster+ Now Available in the EU

September 19, 2025

Dagster+ Now Available in the EU

We're thrilled to announce that Dagster+ has arrived in Europe!