Blog
Community Showcase Part 1

Community Showcase Part 1

June 3, 2026
Community Showcase Part 1
Community Showcase Part 1

Some of the most interesting Dagster projects come from the community. This post highlights creative community-built applications ranging from public data exploration and infrastructure monitoring to research automation and internal tooling, along with why their creators chose Dagster and what building with it was like.

Some of our favorite Dagster use cases are the ones we never could have predicted.

People in the community are using Dagster to explore public datasets, monitor infrastructure, automate research workflows, build internal tools, and experiment with entirely new kinds of data applications. Some projects are deeply technical, some are wonderfully niche, and all of them reflect the creativity of the people building with Dagster.

This post highlights a few community-built projects that caught our attention along with the stories behind them, why their creators picked Dagster, and what the experience of building them was like.

Working on something fun with Dagster? We’d love to hear about it.

Christian Casazza

Linkedin | Github

Tell us a little about yourself and what you work on

I taught myself data engineering on the job because I began to see it as a force multiplier for data value creation. Investing in the data engineering foundation of a system makes everything else downstream easier. I first began working with developers when we needed to create pipelines for e-commerce operational analytics. I enjoyed building the pipeline logic, and then just kept asking the developers questions on how everything worked. With LLMs getting “good enough” beginning with GPT4, I started teaching myself agentic engineering with open source data primitives.

Today, I’m building data value creation systems. My main focus has been to learn by doing, and just building as many end-to-end pipelines as possible. NYC and NY State have among the best curated open government data in the world, and so it's been my initial “Avatar” as I’ve built pipelines across domains like budget, transportation, real estate, legislation, and more. Modern open data primitives combined with coding agents really allow me to go deeper into the pipelines, on both the data enhancement and the analytics side, without investing a ton of capital upfront. In addition to my main NY pipelines, I’ve also created data products around sports with great providers like NFLverse and prediction markets like Polymarket.

How did you first discover Dagster?

I first came across Dagster through social channels and noticed that “high signal” people in the data space seemed to like it, so once my pipelines reached suitable complexity I gave Dagster a shot. What stood out was that Dagster handled many of the generic, “logistics” of data engineering (orchestration, scheduling, asset checks, factory creation, etc) while still feeling natively Pythonic. I liked how I could develop locally fast without infrastructure, so the activation energy to start a new pipeline is low, and the  developer cycle is fast. 

What project have you been building with Dagster?

I have been building a vertical data stack using open source data primitives. You can explore my public curated datasets at QueryStation.app. Dagster allows me to build detailed ingestion, cleaning, joining, and analytics pipelines.

My data pipelines are mainly built around Arrow, Parquet, DuckDB, Polars, and DuckLake. All raw data is initially standardized at the Parquet level, where the Polars Lazyframe engine typically handles the majority of my data wrangling. DuckDB is my engine for writing clean intermediate parquet files to DuckLake and running downstream SQL analytics pipelines. DuckLake is my open table provider, which makes it so much simpler to update tables over time, and for the data application layer to consume the tables. Apache Arrow ensures all layers can speak with each other, such as passing a dataframe to DuckDB or the typescript data app layer querying a DuckLake table and receiving Arrow IPC. Arrow makes it easy to build a SQL based API. For users that want to query my already curated datasets, like NYC 311 or its various budget datasets, they can just query QueryStation directly with SQL and get back Arrow, which can be read into a dataframe and analyzed by agents.

Dagster is the foundation of the data engineering system. It acts as the bridge that turns a collection of scripts using primitives into a linked, coherent system. It standardizes the data pipeline lifecycle by automatically handling DAG management, scheduling, asset checks, metadata storage, and more. Dagster is designed to embed trust into data pipelines so consumers actually use the outputs. The design of the repo is intentionally opinionated around Dagster. The opinionated setup is so agents can focus on context specific data engineering instead of the implementation of the underlying data software. One core idea is reducing the marginal cost of building and expanding new sources. I’ve been developing factory-style patterns where you can provide an AI agent with a URL or API documentation that can then be analyzed to generate formal Dagster assets from templates that will scrape the docs site to create organized markdown files. The  “Dagster expert” skill has been helpful for grounding my Dagster development around the core Python best practices.

What was the hardest or most interesting problem you solved?

The most interesting challenge was getting the factory functions online and designing assets in a scalable way. Rather than designing every asset individually, I wanted shared patterns where assets could inherit common behavior like checks, metadata, and documentation. I built toward an asset registry concept: a way to automatically register assets, generate documentation, and allow agents to encode knowledge across work.

  def create_socrata_pipeline(
       name: str,
       socrata_config: SocrataIngestConfig,
       schema: SchemaContract,
       *,
       # Organization (NEW)
       domain: str | None = None,
       geographic_scope: str | None = None,
       group: str | None = None,
       # Partitioning
       partitions_def: PartitionsDefinition | None = None,
       clean_partitions_def: PartitionsDefinition | None = None,
       partition_mapping: PartitionMapping | None = None,
       # Logic
       post_transform_fn: Callable[[pl.LazyFrame], pl.LazyFrame] | None = None,
       enrichments: StandardEnrichments | None = None,
       # ... more parameters
   ) -> PipelineResult:
       """Create a 2-stage Socrata pipeline: Landing (CSV) → Clean (Parquet)."""
       # Normalize schema: accept 2-tuple or 3-tuple, extract contract
       schema_contract, is_3tuple = normalize_schema(schema)
       # ... compose landing asset, clean asset, checks
       return PipelineResult(
           landing=landing_asset,
           clean=clean_asset,
           checks=checks,
       )

The hardest part was the front-loaded setup. Some of the work predated the dg CLI, so more of the structure had to be assembled manually. There are often multiple ways to solve the same problem in Dagster, so finding the mental model that makes the most sense takes some trial and error.

I also am figuring the best way to handle open table maintenance jobs, such as those needed for the DuckLake table, within Dagsters. One idea was a maintenance asset that bundles checkpoint or table-maintenance behavior into the graph, while allowing for setting DuckLake configurations at the Python level. However, DuckLake is still fairly new, so some of these best practices will only be possible to know with time.

Why was Dagster a good fit for this project?

Dagster was a strong fit because the project involves many data pipelines that are completely unrelated domain wise, but need consistent patterns around asset code, checks, tests, metadata, scheduling, etc. The AssetFactory pattern makes it easy to design stable, reusable code APIs for the pure logistics aspects of data engineering.

Using a Dagster asset factory also makes it much easier to encode best practices and prevent agentic-based code drift across pipelines. For example, Polars is significantly more performant for data wrangling when leveraging the LazyFrame API with scan and sink_parquet vs using the traditional Eager API with .collect, read, and write_parquet. I’ve found that agents when given leeway are more prone to slipping in manual .collect calls into pipelines that create premature materializations and turn lazy pipelines eager. The factory functions provide strong guardwails to keep transformations end-to-end lazy automatically, so the agent can focus on the actual data wrangling logic it wants to happen instead of the implementation details. 

Another really cool aspect of Dagster is it’s a natural fit for building an agent-friendly data pipeline with a contextual domain knowledge system. Dagster stores all of its various metadata(runs, asset info, etc) in SQL tables, which can be done with Postgres and queried through API. Dagster also lets users create custom metadata they can attach to assets which then show up in the Dagster UI. By building into these existing primitives, I’ve been able to add arbitrary types of different domain knowledge to assets. Key info like sources, analyst discovery facts, data quality/quarks caveats, etc. that agents discover as they work can be encoded into the pipeline with Dagster.  Encoding the info into the pipeline makes it much easier to carry context across agent sessions and providers. It also makes it easier to build on previous work and build data pipelines that can go deeper and build layered analysis.

What advice would you give someone starting with Dagster?

One of Dagster’s strongest traits, how composable it is, can also be one of its pitfalls for early users. With Dagster, there are often multiple “correct” ways to implement a data pipeline in terms of how you design your code and where the logic lives. This can lead to refactor paralysis to find the “best” way that uses “all” the features.

It's best to start with one complete end-to-end pipeline, ideally something separate from your main work code. Build it, change things, break things, and learn how the pieces fit together. Then slowly layer in the complexity. Add more advanced features like scheduling, asset checks, and custom metadata later. See what breaks and what changes with edits until you have a shape you like. I would also recommend running agents periodically through your repo using the curated Dagster-Expert and Dignified Pipeline skills from the Dagster team. They help to ground your code with best practices that help to avoid agentic drift. 

However, once you get past the initial learning curve, don’t be afraid to start being ambitious with Dagster. The composability it gives lets you build some really powerful workflows as a code native workflow builder. The key is to start simple, build something that works, then see what breaks when you try to add sophistication gradually. 

Using my repo is personally my recommended way to get started with Dagster. The opinionated setup has many of the core infrastructure and design patterns in place already. Anyone working with an agent can use the repo to build their own end to end data pipeline, using the existing code as an example base for I believe a very wide range of data pipelines someone may want to build.

Have feedback or questions? Start a discussion in Slack or Github.

Interested in working with us? View our open roles.

Want more content like this? Follow us on LinkedIn.

Dagster Newsletter

Get updates delivered to your inbox

Latest writings

The latest news, technologies, and resources from our team.

Multi-Tenancy for Modern Data Platforms
Webinar

April 13, 2026

Multi-Tenancy for Modern Data Platforms

Learn the patterns, trade-offs, and production-tested strategies for building multi-tenant data platforms with Dagster.

Deep Dive: Building a Cross-Workspace Control Plane for Databricks
Webinar

March 24, 2026

Deep Dive: Building a Cross-Workspace Control Plane for Databricks

Learn how to build a cross-workspace control plane for Databricks using Dagster — connecting multiple workspaces, dbt, and Fivetran into a single observable asset graph with zero code changes to get started.

Dagster Running Dagster: How We Use Compass for AI Analytics
Webinar

February 17, 2026

Dagster Running Dagster: How We Use Compass for AI Analytics

In this Deep Dive, we're joined by Dagster Analytics Lead Anil Maharjan, who demonstrates how our internal team utilizes Compass to drive AI-driven analysis throughout the company.

Community Showcase Part 1
Community Showcase Part 1
Blog

June 3, 2026

Community Showcase Part 1

Some of the most interesting Dagster projects come from the community. This post highlights creative community-built applications ranging from public data exploration and infrastructure monitoring to research automation and internal tooling, along with why their creators chose Dagster and what building with it was like.

How Dagster Compass Powers Brooklyn Data’s Self-Service Analytics
How Dagster Compass Powers Brooklyn Data’s Self-Service Analytics
Blog

June 1, 2026

How Dagster Compass Powers Brooklyn Data’s Self-Service Analytics

Text-to-analytics promises self-service access to data, but adoption depends on usability, governance, and trust. In this guest post, Brooklyn Data explains how it evaluated Compass, deployed it on top of Snowflake, and enabled teams to answer operational questions directly in Slack while maintaining centralized governance and business context.

Snowflake Runs Your Data: Dagster Runs Everything Else
Snowflake Runs Your Data: Dagster Runs Everything Else
Blog

May 28, 2026

Snowflake Runs Your Data: Dagster Runs Everything Else

Snowflake increasingly handles transformation and data freshness internally through features like Dynamic Tables and Cortex. Dagster complements Snowflake by providing orchestration, lineage, automation, and cost visibility across your broader data platform from SQL-defined assets to downstream automation and Snowflake query attribution.

How Magenta Telekom Built the Unsinkable Data Platform
Case study

February 25, 2026

How Magenta Telekom Built the Unsinkable Data Platform

Magenta Telekom rebuilt its data infrastructure from the ground up with Dagster, cutting developer onboarding from months to a single day and eliminating the shadow IT and manual workflows that had long slowed the business down.

Scaling FinTech: How smava achieved zero downtime with Dagster
Case study

November 25, 2025

Scaling FinTech: How smava achieved zero downtime with Dagster

smava achieved zero downtime and automated the generation of over 1,000 dbt models by migrating to Dagster's, eliminating maintenance overhead and reducing developer onboarding from weeks to 15 minutes.

Zero Incidents, Maximum Velocity: How HIVED achieved 99.9% pipeline reliability with Dagster
Case study

November 18, 2025

Zero Incidents, Maximum Velocity: How HIVED achieved 99.9% pipeline reliability with Dagster

UK logistics company HIVED achieved 99.9% pipeline reliability with zero data incidents over three years by replacing cron-based workflows with Dagster's unified orchestration platform.

Modernize Your Data Platform for the Age of AI
Guide

January 15, 2026

Modernize Your Data Platform for the Age of AI

While 75% of enterprises experiment with AI, traditional data platforms are becoming the biggest bottleneck. Learn how to build a unified control plane that enables AI-driven development, reduces pipeline failures, and cuts complexity.

Download the eBook on How to Scale Data Teams
Guide

November 5, 2025

Download the eBook on How to Scale Data Teams

From a solo data practitioner to an enterprise-wide platform, learn how to build systems that scale with clarity, reliability, and confidence.

Download the eBook Primer on How to Build Data Platforms
Guide

February 21, 2025

Download the eBook Primer on How to Build Data Platforms

Learn the fundamental concepts to build a data platform in your organization; covering common design patterns for data ingestion and transformation, data modeling strategies, and data quality tips.

AI Driven Data Engineering
Course

March 19, 2026

AI Driven Data Engineering

Learn how to build Dagster applications faster using AI-driven workflows. You'll use Dagster's AI tools and skills to scaffold pipelines, write quality code, and ship data products with confidence while still learning the fundamentals.

Dagster & ETL
Course

July 11, 2025

Dagster & ETL

Learn how to ingest data to power your assets. You’ll build custom pipelines and see how to use Embedded ETL and Dagster Components to build out your data platform.

Testing with Dagster
Course

April 21, 2025

Testing with Dagster

In this course, learn best practices for testing, including unit tests, mocks, integration tests and applying them to Dagster.