- Name
- Sean Lopp
- Handle
- @lopp
Declarative Scheduling is an exciting feature that was part of our 1.1 release. For any data engineer or analytics engineer looking to orchestrate dbt models, Declarative Scheduling will provide some ground-breaking improvements.
For those unfamiliar with Dagster, it is an open-source framework that lets you easily interleave and orchestrate compute, no matter where it is - whether it be dbt, Python, or other modern data stack tools.
Unlike other orchestrators, Dagster is built on software-defined assets (SDAs). Using SDAs with dbt and dbt Cloud allows you to:
- Orchestrate the entire pipeline from one place: Use dbt in the context of your entire data platform, including sources upstream of the warehouse (like Airbyte or Spark) or downstream of dbt (like ML models in python)
- Build that ‘single pane of glass’: View detailed historical metadata, logs, and lineage for each dbt model and centralize details about the model (the sql query, the schema)
- Speed up your development lifecycle and adopt development best practices: Tap into dbt profiles to flip between dev and prod targets
- Make your pipelines more robust and easier to maintain: Invoke dbt under specific situations (e.g., to run dbt when external data is ready instead of just guessing it will land at 9am). Target a specific model or subset of models. Automate the rerun of failed dbt jobs in a surgical fashion.
Keep it Fresh with Declarative Scheduling
Now, with the addition of Declarative Scheduling, your ability to orchestrate dbt models as part of a larger pipeline has been taken to an entirely new level.
With our 1.1 release, we have introduced the ability to use Declarative Scheduling, which allows you to specify freshness policies (i.e. how up-to-date an asset needs to be) for any data asset in your asset graph, including dbt models.
Declarative scheduling for dbt has a few key benefits:
- New models can be created and automated without worrying about wedging a model into a job or schedule
- Models can be scheduled based on stakeholder expectations
- Pipelines can be optimized to save money by reducing redundant model runs
Let’s look at each of these benefits in more detail.
Dagster and dbt
Most other orchestrators - Airflow being the most notable - model data pipelines as tasks. These tasks represent imperative operations that must be scheduled as part of an overall pipeline. To Airflow, these tasks are black boxes. Airflow (and other orchestrators) don’t know what’s going on inside that box, and often don’t know what’s going into it or coming out of it, either.
Treating dbt as one mega-task has an unfortunate side effect: most teams that orchestrate dbt with Airflow simply have one giant “dbt run” task in their DAG. Inside this task, hundreds of models might be refreshed. Because tasks are a black box to Airflow and other orchestrators, the orchestrator needs to refresh everything everywhere all at once, rather than just what actually needs to be refreshed, because it lacks the information about what’s going on inside of dbt.
Dagster does not have these problems. The Dagster dbt integration (and dbt Cloud integration) pulls in each dbt model as its own software-defined asset, allowing a fine-grained understanding of the dbt graph, its upstreams, and its downstreams. This information enables Dagster to provide scheduling guarantees that other orchestrators cannot.
All of this means that you can:
- Understand success and failure at the level of individual tables, which is likely what your stakeholders care about.
- Fix individual dbt models on failure, and then pick up the pipeline execution where you left off.
- Surgically materialize a specific dbt model and all of the upstream assets that the model depends on - and only those assets.
Challenges of scheduling dbt models
Now that we understand some of the basics of the Dagster and dbt integration, we can look deeper at scheduling. After developing a new model in dbt, scheduling it can be a challenge. There are several factors to consider when deciding how to schedule a dbt model.
First, you must determine what job(s) the model should be a part of. Your organization may have multiple jobs that perform different functions, such as reporting for marketing, updating executive KPIs, or running checks for operational alerting. These functions may all may depend on a shared set of dbt models. Deciding which job(s) your model should be a part of can be confusing and cause maintenance headaches in the future.
Second, different stakeholders might have different expectations for how fresh their data should be. Coming up with a set of cron schedules that captures all of these SLAs is a lot of manual work.
Finally, in a complex DAG of dependencies, it can be challenging to prevent expensive models from running too often. If a model is run unnecessarily, it can waste resources and negatively impact performance.
Declarative scheduling of dbt with Dagster
Dagster's declarative scheduling system addresses the challenges of scheduling dbt models by allowing users to specify freshness policies for their models. Freshness policies are statements that define how often a data asset should be updated, such as "this dbt model must be no more than one hour old at all times." When a freshness policy is specified, Dagster will automatically run the dbt model only when it needs to in order to meet the policy's requirements.
This greatly simplifies the scheduling process for dbt models. You don't need to explicitly add your model to a job or think about cron schedules. You also don't need to worry about the SLAs of your downstream models, as Dagster will automatically ensure that all dependencies are met. And, because your models will only run as often as they need to, you can save resources and improve performance.
In addition to these benefits, using Dagster with dbt allows you to leverage all of these great capabilities with both dbt and non-dbt compute. You can easily interleave Python, dbt, and other tools in the modern data stack, and operate your entire data platform from a single pane of glass: Dagster.
Example
To see how these concepts work in practice let’s take a look at an example of this in action: the fictional Hooli Data Platform project.
In this project, they’ve built a dbt model called daily_order_summary.sql
. This is a critical model that could be used by numerous different teams across the entire organization. It’s also very expensive to compute. This is a perfect candidate for Dagster’s declarative scheduling.
First, the Hooli data engineering team orchestrated the dbt project using dagster-dbt:
dbt_assets = load_assets_from_dbt_project(
project_dir=file_relative_path(__file__, "../dbt_project"),
profiled_dir=file_relative_path(__file__, "../dbt_project/config"),
)
This created one Dagster software-defined asset for every dbt model in the project. They were then able to easily leverage those dbt models from other heterogenous assets in their data platform.
Next, the Hooli analytics engineering team annotated the model with a freshness policy:
{{
config(
dagster_freshness_policy={"maximum_lag_minutes": 60}
)
}}
select
order_date,
n_orders as num_orders
from {{ ref("order_stats") }}
This freshness policy ensures that the daily_order_summary
model is never more than 60 minutes out-of-date.
Next, the Hooli data engineering team created a reconciliation sensor that will detect when any asset is out of SLA and will automatically kick off a run to refresh it and any necessary dependencies. Here’s the short code snippet:
freshness_sla_sensor = build_asset_reconciliation_sensor(
name = "freshness_sla_sensor",
asset_selection = AssetSelection.all()
)
As you can see, this code is fully declarative and is easy to adopt incrementally. And think of all the things we didn’t do:
- We didn’t write any schedules
- We didn’t add our assets to any jobs
- We didn’t think about the SLAs of our upstreams or downstreams
We just told Dagster our SLAs and Dagster took care of the rest! Whenever daily_order_summary
or any of its dependencies get too old, a run will automatically be kicked off to execute dbt run for daily_order_summary
and any of its stale dependencies. Finally, alerts can be set to notify the data team when the SLA is violated, not on spurious job failures, decreasing alert fatigue.
Conclusion
Overall, we’ve found that declarative scheduling is a superpower for data teams that have adopted Dagster. With the latest Dagster release, this superpower has come to our dbt integration.
We hope you found this post informative and encourage you to try out the example for yourself to see the benefits of declarative scheduling in action. Check out the docs to learn more, and don’t forget to join the community and star us on Github!
Further reading:
If you are interested in exploring in more detail how Dagster supports the use of dbt models in your data pipelines, here are some additional resources:
- Orchestrating Python and dbt with Dagster - an overview of the concepts and benefits of managing dbt models through an orchestrator like Dagster.
- Using dbt with Dagster software-defined assets - Dagster SDA + dbt tutorial
- dbt_python_assets example - jump straight into the code
- the "Migrating off dbt Cloud™ article
- The Dagster 1.1 release notes.
- Docs on freshness policies, automatic asset reconciliation, code versions, and source asset versions.
- Ask us questions in Slack
We're always happy to hear your feedback, so please reach out to us! If you have any questions, ask them in the Dagster community Slack (join here!) or start a Github discussion. If you run into any bugs, let us know with a Github issue. And if you're interested in working with us, check out our open roles!
Follow us:
Data Platform Week 2024
- Name
- Alex Noonan
- Handle
- @noonan
From Chaos to Control: How Dagster Unifies Orchestration and Data Cataloging
- Name
- Alex Noonan
- Handle
- @noonan
Dagster Deep Dive Recap: Orchestrating Flexible Compute for ML with Dagster and Modal
- Name
- TéJaun RiChard
- Handle
- @tejaun