- Name
- Joe Van Drunen
- Handle
With the release of dagster-airflow
1.1.17, data engineering teams have access to tooling that makes porting Airflow DAGs to Dagster much easier. Data teams looking for a radically better developer experience can now easily transition away from legacy imperative approaches and adopt a modern declarative framework that provides excellent developer ergonomics.
For reference, the dagster-airflow
docs can be found here.
Data platform migration - not an easy lift
In the early days of Dagster, before we released Dagster V 1.0, there was a lot of interest from organizations running on Apache Airflow who were looking for a better orchestrator. The frustrations with Airflow were well known, and the need for a better framework was widely appreciated. Dagster emerged as the Apache Airflow alternative.
Since the release of Dagster 1.0 and our GA release of Dagster Cloud on August 9th 2022, many data teams have taken the plunge and switched from Airflow to Dagster. The difference in developer ergonomics, deployment, observability, testability and support for real CI/CD has had a significant impact on their ability to ship high-quality data pipelines that are maintainable, observable, and extensible. For example, one team cut the time to test a change in development from 10 minutes to under a minute.
Still, porting over dozens or hundreds of Apache Airflow DAGs to Dagster, and refactoring them to take full advantage of Dagster’s Software-defined Assets was intimidating—enough to deter many teams for whom working on Airflow was painful, but not painful enough to make the switch.
Sure, it’s possible to run old legacy pipelines on Airflow and build anything new on Dagster. And that is a good way to get started. But realistically, no team wants to manage two platforms, certainly not for long.
dagster-airflow
- the easy way to move from the old to the new
Enter dagster-airflow
is a new package that provides interoperability between Dagster and Airflow. The main scenario for using the dagster-airflow
adapter is to do a lift-and-shift migration of all your existing Airflow DAGs into Dagster.
For some use cases, you may also want to orchestrate Dagster job runs from Airflow. While a less common use case, we will touch on it at the end of this article.
The cognitive load of switching frameworks
One challenge (and often underestimated) in making these types of transitions is the cognitive load involved in switching frameworks. Developers familiar with Airflow will have to learn a slightly different set of concepts in Dagster. Luckily we have provided handy guides such as a ”Airflow vs Dagster concept map” and a “Learning Dagster from Airflow” tutorial.
"Transitioning to Dagster was a calculated risk that paid off. The development and testing process has become streamlined and the data quality checks have drastically reduced the time I spend troubleshooting.- Jatin Solanki, founder, Decube.io in Migrating from Apache Airflow to Dagster.
The built-in tools for managing configurations, maintaining data quality, and visualizing data lineage are a godsend. Plus, the local development mode makes it easy to test pipelines before deployment."
Option 1: Lift and Shift from Airflow to Dagster
Most organizations who migrate off Airflow to Dagster will first do a limited POC, quickly test things out and build confidence that Dagster is the superior solution. Once the team is ready to make the switch, they lift-and-shift their pipelines over to Dagster. This process typically takes a few days to a couple of weeks.
While ultimately, lift-and-shift is the most common pattern, organizations can make this transition on their own timelines, and we provide plenty of options for incremental adoption.
As this is the most popular migration pattern, we will cover this one first.
From DAGs to Jobs
The fastest path for moving off Airflow is to port Airflow DAGs onto Dagster and execute them there as ops - the core unit of computation in Dagster.
Taking this step has several advantages right off the bat:
- You can rapidly (typically within a day) migrate DAGs from your Airflow environment onto Dagster, and see them execute there.
- Build out your CI/CD pipeline for these DAGs, including local testing and execution.
- Rapidly demonstrate that this migration path will work and build familiarity/confidence with the Dagster framework.
In Dagster, ops are organized into jobs. A single op is a relatively simple and reusable computation step, whereas a job is a logical sequence of ops that achieve a specific goal, such as updating a table.
Jobs are the main unit of execution and monitoring in Dagster.
Since an Airflow DAG maps cleanly to a Job in Dagster this is the most commonly used approach for migrating and works with almost every DAG. However It is also possible to begin moving your legacy Airflow DAGs into a more modern declarative framework for orchestration: Software-defined Assets.
From Airflow DAGs to Dagster Software-defined Assets
In Dagster, Software-defined Assets are abstractions that power the declarative nature of the framework. With SDAs, developers can shift their focus from the execution of tasks to the assets produced - the end product of the data engineering effort.
Software-defined Assets provide much greater observability, advanced scheduling, and - combined with Dagster’s I/O Managers - make it much easier to switch your work between environments, such as testing in development, then switching pipelines to production.
Switching from an Airflow DAG to a pipeline built around Software-Defined Assets is a more advanced maneuver than transitioning to Jobs, but don’t let that intimidate you. The new dagster-airflow
library makes this transition much smoother, and you can get to value much faster than by manually refactoring your code.
The first step in porting Airflow DAGs to Dagster Software-defined Assets is to import them using the dagster-airflow
library, as described in the Dagster Docs. Once imported the Airflow DAG becomes a single (and possibly quite large) partitioned Dagster asset known as a graph-backed asset
.
You will find more information on graph-backed assets in the Dagster docs here, but essentially they are Software-defined Assets that are computed using an arbitrary number of ops
, all managed as one single asset.
An important thing to note here is that dagster-airflow
will convert your airflow schedule into time-based partitions. This pattern of creating partitioned assets from your airflow DAG works well if your existing Airflow DAG was outputting data partitioned by its schedule interval.
Running this import gives you a single Partitioned Dagster Asset, backed by a graph of Ops, that you can now execute in Dagster.
This is great, but your Airflow DAG might have been producing multiple distinct Assets linked to specific Airflow tasks, so next we will break out the imported Tasks one-by-one into stand-alone Dagster assets.
We provide a sample repo for this process here on GitHub along with a GitPod instance for you to try it out.
In our example, here is the starting Airflow code:
from dagster import AssetKey
from dagster_airflow import load_assets_from_airflow_dag
from with_airflow.airflow_migration_dag import migration_dag
migration_assets = load_assets_from_airflow_dag(
migration_dag,
task_ids_by_asset_key={
AssetKey("top_story_ids"): {"load_top_story_ids"},
AssetKey("top_stories"): {"load_top_stories"}
},
upstream_dependencies_by_asset_key={
AssetKey("top_stories"): {AssetKey("top_story_ids")}
}
)
Starting with the first Op in the DAG, we will refactor this as a Dagster Asset, using load_assets_from_airflow_dag
from our dagster-airflow
package.
We grab the first Task in the graph-backed asset and refactor it as a new Dagster Asset. Our newly rewritten Asset now looks like this. Note the use of load_assets_from_airflow_dag()
.
import requests
from dagster import AssetKey, asset
from dagster_airflow import load_assets_from_airflow_dag
from with_airflow.airflow_migration_dag_int import migration_dag
@asset(
group_name="partially_migrated"
)
def top_story_ids():
newstories_url = "https://hacker-news.firebaseio.com/v0/topstories.json"
top_10_newstories = requests.get(newstories_url).json()[:10]
return top_10_newstories
migration_assets = load_assets_from_airflow_dag(
migration_dag,
task_ids_by_asset_key={
AssetKey("top_stories"): {"load_top_stories"}
},
upstream_dependencies_by_asset_key={
AssetKey("top_stories"): {AssetKey("top_story_ids")}
}
)
Since the imported DAG was also an Asset, what we have now is a Multi-Asset, which we can execute and test.
We can continue this process, testing as we go. Many users take the opportunity to improve the logic in their pipelines and simplify or deduplicate as they go.
Eventually, we have fully refactored the old imported DAG, and we can jettison our graph-backed asset, say goodbye to the old and enjoy all the benefits of an asset-centric orchestrator.
With a newly refreshed and cleaned pipeline, we can now observe execution steps, introduce new storage logic to support local development, and observe the execution of the job in more detail.
Our final code is now simply written as:
import requests
import pandas as pd
from dagster import asset
@asset
def top_story_ids():
newstories_url = "https://hacker-news.firebaseio.com/v0/topstories.json"
top_10_newstories = requests.get(newstories_url).json()[:10]
return top_10_newstories
@asset
def top_stories(top_story_ids):
results = []
for item_id in top_story_ids:
item = requests.get(f"https://hacker-news.firebaseio.com/v0/item/{item_id}.json").json()
results.append(item)
df = pd.DataFrame(results)
return df
Group 1001’s story
One organization that transitioned from Airflow to Dagster using the dagster-airflow
library is Group 1001, a technology-driven financial services company specializing in insurance and annuity products.
“It was phenomenal […] In about four hours, we were able to run Airflow 1.10.15 locally development sandbox on our Macs executing our KubernetesPodOperator DAG using the Docker image from our Google cloud repository, after four hours of testing it out. We had never seen a true local Airflow execution of that before.”
Group 1001’s data team of two went on to migrate 73 Airflow DAGs over to Dagster within 2 weeks of evaluating Dagster. In just two days, they had the full GitLab CI/CD pipeline up and running.
Option 2: Trigger Dagster jobs from Airflow
While not a typical pattern, some teams may wish to work with Dagster but sit in an organization with an established Airflow implementation. In this scenario, you may want to build jobs and manage data assets in Dagster but manage the execution of those from Airflow.
You can orchestrate Dagster job runs from Airflow by using the DagsterCloudOperator
or DagsterOperator
operators in your existing Airflow DAGs, and a quick guide is provided in the migration guide.
This pattern will allow you to perform most of your development work within the modern development environment of Dagster and only need to touch Airflow when scheduling the execution of your new Jobs or Assets.
Summary:
To wrap up, migrating off Airflow and onto Dagster will boost your team’s ability to rapidly develop, deploy and maintain high-quality pipelines so that they can deliver more effectively against your stakeholder's expectations.
You can join the many data teams that were looking for an Airflow alternative, and are now working on Dagster.Join us and rediscover how fun working with data can be!
Here is a summary of key resources for organizations looking to migrate off Airflow and onto Dagster:
- The migration guide.
- The dagster-airflow library.
- Looking for an example of an Airflow to Dagster migration? Check out the dagster-airflow migration example repo on GitHub.
- Join our growing community on Slack, and get answers straight from the Dagster Labs engineering team or other Dagster experts from the community.
- Finally, check out the Feb. 2023 Airflow → Dagster migration event, where we provide more inspiration, tips, and insights for a smooth migration.
We're always happy to hear your feedback, so please reach out to us! If you have any questions, ask them in the Dagster community Slack (join here!) or start a Github discussion. If you run into any bugs, let us know with a Github issue. And if you're interested in working with us, check out our open roles!
Follow us:
Data Platform Week 2024
- Name
- Alex Noonan
- Handle
- @noonan
From Chaos to Control: How Dagster Unifies Orchestration and Data Cataloging
- Name
- Alex Noonan
- Handle
- @noonan
Dagster Deep Dive Recap: Orchestrating Flexible Compute for ML with Dagster and Modal
- Name
- TéJaun RiChard
- Handle
- @tejaun