September 20, 2022 • 9 minute read
Dagster vs. Airflow
Dagster takes a radically different approach to data orchestration than other tools. We often get asked why a data team should choose Dagster over Apache Airflow. Here, we compare Dagster and Airflow, in five parts:
- 🌎 The 10,000 Foot View
- 🔁 Orchestration and Developer Productivity
- 📃 Orchestrating Assets, Not Just Tasks
- 📦 Container-Native Orchestration
- 🔬 Other Differences: Data Passing, Event-Driven Execution, and Backfills
We also give some pointers on how to adopt Dagster when you already have a heavyweight Apache Airflow installation that’s not going away any time soon.
The 10,000 Foot View
Data practitioners use orchestrators to build and run data pipelines: graphs of computations that consume and produce data assets, such as tables, files, and machine learning models.
Apache Airflow, which gained popularity as the first Python-based orchestrator to have a web interface, has become the most commonly used tool for executing data pipelines.
But first is not always best. Airflow dutifully executes tasks in the right order, but does a poor job of supporting the broader activity of building and running data pipelines. Airflow’s design, a product of an era when software engineering principles hadn’t yet permeated the world of data, misses out on the bigger picture of what modern data teams are trying to accomplish. It schedules tasks, but doesn’t understand that tasks are built to produce and maintain data assets. It executes pipelines in production, but makes it hard to work with them in local development, unit tests, CI, code review, and debugging.
Data teams who use Airflow, including the teams we’ve previously worked on, face a set of struggles:
- They constantly catch errors in production and find that deploying changes to data feels dangerous and irreversible.
- They struggle to understand whether data is up-to-date and to distinguish trustworthy, maintained data from one-off artifacts that went stale months ago.
- They confront lose-lose choices when dealing with environments and dependency management.
- They face an abrasive development workflow that drags down their velocity.
These aren't issues that can be fixed with a few new features. Airflow's fundamental architecture, abstractions, and assumptions make it a poor fit for the job of data orchestration and today’s modern data stack.
How Dagster compares to Apache Airflow
We built Dagster to help data practitioners build, test, and run data pipelines. We observed that there was a dramatic mismatch between the complexity of the job and the tools that existed to support it. We believed that the right tools could make data practitioners 10x more productive.
Dagster and Airflow are conceptually very different, but they’re frequently used for similar purposes, so we’re often asked to provide a comparative analysis.
At a high-level, Dagster and Airflow are different in three main ways:
- Dagster is designed to make data practitioners more productive. It’s built to facilitate local development of data pipelines, unit testing, CI, code review, staging environments, and debugging. Airflow makes pipelines hard to test, develop, and review outside of production deployments.
- Dagster supports a declarative, asset-based approach to orchestration. It enables thinking in terms of the tables, files, and machine learning models that data pipelines create and maintain. Airflow puts all its emphasis on imperative tasks.
- Dagster is cloud- and container-native. Airflow makes it awkward to isolate dependencies and provision infrastructure.
In this post, we’ll dig into each of these areas in greater detail, as well as differences in data-passing, event-driven execution, and backfills. We’ll also discuss Dagster’s Airflow integration, which allows you to build pipelines in Dagster even when you’re already using Airflow heavily.
Orchestration and Developer Productivity
Your choice of orchestrator has a huge impact on how fast you can develop your data pipelines and how much of your time you spend fixing them.
If you can run your data pipeline before merging it to production, you can catch problems before they break production. If you can run it on your laptop, you can iterate on it quickly. If you can run it in a unit test, you can write a suite that expresses your expectations about how it works and continually defends against future breakages.
With Airflow, it’s difficult and frustrating to do any of these. With Dagster, almost all users do them. We'll explore how this works in the next couple sections.
“Local dev environment is an incredible boon for productivity. This in turn, enables proper unit testing, both locally and in CI/CD. With smart use of the resources concept you can have end-to-end test coverage of your pipelines without having to deploy to a test server. Branch deployments are amazing for integration testing - I'm not aware of anything even close for Airflow” - Zachary Romer, Data Infrastructure Engineer at Empirico Tx
Running pipelines in different environments
Data pipelines written for Airflow are typically bound to a particular environment. To avoid dependency hell, most guides recommend defining Airflow tasks with operators like the KubernetesPodOperator, which dictates that the task gets executed in Kubernetes. When a DAG is written in this way, it’s nigh-impossible to run it locally or as part of CI. And it requires opting out of all of the integrations that come out-of-the-box with Airflow
On the other hand, Dagster makes it easy to write data pipelines without coupling them to any particular environment. This makes it possible to run them in a variety of contexts.
- In Dagster, execution is pluggable at a fine-grained level, which means that you can take a job that runs distributed across a Kubernetes cluster in production, and run it inside a single process within a unit test.
- Dagster makes it easy to alter the behavior of your pipelines based on their environment. A pipeline can interact with a mock version of Slack when running in a unit test and interact with the production Slack when running in production.
- Dagster makes it easy to separate business logic from I/O. A pipeline can write DataFrames to memory when executing within a unit test and store them in Snowflake when executing in production.
- Dagster Cloud’s branch deployments, which provide a lightweight staging environment per branch, take this even further. When you’ve created a pull request or pushed up a git branch, you get a deployment that includes the changes on that branch. The visual representation of your pipelines becomes part of code review, and you can run your jobs before merging to make sure that they work as intended.
Lightweight ad-hoc execution
When developing locally or running a unit test, often you want to be able to launch a run immediately, in a lightweight way.
With Airflow, this is very difficult - all executions go through its scheduler loop. To launch any run, you need to have a long-running scheduler process that’s monitoring a database. After launching a run, you need to wait for the scheduler to see it, which creates a laggy experience that doesn’t give the quick feedback of a local development loop.
On the other hand, Dagster’s default execution is extremely lightweight. You can execute a job within a Python process by calling a function, without any persistent service up and running. If you want access to Dagster’s UI, you can spin it up with a single command and don’t need to run a database or connect to any services in the cloud. From Dagster’s UI, you can go to any job and click a button to immediately launch a run.
In addition, Dagster’s run configuration system makes it easy to supply parameters to runs when you’re launching them. Before launching a run, you can enter config in Dagster’s launchpad that the code inside your run will have access to.
Orchestrating Assets, Not Just Tasks
“Airflow was built to string tasks together, not provide an overview of all the ways data is flowing or what’s causing issues.” - Going with the Airflow - David Jayatillake
The main reason that people build data pipelines is to create and maintain data assets, like tables, files, and ML models. An orchestrator that includes assets as a central part of its data model enables more straightforward debugging, more ergonomic APIs, integrated lineage and observability, and deeper, more powerful integrations with modern data stack tools like dbt, Fivetran, and Airbyte.
Thinking in data assets
Apache Airflow is task-focused rather than asset-focused: you define tasks and tell it to run them at particular times, rather than declaring data assets that you want to keep up-to-date. This means that you end up needing to constantly translate between the data assets you care about and the tasks that update them.
If you notice that a table has weird data and want to run the task that updates it or see what’s in the logs, you need to fish around in Airflow to find that task. This might be especially difficult for someone who relies on the table but isn’t intimately familiar with the data pipeline that creates it.
Dagster has an asset layer that’s deeply integrated into its core, which allows you to think in assets when you’re building and operating your data pipelines. This makes it easy for Dagster to directly answer questions like:
- Is this asset up-to-date?
- What do I need to run to refresh this asset?
- When will this asset be updated next?
- What code and data were used to generate this asset?
- After pushing a change, what assets need to be updated?
If you try to pose these questions to Airflow, you’ll find that it’s just not designed to answer them, because assets are not a core part of its data model.
Note that Dagster assets are built on an underlying mature, imperative orchestration layer. The imperative layer can be used as well in situations where assets aren’t the appropriate abstraction.
Assets and code ergonomics
Thinking in assets allows you to express your intentions more directly, which means less code boilerplate. Compare an Airflow DAG with Dagster’s software-defined asset API for expressing a simple data pipeline with two assets:
The Airflow DAG, follows the recommended practices of using the
KubernetesPodOperator to avoid issues with dependency isolation. It also specifies every dependency twice: once when constructing the DAG, and once inside the task when reading the upstream data.
Dagster software-defined assets know which others assets they depend on, so there's no need to define a DAG object, and there's no need to redundantly specify task dependencies and data dependencies.
The gap widens further as the size of your DAG increases, because Airflow has poor support for large DAGs, poor support for cross-DAG dependencies, and thinks in terms of execution dependencies, not data dependencies.
Asset lineage graphs span entire organizations. It’s common for a machine learning feature table to depend on a core dataset that’s maintained by an upstream team, which depends on a data source which is ingested by a different upstream team. Debugging a problem or understanding the downstream implications of a change often requires crossing these boundaries.
If you have a large network of tasks that produce data assets, you’re stuck with two sub-par options in Airflow:
- Split them up into smaller DAGs, but resort to awkward constructs like ExternalTaskSensors or Deferrable Operators for tracking dependencies between them.
- Keep them all in one huge DAG, which will be slow to load and couples the deployment of every asset with every other asset.
In Dagster, each software-defined asset knows which assets it depends on, so you don’t need a monolithic object to connect them all.
Asset sensors make it easy to carve up your asset graph into jobs connected by loose dependencies.
Execution vs. data dependencies
Airflow tracks execution dependencies - “run X after Y finishes running” - not data dependencies. This means you lose the trail in cases where the data for X depends on the data for Y, but they’re updated in different ways. For example, you might only re-train your ML model weekly, even though it uses data that’s updated hourly. When debugging, or reviewing a change, you’d still want to understand that dependency, and when backfilling your data, you might want to launch a run that includes refreshing the model too.
Also, as discussed above, Airflow has awkward constructs for modeling cross-DAG dependencies, so you might end up not capturing an execution dependency between DAGs because it’s difficult to express it.
Dagster exposes a single straightforward way of tracking dependencies between software-defined assets - both within jobs and across them. When debugging, reviewing changes, or backfilling, you have deep, asset-level lineage that goes all the way to the source.
You can learn more about asset-based debugging here.
dbt and other modern data stack tools
The tools that define the modern data stack, like dbt, Fivetran, Airbyte, and Meltano, think in data assets, not tasks. For example, a dbt model describes a table that’s meant to exist, not a task that’s meant to run.
Dagster’s asset focus allows it to have much deeper integrations with these tools than a task-based orchestrator like Airflow can.
For example, when using Airflow with dbt, you’re stuck with two sub-par options:
- Have an Airflow task that refers to an entire dbt project (or chunk of it). This gives limited insight into the dbt DAG and little ability to run subsets of it when you need to.
- Have an Airflow task for every dbt model in the project. This incurs heavy overhead for running every model.
Dagster’s dbt integration understands each dbt model as its own asset, but can run a set of models together within a single dbt invocation. This allows full understanding of the dbt DAG, without incurring overhead.
Dagster can trace asset dependencies in a granular way across tools. It can understand that a particular dbt model depends on the data produced by a particular Airbyte stream. You can say “sync this table using Airbyte, run all the dbt models that are downstream of it, and the machine learning model that’s downstream of that.”
Dagster and Airflow have different approaches to interfacing with user code, which have critical implications for dependency isolation and deployment management.
You tell Airflow about your Airflow DAGs by pointing it to a directory of Python files, which Airflow’s scheduler then evaluates directly and repeatedly. In contrast, you can tell Dagster about your Dagster jobs and assets by pointing it to containers, which Dagster loads from using a stable gRPC interface.
Because the Airflow scheduler directly evaluates Python files, all the code that defines Airflow DAGs within a single Airflow installation must be able to run with the same set of Python library dependencies, including the Python dependencies of the Airflow scheduler. I.e., you can’t have one of your DAGs depend on TensorFlow 1.x and another DAG depend on TensorFlow 2.x.
Most mature Airflow installations try to sidestep this issue by isolating task code from DAG code, with operators like the KubernetesPodOperator. This approach avoids the dependency issues, but at a big cost:
- It deeply exacerbates the difficulties with using Airflow DAGs in local development and testing, as covered in the section on running pipelines in different environments.
- It requires opting out of all of the integrations that come out-of-the-box with Airflow - if every task is a
KubernetesPodOperator, then you’re not using the
- It makes TaskFlow and XCom even more difficult to use.
- Parameterizing tasks becomes very inconvenient. The task needs to accept a command line argument for every parameter, and the operator needs to pass the parameter through. We’ve heard users describe this as “trading dependency hell for CLI hell”.
- It requires coordinating separate deployed artifacts: the Python file that contains the DAG definition and the container that contains the task. It’s easy for these to get out of sync and cause errors.
In contrast, Dagster provides the tools to isolate different pipelines from each other. The typical way of deploying Dagster in multi-tenant environments involves containers. Each team or project packages its Dagster job and asset definitions into a container. The Dagster scheduler and web server can interact with multiple of these containers. This allows dependencies to be completely isolated between these teams or projects, within a single Dagster installation.
Upgrading the orchestrator
Because the Airflow scheduler directly evaluates the Python files that contain DAG definitions, upgrading an Airflow installation means upgrading the Airflow version of all DAGs, in lockstep. This can be disruptive - every line of code in every DAG file will run using the new Airflow library version, and changes to Airflow’s Python APIs will immediately break these DAGs. There’s no option to upgrade incrementally. This means, in practice, at-scale Airflow installations are extremely challenging to upgrade.
In contrast, the fact that Dagster job and asset definitions are packaged in containers means that it’s possible to upgrade the Dagster scheduler and web server without upgrading user code. The gRPC interface between the Dagster scheduler + web server and user containers is small and hasn’t broken even across major version bumps and API shifts. A single dagster installation can host jobs that are defined in 1.0.7 and 0.14.0 side-by-side.
This also means that hosted versions of Dagster like Dagster Cloud can continually ship improvements without impacting user-code.
Other Differences: Data Passing, Event-Driven Execution, and Backfills
So far, we’ve talked about the main themes for why Dagster is better than Airflow. But the small things can matter as much as the big ones. In this section, we compare Dagster and Airflows approaches to data passing, event-driven execution, and backfills.
XCom, Airflow’s scheme for passing data between tasks, is only designed for passing small messages, even when using a custom backend. This makes it a poor fit for the kinds of data in most data pipelines. The Dagster equivalent is I/O managers, and it’s fully pluggable, at the level of individual assets and ops. Your op can return a huge Spark dataframe, your I/O manager can store it in Snowflake with no extra overhead, and then your I/O manager can load it as input when running downstream ops.
It’s common to want to run a DAG whenever an event happens - e.g. when data arrives in your warehouse or when another DAG completes.
The best solutions that Airflow has for this – Sensors and Deferrable Operators – get it backwards: they expect you to start the DAG and then wait for something to happen. Sensors also take up resources on your cluster, and deferrable operators are poorly documented for event-driven execution. They're both brittle, polling tasks, shoehorned into an architecture never meant to support event-driven execution
Dagster sensors offer event-driven execution. If you can express something as “run X when Y is true”, you can write it as a sensor. Built-in sensors handle common cases: it takes just a few lines of code to launch a run whenever a run of another job completes or when a particular asset is materialized.
If you want to run a backfill in Airflow, you need to run it from the CLI. This only works if you have terminal access to the machine running Airflow. After you’ve done so, there’s no record that you tried to backfill X tasks at Y time.
Dagster has a full support for backfills in its UI. You can click a button and select a range of partitions to launch a backfill over, and Dagster will kick off a run for each partition. This is fault tolerant - if your scheduler dies while Dagster is submitting runs, it will pick up where it left off when it comes back up.
After you've kicked off a backfill, you can use Dagster's UI to monitor it. You can see all the runs that correspond to your backfill and track how many of them are submitted / started / completed / failed.
Adopting Dagster when you're stuck with Airflow
Some of Dagster’s biggest users are organizations that also have large Airflow installations. If you have a large Airflow installation in your organization, it’s likely not going away any time soon. That doesn’t mean you can’t use Dagster.
In some cases, Dagster and Airflow sit side-by-side, independent of each other.
Dagster also includes an Airflow integration that enables using Dagster and Airflow together. This lets you take advantage of Dagster’s asset-based programming model and use Dagster’s lightweight local development flow, while deploying production DAGs to Airflow. You can develop new pipelines using Dagster, without overhauling the pipelines you’ve already built with Airflow and without fragmenting your production infrastructure..
That about wraps it up. Airflow is often used to build and run data pipelines, but it wasn't designed with a holistic view of what it takes to do so. Dagster was designed to help data teams build and run data pipelines: to develop data assets and keep them up-to-date. The impact on development velocity and production reliability is enormous.
The majority of Dagster users used to be Airflow users, and nearly every Dagster user considered Airflow at the time they chose Dagster.
We're always happy to hear your feedback, so please reach out to us! If you have any questions, ask them in the Dagster community Slack (join here!) or start a Github discussion. If you run into any bugs, let us know with a Github issue. And if you're interested in working with us, check out our open roles!