February 17, 2022 • 5 minute read
Rebundling the Data Platform
A fascinating post made the rounds yesterday in the data community entitled The Unbundling of Airflow. It is provocative and deserves the high level of interest it earned.
The article describes how the Airflow DAG is being “eaten” from the inside out. Whether it is ingestion tools (Airbyte, Fivetran, Meltano), transformational tools (dbt), reverse ETL tools (Census, Hightouch), metrics layers (Transform), ML-focused systems (Continual) and others: Computations are outsourced to specialized tools that have their own concepts and self-manage some degree of orchestration in their own domain. For example, a dbt project with hundreds of models that would have corresponded to hundreds of manually created Airflow Tasks has been collapsed into a single Airflow task that outsources execution to the dbt runner.
What you are left with is the image in the Unbundling Airflow article. A constellation of tools, each with their own internal logical model of the assets or entities they manage.
Unbundling creates a whole new set of problems.
I don’t think anyone believes that this is an ideal end state, and the post itself notes that consolidation will be a necessity. Having this many tools without a coherent, centralized control plane is lunacy, and a terrible endstate for data practitioners and their stakeholders. It results in an operationally fragile data platform that leaves everyone in a constant state of confusion about what ran, what's supposed to run, and whether things ran in the right order. And yet, this is the reality we are slouching toward in this “unbundled” world.
Having this many tools without a coherent, centralized control plane is lunacy, and a terrible endstate for data practitioners and their stakeholders.
Some teams eschew orchestration altogether and instead use the barebones, cron-based scheduling systems that come with these products. Without any centralized orchestration, you are left with overlapping cron schedules and a “hope and pray” approach to operational robustness. One tool starts at midnight, the next, blindly, at 2 am, and if something goes wrong or is unexpectedly delayed, you are left scrambling to debug stale or broken assets within disconnected tools. If you need any computation in a programming language like Python, you are left with very little, if any, support.
Even when an orchestrator is used it knows little about the inner workings of the data platform and assets it manages. You are left with a tool with less operational richness that also does not provide a unified “single pane of glass” for your platform. Now you have operational silos, and are forced to do costly, invasive integrations with tools to get the most basic information about the assets in and status of your data platform.
Ananth Packkildurai, the author of the Data Engineering Weekly newsletter, summarizes this state of affairs well:
MDS is a set of vendor tools that solve niche data problems (lineage, orchestration, quality) with the side effect of creating a disjointed data workflow that makes data folks lives more complicated.— Ananth Packkildurai (@ananthdurai) January 6, 2022
From imperative tasks to declarative assets
I received a number of messages from others in the ecosystem saying things like “shots fired!” specifically because of this quote in the article:
"It’s hard to make predictions in the ever evolving data landscape, but I am not sure if we need a better Airflow. Building a better Airflow feels like trying to optimize writing code that shouldn’t have been written in the first place."
The reality is that I couldn’t agree more. Earlier in the year the Dagster core team arrived at a similar conclusion. While there was a lot of value in building a better task-based orchestrator, it wasn’t in line with where we saw the ecosystem evolving. We saw more and more adjacent tools adopting a declarative, asset-based philosophy. We decided that an orchesrator that embraced that philosophy was needed to tame the huge array of problems that we bucket under Big Complexity. Every company has a data platform, and every company needs a cohesive data management solution and a control plane in that platform. Orchestration is a part of that story, but not the whole story.
Rebundling is here: Software-Defined Assets
Today we released Dagster 0.14.0, and it contains a new concept: software-defined assets. A proper introduction will be published next week, but tl;dr: It is a fundamentally new approach to orchestration that orients around assets rather than tasks. This enables not just orchestration, but also a fully integrated understanding of data assets declared in other tools. This "rebundles" the platform by ingesting these assets into a single surface that provides a base layer of lineage, observability, and data quality monitoring for your entire platform. Dagster users can use it directly with our built-in tools, and more specialized ecosystem tools can build on top of and integrate with this base layer as well.
Finally Python feels like a first-class citizen in the modern data stack.
First, it enables the authorship of software-defined assets directly in Python. The users no longer create a centralized DAG. Instead they author Python functions that represent a particular asset, and declare their upstream dependencies right inline using an elegant Pythonic API. This API is a huge step forward for practitioners building production ETL or model-training pipelines in Python.
Python is essential in the modern data ecosystem, used by millions of data practitioners of all stripes. It matters and it is not going anywhere. With software-defined assets in Dagster, Python finally feels like a first-class citizen in the modern data stack.
All your assets are now directly viewable and operable in a single, unified fabric.
Second, it enables the explicit modeling and operation of entities such as dbt models and tables in Airbyte destinations natively in Dagster. This means the user is still defining these concepts in the tool of their choice, but Dagster pulls in this information via an integration which understands both its logical model and how to execute it.
Whether it is an asset defined in another tool, or defined directly in Dagster as Python, it ends up part of a unified fabric (dare we call it mesh?!) of assets that serves as the system of record for all the data assets in your platform, and a control plane to operate the computations that produce those assets.
Here is a more tangible example: say you use Airbyte to ingest data, dbt to transform it, and then a custom python transformation. Prior to software-defined assets, this would have appeared in Dagster as it would in other task-based orchestrators, as a three node graph: one for Airbyte, one for dbt, and one for the Python transformation for ML. In Airflow or any other task-based orchestration system, it would be very similar:
Software-designed assets completely change the mental model. Dagster ingests the Airbyte sources and dbt models natively, interleaves them directly with Python data assets, and models them all in Dagster as software-defined assets.
The code to construct this is surprisingly minimal. We are leveraging the work already done in other tools to construct this unified view and then interleaving it with the python asset. The system constructs the DAG on your behalf, connecting the assets via their names.
The graph you have been waiting for.
From this single graph you can view, operate, and observe all the data assets in your platform. You can see their lineage, metadata, their status in the warehouse, navigate to the last run that produced them, and more. It, in effect, "rebundles" the data platform by natively understanding the asset-based approach embraced by modern data tools. It is a massive leap forward for data management.
Note that these two graphs co-exist. The graph of ops (Dagster's task-esque abstraction) is an execution graph that is able to materialize the graph of assets. This means that many assets can be materialized by a single op, as is the case with Airbyte and dbt in the above example. This also means that software-defined assets are built on Dagster’s underlying, battle-tested orchestration framework and infrastructure. And you can also code directly against those abstractions for pure workflow use cases, which will continue to exist.
0.14.0 is an incredibly exciting release for us, and we’ll be rolling out tons of content in the next few weeks. Software-defined assets are a paradigm shift, and we’ll be writing a lot about them. We also have tons of API improvements, new integrations with Airbyte and Pandera, and a completely new operational home page for Dagit we’ve dubbed the “Factory Floor.”
We're always happy to hear your feedback, so please reach out to us! If you have any questions, ask them in the Dagster community Slack (join here!) or start a Github discussion. If you run into any bugs, let us know with a Github issue. And if you're interested in working with us, check out our open roles!