How Dagster Balances Open Source and SaaS

The relationship between Dagster, the open-source project, and Dagster Cloud, our hosted SaaS platform.

With the announcement of Dagster 1.0 and Dagster Cloud, we thought it was important to articulate the relationship between Dagster, the open-source project, and Dagster Cloud, our hosted SaaS platform. We’ll get into the details (and the thinking behind them) in what follows, but the TL;DR is:

We are committed to making Dagster a ubiquitous open standard.
A successful, Dagster-based commercial product is essential for the success of the technology, and is a win-win for all stakeholders. Simply put, many organizations will not adopt Dagster without the option of a commercially hosted solution; and the project must be backed by a company with a viable business model. We want to build an awesome project and an awesome business.
We will decide what we provide exclusively in the commercial product based on clear, consistent criteria.

Why we built Dagster Cloud on top of the open-source project

In July 2019 the Elementl team introduced the first release of Dagster. Dagster was a new approach to building highly maintainable data pipelines with high velocity. In particular, Dagster was designed to apply software engineering and DevOps best practices to the discipline of data orchestration, a mentality absent in other orchestration solutions.

Our goal then, as it is today, was to make Dagster a ubiquitous, open standard. We want users to build tools on top of Dagster that we would never have anticipated and that greatly extend the system’s capabilities. This supports both our mission, the user community, and our ambitions for growth.

Dagster Cloud is one example of what you can build on top of OSS Dagster if you’re trying to solve for enterprise data needs. Just like our open-source tooling, the commercial product is built on top of Dagster’s open-source APIs and integration points. But it’s not the only way you can extend the system.

Some users of Dagster have told us they could never adopt a hosted commercial solution like Dagster Cloud, because of their own preferences and requirements. This is fine! There is no single commercial solution or strategy that can address everyone’s needs.

Driving adoption of Dagster with Dagster Cloud

As our initial users have told us time and time again that adopting Dagster helps them structure their data pipelines in ways that make more sense, are much faster to develop, and are easier to change when needed.

“Dagster has really helped with the team productivity and really increasing our development velocity... We can really confidently push changes to production and get features out the door as quickly as we can, as quickly as a developer can write them and get them tested.” - Zachary Romer, Data Infrastructure Engineer at Empirico Inc.

Users recognized early the core value of our approach and overcame the early barriers to adoption that come with any new technology. But their situation is not unique or special: adopting Dagster will be beneficial to all data teams no matter their level of resources, scale or sophistication. That means Dagster needs to be accessible to everyone.

We see Dagster Cloud doing just that: making Dagster more accessible and therefore driving open-source adoption. Dagster Cloud users run the gamut: from small organizations without ops capability to run their own infrastructure, to the largest organizations with intricate needs around access control, collaboration, and security. For all of those organizations, Dagster Cloud makes Dagster more accessible and easy to adopt. This serves our goal of maximizing adoption of the core framework and encouraging teams to extend their data platforms in new creative ways.

Our long-term goals are around adoption and scale: To fully support our open-source community while also maximizing the number of users who find Cloud to be an attractive option, with a fair, sustainable pricing model. We want the smallest orgs—the data teams of one—to participate in the win-win proposition offered by centralized management of operational complexity. But we also need to fairly price the system, for players who reap enormous value from the management of operational and enterprise complexity at scale.

Drawing the line: Three types of complexity

So, what capabilities will be open-source versus proprietary?

Without any structure around this question, customers of open-core companies are left to the whims of those companies’ product leaders, who may shift positions based on competitive dynamics, or pressure to drive revenue. The question we want to ask instead is: what belongs in a hosted product like Dagster Cloud?

To answer this, we think in terms of three different classes of complexity addressed by the systems we’re building: application complexity, operational complexity, and enterprise complexity.

Application complexity is the work users do to structure their code, and will always be in the province of the open-source Dagster Framework.
Enterprise complexity are features that help you manage Dagster within larger organizations––such as RBAC––and is squarely in the domain of Dagster Cloud.
Operational complexity is the work to run Dagster code in production. We optimize for the sustainability and the speed of development. That means capabilities that require complex product ontologies and stateful distributed systems will be in Dagster Cloud, as they benefit from economies of scale and continuous feedback. As Dagster Cloud is composed of the open-source infrastructure and UI components, benefits will naturally flow to open source.

Application Complexity

Application complexity refers to all the work that data practitioners do to structure their code and write their business logic. Framework features that make Dagster projects so expressive, so productive to work in, and so easy to maintain, fall into this class of complexity, as do integrations that allow Dagster to work with other technologies. We view application complexity as firmly in the domain of the Dagster framework, which lives alongside user code in the same process and directly invokes that code. This in-process framework, its core abstractions, the APIs to interact with it will forever and always be open source.

Composability is a powerful force in software. Composable software can be used in more contexts, embedded directly into other applications and systems, and is more likely to be ubiquitous.

Composability combined with transparency is particularly powerful, making it possible for people to repurpose the software in interesting ways––ways that the core team could never have foreseen––and making them feel more empowered to do so.

Members of our community have built their own WYSIWYG editors and novel DSLs on top of our core programming model for their own internal data platforms. They’ve built sophisticated, custom operational tools by building against our database directly. We learn from our community and incorporate the best ideas into our roadmap and introduce new APIs and pluggability points. This virtuous cycle is impossible in closed, proprietary software.

In addition to composability and transparency, we also focus on layering. Every tool we create is built on open standards and APIs that are accessible to users, and free of licensing constraints. Any capability exposed in our UI is implemented with a GraphQL API so that users can build their own custom tooling or integrations on top of it. Our infrastructure tools and services interact with user-defined code over a gRPC interface, providing another layer where users can build their own infrastructure components.

By committing to composability, transparency, and layering, Dagster becomes more than a static framework that includes some fixed set of centrally provisioned tools: It is a piece of software that can be repurposed, extended, and integrated into other software in novel ways. And it is a platform for tool building.

Application complexity will always be the concern of Dagster’s open core.

Operational Complexity

Operational complexity is about how that code runs. Open-source Dagster powers data platforms at many companies today. We want that adoption to continue and for open-source Dagster to continue to be a trusted solution for mission-critical workloads for individuals and teams who want to manage their own infrastructure. The core execution engine will remain in the open-source domain.

However, most organizations do not want to self-host software as stateful as an orchestrator. In order to do so successfully, it is necessary to devote valuable engineering resources to deploy and maintain that software. And those engineers do not know the system as well as the core team.

Beyond the lack of desire to self-host, there are also massive economies of scale with a centralized service: the core team has complete visibility, can redeploy changes at will, immediately see the results, and react accordingly. This is orders of magnitude faster than waiting for an open-source community to upgrade their services and then report back results.

There are a whole set of services and capabilities that are more efficient to develop centrally, because of product complexity and economies of scale. They typically involve lots of state, interconnected distributed components, and complex product ontologies.

What are capabilities that are not subject to economies of scale? An example is illustrative. We want users to be able to self-host Dagster deployments on Kubernetes, as it is the most popular container orchestration framework. If we found a bug in our open-source integration as a result of our own Cloud development, we would immediately ship that to our open-source users. Why? First of all, not doing so is ridiculous and would be contrary to our values. But there are also no economies of scale from which we’d benefit by not fixing this issue.

Our goal is to maximize the speed of innovation and development and so sustainable execution speed is our primary decision-making criteria when navigating the open-source/proprietary divide in the operational domain. Architecturally Dagster Cloud is composed of open-source Dagster components, and therefore work to improve Cloud naturally accrues to those components.

Enterprise Complexity

Finally, enterprise complexity is about how an operational data platform is embedded in a larger organization. This kind of complexity includes the collaboration issues that stem from managing large teams and multiple personas with access to the data platform, but it also includes issues of compliance, security, and communication with nontechnical stakeholders. Federated identity management, audit logs, and complex networking requirements are some examples of features that handle this type of complexity, which we view as squarely in the domain of Dagster Cloud.

Organizations who want these capabilities want to pay for them. They want the contract and all the legal guarantees and business norms that surround commercial contracts. They need the formal arrangement between institutions that can outlive the relationship between any particular stakeholder.

These capabilities also make the product ontology much more complex. The data team of one, spinning Dagit up locally for the first time to turn a messy directory of Jupyter notebooks into a repeatable set of data pipelines, shouldn’t be confronted with the complex hierarchies that are essential for a large data platform team at a regulated enterprise.

Conclusion

So, what’s next?

Elementl is out to empower every organization to build a productive and scalable data platform. This is critical to ensure data teams have the tools to deliver on the inherent value found in data (but often left untapped).

With our upcoming launch of Dagster Cloud to the general public, our objective is that players of all sizes can participate in the win-win proposition of adopting the technology. We do this by adopting a fair, pay-as-you-go model where it makes sense for everyone from the solo practitioner to the largest enterprise to use the technology.

We hope this article has helped make it clear how we’re approaching the challenges of building an open-source project and an open-core business. It’s going to be an incredible decade for this project, and we encourage you to get involved: in Slack, in our docs, in our codebase, or in evangelism.

Have feedback or questions? Start a discussion in Slack or Github.

Interested in working with us? View our open roles.

Want more content like this? Follow us on LinkedIn.